Context Compression

Developer5 min read

26. Context Compression Engine (CCE)

The Context Compression Engine prevents token overflow in long-running discussions by deterministically compressing older messages before they are sent to the LLM. It uses the open-source context-compression-engine library — a zero-dependency TypeScript package that runs in sub-2ms without any LLM calls.

Problem

In nyxCore discussions, every user and assistant message is stored in the database and sent as full conversation history with each new LLM call. Token counts grow monotonically:

Message # Estimated Prompt Tokens
1 ~2k
5 ~20k
10 ~45k
15 ~80k
20+ ~99k+ (overflow)

Without compression, long discussions hit provider context limits, causing truncated responses or outright failures. The DISCUSSION_MAX_TOKENS (16384) completion limit compounds this — the model runs out of output space when the prompt consumes most of the context window.

Solution

CCE sits between message retrieval and LLM invocation. It compresses older messages using deterministic text-level strategies (deduplication, fuzzy matching, convergence) while preserving recent messages verbatim.

flowchart LR DB[(Database)] -->|all messages| EST{Estimate tokens} EST -->|< 80k| LLM[Send to LLM] EST -->|> 80k| CCE[compress] CCE -->|compressed history| LLM

Integration Point

File: src/server/services/discussion-service.ts

The compression is applied inside processDiscussion(), after loading all messages from the database and before passing them to the streaming functions:

import { compress } from "context-compression-engine";
import type { Message as CCEMessage } from "context-compression-engine";

const CONTEXT_TOKEN_BUDGET = 80_000;
const RECENCY_WINDOW = 5;

Flow

  1. Load messages — All user/assistant messages from discussion.messages
  2. Estimate tokenschars / 3.5 heuristic (reasonable for Claude/GPT tokenizers)
  3. Guard check — Only compress if estimatedTokens > CONTEXT_TOKEN_BUDGET AND messages.length > RECENCY_WINDOW
  4. Map to CCE format — Each message gets { id, index, role, content } (required by CCE)
  5. Compress — Deterministic, sub-2ms, no network calls
  6. Map back — Convert CCE output to LLMMessage[] for the provider
  7. Fallback — On any error, use uncompressed history (fail-open)

CCE Configuration

const result = compress(cceMessages, {
  tokenBudget: CONTEXT_TOKEN_BUDGET, // 80k token target
  minRecencyWindow: RECENCY_WINDOW,  // Last 5 messages always preserved
  dedup: true,                       // Remove exact duplicate content
  fuzzyDedup: true,                  // Remove near-duplicate content
  forceConverge: true,               // Guarantee output fits budget
});
Option Value Purpose
tokenBudget 80,000 Target token count for compressed history
minRecencyWindow 5 Recent messages kept verbatim (uncompressed)
dedup true Removes messages with identical content
fuzzyDedup true Removes messages with near-identical content
forceConverge true Guarantees output stays within budget (aggressive trim)

Content Type Handling

LLMMessage.content can be string | Array<TextBlock | ImageBlock>. CCE expects plain strings, so a conversion step is required:

content: typeof m.content === "string"
  ? m.content
  : m.content.map((b) => "text" in b ? b.text : "").join("")

Logging

Successful compression logs a summary line:

[discussion] Context compressed: 22 msgs → 14 msgs (2.3x token ratio, char ratio 2.1x, 8 compressed, 14 preserved)

On failure:

[discussion] Context compression failed, using full history: <error>

Complementary Measures

CCE is part of a broader token management strategy:

Measure File Purpose
CCE compression discussion-service.ts Reduce prompt tokens in long conversations
Truncation warning discussion-service.ts Detect when completion hits maxTokens limit
Token counter UI discussions/[id]/page.tsx Show prompt/completion tokens + cost per msg
Step digest step-digest.ts Compress workflow step outputs for downstream

Limitations

  • Text-only: Image blocks in multimodal messages are dropped during compression (text extracted, images discarded)
  • No semantic awareness: Compression is purely structural (dedup, fuzzy match) — it does not summarize or rephrase content
  • Token estimation: Uses chars / 3.5 heuristic, not a real tokenizer — actual token counts may vary ±15%
  • Single codepath: Only applied in processDiscussion() — workflow engine uses step digests instead

Implementation Pain Points

These issues were encountered during integration and are documented to prevent re-investigation:

Attempted Error Resolution
importanceScoring option Not in CompressOptions type Removed; fuzzyDedup achieves similar intent
CCEMessage without id/index Required fields validation error Map with id: \msg-${i}`, index: i`
Direct as { prop } cast on content TS2339 on union type Extract via typeof check + .map() join

Future Considerations

  • Escalating Summarizer (E-07): Use Haiku to generate semantic summaries of compressed message groups, improving context quality beyond structural dedup
  • Per-provider budgets: Different providers have different context windows — budget could be adjusted dynamically based on modelOverride
  • Compression metrics: Expose compression ratio to Prometheus via nyxcore_discussion_compression_ratio gauge