26. Context Compression Engine (CCE)

The Context Compression Engine prevents token overflow in long-running discussions by deterministically compressing older messages before they are sent to the LLM. It uses the open-source context-compression-engine library — a zero-dependency TypeScript package that runs in sub-2ms without any LLM calls.

Problem

In nyxCore discussions, every user and assistant message is stored in the database and sent as full conversation history with each new LLM call. Token counts grow monotonically:

Message #	Estimated Prompt Tokens
1	~2k
5	~20k
10	~45k
15	~80k
20+	~99k+ (overflow)

Without compression, long discussions hit provider context limits, causing truncated responses or outright failures. The DISCUSSION_MAX_TOKENS (16384) completion limit compounds this — the model runs out of output space when the prompt consumes most of the context window.

Solution

CCE sits between message retrieval and LLM invocation. It compresses older messages using deterministic text-level strategies (deduplication, fuzzy matching, convergence) while preserving recent messages verbatim.

flowchart LR DB[(Database)] -->|all messages| EST{Estimate tokens} EST -->|< 80k| LLM[Send to LLM] EST -->|> 80k| CCE[compress] CCE -->|compressed history| LLM

Integration Point

File: src/server/services/discussion-service.ts

The compression is applied inside processDiscussion(), after loading all messages from the database and before passing them to the streaming functions:

import { compress } from "context-compression-engine";
import type { Message as CCEMessage } from "context-compression-engine";

const CONTEXT_TOKEN_BUDGET = 80_000;
const RECENCY_WINDOW = 5;

Flow

Load messages — All user/assistant messages from discussion.messages
Estimate tokens — chars / 3.5 heuristic (reasonable for Claude/GPT tokenizers)
Guard check — Only compress if estimatedTokens > CONTEXT_TOKEN_BUDGET AND messages.length > RECENCY_WINDOW
Map to CCE format — Each message gets { id, index, role, content } (required by CCE)
Compress — Deterministic, sub-2ms, no network calls
Map back — Convert CCE output to LLMMessage[] for the provider
Fallback — On any error, use uncompressed history (fail-open)

CCE Configuration

const result = compress(cceMessages, {
  tokenBudget: CONTEXT_TOKEN_BUDGET, // 80k token target
  minRecencyWindow: RECENCY_WINDOW,  // Last 5 messages always preserved
  dedup: true,                       // Remove exact duplicate content
  fuzzyDedup: true,                  // Remove near-duplicate content
  forceConverge: true,               // Guarantee output fits budget
});

Option	Value	Purpose
`tokenBudget`	80,000	Target token count for compressed history
`minRecencyWindow`	5	Recent messages kept verbatim (uncompressed)
`dedup`	`true`	Removes messages with identical content
`fuzzyDedup`	`true`	Removes messages with near-identical content
`forceConverge`	`true`	Guarantees output stays within budget (aggressive trim)

Content Type Handling

LLMMessage.content can be string | Array<TextBlock | ImageBlock>. CCE expects plain strings, so a conversion step is required:

content: typeof m.content === "string"
  ? m.content
  : m.content.map((b) => "text" in b ? b.text : "").join("")

Logging

Successful compression logs a summary line:

[discussion] Context compressed: 22 msgs → 14 msgs (2.3x token ratio, char ratio 2.1x, 8 compressed, 14 preserved)

On failure:

[discussion] Context compression failed, using full history: <error>

Complementary Measures

CCE is part of a broader token management strategy:

Measure	File	Purpose
CCE compression	`discussion-service.ts`	Reduce prompt tokens in long conversations
Truncation warning	`discussion-service.ts`	Detect when completion hits maxTokens limit
Token counter UI	`discussions/[id]/page.tsx`	Show prompt/completion tokens + cost per msg
Step digest	`step-digest.ts`	Compress workflow step outputs for downstream

Limitations

Text-only: Image blocks in multimodal messages are dropped during compression (text extracted, images discarded)
No semantic awareness: Compression is purely structural (dedup, fuzzy match) — it does not summarize or rephrase content
Token estimation: Uses chars / 3.5 heuristic, not a real tokenizer — actual token counts may vary ±15%
Single codepath: Only applied in processDiscussion() — workflow engine uses step digests instead

Implementation Pain Points

These issues were encountered during integration and are documented to prevent re-investigation:

Attempted	Error	Resolution
`importanceScoring` option	Not in `CompressOptions` type	Removed; `fuzzyDedup` achieves similar intent
`CCEMessage` without `id`/`index`	Required fields validation error	Map with `id: \`msg-${i}`, index: i`
Direct `as { prop }` cast on `content`	TS2339 on union type	Extract via `typeof` check + `.map()` join

Future Considerations

Escalating Summarizer (E-07): Use Haiku to generate semantic summaries of compressed message groups, improving context quality beyond structural dedup
Per-provider budgets: Different providers have different context windows — budget could be adjusted dynamically based on modelOverride
Compression metrics: Expose compression ratio to Prometheus via nyxcore_discussion_compression_ratio gauge