Context Compression
26. Context Compression Engine (CCE)
The Context Compression Engine prevents token overflow in long-running discussions by deterministically compressing older messages before they are sent to the LLM. It uses the open-source context-compression-engine library — a zero-dependency TypeScript package that runs in sub-2ms without any LLM calls.
Problem
In nyxCore discussions, every user and assistant message is stored in the database and sent as full conversation history with each new LLM call. Token counts grow monotonically:
| Message # | Estimated Prompt Tokens |
|---|---|
| 1 | ~2k |
| 5 | ~20k |
| 10 | ~45k |
| 15 | ~80k |
| 20+ | ~99k+ (overflow) |
Without compression, long discussions hit provider context limits, causing truncated responses or outright failures. The DISCUSSION_MAX_TOKENS (16384) completion limit compounds this — the model runs out of output space when the prompt consumes most of the context window.
Solution
CCE sits between message retrieval and LLM invocation. It compresses older messages using deterministic text-level strategies (deduplication, fuzzy matching, convergence) while preserving recent messages verbatim.
Integration Point
File: src/server/services/discussion-service.ts
The compression is applied inside processDiscussion(), after loading all messages from the database and before passing them to the streaming functions:
import { compress } from "context-compression-engine";
import type { Message as CCEMessage } from "context-compression-engine";
const CONTEXT_TOKEN_BUDGET = 80_000;
const RECENCY_WINDOW = 5;
Flow
- Load messages — All
user/assistantmessages fromdiscussion.messages - Estimate tokens —
chars / 3.5heuristic (reasonable for Claude/GPT tokenizers) - Guard check — Only compress if
estimatedTokens > CONTEXT_TOKEN_BUDGETANDmessages.length > RECENCY_WINDOW - Map to CCE format — Each message gets
{ id, index, role, content }(required by CCE) - Compress — Deterministic, sub-2ms, no network calls
- Map back — Convert CCE output to
LLMMessage[]for the provider - Fallback — On any error, use uncompressed history (fail-open)
CCE Configuration
const result = compress(cceMessages, {
tokenBudget: CONTEXT_TOKEN_BUDGET, // 80k token target
minRecencyWindow: RECENCY_WINDOW, // Last 5 messages always preserved
dedup: true, // Remove exact duplicate content
fuzzyDedup: true, // Remove near-duplicate content
forceConverge: true, // Guarantee output fits budget
});
| Option | Value | Purpose |
|---|---|---|
tokenBudget |
80,000 | Target token count for compressed history |
minRecencyWindow |
5 | Recent messages kept verbatim (uncompressed) |
dedup |
true |
Removes messages with identical content |
fuzzyDedup |
true |
Removes messages with near-identical content |
forceConverge |
true |
Guarantees output stays within budget (aggressive trim) |
Content Type Handling
LLMMessage.content can be string | Array<TextBlock | ImageBlock>. CCE expects plain strings, so a conversion step is required:
content: typeof m.content === "string"
? m.content
: m.content.map((b) => "text" in b ? b.text : "").join("")
Logging
Successful compression logs a summary line:
[discussion] Context compressed: 22 msgs → 14 msgs (2.3x token ratio, char ratio 2.1x, 8 compressed, 14 preserved)
On failure:
[discussion] Context compression failed, using full history: <error>
Complementary Measures
CCE is part of a broader token management strategy:
| Measure | File | Purpose |
|---|---|---|
| CCE compression | discussion-service.ts |
Reduce prompt tokens in long conversations |
| Truncation warning | discussion-service.ts |
Detect when completion hits maxTokens limit |
| Token counter UI | discussions/[id]/page.tsx |
Show prompt/completion tokens + cost per msg |
| Step digest | step-digest.ts |
Compress workflow step outputs for downstream |
Limitations
- Text-only: Image blocks in multimodal messages are dropped during compression (text extracted, images discarded)
- No semantic awareness: Compression is purely structural (dedup, fuzzy match) — it does not summarize or rephrase content
- Token estimation: Uses
chars / 3.5heuristic, not a real tokenizer — actual token counts may vary ±15% - Single codepath: Only applied in
processDiscussion()— workflow engine uses step digests instead
Implementation Pain Points
These issues were encountered during integration and are documented to prevent re-investigation:
| Attempted | Error | Resolution |
|---|---|---|
importanceScoring option |
Not in CompressOptions type |
Removed; fuzzyDedup achieves similar intent |
CCEMessage without id/index |
Required fields validation error | Map with id: \msg-${i}`, index: i` |
Direct as { prop } cast on content |
TS2339 on union type | Extract via typeof check + .map() join |
Future Considerations
- Escalating Summarizer (E-07): Use Haiku to generate semantic summaries of compressed message groups, improving context quality beyond structural dedup
- Per-provider budgets: Different providers have different context windows — budget could be adjusted dynamically based on
modelOverride - Compression metrics: Expose compression ratio to Prometheus via
nyxcore_discussion_compression_ratiogauge
