Every Claude Code user hits it eventually. Your session is deep into a refactor, the context bar creeps toward the limit, and then — a hard pause. Claude Code runs /compact, summarizes the conversation, and you wait 15–30 seconds while it catches up. If you're running an autonomous agent overnight, that pause can break the task entirely.
Context Gateway from Compresr.ai (a YC-backed company) solves this by sitting between your agent and the LLM API, compressing history in the background before you hit the limit. When compaction triggers, the summary is already computed. No wait.
Here's how it works and how to set it up.
Why Claude Code's Context Limit Is a Real Problem
Claude Code (and most LLM-backed agents) operate within a fixed context window. Claude Sonnet 4.6 supports 200K tokens, but a long session with large files, multiple tool calls, and verbose output can approach that limit in a few hours of active work.
When the limit approaches, Claude Code runs /compact — it asks the model to summarize everything so far, then continues with a compressed history. The problem isn't accuracy (the summaries are usually fine). The problem is latency:
- You're mid-task, actively waiting
- Autonomous runs get interrupted, sometimes breaking multi-step workflows
- The compaction call is billed at full token rates for the full context
Context Gateway addresses all three.
What Context Gateway Does Differently
Context Gateway is a proxy. It intercepts every API call between your agent and the LLM, tracking conversation length in real time.
When conversation length crosses a configurable threshold (default: 75% of the context window), Context Gateway starts computing a summary in the background using a separate, cheaper summarizer model. By the time Claude Code would normally trigger /compact, the summary is already computed and cached.
The result: compaction still happens, but the wait is near-instant because the heavy lifting was already done.
Additional capabilities:
- Instant
/compact: Same Claude Code command, no wait - History log:
logs/history_compaction.jsonlrecords every compaction event - Slack notifications: Optional alerts when compaction fires
- Custom threshold: Trigger compaction at 60%, 75%, or 90% — configurable per project
Install Context Gateway
Installation is a single curl command:
curl -fsSL https://compresr.ai/api/install | sh
After install, run the interactive setup wizard:
context-gateway
The TUI wizard walks you through:
- Agent selection: Choose
claude_code,cursor,openclaw, orcustom - Summarizer model: Pick the model used for background compression (more on this below)
- API key: Key for the summarizer model's endpoint
- Compaction threshold: Percentage of context window that triggers pre-computation (default 75%)
- Slack integration: Optional webhook for compaction event notifications
The proxy runs locally and routes Claude Code's API calls through itself. No data leaves your machine to Compresr's servers — Context Gateway is self-hosted.
Choosing the Right Summarizer Model
This is the most important configuration decision. The summarizer model handles background compression, so you want something that's:
- Fast (low latency for background computation)
- Inexpensive (it runs on every compaction event)
- Good at summarization (not just text generation)
- OpenAI-compatible endpoint (Context Gateway uses the standard
/chat/completionsformat)
Context Gateway accepts any OpenAI-compatible API as the summarizer. That includes ModelsLab's LLM endpoint, which gives you access to Llama 3.3 70B, Qwen 2.5 32B, and Mistral models — all via a single pay-as-you-go API without subscriptions.
To configure a ModelsLab model as your summarizer:
# In the TUI wizard, when prompted for summarizer API:
# Endpoint: https://modelslab.com/api/v6/llm/chat/completions
# Model: meta-llama/llama-3.3-70b-instruct (fast, good at summarization)
# Alternative: Qwen/Qwen2.5-32b-instruct (stronger reasoning)
# API key: your ModelsLab API key (docs.modelslab.com)
Llama 3.3 70B is the practical default here. It produces tight, accurate summaries of code conversations and runs significantly cheaper per token than Claude Sonnet. Since compaction summaries are a defined task with clear output expectations (not open-ended generation), a strong 70B model is sufficient for most projects.
Running Context Gateway With Claude Code
Once configured, start the proxy before opening Claude Code:
# Start Context Gateway proxy
context-gateway start
# In a separate terminal, open Claude Code
claude code
Context Gateway intercepts at the API layer — Claude Code doesn't know it's there, and you don't change anything about how you work. The only visible difference: /compact is instant.
To verify it's working, check the log:
tail -f logs/history_compaction.jsonl
Each compaction event writes a JSON line with timestamp, token counts before/after, and the model used for summarization.
Setting the Threshold for Your Workflow
The 75% default works for most workflows, but there are cases where you'd tune it:
| Workflow | Recommended Threshold | Reason |
|---|---|---|
| Interactive development | 75% (default) | Leaves headroom for manual work after compaction fires |
| Overnight autonomous runs | 60% | More aggressive — reduces risk of hitting limit mid-task |
| Short sessions (<2h) | 85% | Fewer compactions for shorter contexts |
| Very large files (>50K tokens/file) | 60% | File reads spike context fast; pre-compute early |
What It Doesn't Fix
Context Gateway is good at eliminating the wait, but it doesn't expand the context window. If you're working on a codebase so large that even the compressed summary exceeds available context, you'll need a different approach:
- Retrieval-augmented context: Use
CLAUDE.mdwith explicit pointers to relevant files so Claude doesn't need to hold everything in context - Per-feature branches: Isolate large changes to smaller branches so sessions stay shorter
- MCP file references: Route large file reads through MCP tools that return targeted excerpts instead of full file content
For most developers, those situations are edge cases. The typical Claude Code bottleneck is the compaction wait, and Context Gateway eliminates it.
The Bigger Picture: API Layer Optimization
Context Gateway is one of several tools emerging in the "AI agent infrastructure" layer — proxies, caches, and routers that sit between coding agents and LLM APIs to reduce cost and latency without changing agent behavior.
This pattern matters for teams running AI agents at scale:
- Background summarization with a cheap model cuts per-session cost
- Caching repeated tool outputs reduces redundant API calls
- Routing different task types to different models (cheap for boilerplate, strong for architecture) compounds the savings
ModelsLab's LLM API is designed to fit into exactly these patterns — OpenAI-compatible endpoint, access to 100+ models under one API key, pay-as-you-go so you're not paying subscription minimums for background jobs. Full documentation at docs.modelslab.com.
If you're running Claude Code on a team or running autonomous overnight tasks, Context Gateway is worth the 5-minute setup.