Cut Claude Code's input tokens by rendering bulky context as images — the same system prompt, tool docs, and history, in a fraction of the tokens.
An image's token cost is fixed by its pixel dimensions, not by how much text
is inside it. Dense content (code, JSON, tool output) packs ~3.1 chars per
image-token vs ~1 char per text-token on real Claude Code traffic. pxpipe is
a local proxy that exploits the gap: it rewrites the bulky parts of each
request into compact PNGs before it leaves your machine. At current Fable
list prices that lands as a ~59–70% lower end-to-end bill — but prices
move and workloads differ, so the durable number is the token cut itself,
measured per-request against a free count_tokens counterfactual in
~/.pxpipe/events.jsonl.
This is what the model sees instead of text:
~48k chars of system prompt + tool docs: ≈25k tokens as text, ≈2.7k image tokens as this page. Real pipeline output; the model reads renders like this at 100/100 (see benchmarks).
Fable 5 (the default, 100/100 reader) — plain left, pxpipe right:
Fable-AB-Demo.mp4
pxpipe counts an exact token 10/10 across 39 imaged filler files
(matches grep line-for-line), gets the multi-step ledger arithmetic right,
and ends the session at $6.06 with context to spare (73.5k/1M) vs
$42.21 at 96% full. One caveat visible in the clip: the pxpipe arm
needed a nudge to match the requested one-line output format.
Opus 4.8 (disabled by default) — same layout:
Opus-AB-Demo.mp4
Text needles read fine on both arms; the imaged phrase-count doesn't read on Opus — and pxpipe says so instead of fabricating a number. That misread rate is why Opus is opt-in.
npx pxpipe-proxy # proxy on 127.0.0.1:47821
ANTHROPIC_BASE_URL=http://127.0.0.1:47821 claude # point Claude Code at itDashboard at http://127.0.0.1:47821/: tokens saved, every text→image conversion side by side, kill switch, live model chips. Responses stream normally — pxpipe compresses the request only, never the model's output. Recent turns stay text; the system prompt, tool docs, and older bulk history are imaged.
- It is lossy. Exact 12-char hex strings in dense imaged content: 13/15 on Fable 5, 0/15 on Opus — and misses are silent confabulations, not errors. Byte-exact values (IDs, hashes, secrets) must stay text; recent turns do. A dedicated verbatim-risk guard is not built yet.
- Escape hatch: subagents on non-allowlisted models pass through as
text — route byte-exact work there
(
CLAUDE_CODE_SUBAGENT_MODEL=claude-sonnet-4-6, ormodel: sonnetin agent frontmatter). - Real work: SWE-bench Lite pilot 10/10 both arms at −65% request
size; SWE-bench Pro 14/19 ON vs 15/19 OFF at −60%, verdicts agree
18/19, and the single split re-resolved 3/3 on replication — run-to-run
variance, not compression. Small n; receipts in
eval/. - Workload-dependent. Wins on token-dense content (~1 char/token), loses money on sparse prose (~3.5 chars/token); a profitability gate (calibrated on N=391 production rows) images only where the math wins.
- Model scope: default
PXPIPE_MODELS=claude-fable-5,gpt-5.6. Opus 4.7/4.8 misread ~7% of renders and GPT 5.5 degrades on imaged context, so both are opt-in viaPXPIPE_MODELSor the dashboard chips.PXPIPE_MODELS=offdisables imaging. Everything else passes through byte-identical. On the GPT path, tool definitions stay native JSON and no Anthropiccache_controlmarkers are used.
Measured with novel random-number problems the model cannot have memorized:
| test | N | text | pxpipe (image) | tokens |
|---|---|---|---|---|
novel arithmetic, claude-fable-5 |
100 | 100% | 100% | −38% |
novel arithmetic, claude-opus-4-8 |
100 | 100% | 93% | −38% |
| gist recall A/B (decisions, values, paths, names, negations; with distractors; 15k-45k char sessions), Fable 5 | 98/arm | 98/98 | 98/98 | - |
| state tracking (value mutated 3x, final/first/count), Fable 5 | 18/arm | 18/18 | 18/18 | - |
| confabulation on never-stated facts (lower is better), Fable 5 | 16/arm | 0/16 | 0/16 | - |
| verbatim 12-char hex recall, dense render, Opus | 15 | 15/15 | 0/15 | - |
| verbatim 12-char hex recall, dense render, Fable 5 | 15 | - | 13/15 | - |
SWE-bench run totals, receipts, and caveats:
eval/swe-bench/ ·
eval/swe-bench-pro/ ·
eval/needle-haystack/ ·
eval/gist-recall/ · analysis in
FINDINGS.md. (GSM8K scored 96% imaged, but it's in training
data — memorized answers survive misreads — so we lead with the novel-number
evals.)
tool_result string ──► wrap at 1928px-wide columns ──► pack ~92,000 chars/page ──► PNG[]
The proxy intercepts /v1/messages, rewrites eligible bulk into image
blocks, splices them back cache-friendly (static prefix preserved, prompt
caching keeps working), and forwards. A 1928×1928 image costs ≈4,761 vision
tokens and holds ≈92,000 chars, so text wins only above ~19 chars/token —
Claude Code traffic runs ~1.91 (N=391). A per-request estimator decides;
sparse prose stays text. Events log to ~/.pxpipe/events.jsonl.
import { renderTextToPngs, transformAnthropicMessages } from "pxpipe";
const imgs = await renderTextToPngs(toolResultText); // RenderedImage[]
const { body, applied, info } = await transformAnthropicMessages({
body: requestBytes,
model: "claude-fable-5",
});options.keepSharp(block) pins blocks as text; options.emitRecoverable
returns the originals of imaged blocks. Pure-JS runtime (Node and
edge/Workers); @napi-rs/canvas is build-time only. Full API:
src/core/index.ts.
pnpm install && pnpm test
pnpm run build # regenerates dist/Is the headline end-to-end, or only on the requests you touched? End-to-end, the whole bill. Most compression tools report savings only on the input slice they touched, which flatters the number. The end-to-end denominator is every production request: the small ones pxpipe correctly left untouched, all cache writes and reads, and all output tokens (which the proxy never compresses). On a 13,709-request snapshot that was 59% ($100 → ~$41); a later 8,904-compressed-request trace measured ~70%. Compressed-only runs higher (~72–74%) and is quoted separately, never as the headline. The exact figure is workload-dependent — reproduce it on your own log.
How is the math measured?
Both sides of the same request, at the same moment. For every /v1/messages
POST the proxy fires a free count_tokens probe on the original uncompressed
body (the counterfactual) in parallel with the real forward, and reads
Anthropic's actually-billed usage block off the response. Both land in the
same row of ~/.pxpipe/events.jsonl, so there is no turn-count or
run-to-run confound. Dollar conversion uses Fable 5 list ratios: input ×1.0,
cache write ×1.25, cache read ×0.1, output ×5. Cache pricing is applied
identically to both sides, so the caching discount cancels and cannot be
double-counted as "savings". Re-derive it yourself from the events log: the
formula and field names are documented in src/core/baseline.ts.
What does it actually compress? Three kinds of input blocks, each behind a profitability gate:
- large
tool_resultbodies (file reads, command output, logs) above ~6k chars of token-dense content - older collapsed history: turns behind the live tail get re-rendered as image pages, recent turns always stay text
- the static system prompt + tool docs slab
Everything else passes through byte-identical: your messages, recent turns,
the model's output (it is the response, the proxy never touches it), sparse
prose, and anything too small to win. Models outside the allowlist pass
through entirely — the default scope is Fable 5 and GPT 5.6 only. Opus 4.8
and GPT 5.5 read imaged content measurably worse (FINDINGS.md 2026-06-16),
so they are deliberately opt-in via the dashboard or PXPIPE_MODELS, never
silently imaged.
Has it ever failed for real, outside the benchmarks? Yes, once in weeks of daily use: the model recalled a person's name from imaged chat history and got it confidently wrong. No error, just a plausible wrong name. That is the documented failure mode: exact strings in imaged content are not byte-safe. Coding sessions tolerate this because the agent re-reads files before editing; pure chat recall has no such check. This failure mode is measured, not anecdotal: the legibility audit quantifies exact-string recall off rendered pages (blind reads top out at 63% on dense identifiers, with every miss predicted by a glyph-confusability matrix) and documents the shipped mitigations — page geometry clamped to the API's resample cap so billed pixels actually reach the vision encoder, and exact identifiers (SHAs, numbers) riding alongside as text.
Why does the README read like an AI wrote it? Because one did. Most of this repo's commits — the code and the docs — were authored by Opus/Fable agent sessions running behind pxpipe itself, reading their own collapsed history as image pages while they worked.
- Lossy (above); verbatim recall from images is unreliable.
- PNG encoding adds latency to large requests before they leave.
- ASCII/Latin-1 well tested; CJK works but conservatively.
Hypotheses, not claims — they ship as numbers with an n or they get cut:
sharper glyph rendering (eval/glyph-matrix/, paused mid-run), whether
imaged bulk stretches effective context (~2x the real content in the same
1M window), and whether a smaller active context improves long-task
accuracy.
MIT.


