Documentation Index
Fetch the complete documentation index at: https://docs.automagik.dev/llms.txt
Use this file to discover all available pages before exploring further.
Cache Mode (CAG)
Cache-Augmented Generation (CAG) embeds your full context directly in the system prompt and lets the provider’s prompt cache handle the rest. Subsequent queries over the same context pay a fraction of the input-token cost — up to 90% off on Google and Anthropic.Use cache mode when your context fits inside the provider’s window and you plan to ask multiple questions against it. For corpora that exceed the window, stay on default RLM mode (programmatic navigation).
CAG vs default RLM
| Mode | How it works | Best for |
|---|---|---|
| Default RLM | Context loaded into a Python REPL context variable; LLM writes code to navigate it programmatically | Very large codebases, exploratory analysis, unknown questions |
| Cache (CAG) | Full context embedded in the system prompt and cached at the provider | Repeated questions on the same docs, study sessions, batch Q&A |
--cache flag on any query, or set cache.enabled: true in .rlmx/rlmx.yaml.
End-to-end example
Interrogate a documentation folder three times — the first query warms the cache, the next two ride it.1. Estimate the cost
Output
2. Warm the cache
Output (stderr)
3. Run cached queries
Approximate cost per query
--cache invocation hashes the context; matching content hashes hit the cached system prompt instead of re-sending it.
How it works
When--cache is enabled RLMX:
- Computes a SHA-256 content hash over sorted context items
- Builds a session ID:
{cache.session-prefix}-{hash}(or just the hash) - Writes the full context into the system prompt under a
## Context Filesblock - Sends the request with provider-specific cache headers (
cache_controlfor Anthropic, explicit cached content for Gemini, etc.) - Provider returns usage metrics including
cache_read_tokens— billed at the discount rate
storage.enabled is auto/always).
Provider support
| Provider | Cache limit | Discount on cached input | TTL behavior |
|---|---|---|---|
| Google Gemini | 1,000,000 tokens | ~90% | Implicit + explicit; retention: long maps to 1-hour TTL |
| Anthropic | 200,000 tokens | ~90% | Ephemeral (~5 min) or long-lived via cache_control |
| OpenAI | 128,000 tokens | ~50% | Automatic prompt caching, no TTL knob |
| Amazon Bedrock | 128,000 tokens | Provider-dependent | Inherits underlying model support |
Configuration
Cache behavior lives undercache: in .rlmx/rlmx.yaml:
| Field | Description |
|---|---|
enabled | Turn cache mode on by default for every rlmx invocation. CLI --cache overrides this per-run. |
retention | short for ephemeral caches, long for extended TTL. Maps to provider-specific behavior. |
ttl | Explicit TTL in seconds when the provider supports it. |
expire-time | ISO 8601 timestamp for Google explicit caching. Mutually exclusive with ttl. |
session-prefix | Namespace for the cache session ID — useful when multiple projects share a provider account. |
The rlmx cache command
rlmx cache is the operator-facing entry point for CAG. It has two modes:
| Invocation | Behavior |
|---|---|
rlmx cache --context <path> --estimate | Prints token count, provider limit, utilization %, and projected first-query cost. No LLM calls. |
rlmx cache --context <path> | Issues a one-iteration warmup query to prime the provider cache. |
rlmx cache.
Practical patterns
Study session over a codebase
Warm once, then ask as many follow-ups as you want within the TTL:Cached batch interrogation
Batch mode always enables caching — it’s effectivelyrlmx cache plus a question loop. See Batch Mode for the questions-file format and cost math.
Budget-capped cached run
Set a hard spend ceiling so runaway iterations can’t overshoot cache savings:Automatic fallback to RLM
If the context inflates past the provider limit, RLMX logs a warning and silently downgrades to RLM navigation:When NOT to use cache mode
- Context is too large for the provider window (use default RLM with
storage.enabled: auto) - Questions span different contexts — each unique context pays its own warmup cost
- You only plan to ask one question — the first-query cost equals a non-cached query, so there’s no savings
See also
Batch Mode
Bulk interrogation that stacks caching with the Gemini Batch API for up to 95% savings.
CLI Reference
Every flag on
rlmx cache and --cache documented.Configuration
cache: section of rlmx.yaml in full.Provider limits
Max cacheable context size by provider.