Response Caching

osmAPI caches responses for identical requests, so you don't pay for the same completion twice. Cached responses are served instantly with near-zero latency.

Cost Savings

Scenario	Without Caching	With Caching	Savings
1,000 identical requests	$15.00	$0.02	~99.8%
45% duplicate rate	$10.00	$5.50	45%
Retry after error	$0.04	$0.02	50%

Cached responses have zero inference cost. You only pay for the first request that populates the cache.

Use Cases

Development & Testing

Cache identical prompts during development so only the first call costs anything.

// Repeated calls resolve instantly from cache
const response = await client.chat.completions.create({
	model: "gpt-4o",
	messages: [{ role: "user", content: "Explain Zero-Trust Architecture." }],
});

FAQ & Support Bots

Common questions like "What's your refund policy?" are answered instantly from cache.

Batch Processing

When processing large datasets with duplicate items, caching automatically deduplicates.

for (const entry of dataBatch) {
	await client.chat.completions.create({
		model: "gpt-4o",
		messages: [{ role: "user", content: `Categorize: ${entry}` }],
	});
}

How It Works

Hash Request: osmAPI generates a unique key from your full request payload.
Check Cache: Looks for a matching cached response.
Cache Hit: Returns the cached response instantly.
Cache Miss: Routes to the provider, returns the response, and caches it.

What's Included in the Cache Key

Model ID
Full message history (roles, content, order)
Temperature, top-p, and other parameters
Max tokens and stop sequences
Tool definitions

Any change in parameters creates a different cache key, preventing cross-contamination between different request types.

Setup

Caching requires Data Retention to be enabled with "Full Retention" mode, since osmAPI needs to store payloads to serve them from cache.

Enable Data Retention (Full Retention) in your organization settings.
Turn on Response Caching in your project settings.
Set your TTL (time-to-live) for cached responses.

TTL (Time-to-Live)

Configure how long cached responses are valid — from 10 seconds up to 1 year. Default is 60 seconds.

Detecting Cache Hits

A cached response has zero token usage:

{
	"usage": {
		"prompt_tokens": 0,
		"completion_tokens": 0,
		"total_tokens": 0,
		"cost_usd_total": 0
	}
}

Tips for Better Cache Hit Rates

Normalize Input: Clean and standardize user input before sending.
Use Temperature 0: For deterministic tasks, set temperature: 0.
Avoid Dynamic Values: Don't put timestamps or session IDs in prompts.

Best For

Knowledge base queries
Classification and sentiment analysis
CI/CD test pipelines
Multi-step tasks with shared context

Pricing

Caching uses the Data Retention storage layer at $0.01 per 1 million tokens. The savings from avoiding repeated inference typically far exceed storage costs.

Response Caching

On this page