New: Audio API, Embeddings & Realtime WebSocket now available!
osmAPI LogoosmAPI

Response Caching

Cache identical requests to reduce costs and latency.

Response Caching

osmAPI caches responses for identical requests, so you don't pay for the same completion twice. Cached responses are served instantly with near-zero latency.

Cost Savings

ScenarioWithout CachingWith CachingSavings
1,000 identical requests$15.00$0.02~99.8%
45% duplicate rate$10.00$5.5045%
Retry after error$0.04$0.0250%

Cached responses have zero inference cost. You only pay for the first request that populates the cache.


Use Cases

Development & Testing

Cache identical prompts during development so only the first call costs anything.

// Repeated calls resolve instantly from cache
const response = await client.chat.completions.create({
	model: "gpt-4o",
	messages: [{ role: "user", content: "Explain Zero-Trust Architecture." }],
});

FAQ & Support Bots

Common questions like "What's your refund policy?" are answered instantly from cache.

Batch Processing

When processing large datasets with duplicate items, caching automatically deduplicates.

for (const entry of dataBatch) {
	await client.chat.completions.create({
		model: "gpt-4o",
		messages: [{ role: "user", content: `Categorize: ${entry}` }],
	});
}

How It Works

  1. Hash Request: osmAPI generates a unique key from your full request payload.
  2. Check Cache: Looks for a matching cached response.
  3. Cache Hit: Returns the cached response instantly.
  4. Cache Miss: Routes to the provider, returns the response, and caches it.

What's Included in the Cache Key

  • Model ID
  • Full message history (roles, content, order)
  • Temperature, top-p, and other parameters
  • Max tokens and stop sequences
  • Tool definitions

Any change in parameters creates a different cache key, preventing cross-contamination between different request types.


Setup

Caching requires Data Retention to be enabled with "Full Retention" mode, since osmAPI needs to store payloads to serve them from cache.

  1. Enable Data Retention (Full Retention) in your organization settings.
  2. Turn on Response Caching in your project settings.
  3. Set your TTL (time-to-live) for cached responses.

TTL (Time-to-Live)

Configure how long cached responses are valid — from 10 seconds up to 1 year. Default is 60 seconds.


Detecting Cache Hits

A cached response has zero token usage:

{
	"usage": {
		"prompt_tokens": 0,
		"completion_tokens": 0,
		"total_tokens": 0,
		"cost_usd_total": 0
	}
}

Tips for Better Cache Hit Rates

  • Normalize Input: Clean and standardize user input before sending.
  • Use Temperature 0: For deterministic tasks, set temperature: 0.
  • Avoid Dynamic Values: Don't put timestamps or session IDs in prompts.

Best For

  • Knowledge base queries
  • Classification and sentiment analysis
  • CI/CD test pipelines
  • Multi-step tasks with shared context

Pricing

Caching uses the Data Retention storage layer at $0.01 per 1 million tokens. The savings from avoiding repeated inference typically far exceed storage costs.

How is this guide?