Realtime API
Build voice agents with real-time speech-to-speech via WebSocket
Realtime API
osmAPI proxies OpenAI's Realtime API, enabling real-time speech-to-speech conversations via WebSocket. Build voice agents, interactive voice assistants, and real-time transcription systems.
Available Models
| Model | Price (Text) | Price (Audio) | Notes |
|---|---|---|---|
gpt-realtime | $4/16 per 1M tokens | $32/64 per 1M audio tokens | Full capability |
gpt-realtime-mini | $1/4 per 1M tokens | $8/16 per 1M audio tokens | Cost-effective |
WebSocket Connection
Connect to the Realtime API via WebSocket:
const ws = new WebSocket(
"wss://api.osmapi.com/v1/realtime?model=gpt-realtime",
{
headers: {
Authorization: "Bearer YOUR_OSM_API_KEY",
},
}
);
ws.on("open", () => {
// Send a text message
ws.send(
JSON.stringify({
type: "response.create",
response: {
modalities: ["text", "audio"],
instructions: "You are a helpful assistant.",
},
})
);
});
ws.on("message", (data) => {
const event = JSON.parse(data);
console.log("Event:", event.type);
});The gateway automatically adds the required OpenAI-Beta header when proxying to OpenAI. You do not need to include it in your client-side connection.
Realtime Transcription
For streaming transcription only (no conversation), use the intent=transcription parameter:
const ws = new WebSocket(
"wss://api.osmapi.com/v1/realtime?model=gpt-realtime&intent=transcription",
{
headers: {
Authorization: "Bearer YOUR_OSM_API_KEY",
},
}
);WebRTC Sessions
For browser-based real-time voice, create an ephemeral session token:
curl -X POST "https://api.osmapi.com/v1/realtime/sessions" \
-H "Authorization: Bearer $OSM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-realtime",
"voice": "alloy"
}'The response contains a client_secret that browser clients use for direct WebRTC connections.
Transcription Sessions
Create ephemeral tokens for WebSocket-based streaming transcription:
curl -X POST "https://api.osmapi.com/v1/realtime/transcription_sessions" \
-H "Authorization: Bearer $OSM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini-transcribe"
}'The Realtime API uses persistent WebSocket connections. Each connection can last up to 60 minutes. Authentication is done via the Bearer token in the connection headers.
Use Cases
- Voice Assistants: Build Siri/Alexa-like voice interfaces
- Real-time Transcription: Live subtitles, meeting notes
- Phone Agents: Interactive voice response (IVR) systems
- Language Learning: Real-time pronunciation feedback
How is this guide?