New: Audio API, Embeddings & Realtime WebSocket now available!
osmAPI LogoosmAPI

Realtime API

Build voice agents with real-time speech-to-speech via WebSocket

Realtime API

osmAPI proxies OpenAI's Realtime API, enabling real-time speech-to-speech conversations via WebSocket. Build voice agents, interactive voice assistants, and real-time transcription systems.

Available Models

ModelPrice (Text)Price (Audio)Notes
gpt-realtime$4/16 per 1M tokens$32/64 per 1M audio tokensFull capability
gpt-realtime-mini$1/4 per 1M tokens$8/16 per 1M audio tokensCost-effective

WebSocket Connection

Connect to the Realtime API via WebSocket:

const ws = new WebSocket(
  "wss://api.osmapi.com/v1/realtime?model=gpt-realtime",
  {
    headers: {
      Authorization: "Bearer YOUR_OSM_API_KEY",
    },
  }
);

ws.on("open", () => {
  // Send a text message
  ws.send(
    JSON.stringify({
      type: "response.create",
      response: {
        modalities: ["text", "audio"],
        instructions: "You are a helpful assistant.",
      },
    })
  );
});

ws.on("message", (data) => {
  const event = JSON.parse(data);
  console.log("Event:", event.type);
});

The gateway automatically adds the required OpenAI-Beta header when proxying to OpenAI. You do not need to include it in your client-side connection.

Realtime Transcription

For streaming transcription only (no conversation), use the intent=transcription parameter:

const ws = new WebSocket(
  "wss://api.osmapi.com/v1/realtime?model=gpt-realtime&intent=transcription",
  {
    headers: {
      Authorization: "Bearer YOUR_OSM_API_KEY",
    },
  }
);

WebRTC Sessions

For browser-based real-time voice, create an ephemeral session token:

curl -X POST "https://api.osmapi.com/v1/realtime/sessions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-realtime",
    "voice": "alloy"
  }'

The response contains a client_secret that browser clients use for direct WebRTC connections.

Transcription Sessions

Create ephemeral tokens for WebSocket-based streaming transcription:

curl -X POST "https://api.osmapi.com/v1/realtime/transcription_sessions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-transcribe"
  }'

The Realtime API uses persistent WebSocket connections. Each connection can last up to 60 minutes. Authentication is done via the Bearer token in the connection headers.

Use Cases

  • Voice Assistants: Build Siri/Alexa-like voice interfaces
  • Real-time Transcription: Live subtitles, meeting notes
  • Phone Agents: Interactive voice response (IVR) systems
  • Language Learning: Real-time pronunciation feedback

How is this guide?