New: Audio API, Embeddings & Realtime WebSocket now available!
osmAPI LogoosmAPI

Audio

Text-to-Speech and Speech-to-Text using OpenAI and Groq models

Audio

osmAPI supports audio endpoints for Text-to-Speech (TTS) and Speech-to-Text (STT) through an OpenAI-compatible API. Generate spoken audio from text or transcribe audio files to text.

Available Models

Text-to-Speech (TTS)

ModelProviderPriceQuality
tts-1OpenAI$15/1M charsStandard, fast
tts-1-hdOpenAI$30/1M charsHD quality
gpt-4o-mini-ttsOpenAI~$12/1M audio tokensBest, supports instructions

Speech-to-Text (STT)

ModelProviderPriceNotes
whisper-1OpenAI$0.006/minAll response formats, supports translation
gpt-4o-transcribeOpenAI$0.006/minJSON only, high accuracy
gpt-4o-mini-transcribeOpenAI$0.003/minJSON only, 50% cheaper
gpt-4o-transcribe-diarizeOpenAI$0.006/minJSON only, high accuracy, with speaker diarization
whisper-large-v3Groq$0.111/hrUltra-fast (189x real-time), supports translation
whisper-large-v3-turboGroq$0.04/hrFastest (216x real-time)

Text-to-Speech

Basic Usage

curl -X POST "https://api.osmapi.com/v1/audio/speech" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, welcome to osmAPI!",
    "voice": "alloy"
  }' \
  --output speech.mp3

With Voice Instructions (gpt-4o-mini-tts only)

curl -X POST "https://api.osmapi.com/v1/audio/speech" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "This is exciting news!",
    "voice": "coral",
    "instructions": "Speak with enthusiasm and energy"
  }' \
  --output speech.mp3

Parameters

ParameterTypeRequiredDescription
modelstringYestts-1, tts-1-hd, or gpt-4o-mini-tts
inputstringYesText to speak. Max 4,096 characters.
voicestringYesVoice: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
response_formatstringNomp3 (default), opus, aac, flac, wav, pcm
speednumberNoSpeed 0.25 to 4.0. Default 1.0
instructionsstringNoVoice tone/emotion instructions. Only gpt-4o-mini-tts.

Available Voices

alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse

tts-1 and tts-1-hd support the first 10 voices. gpt-4o-mini-tts supports all 11 voices including verse. The voices marin and cedar are exclusive to gpt-4o-mini-tts and are not listed here as general-purpose voices.

Speech-to-Text

Transcription

curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1

With Groq (ultra-fast)

curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@audio.mp3 \
  -F model=groq/whisper-large-v3

Parameters

ParameterTypeRequiredDescription
filefileYesAudio file (max 25MB). Formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
modelstringYesModel ID
languagestringNoISO-639-1 language code (e.g., en, es)
response_formatstringNojson (default), text, srt, verbose_json, vtt
temperaturenumberNo0 to 1. Default 0.
promptstringNoGuide transcription style
streamstringNotrue to enable streaming. Only gpt-4o-transcribe and gpt-4o-mini-transcribe.
timestamp_granularities[]arrayNoTimestamp granularity for verbose_json format. Values: "word", "segment".

Streaming Transcription

For long audio files, enable streaming to get text as it's transcribed instead of waiting for the full result:

curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=gpt-4o-mini-transcribe \
  -F stream=true

The response is Server-Sent Events (SSE) with progressive transcript chunks:

data: {"type":"transcript.text.delta","delta":"Hello "}
data: {"type":"transcript.text.delta","delta":"everyone, "}
data: {"type":"transcript.text.delta","delta":"welcome to the meeting."}
data: {"type":"transcript.text.done","text":"Hello everyone, welcome to the meeting."}
data: [DONE]

Streaming is only supported by gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize. The whisper-1 and Groq models do not support streaming — the parameter is silently ignored.

Translation

Translate non-English audio to English text:

curl -X POST "https://api.osmapi.com/v1/audio/translations" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@spanish_audio.mp3 \
  -F model=whisper-1

Only whisper-1 (OpenAI) and whisper-large-v3 (Groq) support translation.

SDK Compatibility

Fully compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    api_key="your-osm-api-key",
    base_url="https://api.osmapi.com/v1"
)

# Text-to-Speech
response = client.audio.speech.create(
    model="tts-1",
    input="Hello from osmAPI!",
    voice="alloy"
)
response.stream_to_file("output.mp3")

# Speech-to-Text
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("audio.mp3", "rb")
)
print(transcription.text)
import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
	apiKey: "your-osm-api-key",
	baseURL: "https://api.osmapi.com/v1",
});

// Text-to-Speech
const audio = await client.audio.speech.create({
	model: "tts-1",
	input: "Hello from osmAPI!",
	voice: "alloy",
});
const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

// Speech-to-Text
const transcription = await client.audio.transcriptions.create({
	model: "whisper-1",
	file: fs.createReadStream("audio.mp3"),
});
console.log(transcription.text);

Pricing

  • TTS: Billed per input character (whitespace excluded)
  • STT: Billed per second of audio
  • Cost is deducted from credits in credits/hybrid mode, or charged to your provider key in API keys mode

Cost Tracking

Every audio response includes an x-osm-response-cost header with the request cost in USD:

# TTS — cost in response header (binary body can't include JSON)
curl -D- https://api.osmapi.com/v1/audio/speech \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello!","voice":"alloy"}' \
  -o speech.mp3

# Response headers include:
# x-osm-response-cost: 0.0000750000
# x-request-id: abc123

# STT — cost in both header and can be parsed from response
curl -D- https://api.osmapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1

# Response headers include:
# x-osm-response-cost: 0.0001200000

See the Cost Breakdown guide for details on USD vs INR billing.

How is this guide?