Audio
Text-to-Speech and Speech-to-Text using OpenAI and Groq models
Audio
osmAPI supports audio endpoints for Text-to-Speech (TTS) and Speech-to-Text (STT) through an OpenAI-compatible API. Generate spoken audio from text or transcribe audio files to text.
Available Models
Text-to-Speech (TTS)
| Model | Provider | Price | Quality |
|---|---|---|---|
tts-1 | OpenAI | $15/1M chars | Standard, fast |
tts-1-hd | OpenAI | $30/1M chars | HD quality |
gpt-4o-mini-tts | OpenAI | ~$12/1M audio tokens | Best, supports instructions |
Speech-to-Text (STT)
| Model | Provider | Price | Notes |
|---|---|---|---|
whisper-1 | OpenAI | $0.006/min | All response formats, supports translation |
gpt-4o-transcribe | OpenAI | $0.006/min | JSON only, high accuracy |
gpt-4o-mini-transcribe | OpenAI | $0.003/min | JSON only, 50% cheaper |
gpt-4o-transcribe-diarize | OpenAI | $0.006/min | JSON only, high accuracy, with speaker diarization |
whisper-large-v3 | Groq | $0.111/hr | Ultra-fast (189x real-time), supports translation |
whisper-large-v3-turbo | Groq | $0.04/hr | Fastest (216x real-time) |
Text-to-Speech
Basic Usage
curl -X POST "https://api.osmapi.com/v1/audio/speech" \
-H "Authorization: Bearer $OSM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, welcome to osmAPI!",
"voice": "alloy"
}' \
--output speech.mp3With Voice Instructions (gpt-4o-mini-tts only)
curl -X POST "https://api.osmapi.com/v1/audio/speech" \
-H "Authorization: Bearer $OSM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini-tts",
"input": "This is exciting news!",
"voice": "coral",
"instructions": "Speak with enthusiasm and energy"
}' \
--output speech.mp3Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | tts-1, tts-1-hd, or gpt-4o-mini-tts |
input | string | Yes | Text to speak. Max 4,096 characters. |
voice | string | Yes | Voice: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer |
response_format | string | No | mp3 (default), opus, aac, flac, wav, pcm |
speed | number | No | Speed 0.25 to 4.0. Default 1.0 |
instructions | string | No | Voice tone/emotion instructions. Only gpt-4o-mini-tts. |
Available Voices
alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse
tts-1 and tts-1-hd support the first 10 voices. gpt-4o-mini-tts supports
all 11 voices including verse. The voices marin and cedar are exclusive to gpt-4o-mini-tts and are not listed here as general-purpose voices.
Speech-to-Text
Transcription
curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
-H "Authorization: Bearer $OSM_API_KEY" \
-F file=@audio.mp3 \
-F model=whisper-1With Groq (ultra-fast)
curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
-H "Authorization: Bearer $OSM_API_KEY" \
-F file=@audio.mp3 \
-F model=groq/whisper-large-v3Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
file | file | Yes | Audio file (max 25MB). Formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm |
model | string | Yes | Model ID |
language | string | No | ISO-639-1 language code (e.g., en, es) |
response_format | string | No | json (default), text, srt, verbose_json, vtt |
temperature | number | No | 0 to 1. Default 0. |
prompt | string | No | Guide transcription style |
stream | string | No | true to enable streaming. Only gpt-4o-transcribe and gpt-4o-mini-transcribe. |
timestamp_granularities[] | array | No | Timestamp granularity for verbose_json format. Values: "word", "segment". |
Streaming Transcription
For long audio files, enable streaming to get text as it's transcribed instead of waiting for the full result:
curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
-H "Authorization: Bearer $OSM_API_KEY" \
-F file=@meeting.mp3 \
-F model=gpt-4o-mini-transcribe \
-F stream=trueThe response is Server-Sent Events (SSE) with progressive transcript chunks:
data: {"type":"transcript.text.delta","delta":"Hello "}
data: {"type":"transcript.text.delta","delta":"everyone, "}
data: {"type":"transcript.text.delta","delta":"welcome to the meeting."}
data: {"type":"transcript.text.done","text":"Hello everyone, welcome to the meeting."}
data: [DONE]Streaming is only supported by gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize. The whisper-1 and Groq models do not support streaming — the parameter is silently ignored.
Translation
Translate non-English audio to English text:
curl -X POST "https://api.osmapi.com/v1/audio/translations" \
-H "Authorization: Bearer $OSM_API_KEY" \
-F file=@spanish_audio.mp3 \
-F model=whisper-1Only whisper-1 (OpenAI) and whisper-large-v3 (Groq) support translation.
SDK Compatibility
Fully compatible with the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
api_key="your-osm-api-key",
base_url="https://api.osmapi.com/v1"
)
# Text-to-Speech
response = client.audio.speech.create(
model="tts-1",
input="Hello from osmAPI!",
voice="alloy"
)
response.stream_to_file("output.mp3")
# Speech-to-Text
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=open("audio.mp3", "rb")
)
print(transcription.text)import OpenAI from "openai";
import fs from "fs";
const client = new OpenAI({
apiKey: "your-osm-api-key",
baseURL: "https://api.osmapi.com/v1",
});
// Text-to-Speech
const audio = await client.audio.speech.create({
model: "tts-1",
input: "Hello from osmAPI!",
voice: "alloy",
});
const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);
// Speech-to-Text
const transcription = await client.audio.transcriptions.create({
model: "whisper-1",
file: fs.createReadStream("audio.mp3"),
});
console.log(transcription.text);Pricing
- TTS: Billed per input character (whitespace excluded)
- STT: Billed per second of audio
- Cost is deducted from credits in credits/hybrid mode, or charged to your provider key in API keys mode
Cost Tracking
Every audio response includes an x-osm-response-cost header with the request cost in USD:
# TTS — cost in response header (binary body can't include JSON)
curl -D- https://api.osmapi.com/v1/audio/speech \
-H "Authorization: Bearer $OSM_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"tts-1","input":"Hello!","voice":"alloy"}' \
-o speech.mp3
# Response headers include:
# x-osm-response-cost: 0.0000750000
# x-request-id: abc123
# STT — cost in both header and can be parsed from response
curl -D- https://api.osmapi.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OSM_API_KEY" \
-F file=@audio.mp3 \
-F model=whisper-1
# Response headers include:
# x-osm-response-cost: 0.0001200000See the Cost Breakdown guide for details on USD vs INR billing.
How is this guide?