Audio

osmAPI supports audio endpoints for Text-to-Speech (TTS) and Speech-to-Text (STT) through an OpenAI-compatible API. Generate spoken audio from text or transcribe audio files to text.

Looking for music generation? That's a different endpoint — /v1/music with Google Lyria 3.

Available Models

Text-to-Speech (TTS)

Model	Provider	Price	Quality
`tts-1`	OpenAI	$15/1M chars	Standard, fast
`tts-1-hd`	OpenAI	$30/1M chars	HD quality
`gpt-4o-mini-tts`	OpenAI	~$12/1M audio tokens	Best, supports instructions

Speech-to-Text (STT)

Model	Provider	Price	Notes
`whisper-1`	OpenAI	$0.006/min	All response formats, supports translation
`gpt-4o-transcribe`	OpenAI	$0.006/min	JSON only, high accuracy
`gpt-4o-mini-transcribe`	OpenAI	$0.003/min	JSON only, 50% cheaper
`gpt-4o-transcribe-diarize`	OpenAI	$0.006/min	JSON only, high accuracy, with speaker diarization
`whisper-large-v3`	Groq	$0.111/hr	Ultra-fast (189x real-time), supports translation
`whisper-large-v3-turbo`	Groq	$0.04/hr	Fastest (216x real-time)

Text-to-Speech

Basic Usage

curl -X POST "https://api.osmapi.com/v1/audio/speech" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, welcome to osmAPI!",
    "voice": "alloy"
  }' \
  --output speech.mp3

With Voice Instructions (gpt-4o-mini-tts only)

curl -X POST "https://api.osmapi.com/v1/audio/speech" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "This is exciting news!",
    "voice": "coral",
    "instructions": "Speak with enthusiasm and energy"
  }' \
  --output speech.mp3

Parameters

Parameter	Type	Required	Description
`model`	string	Yes	`tts-1`, `tts-1-hd`, or `gpt-4o-mini-tts`
`input`	string	Yes	Text to speak. Max 4,096 characters.
`voice`	string	Yes	Voice: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
`response_format`	string	No	`mp3` (default), `opus`, `aac`, `flac`, `wav`, `pcm`
`speed`	number	No	Speed 0.25 to 4.0. Default 1.0
`instructions`	string	No	Voice tone/emotion instructions. Only `gpt-4o-mini-tts`.

Available Voices

alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse

tts-1 and tts-1-hd support the first 10 voices. gpt-4o-mini-tts supports all 11 voices including verse. The voices marin and cedar are exclusive to gpt-4o-mini-tts and are not listed here as general-purpose voices.

Speech-to-Text

Transcription

curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1

With Groq (ultra-fast)

curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@audio.mp3 \
  -F model=groq/whisper-large-v3

Parameters

Parameter	Type	Required	Description
`file`	file	Yes	Audio file (max 25MB). Formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
`model`	string	Yes	Model ID
`language`	string	No	ISO-639-1 language code (e.g., `en`, `es`)
`response_format`	string	No	`json` (default), `text`, `srt`, `verbose_json`, `vtt`
`temperature`	number	No	0 to 1. Default 0.
`prompt`	string	No	Guide transcription style
`stream`	string	No	`true` to enable streaming. Only `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`.
`timestamp_granularities[]`	array	No	Timestamp granularity for `verbose_json` format. Values: `"word"`, `"segment"`.

Streaming Transcription

For long audio files, enable streaming to get text as it's transcribed instead of waiting for the full result:

curl -X POST "https://api.osmapi.com/v1/audio/transcriptions" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=gpt-4o-mini-transcribe \
  -F stream=true

The response is Server-Sent Events (SSE) with progressive transcript chunks:

data: {"type":"transcript.text.delta","delta":"Hello "}
data: {"type":"transcript.text.delta","delta":"everyone, "}
data: {"type":"transcript.text.delta","delta":"welcome to the meeting."}
data: {"type":"transcript.text.done","text":"Hello everyone, welcome to the meeting."}
data: [DONE]

Streaming is only supported by gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize. The whisper-1 and Groq models do not support streaming — the parameter is silently ignored.

Translation

Translate non-English audio to English text:

curl -X POST "https://api.osmapi.com/v1/audio/translations" \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@spanish_audio.mp3 \
  -F model=whisper-1

Only whisper-1 (OpenAI) and whisper-large-v3 (Groq) support translation.

SDK Compatibility

Fully compatible with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    api_key="your-osm-api-key",
    base_url="https://api.osmapi.com/v1"
)

# Text-to-Speech
response = client.audio.speech.create(
    model="tts-1",
    input="Hello from osmAPI!",
    voice="alloy"
)
response.stream_to_file("output.mp3")

# Speech-to-Text
transcription = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("audio.mp3", "rb")
)
print(transcription.text)

import OpenAI from "openai";
import fs from "fs";

const client = new OpenAI({
	apiKey: "your-osm-api-key",
	baseURL: "https://api.osmapi.com/v1",
});

// Text-to-Speech
const audio = await client.audio.speech.create({
	model: "tts-1",
	input: "Hello from osmAPI!",
	voice: "alloy",
});
const buffer = Buffer.from(await audio.arrayBuffer());
fs.writeFileSync("output.mp3", buffer);

// Speech-to-Text
const transcription = await client.audio.transcriptions.create({
	model: "whisper-1",
	file: fs.createReadStream("audio.mp3"),
});
console.log(transcription.text);

Pricing

TTS: Billed per input character (whitespace excluded)
STT: Billed per second of audio
Cost is deducted from credits in credits/hybrid mode, or charged to your provider key in API keys mode

Cost Tracking

Every audio response includes an x-osm-response-cost header with the request cost in USD:

# TTS — cost in response header (binary body can't include JSON)
curl -D- https://api.osmapi.com/v1/audio/speech \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"tts-1","input":"Hello!","voice":"alloy"}' \
  -o speech.mp3

# Response headers include:
# x-osm-response-cost: 0.0000750000
# x-request-id: abc123

# STT — cost in both header and can be parsed from response
curl -D- https://api.osmapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OSM_API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1

# Response headers include:
# x-osm-response-cost: 0.0001200000

See the Cost Breakdown guide for details on USD vs INR billing.

Audio

On this page