



import { writeFileSync } from 'node:fs';
import OpenAI from 'openai';
const api = new OpenAI({
baseURL: 'https://api.ai.cc/v1',
apiKey: '',
});
const main = async () => {
const answer = await api.chat.completions.create({
model: 'openai/gpt-audio',
modalities: ['text', 'audio'],
audio: { voice: 'alloy', format: 'wav' },
messages: [
{
role: 'user',
content: 'Tell me, why is the sky blue?'
}
],
});
console.log(answer.choices[0]);
writeFileSync(
'answer.wav',
Buffer.from(answer.choices[0].message.audio.data, 'base64'),
{ encoding: 'utf-8' }
);
};
main();
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.ai.cc/v1",
api_key="",
)
response = client.chat.completions.create(
model="openai/gpt-audio",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": "Tell me, why is the sky blue?"
},
],
)
print(response.choices[0])
wav_bytes = base64.b64decode(response.choices[0].message.audio.data)
with open("answer.wav", "wb") as f:
f.write(wav_bytes)

Product Detail
GPT-Audio, a state-of-the-art audio AI system from OpenAI, represents a significant leap in audio technology. It is capable of interpreting and generating high-fidelity speech and audio with remarkable precision across various modes, including speech-to-speech, speech-to-text, text-to-speech, and advanced multimodal audio reasoning. This system is designed to streamline both voice-driven workflows and sophisticated conversational AI solutions.
⚙️ Technical Specifications
- Model Type: Foundation Model (Transformer-based architecture)
- Modalities Supported: Audio (input/output), Text (input/output), Multimodal speech-text-audio reasoning
- Input Formats: WAV, MP3, FLAC, PCM
- Output Formats: WAV, MP3, FLAC (16kHz or 44.1kHz, mono/stereo)
- Languages: Multilingual coverage (over 50 languages and accents)
- Maximum Audio Length: Up to 30 minutes per segment
🚀 Performance Benchmarks
- Word Error Rate (WER): <2% on standard speech datasets (LibriSpeech, CommonVoice)
- MOS (Mean Opinion Score) for Speech Synthesis: 4.8/5 (near human parity)
- Speaker Verification Accuracy: 98.9%
- Reaction Latency: 600ms average for real-time TTS
- Ambient Noise Robustness: Functions effectively up to 85dB background noise
✨ Key Features
- Full-duplex conversation: Seamlessly handles simultaneous speech recognition and synthesis for dynamic interactions.
- Emotion and intonation control: Generates remarkably natural and expressive speech output with fine-tuned emotional nuances.
- Speaker Identification: Reliably differentiates multiple speakers in multi-participant audio environments.
- Noise Robustness: Maintains high accuracy even in noisy and dynamic environments, ensuring clear communication.
- Custom Voice Profiles: Offers the ability to train or select virtual voices, perfect for brand consistency or accessibility.
- Multimodal reasoning: Integrates audio cues, spoken data, and textual prompts for a comprehensive, hybrid understanding of context.
💰 GPT Audio API Pricing
- Input: $33.60 / 1M audio tokens; $2.63 / 1M tokens
- Output: $67.20 / 1M output tokens; $10.50 / 1M tokens
💡 Use Cases
- Conversational AI Agents: Powering advanced customer service, intelligent voice chatbots, and responsive digital assistants.
- Accessibility Tools: Enabling real-time speech-to-text captioning for live events and efficient voice translation for global communication.
- Content Creation: Facilitating automated narration for articles, professional podcast production, and interactive audiobooks.
- Voice-based Reasoning: Enhancing audio search capabilities, intuitive spoken command interfaces, and sophisticated multimodal analytics for deeper insights.
Code Sample
// Example: Integrating GPT-Audio API for text-to-speech
// For detailed implementation and full code examples, please refer to OpenAI's official API documentation.
🆚 Comparison with Other Models
vs OpenAI Whisper: GPT-Audio offers a more expansive range of functionalities, notably including expressive speech synthesis, going beyond the transcription-focused capabilities of Whisper.
vs OpenAI GPT-4o (Omni): While GPT-4o is a flagship multimodal model supporting comprehensive voice, text, vision, and audio inputs, GPT-Audio is specifically optimized for high-fidelity audio tasks. It delivers superior speech recognition accuracy and more natural, expressive text-to-speech output, making it the specialized choice for intricate audio processing needs.
vs Deepgram Aura: Deepgram Aura excels in granular voice profile control for highly customized voice experiences. However, GPT-Audio distinguishes itself by incorporating a full multimodal audio reasoning layer, providing a deeper contextual understanding of audio inputs.
❓ Frequently Asked Questions (FAQs)
A: GPT-Audio supports speech-to-speech, speech-to-text, text-to-speech, and multimodal audio reasoning, covering a wide range of audio AI functionalities.
A: GPT-Audio generates highly natural and expressive speech output thanks to its advanced emotion and intonation control capabilities, achieving near-human parity.
A: Yes, GPT-Audio features robust noise handling and can function accurately even with background noise levels up to 85dB, making it suitable for various real-world settings.
A: While GPT-4o is a general-purpose multimodal AI, GPT-Audio is highly specialized and optimized for high-fidelity audio tasks, offering superior speech recognition accuracy and more natural, expressive TTS output specifically for audio processing.
A: Absolutely. GPT-Audio allows for the training or selection of custom virtual voice profiles, enabling unique branding, character voices, or specific accessibility needs.
AI Playground



Log in