128K

Out

Chat

disable

GPT Audio

Whether recognizing complex utterances, synthesizing expressive responses, or reasoning across modalities, it remains remarkably responsive and adaptable.

Free $1 Tokens for New Members

Text to Speech

Javascript

Python

                                        import { writeFileSync } from 'node:fs';
import OpenAI from 'openai';

const api = new OpenAI({
  baseURL: 'https://api.ai.cc/v1',
  apiKey: '',
});

const main = async () => {
  const answer = await api.chat.completions.create({
    model: 'openai/gpt-audio',
    modalities: ['text', 'audio'],
    audio: { voice: 'alloy', format: 'wav' },
    messages: [
      {
        role: 'user',
        content: 'Tell me, why is the sky blue?'
      }
    ],
  });

  console.log(answer.choices[0]);

  writeFileSync(
    'answer.wav',
    Buffer.from(answer.choices[0].message.audio.data, 'base64'),
    { encoding: 'utf-8' }
  );
};

main();

                                        import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.ai.cc/v1",
    api_key="",    
)

response = client.chat.completions.create(
    model="openai/gpt-audio",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Tell me, why is the sky blue?"
        },
    ],
)

print(response.choices[0])

wav_bytes = base64.b64decode(response.choices[0].message.audio.data)
with open("answer.wav", "wb") as f:
    f.write(wav_bytes)

Docs

One API 300+ AI Models

Save 20% on Costs & $1 Free Tokens

Get API Key Explore Models

GPT Audio

Product Detail

GPT-Audio, a state-of-the-art audio AI system from OpenAI, represents a significant leap in audio technology. It is capable of interpreting and generating high-fidelity speech and audio with remarkable precision across various modes, including speech-to-speech, speech-to-text, text-to-speech, and advanced multimodal audio reasoning. This system is designed to streamline both voice-driven workflows and sophisticated conversational AI solutions.

⚙️ Technical Specifications

Model Type: Foundation Model (Transformer-based architecture)
Modalities Supported: Audio (input/output), Text (input/output), Multimodal speech-text-audio reasoning
Input Formats: WAV, MP3, FLAC, PCM
Output Formats: WAV, MP3, FLAC (16kHz or 44.1kHz, mono/stereo)
Languages: Multilingual coverage (over 50 languages and accents)
Maximum Audio Length: Up to 30 minutes per segment

🚀 Performance Benchmarks

Word Error Rate (WER): <2% on standard speech datasets (LibriSpeech, CommonVoice)
MOS (Mean Opinion Score) for Speech Synthesis: 4.8/5 (near human parity)
Speaker Verification Accuracy: 98.9%
Reaction Latency: 600ms average for real-time TTS
Ambient Noise Robustness: Functions effectively up to 85dB background noise

✨ Key Features

Full-duplex conversation: Seamlessly handles simultaneous speech recognition and synthesis for dynamic interactions.
Emotion and intonation control: Generates remarkably natural and expressive speech output with fine-tuned emotional nuances.
Speaker Identification: Reliably differentiates multiple speakers in multi-participant audio environments.
Noise Robustness: Maintains high accuracy even in noisy and dynamic environments, ensuring clear communication.
Custom Voice Profiles: Offers the ability to train or select virtual voices, perfect for brand consistency or accessibility.
Multimodal reasoning: Integrates audio cues, spoken data, and textual prompts for a comprehensive, hybrid understanding of context.

💰 GPT Audio API Pricing

Input: $33.60 / 1M audio tokens; $2.63 / 1M tokens
Output: $67.20 / 1M output tokens; $10.50 / 1M tokens

💡 Use Cases

Conversational AI Agents: Powering advanced customer service, intelligent voice chatbots, and responsive digital assistants.
Accessibility Tools: Enabling real-time speech-to-text captioning for live events and efficient voice translation for global communication.
Content Creation: Facilitating automated narration for articles, professional podcast production, and interactive audiobooks.
Voice-based Reasoning: Enhancing audio search capabilities, intuitive spoken command interfaces, and sophisticated multimodal analytics for deeper insights.

Code Sample

             <!-- Code Sample Placeholder -->
// Example: Integrating GPT-Audio API for text-to-speech
// For detailed implementation and full code examples, please refer to OpenAI's official API documentation.         

🆚 Comparison with Other Models

vs OpenAI Whisper: GPT-Audio offers a more expansive range of functionalities, notably including expressive speech synthesis, going beyond the transcription-focused capabilities of Whisper.

vs OpenAI GPT-4o (Omni): While GPT-4o is a flagship multimodal model supporting comprehensive voice, text, vision, and audio inputs, GPT-Audio is specifically optimized for high-fidelity audio tasks. It delivers superior speech recognition accuracy and more natural, expressive text-to-speech output, making it the specialized choice for intricate audio processing needs.

vs Deepgram Aura: Deepgram Aura excels in granular voice profile control for highly customized voice experiences. However, GPT-Audio distinguishes itself by incorporating a full multimodal audio reasoning layer, providing a deeper contextual understanding of audio inputs.

❓ Frequently Asked Questions (FAQs)

Q: What are the main modes supported by GPT-Audio?
A: GPT-Audio supports speech-to-speech, speech-to-text, text-to-speech, and multimodal audio reasoning, covering a wide range of audio AI functionalities.

Q: How natural is the speech generated by GPT-Audio?
A: GPT-Audio generates highly natural and expressive speech output thanks to its advanced emotion and intonation control capabilities, achieving near-human parity.

Q: Can GPT-Audio perform accurately in noisy environments?
A: Yes, GPT-Audio features robust noise handling and can function accurately even with background noise levels up to 85dB, making it suitable for various real-world settings.

Q: What is the primary difference between GPT-Audio and OpenAI's GPT-4o?
A: While GPT-4o is a general-purpose multimodal AI, GPT-Audio is highly specialized and optimized for high-fidelity audio tasks, offering superior speech recognition accuracy and more natural, expressive TTS output specifically for audio processing.

Q: Are custom voice profiles possible with GPT-Audio?
A: Absolutely. GPT-Audio allows for the training or selection of custom virtual voice profiles, enabling unique branding, character voices, or specific accessibility needs.

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 300 models to integrate into your app.

Try For Free

One API
300+ AI Models

Save 20% on Costs

Free $1 Tokens for New Members