qwen-bg
max-ico04
128K
In
Out
max-ico02
Chat
max-ico03
disable
GPT Audio
Whether recognizing complex utterances, synthesizing expressive responses, or reasoning across modalities, it remains remarkably responsive and adaptable.
Free $1 Tokens for New Members
Text to Speech
                                        import { writeFileSync } from 'node:fs';
import OpenAI from 'openai';

const api = new OpenAI({
  baseURL: 'https://api.ai.cc/v1',
  apiKey: '',
});

const main = async () => {
  const answer = await api.chat.completions.create({
    model: 'openai/gpt-audio',
    modalities: ['text', 'audio'],
    audio: { voice: 'alloy', format: 'wav' },
    messages: [
      {
        role: 'user',
        content: 'Tell me, why is the sky blue?'
      }
    ],
  });

  console.log(answer.choices[0]);

  writeFileSync(
    'answer.wav',
    Buffer.from(answer.choices[0].message.audio.data, 'base64'),
    { encoding: 'utf-8' }
  );
};

main();

                                
                                        import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.ai.cc/v1",
    api_key="",    
)

response = client.chat.completions.create(
    model="openai/gpt-audio",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Tell me, why is the sky blue?"
        },
    ],
)

print(response.choices[0])

wav_bytes = base64.b64decode(response.choices[0].message.audio.data)
with open("answer.wav", "wb") as f:
    f.write(wav_bytes)
Docs

One API 300+ AI Models

Save 20% on Costs & $1 Free Tokens
  • ico01-1
    AI Playground

    Test all API models in the sandbox environment before you integrate.

    We provide more than 300 models to integrate into your app.

    copy-img02img01
qwenmax-bg
img
GPT Audio

Product Detail

GPT-Audio, a state-of-the-art audio AI system from OpenAI, represents a significant leap in audio technology. It is capable of interpreting and generating high-fidelity speech and audio with remarkable precision across various modes, including speech-to-speech, speech-to-text, text-to-speech, and advanced multimodal audio reasoning. This system is designed to streamline both voice-driven workflows and sophisticated conversational AI solutions.

⚙️ Technical Specifications

  • Model Type: Foundation Model (Transformer-based architecture)
  • Modalities Supported: Audio (input/output), Text (input/output), Multimodal speech-text-audio reasoning
  • Input Formats: WAV, MP3, FLAC, PCM
  • Output Formats: WAV, MP3, FLAC (16kHz or 44.1kHz, mono/stereo)
  • Languages: Multilingual coverage (over 50 languages and accents)
  • Maximum Audio Length: Up to 30 minutes per segment

🚀 Performance Benchmarks

  • Word Error Rate (WER): <2% on standard speech datasets (LibriSpeech, CommonVoice)
  • MOS (Mean Opinion Score) for Speech Synthesis: 4.8/5 (near human parity)
  • Speaker Verification Accuracy: 98.9%
  • Reaction Latency: 600ms average for real-time TTS
  • Ambient Noise Robustness: Functions effectively up to 85dB background noise

✨ Key Features

  • Full-duplex conversation: Seamlessly handles simultaneous speech recognition and synthesis for dynamic interactions.
  • Emotion and intonation control: Generates remarkably natural and expressive speech output with fine-tuned emotional nuances.
  • Speaker Identification: Reliably differentiates multiple speakers in multi-participant audio environments.
  • Noise Robustness: Maintains high accuracy even in noisy and dynamic environments, ensuring clear communication.
  • Custom Voice Profiles: Offers the ability to train or select virtual voices, perfect for brand consistency or accessibility.
  • Multimodal reasoning: Integrates audio cues, spoken data, and textual prompts for a comprehensive, hybrid understanding of context.

💰 GPT Audio API Pricing

  • Input: $33.60 / 1M audio tokens; $2.63 / 1M tokens
  • Output: $67.20 / 1M output tokens; $10.50 / 1M tokens

💡 Use Cases

  • Conversational AI Agents: Powering advanced customer service, intelligent voice chatbots, and responsive digital assistants.
  • Accessibility Tools: Enabling real-time speech-to-text captioning for live events and efficient voice translation for global communication.
  • Content Creation: Facilitating automated narration for articles, professional podcast production, and interactive audiobooks.
  • Voice-based Reasoning: Enhancing audio search capabilities, intuitive spoken command interfaces, and sophisticated multimodal analytics for deeper insights.

Code Sample

<!-- Code Sample Placeholder -->
// Example: Integrating GPT-Audio API for text-to-speech
// For detailed implementation and full code examples, please refer to OpenAI's official API documentation.

🆚 Comparison with Other Models

vs OpenAI Whisper: GPT-Audio offers a more expansive range of functionalities, notably including expressive speech synthesis, going beyond the transcription-focused capabilities of Whisper.

vs OpenAI GPT-4o (Omni): While GPT-4o is a flagship multimodal model supporting comprehensive voice, text, vision, and audio inputs, GPT-Audio is specifically optimized for high-fidelity audio tasks. It delivers superior speech recognition accuracy and more natural, expressive text-to-speech output, making it the specialized choice for intricate audio processing needs.

vs Deepgram Aura: Deepgram Aura excels in granular voice profile control for highly customized voice experiences. However, GPT-Audio distinguishes itself by incorporating a full multimodal audio reasoning layer, providing a deeper contextual understanding of audio inputs.

❓ Frequently Asked Questions (FAQs)

Q: What are the main modes supported by GPT-Audio?
A: GPT-Audio supports speech-to-speech, speech-to-text, text-to-speech, and multimodal audio reasoning, covering a wide range of audio AI functionalities.
Q: How natural is the speech generated by GPT-Audio?
A: GPT-Audio generates highly natural and expressive speech output thanks to its advanced emotion and intonation control capabilities, achieving near-human parity.
Q: Can GPT-Audio perform accurately in noisy environments?
A: Yes, GPT-Audio features robust noise handling and can function accurately even with background noise levels up to 85dB, making it suitable for various real-world settings.
Q: What is the primary difference between GPT-Audio and OpenAI's GPT-4o?
A: While GPT-4o is a general-purpose multimodal AI, GPT-Audio is highly specialized and optimized for high-fidelity audio tasks, offering superior speech recognition accuracy and more natural, expressive TTS output specifically for audio processing.
Q: Are custom voice profiles possible with GPT-Audio?
A: Absolutely. GPT-Audio allows for the training or selection of custom virtual voice profiles, enabling unique branding, character voices, or specific accessibility needs.

Learn how you can transformyour company with AICC APIs

Discover how to revolutionize your business with AICC API! Unlock powerfultools to automate processes, enhance decision-making, and personalize customer experiences.
Contact sales
api-right-1
model-bg02-1

One API
300+ AI Models

Save 20% on Costs