Out

Chat

disable

Qwen3-Omni Captioner

It serves audio input and returns rich text captions in real-time or batch mode without requiring input prompts.

Free $1 Tokens for New Members

Text to Speech

Javascript

Python

                                        import OpenAI from 'openai';

const api = new OpenAI({
  baseURL: 'https://api.ai.cc/v1',
  apiKey: '',
});

const main = async () => {
  const response = await api.chat.completions.create({
    model: 'alibaba/qwen3-omni-30b-a3b-captioner',
    messages: [
      {
        role: 'user',
        content: [
          { 
            type: 'input_audio', 
            input_audio: { 
              data: 'https://cdn.ai.cc/eagle/files/elephant/cJUTeeQmpodIV1Q3MWDAL_vibevoice-output-7b98283fd3974f48ba90e91d2ee1f971.mp3'
            }
          }
        ]
      }
    ],
  });

  console.log(response.choices[0].message.content);
};

main();

                                        from openai import OpenAI

client = OpenAI(
    base_url="https://api.ai.cc/v1",
    api_key="",    
)

response = client.chat.completions.create(
    model="alibaba/qwen3-omni-30b-a3b-captioner",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": "https://cdn.aimlapi.com/eagle/files/elephant/cJUTeeQmpodIV1Q3MWDAL_vibevoice-output-7b98283fd3974f48ba90e91d2ee1f971.mp3"
                    }
                }
            ]
        },
    ],
)

print(response.choices[0].message.content)

Docs

One API 300+ AI Models

Save 20% on Costs & $1 Free Tokens

Get API Key Explore Models

Qwen3-Omni Captioner

Product Detail

Unveiling Qwen3-Omni Captioner: A Multilingual Omni-Modal AI Powerhouse

Discover Qwen3-Omni Captioner, Alibaba Cloud’s state-of-the-art, natively end-to-end multilingual omni-modal foundation model. Engineered to redefine AI interaction, it seamlessly processes diverse inputs including text, images, audio, and video. This innovative model delivers real-time streaming responses in both natural text and speech, maintaining exceptional performance across all modalities without degradation. Qwen3-Omni stands as a leading multimodal AI solution, offering unparalleled capabilities.

⚙️Technical Deep Dive

Thinker-Talker Architecture: This unique design intelligently separates text generation (the Thinker) from real-time speech synthesis (the Talker). This enables highly specialized and efficient processing for both distinct tasks.
Ultra-Low-Latency Streaming: The Talker component predicts multi-codebook sequences autoregressively. Its Multi-Token Predictor (MTP) module outputs residual codebooks for the current audio frame, which are then incrementally synthesized into a waveform by the Code2Wav renderer. This sophisticated process ensures seamless, real-time audio output.
AuT Audio Encoder: Powering the model's audio capabilities, the AuT encoder is meticulously trained on an extensive dataset of 20 million hours of audio data. This vast training provides exceptionally strong and generalizable audio feature extraction.
MoE Architecture: Both the Thinker and Talker subsystems are built upon Mixture-of-Experts (MoE) models. This architecture facilitates high concurrency and rapid inference by activating only a subset of parameters per token, leading to superior efficiency.

📊Performance Highlights

Qwen3-Omni establishes itself as a leader, achieving state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks. It notably surpasses strong closed-source models, including Gemini 2.5 Pro and GPT-4o-Transcribe, across various performance metrics.

Text Understanding: Demonstrates competitive performance against top models in MMLU, GPQA, reasoning, and complex code tasks.
Audio Recognition (ASR): Achieves a Word Error Rate (WER) on par with or superior to Seed-ASR and GPT-4o-Transcribe across numerous datasets.
Multimodal Reasoning: Exhibits robust performance in challenging audio-visual question answering and comprehensive video description benchmarks.
Speech Generation: Delivers high-quality multilingual speech synthesis, maintaining consistent speaker identity across 10 different languages.
Streaming Latency: Features an impressive ultra-low first-packet latency of approximately 211 ms, ensuring near-instantaneous speech responses.
Audio Captioning: The specially fine-tuned model excels in generating detailed, highly accurate captions for arbitrary audio content.

Performance Benchmarks: As presented in the original source, this image highlights Qwen3-Omni's competitive edge.

💡Key Capabilities

Advanced Architecture: Features an MoE-based Thinker–Talker design, integrating Audio Transformer (AuT) pretraining and innovative multi-codebook speech synthesis for low-latency and exceptionally high-fidelity output.
Extensive Reasoning: The specialized Thinking model variant significantly enhances reasoning abilities across all supported modalities, ensuring a deeper understanding of complex inputs.
Customization: Offers robust customization options, allowing users to fine-tune the model's behavior, tone, and interaction style via intuitive system prompts.
Open-Source Audio Captioner: The fine-tuned Qwen3-Omni-30B-A3B-Captioner variant provides highly detailed and low-hallucination audio descriptions, making advanced captioning accessible.
Real-Time Interaction: Designed for natural turn-taking in conversations, supporting immediate text or speech responses for a fluid and engaging user experience.

🚀Diverse Use Cases

Development of advanced multilingual chatbots capable of understanding both audio and visual inputs.
Real-time streaming transcription and translation services across a multitude of languages.
In-depth audio and video content analysis, including automated summarization and detailed captioning.
Creation of sophisticated multimodal question answering and reasoning systems.
Design of intuitive voice assistants with natural speech comprehension and rich multimodal understanding.
Enabling interactive multimedia content generation and seamless navigation experiences.

💻API & Integration

API Pricing:

Input: $4.0005
Output: $3.213

API Integration:

Qwen3-Omni Captioner is easily accessible via the AI/ML API. For comprehensive documentation, detailed integration guides, and further API references, please visit the official documentation available here.

Code Sample: <snippet data-name="open-ai.audio-qwen" data-model="alibaba/qwen3-omni-30b-a3b-captioner"></snippet>

🆚Qwen3-Omni vs. Leading Models

vs Gemini 2.5 Pro: Qwen3-Omni matches or surpasses Gemini’s performance on audio-video benchmarks and offers superior open-source accessibility. It provides comparable ASR performance with significantly lower latency in streaming speech generation.
vs Seed-ASR: Qwen3-Omni achieves superior or highly comparable Word Error Rates while extending its capabilities to broader multimodal domains well beyond simple audio processing.
vs GPT-4o: Qwen3-Omni excels particularly in multimodal audio and video tasks, all while maintaining robust proficiency in traditional text-based tasks. It features lower latency streaming audio output, largely due to its native multi-codebook speech codec.

❓Frequently Asked Questions

1. What makes Qwen3-Omni Captioner a unique AI model?

Qwen3-Omni Captioner is unique due to its nature as an end-to-end multilingual omni-modal foundation model. It supports diverse inputs like text, images, audio, and video, and provides real-time streaming text and speech outputs. Its innovative Thinker-Talker architecture and MoE design ensure exceptional performance and ultra-low latency across all modalities.

2. How does Qwen3-Omni achieve its ultra-low-latency real-time speech output?

The model achieves this through its "Talker" component, which uses a Multi-Token Predictor (MTP) to autoregressively predict multi-codebook sequences. These residual codebooks are then incrementally synthesized into waveforms by the Code2Wav renderer, enabling seamless, frame-by-frame audio streaming with minimal delay.

3. How does Qwen3-Omni's performance compare to other leading AI models?

Qwen3-Omni demonstrates state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks. It often outperforms or matches strong closed-source models such as Gemini 2.5 Pro, Seed-ASR, and GPT-4o, particularly excelling in multimodal tasks, ASR accuracy, and offering lower streaming latency.

4. Can I customize Qwen3-Omni's responses and interaction style?

Yes, Qwen3-Omni offers extensive customization options. Its behavior, including tone and style of interaction, is fully configurable via system prompts. This allows users to tailor the model's responses to specific application needs and user preferences.

5. What are the primary applications and use cases for Qwen3-Omni Captioner?

Qwen3-Omni Captioner is highly versatile, ideal for applications like multilingual chatbots with multimodal understanding, real-time transcription and translation, detailed audio and video content analysis, advanced multimodal question answering, natural voice assistants, and interactive multimedia content generation.

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 300 models to integrate into your app.

Try For Free

One API
300+ AI Models

Save 20% on Costs

Free $1 Tokens for New Members