



import OpenAI from 'openai';
const api = new OpenAI({
baseURL: 'https://api.ai.cc/v1',
apiKey: '',
});
const main = async () => {
const response = await api.chat.completions.create({
model: 'alibaba/qwen3-omni-30b-a3b-captioner',
messages: [
{
role: 'user',
content: [
{
type: 'input_audio',
input_audio: {
data: 'https://cdn.ai.cc/eagle/files/elephant/cJUTeeQmpodIV1Q3MWDAL_vibevoice-output-7b98283fd3974f48ba90e91d2ee1f971.mp3'
}
}
]
}
],
});
console.log(response.choices[0].message.content);
};
main();
from openai import OpenAI
client = OpenAI(
base_url="https://api.ai.cc/v1",
api_key="",
)
response = client.chat.completions.create(
model="alibaba/qwen3-omni-30b-a3b-captioner",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://cdn.aimlapi.com/eagle/files/elephant/cJUTeeQmpodIV1Q3MWDAL_vibevoice-output-7b98283fd3974f48ba90e91d2ee1f971.mp3"
}
}
]
},
],
)
print(response.choices[0].message.content)
-
AI Playground

Test all API models in the sandbox environment before you integrate.
We provide more than 300 models to integrate into your app.


Product Detail
Unveiling Qwen3-Omni Captioner: A Multilingual Omni-Modal AI Powerhouse
Discover Qwen3-Omni Captioner, Alibaba Cloud’s state-of-the-art, natively end-to-end multilingual omni-modal foundation model. Engineered to redefine AI interaction, it seamlessly processes diverse inputs including text, images, audio, and video. This innovative model delivers real-time streaming responses in both natural text and speech, maintaining exceptional performance across all modalities without degradation. Qwen3-Omni stands as a leading multimodal AI solution, offering unparalleled capabilities.
⚙️Technical Deep Dive
- Thinker-Talker Architecture: This unique design intelligently separates text generation (the Thinker) from real-time speech synthesis (the Talker). This enables highly specialized and efficient processing for both distinct tasks.
- Ultra-Low-Latency Streaming: The Talker component predicts multi-codebook sequences autoregressively. Its Multi-Token Predictor (MTP) module outputs residual codebooks for the current audio frame, which are then incrementally synthesized into a waveform by the Code2Wav renderer. This sophisticated process ensures seamless, real-time audio output.
- AuT Audio Encoder: Powering the model's audio capabilities, the AuT encoder is meticulously trained on an extensive dataset of 20 million hours of audio data. This vast training provides exceptionally strong and generalizable audio feature extraction.
- MoE Architecture: Both the Thinker and Talker subsystems are built upon Mixture-of-Experts (MoE) models. This architecture facilitates high concurrency and rapid inference by activating only a subset of parameters per token, leading to superior efficiency.
📊Performance Highlights
Qwen3-Omni establishes itself as a leader, achieving state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks. It notably surpasses strong closed-source models, including Gemini 2.5 Pro and GPT-4o-Transcribe, across various performance metrics.
- Text Understanding: Demonstrates competitive performance against top models in MMLU, GPQA, reasoning, and complex code tasks.
- Audio Recognition (ASR): Achieves a Word Error Rate (WER) on par with or superior to Seed-ASR and GPT-4o-Transcribe across numerous datasets.
- Multimodal Reasoning: Exhibits robust performance in challenging audio-visual question answering and comprehensive video description benchmarks.
- Speech Generation: Delivers high-quality multilingual speech synthesis, maintaining consistent speaker identity across 10 different languages.
- Streaming Latency: Features an impressive ultra-low first-packet latency of approximately 211 ms, ensuring near-instantaneous speech responses.
- Audio Captioning: The specially fine-tuned model excels in generating detailed, highly accurate captions for arbitrary audio content.

💡Key Capabilities
- Advanced Architecture: Features an MoE-based Thinker–Talker design, integrating Audio Transformer (AuT) pretraining and innovative multi-codebook speech synthesis for low-latency and exceptionally high-fidelity output.
- Extensive Reasoning: The specialized Thinking model variant significantly enhances reasoning abilities across all supported modalities, ensuring a deeper understanding of complex inputs.
- Customization: Offers robust customization options, allowing users to fine-tune the model's behavior, tone, and interaction style via intuitive system prompts.
- Open-Source Audio Captioner: The fine-tuned Qwen3-Omni-30B-A3B-Captioner variant provides highly detailed and low-hallucination audio descriptions, making advanced captioning accessible.
- Real-Time Interaction: Designed for natural turn-taking in conversations, supporting immediate text or speech responses for a fluid and engaging user experience.
🚀Diverse Use Cases
- Development of advanced multilingual chatbots capable of understanding both audio and visual inputs.
- Real-time streaming transcription and translation services across a multitude of languages.
- In-depth audio and video content analysis, including automated summarization and detailed captioning.
- Creation of sophisticated multimodal question answering and reasoning systems.
- Design of intuitive voice assistants with natural speech comprehension and rich multimodal understanding.
- Enabling interactive multimedia content generation and seamless navigation experiences.
💻API & Integration
API Pricing:
- Input: $4.0005
- Output: $3.213
API Integration:
Qwen3-Omni Captioner is easily accessible via the AI/ML API. For comprehensive documentation, detailed integration guides, and further API references, please visit the official documentation available here.
Code Sample: <snippet data-name="open-ai.audio-qwen" data-model="alibaba/qwen3-omni-30b-a3b-captioner"></snippet>
🆚Qwen3-Omni vs. Leading Models
- vs Gemini 2.5 Pro: Qwen3-Omni matches or surpasses Gemini’s performance on audio-video benchmarks and offers superior open-source accessibility. It provides comparable ASR performance with significantly lower latency in streaming speech generation.
- vs Seed-ASR: Qwen3-Omni achieves superior or highly comparable Word Error Rates while extending its capabilities to broader multimodal domains well beyond simple audio processing.
- vs GPT-4o: Qwen3-Omni excels particularly in multimodal audio and video tasks, all while maintaining robust proficiency in traditional text-based tasks. It features lower latency streaming audio output, largely due to its native multi-codebook speech codec.
❓Frequently Asked Questions
Qwen3-Omni Captioner is unique due to its nature as an end-to-end multilingual omni-modal foundation model. It supports diverse inputs like text, images, audio, and video, and provides real-time streaming text and speech outputs. Its innovative Thinker-Talker architecture and MoE design ensure exceptional performance and ultra-low latency across all modalities.
The model achieves this through its "Talker" component, which uses a Multi-Token Predictor (MTP) to autoregressively predict multi-codebook sequences. These residual codebooks are then incrementally synthesized into waveforms by the Code2Wav renderer, enabling seamless, frame-by-frame audio streaming with minimal delay.
Qwen3-Omni demonstrates state-of-the-art results on 22 out of 36 audio and audio-visual benchmarks. It often outperforms or matches strong closed-source models such as Gemini 2.5 Pro, Seed-ASR, and GPT-4o, particularly excelling in multimodal tasks, ASR accuracy, and offering lower streaming latency.
Yes, Qwen3-Omni offers extensive customization options. Its behavior, including tone and style of interaction, is fully configurable via system prompts. This allows users to tailor the model's responses to specific application needs and user preferences.
Qwen3-Omni Captioner is highly versatile, ideal for applications like multilingual chatbots with multimodal understanding, real-time transcription and translation, detailed audio and video content analysis, advanced multimodal question answering, natural voice assistants, and interactive multimedia content generation.
Learn how you can transformyour company with AICC APIs



Log in