



const axios = require('axios').default;
const api = new axios.create({
baseURL: 'https://api.ai.cc/v1',
headers: { Authorization: 'Bearer ' },
});
const main = async () => {
const response = await api.post('/stt', {
model: 'aai/universal',
url: 'https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3',
});
console.log('[transcription]', response.data.results.channels[0].alternatives[0].transcript);
};
main();
import requests
headers = {"Authorization": "Bearer "}
def main():
url = f"https://api.ai.cc/v1/stt"
data = {
"model": "aai/universal",
"url": "https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3",
}
response = requests.post(url, json=data, headers=headers)
if response.status_code >= 400:
print(f"Error: {response.status_code} - {response.text}")
else:
response_data = response.json()
transcript = response_data["results"]["channels"][0]["alternatives"][0][
"transcript"
]
print("[transcription]", transcript)
if __name__ == "__main__":
main()
-
AI Playground

Test all API models in the sandbox environment before you integrate.
We provide more than 300 models to integrate into your app.


Product Detail
AssemblyAI's Universal series represents the pinnacle of Speech-to-Text (STT) technology, engineered to transform spoken language into highly accurate and intelligible text. These advanced models are meticulously trained on over 12.5 million hours of diverse multilingual audio data, allowing them to excel in complex, real-world conversational settings. They adeptly manage multiple speakers, various accents, and challenging background noise with exceptional fidelity.
⚙ Technical Specifications
- ✓ Architecture: Universal-1 leverages a Conformer encoder paired with a recurrent neural network transducer (RNN-T) model, optimized for both speed and accuracy.
- ✓ Encoder Details: Features convolutional layers for 4x subsampling, positional encoding, and 24 Conformer layers, totaling approximately 600 million parameters. Each Conformer block utilizes chunk-wise attention on 8-second audio segments for faster processing and robustness to varying audio lengths.
- ✓ Decoder: Comprises a two-layer LSTM predictor with a joiner, employing a WordPiece tokenizer trained on extensive multilingual corpora.
- ✓ Parallel Processing: Designed for highly parallelized encoder computation, enabling large-scale, low-latency inference, ideal for real-time applications.
- ✓ Timestamping: Ensures precise time alignment for accurate word-level timestamp estimation.
📈 Performance Benchmarks
- ✓ State-of-the-Art WER: Achieves industry-leading Word Error Rate (WER) on English, outperforming numerous commercial ASR providers and open-source models, including OpenAI’s Whisper Large-v3 and NVIDIA’s Canary-1B.
- ✓ Enhanced Robustness: Demonstrates superior noise robustness and strong performance in telephony and other challenging acoustic environments.
- ✓ Multilingual Competence: Shows competitive WER across Spanish, French, and German datasets, exhibiting robust cross-language capabilities.
- ✓ Qualitative Improvement: Human evaluations reveal a 60% preference for Universal-1 transcriptions over the previous-generation Conformer-2, underscoring significant qualitative transcription enhancements.
💰 API Pricing
$0.004725 per minute
📣 Core Features & Capabilities
- ✓ High-Accuracy Transcription: Delivers precise transcriptions, complete with punctuation, capitalization, and advanced text formatting.
- ✓ Speaker Diarization: Intelligently identifies and differentiates individual speakers within the audio.
- ✓ Advanced Entity Recognition: Accurately recognizes and transcribes proper nouns and alphanumeric content (e.g., phone numbers, email addresses).
- ✓ Real-time Processing: Offers low-latency real-time transcription with exceptional scalability and efficiency.
- ✓ Customization & Fine-tuning: Provides flexible options for fine-tuning and customization to suit diverse enterprise use cases.
- ✓ Ethical AI: Integrates rigorous strategies for bias mitigation, content safety, and hallucination reduction.
💻 Code Sample
<snippet data-name="voice.stt" data-model="aai/universal"></snippet>
🔗 Comparison with Other Models
► Universal vs GPT-5
While GPT-5 boasts an enormous 400,000-token context window and advanced hierarchical reasoning, making it ideal for large-scale language understanding and generation, it is less suited for real-time STT processing compared to Universal. Universal is purpose-built for high-accuracy speech transcription.
► Universal vs GPT-4.1
GPT-4.1 specializes in coding tasks and structured code manipulation with a smaller context window. Though optimized for developer-focused scenarios, it lacks the broad speech recognition and multimodal integration capabilities that are central to AssemblyAI Universal.
► Universal vs OpenAI o3
OpenAI o3 primarily serves legacy agent tasks with basic image understanding. It exhibits higher latency and less accurate multimodal reasoning compared to AssemblyAI Universal, rendering it less effective for modern real-time transcription and multimodal applications.
📜 Frequently Asked Questions
1. What makes AssemblyAI Universal stand out in speech-to-text technology?
AssemblyAI Universal excels due to its training on over 12.5 million hours of multilingual audio data, allowing it to handle complex real-world scenarios with high accuracy, including multiple speakers, diverse accents, and significant background noise.
2. What are the key technical components of Universal-1?
Universal-1 employs a Conformer encoder with 24 layers and approximately 600 million parameters, combined with an RNN-T model. It features chunk-wise attention for faster processing and a two-layer LSTM decoder with a WordPiece tokenizer.
3. How does Universal perform compared to other leading ASR models?
Universal achieves state-of-the-art Word Error Rate (WER) on English, surpassing models like OpenAI’s Whisper Large-v3 and NVIDIA’s Canary-1B. It also shows competitive WER in Spanish, French, and German, demonstrating strong cross-language robustness.
4. What unique capabilities does AssemblyAI Universal offer?
Beyond high-accuracy transcription, it offers speaker diarization, accurate recognition of proper nouns and alphanumeric content, low-latency real-time transcription, and flexible customization options for enterprise use.
5. Is Universal suitable for real-time applications?
Yes, Universal's architecture is specifically designed for highly parallelized computation and enables large-scale, low-latency inference, making it ideally suited for real-time transcription and applications requiring immediate processing.
Learn how you can transformyour company with AICC APIs



Log in