



const axios = require('axios').default;
const api = new axios.create({
baseURL: 'https://api.ai.cc/v1',
headers: { Authorization: 'Bearer ' },
});
const main = async () => {
const response = await api.post('/stt', {
model: 'aai/slam-1',
url: 'https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3',
});
console.log('[transcription]', response.data.results.channels[0].alternatives[0].transcript);
};
main();
import requests
headers = {"Authorization": "Bearer "}
def main():
url = f"https://api.ai.cc/v1/stt"
data = {
"model": "aai/slam-1",
"url": "https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3",
}
response = requests.post(url, json=data, headers=headers)
if response.status_code >= 400:
print(f"Error: {response.status_code} - {response.text}")
else:
response_data = response.json()
transcript = response_data["results"]["channels"][0]["alternatives"][0][
"transcript"
]
print("[transcription]", transcript)
if __name__ == "__main__":
main()
-
AI Playground

Test all API models in the sandbox environment before you integrate.
We provide more than 300 models to integrate into your app.


Product Detail
Slam-1 stands as AssemblyAI's groundbreaking Speech Language Model (SLM), uniquely designed to unify large language model architecture with advanced automatic speech recognition (ASR) encoders. This powerful combination delivers superior speech-to-text transcription accuracy. Tailored specifically for speech tasks, Slam-1 offers a profound understanding of context and semantics, enabling promptable and highly customizable transcription. It intelligently adapts to specialized industry terminology and complex spoken content, making it an ideal solution for critical use cases in healthcare, legal, sales, and technical domains that require precise, context-aware transcriptions.
Technical Specifications
Performance Benchmarks
✅ Reduces missed entity rates by up to 66%, particularly for names, medical, and technical terms.
✅ Decreases formatting errors by approximately 20%.
✅ Preferred by over 72% of end users in blind tests versus competing models.
✅ Achieves more reliable transcript quality in noisy and specialized contexts.
✅ Delivers robustness against hallucinations through a multi-modal architecture that simultaneously processes audio and language.
Architecture Breakdown
Slam-1’s architecture distinctively merges a speech encoder with an adapter layer precisely tuned to link acoustic features with a fixed large language model. This enables powerful semantic understanding. This multi-modal design surpasses traditional audio-to-text models by interpreting spoken content holistically, supporting accurate transcription and contextual reasoning. The approach leverages prompt engineering to dynamically customize transcription accuracy for industry-specific vocabularies and speech patterns.
API Pricing
Get started for just $0.002625 per minute
Core Features & Capabilities
✨ Speech and Language Integration: Seamlessly combines speech encoder and LLM for promptable and customizable transcription workflows.
⚙️ Fine-Tuning & Customization: Enables domain-specific adaptation through simple prompts, eliminating the need for complex retraining.
🎯 High Accuracy: Offers superior recognition of rare and domain-specific terms, significantly improving downstream analytics and reducing manual review efforts.
🗣️ Multi-Channel & Speaker Diarization: Fully supports complex audio streams with accurate speaker separation and timestamps provided out of the box.
🏢 Enterprise Ready: Specifically designed to reduce post-processing effort and enhance transcript quality in high-stakes industries such as healthcare and legal.
Code Sample
Comparison with Other Models
VS AssemblyAI Universal: Slam-1 distinguishes itself with promptable, highly customizable transcription featuring superior entity recognition for specialized domains. In contrast, AssemblyAI Universal is optimized for broader language support and lower latency, catering to general transcription needs.
VS GPT-4.1 (audio transcription use): Slam-1 is purpose-built and highly optimized for speech-to-text, incorporating robust multi-channel and speaker diarization features. GPT-4.1, on the other hand, primarily focuses on general Natural Language Processing (NLP) tasks and lacks native audio processing capabilities essential for comprehensive transcription.
Frequently Asked Questions (FAQ)
Q: What makes Slam-1 unique among speech-to-text solutions?
A: Slam-1 is unique due to its innovative architecture that unifies a speech encoder with a large language model (LLM). This integration allows it to understand context and semantics at a deep level, providing significantly higher accuracy and enabling promptable, customizable transcription for complex and specialized content, outperforming traditional ASR systems.
Q: How does Slam-1 ensure high accuracy for specialized terminology?
A: Slam-1 leverages prompt engineering and its LLM capabilities to adapt dynamically to specific industry vocabularies. This allows users to customize the model to recognize rare names, medical terms, legal jargon, and technical phrases with superior precision without requiring extensive retraining, thereby reducing missed entity rates significantly.
Q: What industries benefit most from Slam-1's capabilities?
A: Industries requiring precise and context-aware transcription benefit immensely. This includes healthcare (for medical dictation and patient records), legal (for court proceedings and depositions), sales (for call analytics), and technical domains (for detailed technical discussions and documentation). Slam-1's high accuracy and customization are critical in these high-stakes environments.
Q: Does Slam-1 support multi-speaker audio transcription?
A: Yes, Slam-1 comes with multi-channel and speaker diarization features built-in. This means it can accurately separate different speakers in complex audio streams and provide timestamps for each speaker's contribution, making it ideal for meetings, interviews, and other multi-participant recordings.
Q: How does Slam-1 address the issue of transcription "hallucinations"?
A: Slam-1's multi-modal architecture is designed for robustness against hallucinations. By simultaneously processing both audio and language data, it can cross-reference and validate information from acoustic features against semantic understanding, significantly reducing the likelihood of generating inaccurate or fabricated content in its transcriptions.
Learn how you can transformyour company with AICC APIs



Log in