Out

Chat

disable

Whisper

OpenAI's Whisper API offers robust, multilingual speech-to-text capabilities, trained on diverse data, free for commercial use under the MIT license.

Free $1 Tokens for New Members

Text to Speech

Javascript

Python

                                        const axios = require('axios').default;

const api = new axios.create({
  baseURL: 'https://api.ai.cc/v1',
  headers: { Authorization: 'Bearer ' },
});

const main = async () => {
  const response = await api.post('/stt', {
    model: '#g1_whisper-large',
    url: 'https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3',
  });

  console.log('[transcription]', response.data.results.channels[0].alternatives[0].transcript);
};

main();

                                        import requests


headers = {"Authorization": "Bearer "}


def main():
    url = f"https://api.ai.cc/v1/stt"
    data = {
        "model": "#g1_whisper-large",
        "url": "https://audio-samples.github.io/samples/mp3/blizzard_unconditional/sample-0.mp3",
    }

    response = requests.post(url, json=data, headers=headers)

    if response.status_code >= 400:
        print(f"Error: {response.status_code} - {response.text}")
    else:
        response_data = response.json()
        transcript = response_data["results"]["channels"][0]["alternatives"][0][
            "transcript"
        ]
        print("[transcription]", transcript)

if __name__ == "__main__":
    main()

Docs

One API 300+ AI Models

Save 20% on Costs & $1 Free Tokens

Get API Key Explore Models

Whisper

Product Detail

Understanding OpenAI's Whisper Model: A Comprehensive Overview

The Whisper model, developed by OpenAI, stands as a pivotal advancement in automatic speech recognition (ASR) and speech translation technology. Released to the public to foster AI research, Whisper models are designed for robustness, generalization, and to identify potential biases in AI systems. They are particularly effective for English speech recognition but offer strong multilingual capabilities.

Important Note: Use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.

Basic Information & Evolution

Model Name: Whisper
Developer: OpenAI
Release History: Original series in September 2022, followed by large-v2 in December 2022, and large-v3 in November 2023.
Model Type: Sequence-to-sequence ASR (Automatic Speech Recognition) and Speech Translation Model.

Whisper Model Versions Overview

Size	Parameters	Relative Speed
tiny	39 M	~32x
base	74 M	~16x
small	244 M	~6x
medium	769 M	~2x
large	1550 M	1x

Key Features of Whisper Models ⭐

✓ Multilingual Capabilities: Strong performance across approximately 10 languages, with ongoing evaluation for broader applications like voice detection and speaker classification.
✓ Robustness: Exceptionally resilient to diverse accents, dialects, and noisy audio environments.
✓ Versatile Applications: Ideal for speech transcription, language translation, and automated subtitle generation.

Intended Use Cases 🚀

Whisper models are primarily intended for developers and researchers. They are valuable tools for integrating advanced speech-to-text functionalities into various applications, enhancing accessibility features, and supporting linguistic research initiatives.

Technical Details ⚙️

Architecture:

The Whisper model is built upon a sophisticated Transformer architecture. This architecture is pre-trained on an extensive dataset comprising both supervised and unsupervised learning data, allowing for robust feature learning.

Training Data:

Training involved a massive 680,000 hours of internet-sourced audio and corresponding transcripts. This dataset was meticulously balanced:

‣ 65% English audio with English transcripts.
‣ 18% Non-English audio with English transcripts.
‣ 17% Non-English audio with matching non-English transcripts.

In total, the training data covered 98 distinct languages.

Performance Metrics & Considerations:

Research indicates Whisper models generally outperform many existing ASR systems, exhibiting enhanced robustness to accents, background noise, and specialized technical language. They deliver nearly state-of-the-art accuracy in both speech recognition and zero-shot translation from multiple languages into English.

However, performance can vary significantly across languages, particularly in low-resource or less commonly studied ones. Accuracy may also differ based on various accents, dialects, and demographic groups. The models can occasionally generate repetitive text, a characteristic that can often be mitigated through techniques like beam search and temperature scheduling.

Knowledge Cutoff:

The audio and text data used for training the Whisper models do not include information beyond mid-2022.

Usage and Integration 💻

Code Samples/SDK: Developers can access Whisper functionalities via available SDKs and code samples for integration into their applications.
Tutorials: Explore guides such as the Speech-to-text Multimodal Experience in NodeJS for practical implementation insights.
Maximum File Size: The current limit for audio file processing is 2 GB.

Support and Community 💬

Community Resources: Join the discussion and get support on the AIML API Discord server.
Support Channels: Report issues or contribute directly through the official OpenAI Whisper GitHub repository.

Ethical Considerations & Licensing ⚖️

⚠ Ethical Guidelines: OpenAI provides comprehensive guidance on responsible usage, emphasizing the importance of privacy and ethical deployment of AI technologies.
⚠ Bias Mitigation: Continuous efforts are underway to reduce biases in speech recognition accuracy across different languages, accents, and demographic groups.
ⓘ License Type: Whisper models are released under the MIT License, permitting both commercial and non-commercial use.

References 📖

Learn more about the underlying research: Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Frequently Asked Questions (FAQ)

Q1: What is the primary purpose of OpenAI's Whisper model?
A1: The Whisper model is an advanced ASR and speech translation model primarily intended for AI research into model robustness, generalization, and biases. It also excels in English speech recognition and offers strong multilingual capabilities.

Q2: What are the main applications of the Whisper model?
A2: It can be used for various tasks including speech transcription, translating spoken language into text, and generating subtitles for audio and video content.

Q3: How many languages does Whisper support?
A3: The models were trained on data covering 98 languages and show strong performance in roughly 10 languages, with varying accuracy for others.

Q4: Are there any ethical concerns regarding the use of Whisper?
A4: Yes, OpenAI strongly discourages its use for transcribing non-consensual recordings or in high-risk decision-making processes due to potential inaccuracies and privacy concerns. Users are advised to follow OpenAI's ethical guidelines.

Q5: Is the Whisper model open source?
A5: Yes, Whisper models are released under the MIT license, allowing for both commercial and non-commercial use by developers and researchers.

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 300 models to integrate into your app.

Try For Free

One API
300+ AI Models

Save 20% on Costs

Free $1 Tokens for New Members