



const fs = require('fs');
const path = require('path');
const axios = require('axios').default;
const api = new axios.create({
baseURL: 'https://api.ai.cc/v1',
headers: { Authorization: 'Bearer ' },
});
const main = async () => {
try {
const response = await api.post('/tts', {
model: 'microsoft/vibevoice-1.5b',
script: 'Speaker 0: Hello there! Speaker 1: Hi, how are you?',
speakers: [
{ preset: 'Frank [EN]' }
]
});
const responseData = response.data;
const audioUrl = responseData.audio.url;
const fileName = responseData.audio.file_name;
const audioResponse = await api.get(audioUrl, { responseType: 'stream' });
const dist = path.resolve(__dirname, fileName);
const writeStream = fs.createWriteStream(dist);
audioResponse.data.pipe(writeStream);
writeStream.on('close', () => {
console.log('Audio saved to:', dist);
console.log(`Duration: ${responseData.duration} seconds`);
console.log(`Sample rate: ${responseData.sample_rate} Hz`);
});
} catch (error) {
console.error('Error:', error.message);
}
};
main();
import os
import requests
def main():
url = "https://api.ai.cc/v1/tts"
headers = {
"Authorization": "Bearer ",
}
payload = {
"model": "microsoft/vibevoice-1.5b",
"script": "Speaker 0: Hello there! Speaker 1: Hi, how are you?",
"speakers": [
{ "preset": "Frank [EN]" }
]
}
try:
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status() # Raise an exception for bad status codes
response_data = response.json()
audio_url = response_data["audio"]["url"]
file_name = response_data["audio"]["file_name"]
audio_response = requests.get(audio_url, stream=True)
audio_response.raise_for_status()
dist = os.path.join(os.path.dirname(__file__), file_name)
with open(dist, "wb") as write_stream:
for chunk in audio_response.iter_content(chunk_size=8192):
if chunk:
write_stream.write(chunk)
print("Audio saved to:", dist)
print(f"Duration: {response_data['duration']} seconds")
print(f"Sample rate: {response_data['sample_rate']} Hz")
except requests.exceptions.RequestException as e:
print(f"Error making request: {e}")
except Exception as e:
print(f"Error: {e}")
main()
-
AI Playground

Test all API models in the sandbox environment before you integrate.
We provide more than 300 models to integrate into your app.


Product Detail
VibeVoice 1.5B stands as a groundbreaking AI voice synthesis model, meticulously engineered to deliver high-quality, natural-sounding speech. It boasts exceptional expressive tone modulation, adapting flawlessly across diverse languages and contexts. This highly scalable and versatile solution empowers content creators, developers, and enterprises by providing advanced voice generation capabilities for a wide array of applications, including virtual assistants, audiobooks, gaming, and multimedia production.
✨ Key Capabilities & Input Versatility
VibeVoice 1.5B masterfully processes various input types to produce lifelike speech with nuanced prosody, ensuring adaptability for any project. It supports:
- ✓ Plain Text: For simple and direct speech generation.
- ✓ SSML (Speech Synthesis Markup Language): Enabling fine-grained control over speech attributes like pauses, pronunciation, and intonation.
- ✓ Emotional/Style Tags: To infuse specific emotions and distinct speaking styles into the output.
This model adeptly handles conversational dialogue, narration, and character voices, delivering dynamic intonation that makes every utterance sound genuinely human.
🚀 Unmatched Performance & Output Quality
- ⏳ Latency: Optimized for near real-time voice generation, VibeVoice 1.5B is perfectly suited for interactive applications such as chatbots and live broadcasts, ensuring immediate and fluid communication.
- 🎧 Audio Quality: It consistently produces studio-grade audio, characterized by clear articulation, natural intonation, and seamless transitions. This makes it ideal for both professional and consumer-facing applications demanding superior audio fidelity.
- 💬 Expressiveness: The model provides granular control over emotional tone, emphasis, pacing, and accent adaptations. This flexibility allows users to perfectly align the voice output with specific storytelling requirements and branding needs.
🧠 Advanced Technical Architecture
VibeVoice 1.5B is built upon a sophisticated transformer-based neural Text-to-Speech (TTS) backbone. It incorporates advanced prosody modeling modules, leveraging multi-layer self-attention mechanisms and convolutional layers specifically optimized for temporal acoustic feature extraction. The model's exceptional performance is a result of extensive training on a vast corpus of multi-lingual speech recordings and richly annotated emotional speech datasets, ensuring robust generalization across a wide range of speakers and styles.
💲 API Pricing
- 💰 $0.042 per generated minute
⭐ Core Features at a Glance
- 📝 Multimodal Input Processing: Accepts diverse input formats, including textual content enriched with embedded emotional cues and precise phoneme-level instructions, offering unparalleled control over the synthetic voice.
- 🎧 Expressive Voice Customization: Enables detailed adjustment of critical speech attributes such as pitch, speaking speed, emotional undertones, and subtle speaker identity variations, allowing for perfect voice alignment with your creative vision.
- 🌐 Multilingual and Multidialect Support: Delivers consistently natural voice outputs across numerous languages and regional dialects, maintaining high-fidelity voice quality for a truly global reach.
💡 Diverse Applications
- 👤 Virtual Assistants & Chatbots: Facilitate engaging, human-like interactions, enhancing customer support and digital companionship.
- 📚 Audiobook & Podcast Narration: Generate dynamic voice performances with distinct character differentiation and emotion, bringing narratives vividly to life.
- 🎮 Gaming & Animation: Create realistic character voices with extensive style flexibility, contributing to deeply immersive storytelling and gameplay experiences.
- 📖 Accessibility Tools: Provide high-quality screen reader voices with customizable expressiveness, significantly enhancing the user experience for everyone.
- 🌎 Content Localization: Enable fast, natural voice dubbing across multiple languages, effortlessly supporting global content distribution and broader audience reach.
📝 Code Sample
// Example VibeVoice 1.5B API usage
const textToSynthesize = "Hello, this is VibeVoice 1.5B speaking!";
const voiceParams = {
model: "microsoft/vibevoice-1.5b",
language: "en-US",
emotion: "joyful"
};
VibeVoice.synthesize(textToSynthesize, voiceParams)
.then(audioUrl => console.log("Generated audio:", audioUrl))
.catch(error => console.error("Error synthesizing voice:", error));
📈 VibeVoice 1.5B vs. Competitors
- vs. Eleven Music: While Eleven Music specializes in AI-driven music generation with intricate composition capabilities, VibeVoice 1.5B distinguishes itself by excelling in natural and expressive voice synthesis, specifically for spoken audio.
- vs. Suno AI: Compared to Suno AI's focus on music generation features, VibeVoice 1.5B's core strength lies in its superior speech quality, unparalleled prosody control, and multilingual voice delivery, meticulously designed for conversational contexts rather than musical content.
- vs. Udio: Udio typically targets simpler audio production with limited voice synthesis. VibeVoice, conversely, offers significantly higher fidelity, detailed emotional variation, and broader application support tailored for professional voice generation requirements.
- vs. MusicAI Sandbox: MusicAI Sandbox is primarily geared towards creative music experimentation. In stark contrast, VibeVoice 1.5B prioritizes realistic spoken voice output, providing advanced fine-tuning options for a diverse range of vocal characteristics and styles.
- vs. AIMusic.fm: AIMusic.fm largely automates music creation with restricted customization options. VibeVoice provides granular control over speech parameters and extensive style adaptability, specifically tailored for speech-centric projects.
☝ Frequently Asked Questions (FAQs)
1. What neural vocoder architecture enables VibeVoice 1.5B's expressive speech synthesis?
VibeVoice 1.5B employs an efficient flow-matching diffusion architecture, meticulously optimized for emotional expressiveness and voice quality at its 1.5-billion parameter scale. This architecture features hierarchical waveform generation that captures both macro-prosodic patterns and micro-intonation details, coupled with style-adaptive normalization to preserve speaker identity across various emotional states.
2. How does the model achieve emotional expressiveness within its compact parameter budget?
The model implements highly efficient emotional prosody modeling through distilled emotion embeddings. These capture the acoustic correlates of different emotional states without requiring extensive parameter overhead. This, combined with shared emotional feature extractors and optimized pitch/timing networks, allows for an impressive emotional range.
3. What voice customization capabilities does VibeVoice 1.5B offer?
VibeVoice 1.5B provides efficient voice adaptation through few-shot learning from limited audio samples and parameter-efficient fine-tuning. Users can adjust voice attributes including pitch, speaking rate, and emotional intensity. It also supports style transfer from reference audio and basic accent adaptation while maintaining computational efficiency.
4. How does VibeVoice 1.5B balance quality and efficiency for different deployment scenarios?
The model employs intelligent resource allocation, directing computational budget to the most perceptually important aspects of speech generation. This includes adaptive quality scaling, efficient attention mechanisms, and optimized audio processing pipelines. This balanced approach ensures strong performance across diverse deployment environments, from cloud instances to edge devices.
5. What practical applications benefit most from VibeVoice 1.5B's efficient design?
Its efficiency makes it exceptionally suitable for applications such as mobile voice assistants, embedded systems with limited computational resources, multi-tenant cloud services requiring cost-effective voice generation, real-time interactive applications with strict latency requirements, and educational platforms serving many simultaneous users.
Learn how you can transformyour company with AICC APIs



Log in