



const { OpenAI } = require('openai');
const api = new OpenAI({
baseURL: 'https://api.ai.cc/v1',
apiKey: '',
});
const main = async () => {
const result = await api.chat.completions.create({
model: 'nvidia/nemotron-nano-12b-v2-vl',
messages: [
{
role: 'system',
content: 'You are an AI assistant who knows everything.',
},
{
role: 'user',
content: 'Tell me, why is the sky blue?'
}
],
});
const message = result.choices[0].message.content;
console.log(`Assistant: ${message}`);
};
main();
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.ai.cc/v1",
api_key="",
)
response = client.chat.completions.create(
model="nvidia/nemotron-nano-12b-v2-vl",
messages=[
{
"role": "system",
"content": "You are an AI assistant who knows everything.",
},
{
"role": "user",
"content": "Tell me, why is the sky blue?"
},
],
)
message = response.choices[0].message.content
print(f"Assistant: {message}")
-
AI Playground

Test all API models in the sandbox environment before you integrate.
We provide more than 300 models to integrate into your app.


Product Detail
Nemotron Nano 12B V2 VL is NVIDIA's state-of-the-art 12-billion-parameter open multimodal vision-language model, meticulously engineered for exceptional performance in video understanding, complex multi-image document reasoning, and nuanced natural language output generation. Leveraging a novel hybrid Transformer-Mamba architecture, it perfectly balances the high accuracy characteristic of transformers with the memory-efficient sequence modeling of Mamba. This innovative design facilitates rapid throughput and low-latency inference, making it optimally suited for demanding tasks involving extensive text and imagery, particularly long-form documents and videos.
🚀 Technical Specifications
- • Model Size: 12.6 billion parameters
- • Architecture: Hybrid Transformer-Mamba sequence model
- • Context Window: Ultra-long, supporting up to 128,000 tokens
- • Input Modalities: Text, multi-image documents, video frames
✨ Performance Benchmarks
- OCRBench v2: Achieves leading accuracy in optical character recognition for superior document understanding tasks.
- Multimodal Reasoning: Boasts an impressive average score of ≈74 across key benchmarks including MMMU, MathVista, AI2D, ChartQA, DocVQA, and Video-MME.
- Video Comprehension: Enhanced by Efficient Video Sampling (EVS), enabling long-form video processing with significantly reduced inference costs.
- Multilingual Accuracy: Delivers robust performance across diverse languages, ensuring strong visual question answering and precise document parsing globally.
💡 Key Features
- ✅ Low Latency VL Inference: Optimized for exceptionally fast, high-throughput reasoning on combined text and image data.
- ✅ Efficient Long-Context Processing: Capable of handling extensive videos and documents up to 128K tokens through innovative token reduction techniques.
- ✅ Multi-Image & Video Understanding: Provides simultaneous analysis of multiple images and video frames for comprehensive scene interpretation and summarization.
- ✅ High-Resolution & Wide Layout Support: Expertly processes tiled images and panoramic inputs, making it ideal for charts, forms, and complex visual documents.
- ✅ Multimodal Querying: Supports advanced visual question answering, document data extraction, multi-step reasoning, and dense captioning across multiple languages.
- ✅ Hybrid Transformer-Mamba Architecture: Skillfully balances the high accuracy of traditional transformers with the memory efficiency of Mamba, enhancing inference scalability.
💲 Nemotron Nano 12B V2 VL API Pricing
Input: $0.22155 / 1M tokens
Output: $0.66465 / 1M tokens
🎯 Key Use Cases
- • Document Intelligence: Automate extraction and analysis of complex documents like invoices, contracts, receipts, and manuals with high precision.
- • Visual Question Answering (VQA): Query intricate images, charts, or video scenes to receive detailed and accurate answers.
- • Video Analytics: Perform comprehensive summarization, action detection, and scene understanding for long-form video content.
- • Data Analysis & Reporting: Automatically generate structured reports with high accuracy from diverse multimodal data inputs.
- • Media Asset Management: Enable dense captioning and comprehensive indexing for video content and extensive multimedia libraries.
- • Cross-Lingual Multimodal Tasks: Seamlessly handle diverse language inputs combined with images for broad global applications.
💻 Code Sample
Note: The code snippet above is a placeholder and would be rendered dynamically by your platform.
🆚 Comparison with Other Leading Models
Nemotron Nano 12B V2 VL vs. Qwen3 32B VL: Nemotron demonstrates superior performance in OCR and video benchmarks, making it optimally suited for real-time applications. Qwen3, on the other hand, prioritizes broader versatility across tasks.
Nemotron Nano 12B V2 VL vs. LLaVA-1.5: While LLaVA-1.5 is a competitive research model known for innovative multimodal instruction tuning, Nemotron Nano 12B V2 VL outperforms it in document intelligence, OCR, and extended video reasoning by incorporating dedicated vision encoders and efficient video sampling techniques.
Nemotron Nano 12B V2 VL vs. Eagle 2.5: Although Eagle 2.5 is strong in general visual question answering, Nemotron offers more specialized capabilities in chart reasoning, intricate document understanding, and comprehensive video comprehension.
Nemotron Nano 12B V2 VL vs. InternVL 14B V2: Nemotron's unique hybrid Mamba-Transformer backbone achieves significantly greater throughput on long-context tasks, positioning it as a more suitable choice for real-time AI agents processing dense visual and text data.
❓ Frequently Asked Questions (FAQ)
A: It's NVIDIA's 12-billion-parameter open multimodal vision-language model, excelling in video understanding and document reasoning. Its core innovation is a hybrid Transformer-Mamba architecture that balances accuracy with memory efficiency for low-latency inference.
A: It supports an ultra-long context window of up to 128,000 tokens, combined with Efficient Video Sampling (EVS) and innovative token reduction techniques to process lengthy content efficiently and cost-effectively.
A: Key applications include document intelligence, visual question answering (VQA), video analytics, data analysis & reporting, media asset management, and cross-lingual multimodal tasks.
A: Nemotron Nano 12B V2 VL achieves leading accuracy in OCRBench v2 for document understanding and an average multimodal reasoning score of ≈74 across various benchmarks like MMMU, MathVista, and DocVQA.
Learn how you can transformyour company with AICC APIs



Log in