131K

Out

Chat

disable

Nemotron Nano 12B V2 VL

Optimized for low-latency deployment, it excels in optical character recognition (OCR), chart reasoning, document comprehension, and long-form video analysis.

Free $1 Tokens for New Members

Text to Speech

Javascript

Python

                                        const { OpenAI } = require('openai');

const api = new OpenAI({
  baseURL: 'https://api.ai.cc/v1',
  apiKey: '',
});

const main = async () => {
  const result = await api.chat.completions.create({
    model: 'nvidia/nemotron-nano-12b-v2-vl',
    messages: [
      {
        role: 'system',
        content: 'You are an AI assistant who knows everything.',
      },
      {
        role: 'user',
        content: 'Tell me, why is the sky blue?'
      }
    ],
  });

  const message = result.choices[0].message.content;
  console.log(`Assistant: ${message}`);
};

main();

                                        import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.ai.cc/v1",
    api_key="",    
)

response = client.chat.completions.create(
    model="nvidia/nemotron-nano-12b-v2-vl",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant who knows everything.",
        },
        {
            "role": "user",
            "content": "Tell me, why is the sky blue?"
        },
    ],
)

message = response.choices[0].message.content

print(f"Assistant: {message}")

Docs

One API 300+ AI Models

Save 20% on Costs & $1 Free Tokens

Get API Key Explore Models

AI Playground

Test all API models in the sandbox environment before you integrate.

We provide more than 300 models to integrate into your app.

Nemotron Nano 12B V2 VL

Product Detail

Nemotron Nano 12B V2 VL is NVIDIA's state-of-the-art 12-billion-parameter open multimodal vision-language model, meticulously engineered for exceptional performance in video understanding, complex multi-image document reasoning, and nuanced natural language output generation. Leveraging a novel hybrid Transformer-Mamba architecture, it perfectly balances the high accuracy characteristic of transformers with the memory-efficient sequence modeling of Mamba. This innovative design facilitates rapid throughput and low-latency inference, making it optimally suited for demanding tasks involving extensive text and imagery, particularly long-form documents and videos.

🚀 Technical Specifications

• Model Size: 12.6 billion parameters
• Architecture: Hybrid Transformer-Mamba sequence model
• Context Window: Ultra-long, supporting up to 128,000 tokens
• Input Modalities: Text, multi-image documents, video frames

✨ Performance Benchmarks

OCRBench v2: Achieves leading accuracy in optical character recognition for superior document understanding tasks.
Multimodal Reasoning: Boasts an impressive average score of ≈74 across key benchmarks including MMMU, MathVista, AI2D, ChartQA, DocVQA, and Video-MME.
Video Comprehension: Enhanced by Efficient Video Sampling (EVS), enabling long-form video processing with significantly reduced inference costs.
Multilingual Accuracy: Delivers robust performance across diverse languages, ensuring strong visual question answering and precise document parsing globally.

💡 Key Features

✅ Low Latency VL Inference: Optimized for exceptionally fast, high-throughput reasoning on combined text and image data.
✅ Efficient Long-Context Processing: Capable of handling extensive videos and documents up to 128K tokens through innovative token reduction techniques.
✅ Multi-Image & Video Understanding: Provides simultaneous analysis of multiple images and video frames for comprehensive scene interpretation and summarization.
✅ High-Resolution & Wide Layout Support: Expertly processes tiled images and panoramic inputs, making it ideal for charts, forms, and complex visual documents.
✅ Multimodal Querying: Supports advanced visual question answering, document data extraction, multi-step reasoning, and dense captioning across multiple languages.
✅ Hybrid Transformer-Mamba Architecture: Skillfully balances the high accuracy of traditional transformers with the memory efficiency of Mamba, enhancing inference scalability.

💲 Nemotron Nano 12B V2 VL API Pricing

Input: $0.22155 / 1M tokens

Output: $0.66465 / 1M tokens

🎯 Key Use Cases

• Document Intelligence: Automate extraction and analysis of complex documents like invoices, contracts, receipts, and manuals with high precision.
• Visual Question Answering (VQA): Query intricate images, charts, or video scenes to receive detailed and accurate answers.
• Video Analytics: Perform comprehensive summarization, action detection, and scene understanding for long-form video content.
• Data Analysis & Reporting: Automatically generate structured reports with high accuracy from diverse multimodal data inputs.
• Media Asset Management: Enable dense captioning and comprehensive indexing for video content and extensive multimedia libraries.
• Cross-Lingual Multimodal Tasks: Seamlessly handle diverse language inputs combined with images for broad global applications.

💻 Code Sample

Note: The code snippet above is a placeholder and would be rendered dynamically by your platform.

🆚 Comparison with Other Leading Models

Nemotron Nano 12B V2 VL vs. Qwen3 32B VL: Nemotron demonstrates superior performance in OCR and video benchmarks, making it optimally suited for real-time applications. Qwen3, on the other hand, prioritizes broader versatility across tasks.

Nemotron Nano 12B V2 VL vs. LLaVA-1.5: While LLaVA-1.5 is a competitive research model known for innovative multimodal instruction tuning, Nemotron Nano 12B V2 VL outperforms it in document intelligence, OCR, and extended video reasoning by incorporating dedicated vision encoders and efficient video sampling techniques.

Nemotron Nano 12B V2 VL vs. Eagle 2.5: Although Eagle 2.5 is strong in general visual question answering, Nemotron offers more specialized capabilities in chart reasoning, intricate document understanding, and comprehensive video comprehension.

Nemotron Nano 12B V2 VL vs. InternVL 14B V2: Nemotron's unique hybrid Mamba-Transformer backbone achieves significantly greater throughput on long-context tasks, positioning it as a more suitable choice for real-time AI agents processing dense visual and text data.

❓ Frequently Asked Questions (FAQ)

Q: What is Nemotron Nano 12B V2 VL and its core innovation?

A: It's NVIDIA's 12-billion-parameter open multimodal vision-language model, excelling in video understanding and document reasoning. Its core innovation is a hybrid Transformer-Mamba architecture that balances accuracy with memory efficiency for low-latency inference.

Q: How does Nemotron Nano 12B V2 VL handle long documents and videos?

A: It supports an ultra-long context window of up to 128,000 tokens, combined with Efficient Video Sampling (EVS) and innovative token reduction techniques to process lengthy content efficiently and cost-effectively.

Q: What are the primary use cases for this model?

A: Key applications include document intelligence, visual question answering (VQA), video analytics, data analysis & reporting, media asset management, and cross-lingual multimodal tasks.

Q: How does its performance compare in OCR and multimodal reasoning?

A: Nemotron Nano 12B V2 VL achieves leading accuracy in OCRBench v2 for document understanding and an average multimodal reasoning score of ≈74 across various benchmarks like MMMU, MathVista, and DocVQA.

Learn how you can transformyour company with AICC APIs

Discover how to revolutionize your business with AICC API! Unlock powerfultools to automate processes, enhance decision-making, and personalize customer experiences.

Contact sales

One API
300+ AI Models

Save 20% on Costs

Free $1 Tokens for New Members