Happy Horse 1.0 is now on ModelsLab

Try Now
Skip to main content
Available now on ModelsLab · Language Model

NVIDIA: Nemotron Nano 12B 2 VL (free)Document Intelligence. Video Understanding.

Efficient Multimodal Reasoning. Production Ready.

Hybrid Architecture

Mamba-Transformer Efficiency

35% higher throughput than prior generation with memory-efficient sequence modeling.

Document Processing

OCR and Chart Reasoning

Leading OCRBench v2 performance. Handles invoices, receipts, charts, and multi-page documents.

Video Capability

Long-Form Video Sampling

Efficient video sampling reduces inference cost while maintaining comprehension accuracy.

Examples

See what NVIDIA: Nemotron Nano 12B 2 VL (free) can create

Copy any prompt below and try it yourself in the playground.

Invoice Analysis

Extract line items, totals, and vendor information from this invoice image. Provide structured output with dates, amounts, and payment terms.

Chart Interpretation

Analyze this quarterly revenue chart. Describe trends, identify peak periods, and summarize year-over-year growth patterns.

Document Summarization

Summarize this multi-page technical manual in 3-4 key points, highlighting installation steps and safety warnings.

Video Comprehension

Watch this product demonstration video and describe the main features shown, key use cases, and any technical specifications mentioned.

For Developers

A few lines of code.
Vision and text. Twelve billion parameters.

ModelsLab handles the infrastructure: fast inference, auto-scaling, and a developer-friendly API. No GPU management needed.

  • Serverless: scales to zero, scales to millions
  • Pay per token, no minimums
  • Python and JavaScript SDKs, plus REST API
import requests
response = requests.post(
"https://modelslab.com/api/v7/llm/chat/completions",
json={
"key": "YOUR_API_KEY",
"prompt": "",
"model_id": ""
}
)
print(response.json())

FAQ

Common questions about NVIDIA: Nemotron Nano 12B 2 VL (free)

Read the docs

It's a 12-billion-parameter open multimodal model designed for document intelligence and video understanding. The hybrid Transformer-Mamba architecture delivers 35% higher throughput than prior generations while maintaining leading accuracy on OCR and reasoning benchmarks.

Process invoices, receipts, and manuals; perform visual question answering; summarize documents and videos; extract text from images; analyze charts and diagrams. It handles up to four 1k×2k resolution images plus long text prompts.

Input tokens cost $0.20/1M and output tokens cost $0.60/1M, making it one of the most cost-effective vision-language models. It's ideal for high-volume document processing and video analysis applications.

The 12B v2 VL offers 35% higher throughput in long document scenarios and improved accuracy across vision and reasoning benchmarks. It uses an enhanced hybrid architecture built on Nemotron Nano v2 and RADIOv2.5 vision encoder.

Yes, it's marked as ready for commercial use. It's optimized for NVIDIA GPU-accelerated systems and supports vLLM and TRT-LLM runtime engines across multiple hardware microarchitectures.

Nemotron Nano 12B v2 VL supports English, German, Spanish, French, Italian, and Japanese, making it suitable for multilingual document and video processing workflows.

Ready to create?

Start generating with NVIDIA: Nemotron Nano 12B 2 VL (free) on ModelsLab.