Create & Edit Images Instantly with Google Nano Banana 2

Try Nano Banana 2 Now
Skip to main content

Parakeet.cpp vs Whisper: Which Self-Hosted ASR Is Right for Your App in 2026?

Adhik JoshiAdhik Joshi
||7 min read|Audio Generation
Parakeet.cpp vs Whisper: Which Self-Hosted ASR Is Right for Your App in 2026?

Integrate AI APIs Today

Build next-generation applications with ModelsLab's enterprise-grade AI APIs for image, video, audio, and chat generation

Get Started
Get Started

A new C++ implementation of NVIDIA's Parakeet ASR models hit Hacker News this week and the benchmarks are hard to ignore: 96x faster than CPU inference, no Python runtime, no ONNX dependency, and it runs natively on Apple Silicon GPU via Metal. If you've been using Whisper for speech-to-text in your app, it's worth understanding what Parakeet.cpp changes — and where Whisper still makes sense.

What Is Parakeet.cpp?

Parakeet.cpp is a pure C++ inference engine for NVIDIA's Parakeet family of ASR models. It's built on Axiom — a lightweight tensor library with automatic Metal GPU acceleration — and requires no Python, no ONNX runtime, and no heavyweight ML framework. Just C++ and one tensor library.

The headline number: ~27ms encoder inference on Apple Silicon GPU for 10 seconds of audio using the 110M model. That's 96x faster than the same model running on CPU. For streaming applications, the Nemotron variant supports configurable latency from 80ms to 1,120ms — real-time transcription at the low end.

The library supports five NVIDIA Parakeet model variants:

  • tdt-ctc-110m — 110M param, offline, English only, dual CTC/TDT decoder heads
  • tdt-600m (V2/V3) — 600M param, offline, multilingual (V3 adds 25 European languages), word-level timestamps
  • eou-120m — 120M param, streaming, English, RNNT with end-of-utterance detection
  • nemotron-600m — 600M param, streaming, multilingual, configurable latency (80–1120ms)
  • sortformer-117m — speaker diarization, up to 4 speakers

All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.

What Is Whisper?

OpenAI's Whisper has been the de facto standard for open-source ASR since 2022. It's accurate, multilingual (99 languages), and handles messy real-world audio reasonably well. The tradeoffs are well-known:

  • Python-first — requires PyTorch, which means large container images and GPU memory overhead
  • Offline only — not designed for streaming or real-time transcription
  • Slow on CPU, especially the larger models (large-v3 is 1.55B params)
  • No native speaker diarization

The community has shipped workarounds — faster-whisper (CTranslate2 backend), whisper.cpp (llama.cpp author's C++ port), and various quantized variants. These close the performance gap, but they're still targeting Whisper's architecture, which wasn't designed with production throughput in mind.

Head-to-Head Comparison

Feature Parakeet.cpp Whisper / faster-whisper
Language C++ (no Python) Python (+ C++ port available)
Inference speed ~27ms / 10s audio (GPU) ~150–500ms / 10s audio (GPU)
Streaming Yes (eou-120m, nemotron-600m) No (offline only)
Speaker diarization Yes (sortformer, 4 speakers) No (third-party add-on)
Word timestamps Yes (TDT decoder) Yes
Multilingual Yes (V3: 25 European langs) Yes (99 languages)
Apple Silicon Native Metal acceleration Via CoreML / MPS
Container footprint Small (no Python runtime) Large (PyTorch + dependencies)
License CC-BY-4.0 (commercial OK) MIT (commercial OK)

Using Parakeet.cpp: Code Examples

The API is minimal. Basic transcription in C++:

#include <parakeet/parakeet.hpp>

// Offline transcription (110M model)
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // Metal acceleration on Apple Silicon

auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;

Choose your decoder at call time — CTC for speed, TDT for accuracy:

// Fast greedy decode
auto result = t.transcribe("audio.wav", parakeet::Decoder::CTC);

// Better accuracy (default)
auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT);

Word-level timestamps:

auto result = t.transcribe("audio.wav", parakeet::Decoder::TDT, /*timestamps=*/true);
for (const auto& w : result.word_timestamps) {
    std::cout << "[" << w.start << "s - " << w.end << "s] " << w.word << "\n";
}

Streaming transcription with the Nemotron model (configurable latency from 80ms):

// latency_frames: 0=80ms, 1=160ms, 6=560ms, 13=1120ms
auto cfg = parakeet::make_nemotron_600m_config(/*latency_frames=*/0);
parakeet::NemotronTranscriber t("model.safetensors", "vocab.txt", cfg);

while (auto chunk = get_audio_chunk()) {
    auto text = t.transcribe_chunk(chunk);
    if (!text.empty()) std::cout << text << std::flush;
}

Speaker diarization (who said what):

parakeet::Sortformer model(parakeet::make_sortformer_117m_config());
model.load_state_dict(axiom::io::safetensors::load("sortformer.safetensors"));

auto wav = parakeet::read_wav("meeting.wav");
auto features = parakeet::preprocess_audio(wav.samples, {.normalize = false});
auto segments = model.diarize(features);

for (const auto& seg : segments) {
    std::cout << "Speaker " << seg.speaker_id 
              << " [" << seg.start << "s-" << seg.end << "s]: " 
              << seg.text << "\n";
}

When to Use Parakeet.cpp

Parakeet.cpp is the right choice when:

  • You're building a native macOS/iOS app — Metal acceleration means you get GPU-level speed without any Python overhead. Great for desktop productivity tools, meeting recorders, or voice interfaces.
  • You need real-time/streaming transcription — Nemotron at 80ms latency is competitive with commercial real-time APIs. Eou-120m handles English with end-of-utterance detection for voice command applications.
  • You need speaker diarization built-in — Sortformer handles up to 4 speakers without tacking on pyannote or similar.
  • You're targeting edge/embedded deployment — No Python runtime means a significantly smaller deployment footprint.
  • Throughput is the bottleneck — At 96x CPU speed, you can process large audio archives much faster than any Python-based pipeline.

When Whisper Still Makes Sense

Parakeet.cpp isn't a universal replacement:

  • Non-European language support — Whisper handles 99 languages. Parakeet V3 covers 25 European languages; for Thai, Arabic, Japanese, etc., Whisper still wins.
  • Noisy/accented audio from unknown sources — Whisper's training data diversity is its strength. For podcast transcription or user-generated content with variable audio quality, Whisper large-v3 tends to be more robust.
  • Python-native stack — If your entire pipeline is Python and you're already paying for a GPU, faster-whisper with CTranslate2 is easier to integrate.
  • Existing Whisper tooling — If you're using Diarization add-ons, PromptWhisper, or other ecosystem tools that assume Whisper's output format, migration has a cost.

Self-Hosted vs. Speech Recognition API

Both Parakeet.cpp and Whisper require you to manage model weights, GPU provisioning, and infrastructure. For many production applications, that's engineering time that doesn't scale:

  • GPU instances run 24/7 even when traffic is zero
  • You handle model updates, driver compatibility, and burst scaling
  • Cold-start latency hits hard if you're spinning up containers on demand

ModelsLab's Speech Recognition API gives you production ASR without the infra overhead. Send an audio URL, get back a transcript — no GPU required, pay per second of audio processed, burst to any scale instantly.

import requests

response = requests.post(
    "https://modelslab.com/api/v6/voice/speech_to_text",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer YOUR_API_KEY"
    },
    json={
        "audio_url": "https://example.com/meeting.wav",
        "language": "en"
    }
)

data = response.json()
print(data["text"])  # Full transcript with timestamps

Use self-hosted Parakeet.cpp when you need sub-30ms latency for real-time use cases, edge deployment, or when data privacy requires on-premise processing. Use a managed API when you want production-ready transcription without the infra overhead.

Bottom Line

Parakeet.cpp is one of the most technically interesting ASR releases in a while — pure C++, Metal acceleration, 96x CPU speedup, streaming support, and speaker diarization built-in. If you're building a native app or need real-time transcription with minimal footprint, it's worth benchmarking against your current setup.

For production applications where you need reliable, scalable transcription without managing GPU infrastructure, an API like ModelsLab still makes more sense. But watching parakeet.cpp — it's the kind of project that tends to define the self-hosted standard for the next few years.

Try the parakeet.cpp repo on GitHub and compare benchmarks on your target hardware. The 600M multilingual V3 model is the one to test for accuracy; the 110M is your latency baseline.

Share:
Adhik Joshi

Written by

Adhik Joshi

Plugins

Explore Plugins for Pro

Our plugins are designed to work with the most popular content creation software.

API

Build Apps with
ML
API

Use our API to build apps, generate AI art, create videos, and produce audio with ease.