1. Accuracy & WER comparisons
Users are constantly weighing Voxtral against GPT‑4o Mini Transcribe and Whisper.
- “The linked article claims the average word error rate for Voxtral mini v2 is lower than GPT‑4o mini transcribe.”
- “The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization… end up making a very different class of mistakes than Whisper.”
2. Multilingual coverage & language‑switching quirks
The model’s 13‑language claim is both praised and criticized.
- “The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.”
- “I tried English + Polish… it thinks you speak Russian.”
3. Diarization limitations
Only the older “Transcribe V2” version offers diarization; the new real‑time model does not.
- “The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.”
- “The Voxtral Realtime model doesn’t support diarization.”
4. Deployment, cost & latency
Open‑weight, local inference and pricing are major discussion points.
- “The 4 GB model can run locally with vLLM.”
- “The API is $0.003/min.”
- “With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.”
These four themes—accuracy, multilingualism, diarization, and deployment economics—drive the bulk of the conversation.