NVIDIA has released Nemotron OCR v2, a multilingual optical character recognition system that achieves high accuracy while running at 34.7 pages per second on a single A100 GPU. The model was trained on 12 million synthetic images across six languages including Japanese, Korean, Russian, and Chinese. Normalized Edit Distance scores for non-English languages dropped from between 0.56 and 0.92 in the previous model to between 0.035 and 0.069 in v2.
The breakthrough came from synthetic data generation rather than architectural changes. NVIDIA built a pipeline that programmatically renders text onto images with precise bounding boxes, transcriptions, and reading order information at word, line, and paragraph levels. This approach avoids the cost and delay of manual annotation while avoiding the noise in web-scraped PDFs.
Key components include mOSCAR, a multilingual web corpus covering 163 language subsets, and a modified version of the SynthDoG rendering engine. The pipeline generates multi-level annotations: axis-aligned boxes and 4-point quads for words, lines, and paragraphs, plus a relation graph for reading order. Fonts and backgrounds are randomized to simulate real documents.
The synthetic dataset and model are publicly available on Hugging Face under nvidia/OCR-Synthetic-Multilingual-v1 and nvidia/nemotron-ocr-v2. A browser demo allows direct testing of the model without installation.
Source: huggingface.co