Sihari reordering
The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.
OCR mangles connected scripts: sihari lands before its consonant, nuktas drift, diacritics scatter. gurmukhifix repairs OCR output — from Tesseract, Surya, Gemini or any engine — into well-formed Unicode. Gurmukhi and Indic scripts first (Urdu & Farsi are experimental), and it never corrupts text that was already correct — including Gurbani.
ਿਸੱਖ ਧਰਮਸਿੱਖ ਧਰਮSihari ਿ reordered after its base consonant — the #1 systematic Gurmukhi OCR error.
Paste raw OCR text and watch it become clean, well-formed Unicode. Everything runs in your browser.
This is a lightweight in-browser preview covering Gurmukhi, Punjabi, Hindi and Devanagari. The installable package is authoritative — it adds Gurbani dictionary-gating, a verbatim-scripture lock, and Urdu/Farsi. pip install gurmukhifix ↗
Drop in an image or a PDF. Digital PDFs are read straight from their text layer; scans and photos are OCR'd in your browser with Tesseract.js (multi-page, up to 15), then cleaned by gurmukhifix. Everything runs locally — nothing is uploaded. Handwriting accuracy is limited; this demos the pipeline, not production OCR.
gurmukhifix is a post-processor, not an OCR engine. Tesseract turns the image into characters; gurmukhifix applies the linguistic rules Tesseract can't.
Run any engine — Tesseract (TSV/hOCR), Surya, Gemini, Google Vision. gurmukhifix reads them all.
≥85% passes through, <60% is flagged, the middle band is corrected.
A fix is applied only if it lowers script-validity badness — correct text is never changed.
Corrected text, a per-fix report and preserved layout metadata.
The dependent vowel ਿ is written before its consonant but must be encoded after it. gurmukhifix moves it back.
A nukta after a vowel sign (ਸਾ਼) is reordered to the canonical consonant+nukta+vowel (ਸ਼ਾ).
Corrections require validity evidence. Already-correct Unicode round-trips byte-for-byte — enforced by CI.
Orphaned matras, impossible sequences and out-of-script code-points are surfaced with severity.
Parallel batch processing and a SQLite store that promotes repeatedly-confirmed corrections.
Bounding boxes flow through end-to-end so downstream tools can rebuild the page.
Gurbani was written larivaar, one unbroken stream of letters, and later padched with a space between each word. Deciding where words begin is exactly what OCR gets wrong — and what gurmukhifix reasons about, gated against a verbatim Gurbani lexicon so a real scripture word is never split or rewritten.


One shared engine, per-script rules via extends. Gurmukhi, Punjabi, Hindi and Devanagari run in the demo above; Urdu & Farsi ship in the package as experimental (structural-only). Click any script for a plain-English deep-dive.
The script of Sikh scripture and one of the writing systems for Punjabi.
Deep-dive → In the demoਪੰPunjabi written in the Gurmukhi script — it builds on every Gurmukhi rule.
Deep-dive → In the demoहिHindi written in the Devanagari script.
Deep-dive → In the demoदेThe shared base script behind Hindi, Marathi, Nepali and Sanskrit.
Deep-dive → ExperimentalاُUrdu in the Nasta'liq style — a connected, right-to-left script.
Deep-dive → ExperimentalفاPersian (Farsi) — Arabic-script with Persian-specific letters.
Deep-dive →On PyPI, MIT-licensed and free for anyone. gurmukhifix reads output from any OCR engine — Tesseract, Surya, Gemini or Google Vision.
pip install gurmukhifixtesseract page.png out --oem 1 --psm 6 tsv
gurmukhifix correct --input out.tsv \
--lang gurmukhi --output ./resultsgurmukhifix batch --input-dir ./pages \
--lang devanagari --workers 4