[tesseract-ocr] Guidence for optimising model.

Thomas Mon, 19 Jan 2026 07:31:11 -0800

Hi!,

Thank you for reading, I'm new here and have a hard time getting my head 
around how the training works.
I read https://github.com/tesseract-ocr/tesstrain , but I can't figure out 
what is the best thing to do.

My situation is like this:
I have about 22,000 pages (PDF / tiff images) all with the same font and
similar content.
It contains English + IAST transliterated Sanskrit or Bengali.

There is a IAST.traineddata model that works quite well but makes some
mistakes, e.g. the dot below a ḍ or above a ṁ are sometimes missing.

I want to optimise this model to work as perfect as possible for my data
set, I don't care that it won't be able to handle other fonts any more.

I was thinking that I can run my existing model on some pages, correct the
output and feed it back somehow, but I can't figure out how. All info I
find online is mixed (version 3, 4, 5)

If there is a clear step by step, command by command guide that would be
very useful.

Any assistance will be greatly appreciated. If it turns out to be difficult
I might be able to collect some donations to give as a reward for someone
that does the training for me.

References;
All PDFs found here that start with a copyright notice from ISKCON MEDIA
VEDIC LIBRARY (Please don't worry about the copyright, it's my late friends
work that I am trying to preserve) data-set

<https://vedicilluminations.com/spiritual-library/Acharya%20Books/>Existing
model;
https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST
probably related langdata:
https://github.com/tesseract-ocr/langdata/tree/main/iast

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/tesseract-ocr/3fbec3bd-3aec-47b5-8405-35f8087addcdn%40googlegroups.com.

[tesseract-ocr] Guidence for optimising model.

Reply via email to