Hi!,

Thank you for reading, I'm new here and have a hard time getting my head 
around how the training works.
I read https://github.com/tesseract-ocr/tesstrain , but I can't figure out 
what is the best thing to do.

My situation is like this:
I have about 22,000 pages (PDF / tiff images) all with the same font and 
similar content.
It contains English + IAST transliterated Sanskrit or Bengali.

There is a IAST.traineddata model that works quite well but makes some 
mistakes, e.g. the dot below a ḍ or above a ṁ are sometimes missing.

I want to optimise this model to work as perfect as possible for my data 
set, I don't care that it won't be able to handle other fonts any more.

I was thinking that I can run my existing model on some pages, correct the 
output and feed it back somehow, but I can't figure out how.  All info I 
find online is mixed (version 3, 4, 5)

If there is a clear step by step, command by command guide that would be 
very useful.

Any assistance will be greatly appreciated. If it turns out to be difficult 
I might be able to collect some donations to give as a reward for someone 
that does the training for me.

References;
All PDFs found here that start with a copyright notice from ISKCON MEDIA 
VEDIC LIBRARY (Please don't worry about the copyright, it's my late friends 
work that I am trying to preserve) data-set

<https://vedicilluminations.com/spiritual-library/Acharya%20Books/>Existing 
model;
https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST
probably related langdata: 
https://github.com/tesseract-ocr/langdata/tree/main/iast

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/3fbec3bd-3aec-47b5-8405-35f8087addcdn%40googlegroups.com.

Reply via email to