Hi!, Thank you for reading, I'm new here and have a hard time getting my head around how the training works. I read https://github.com/tesseract-ocr/tesstrain , but I can't figure out what is the best thing to do.
My situation is like this: I have about 22,000 pages (PDF / tiff images) all with the same font and similar content. It contains English + IAST transliterated Sanskrit or Bengali. There is a IAST.traineddata model that works quite well but makes some mistakes, e.g. the dot below a ḍ or above a ṁ are sometimes missing. I want to optimise this model to work as perfect as possible for my data set, I don't care that it won't be able to handle other fonts any more. I was thinking that I can run my existing model on some pages, correct the output and feed it back somehow, but I can't figure out how. All info I find online is mixed (version 3, 4, 5) If there is a clear step by step, command by command guide that would be very useful. Any assistance will be greatly appreciated. If it turns out to be difficult I might be able to collect some donations to give as a reward for someone that does the training for me. References; All PDFs found here that start with a copyright notice from ISKCON MEDIA VEDIC LIBRARY (Please don't worry about the copyright, it's my late friends work that I am trying to preserve) data-set <https://vedicilluminations.com/spiritual-library/Acharya%20Books/>Existing model; https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST probably related langdata: https://github.com/tesseract-ocr/langdata/tree/main/iast -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/3fbec3bd-3aec-47b5-8405-35f8087addcdn%40googlegroups.com.

