Hi All, I am trying to train tesseract for Sinhalese language, for recognize text in old Sinhalese newspapers. I am new for tesseract and I have few questions about how to prepare training data for best results. So these are my questions,
1. What is the best resolution (dpi) for training data? 2. I supposed to do binarization and some enhancements as a preprocessing before doing ocr, so will teseract give best results if I train it for preprocessed images or will it give best results if I train it for raw images (attached herewith)? 3. I don't have font related with these images so I couldn't create training data myself, so are there any solution for creating training data other than using scanned images of newspapers? 4. Sinahales has huge character set which include different diacritics for modify the phonetic sound/meaning of a letter so what are the steps do I have to take in order to increase accuracy? Any help would be appreciated. Regards, Ruwanka De Silva -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8d8ad5b8-e3d7-4581-8972-1b631f5bc1c5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

