Hi Ruwanka! 1. 300 to 500 dpi 2. Preprocessing is necessary. Regarding the sample you give - lines should be as horizontal as possible and text should be black, background be white. 3. The font doesn't have to be identical - just "sufficiently" (very) similar. 4. Diacritics shouldn't be a big issue - at least not for the dash-above-character-kind. Just make sure you have sufficiently large sample size (at least 10 specimen) per each character(-diacritcs-combination).
Good luck Raffael On Sunday, 15 March 2015 15:45:37 UTC+1, Ruwanka De Silva wrote: > > Hi All, > > I am trying to train tesseract for Sinhalese language, for recognize text > in old Sinhalese newspapers. I am new for tesseract and I have few > questions about how to prepare training data for best results. So these are > my questions, > > 1. What is the best resolution (dpi) for training data? > 2. I supposed to do binarization and some enhancements as a preprocessing > before doing ocr, so will teseract give best results if I train it for > preprocessed images or will it give best results if I train it for raw > images (attached herewith)? > 3. I don't have font related with these images so I couldn't create > training data myself, so are there any solution for creating training data > other than using scanned images of newspapers? > 4. Sinahales has huge character set which include different diacritics for > modify the phonetic sound/meaning of a letter so what are the steps do I > have to take in order to increase accuracy? > > Any help would be appreciated. > > Regards, > Ruwanka De Silva > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3146016b-efdd-4ecb-8ea5-a0fc56ff2dbe%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

