[tesseract-ocr] Preparing training data for new language

Ruwanka De Silva Sun, 15 Mar 2015 07:46:22 -0700

Hi All,

I am trying to train tesseract for Sinhalese language, for recognize text 
in old Sinhalese newspapers. I am new for tesseract and I have few 
questions about how to prepare training data for best results. So these are 
my questions,


1. What is the best resolution (dpi) for training data?
2. I supposed to do binarization and some enhancements as a preprocessing 
before doing ocr, so will teseract give best results if I train it for 
preprocessed images or will it give best results if I train it for raw 
images (attached herewith)?
3. I don't have font related with these images so I couldn't create 
training data myself, so are there any solution for creating training data 
other than using scanned images of newspapers?
4. Sinahales has huge character set which include different diacritics for 
modify the phonetic sound/meaning of a letter so what are the steps do I 
have to take in order to increase accuracy?

Any help would be appreciated.

Regards,
Ruwanka De Silva

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8d8ad5b8-e3d7-4581-8972-1b631f5bc1c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Preparing training data for new language

Reply via email to