Hi Ruwanka!

1. 300 to 500 dpi
2. Preprocessing is necessary. Regarding the sample you give - lines should 
be as horizontal as possible and text should be black, background be white.
3. The font doesn't have to be identical - just "sufficiently" (very) 
similar. 
4. Diacritics shouldn't be a big issue - at least not for the 
dash-above-character-kind. Just make sure you have sufficiently large 
sample size (at least 10 specimen) per each 
character(-diacritcs-combination).

Good luck

Raffael

On Sunday, 15 March 2015 15:45:37 UTC+1, Ruwanka De Silva wrote:
>
> Hi All,
>
> I am trying to train tesseract for Sinhalese language, for recognize text 
> in old Sinhalese newspapers. I am new for tesseract and I have few 
> questions about how to prepare training data for best results. So these are 
> my questions,
>
> 1. What is the best resolution (dpi) for training data?
> 2. I supposed to do binarization and some enhancements as a preprocessing 
> before doing ocr, so will teseract give best results if I train it for 
> preprocessed images or will it give best results if I train it for raw 
> images (attached herewith)?
> 3. I don't have font related with these images so I couldn't create 
> training data myself, so are there any solution for creating training data 
> other than using scanned images of newspapers?
> 4. Sinahales has huge character set which include different diacritics for 
> modify the phonetic sound/meaning of a letter so what are the steps do I 
> have to take in order to increase accuracy?
>
> Any help would be appreciated.
>
> Regards,
> Ruwanka De Silva
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3146016b-efdd-4ecb-8ea5-a0fc56ff2dbe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to