Training only involves getting the data it requires into a few appropriate files and executing a few appropriate commands, no programming required.
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 Take a look at the source training data for Vietnamese, which has many diacritical marks similar to your language, and then adapt it to your needs. http://sourceforge.net/projects/vietocr/files/lang%20data%20for%20tesseract-ocr/source%20training%20data%20for%20vietnamese/boxtiff-3.02.vie.zip If possible, attach here a 1- or 2-page-long text file containing all the alphabets of your language with the required minimum number of sample for each character. Make the text more realistic, as the Wiki suggests. On Wednesday, August 28, 2013 3:19:39 AM UTC-5, JOSE MARIA GARCIA NAÑEZ wrote: > > Hi y'all! > I have some resources, mainly linguistics stuff, entirely written in > pinyin -therefore no hanzi whatsoever ; I've tried to OCR the data with > commercial software such as Abby , Acrobat, etc but no luck. The problem > arises from the following set of characters { o ā ɑ̄ ē ī ō ū ǖ Ā Ē Ī Ō Ū > Ǖ á ɑ́ é í ó ú ǘ Á É Í Ó Ú Ǘ ǎ ɑ̌ ě ǐ ǒ ǔ ǚ Ǎ Ě Ǐ Ǒ Ǔ Ǚ à ɑ̀ è ì ò ù ǜ > À È Ì Ò Ù Ǜ a ɑ e i o u ü A E I O U Ü }. I've tried it all, but no matter > how much training, they just won't get them right. Even abby finereader's > languages that do contain some of the characters, as Czech, faiI to > recognize them. I've benn for about a year looking for a solution in > forums, but futile attemps so long. I cannot believe there's no way to work > this out so, having no idea about programming anything, I've decided to ask > in this forum. > > Any help will be much appreciated. Thanks in advance. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

