Re: OCR romanized Asian languages

Quan Nguyen Thu, 29 Aug 2013 17:19:04 -0700

Training only involves getting the data it requires into a few appropriate 
files and executing a few appropriate commands, no programming required.


http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Take a look at the source training data for Vietnamese, which has many 
diacritical marks similar to your language, and then adapt it to your needs.

http://sourceforge.net/projects/vietocr/files/lang%20data%20for%20tesseract-ocr/source%20training%20data%20for%20vietnamese/boxtiff-3.02.vie.zip

If possible, attach here a 1- or 2-page-long text file containing all the 
alphabets of your language with the required minimum number of sample for 
each character. Make the text more realistic, as the Wiki suggests.

On Wednesday, August 28, 2013 3:19:39 AM UTC-5, JOSE MARIA GARCIA NAÑEZ 
wrote:
>
> Hi y'all!
> I have some resources, mainly linguistics stuff, entirely written in 
> pinyin -therefore no hanzi whatsoever ; I've tried to OCR the data with 
> commercial software such as Abby , Acrobat, etc but no luck. The problem 
> arises from the following set of characters { o ā ɑ̄ ē ī ō ū ǖ Ā Ē Ī Ō Ū 
> Ǖ á ɑ́ é í ó ú ǘ Á É Í Ó Ú Ǘ   ǎ ɑ̌ ě ǐ ǒ ǔ ǚ Ǎ Ě Ǐ Ǒ Ǔ Ǚ à ɑ̀ è ì ò ù ǜ 
> À È Ì Ò Ù Ǜ a ɑ e i o u ü A E I O U Ü }. I've tried it all, but no matter 
> how much training, they just won't get them right. Even abby finereader's 
> languages that do contain some of the characters, as Czech, faiI to 
> recognize them. I've benn for about a year looking for a solution in 
> forums, but futile attemps so long. I cannot believe there's no way to work 
> this out so, having no idea about programming anything, I've decided to ask 
> in this forum.
>
> Any help will be much appreciated. Thanks in advance.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: OCR romanized Asian languages

Reply via email to