Re: Tess v3 not recognising accented Esperanto characters.

Donaldo Tue, 09 Oct 2012 19:35:12 -0700

 

I found that someone in tesseract-ocr group recommended using a config 
parameter to 
switch on a new method:enable_new_segsearch     1so I created a new epo.config 
file (there wasn't one before) with that one line in it. 
I generated a new epo.traineddata file, and reran my test*combine_tessdata 
epo.**sudo cp epo.traineddata  /usr/share/tesseract-ocr/tessdata/**tesseract 
../monato.tif monato3 -l epo*The result was exactly the same as without the 
config file.


I then took the epo.number-dawg from the distribution of tesseract-ocr 3.02. 
It is quite a bit shorter than the one in eng.number-dawg.
I extracted it and made a new dawg with my epo.unicharset file. This 
reduced the number of number errors from 5 to 2.
*dawg2wordlist epo.unicharset epo.number-dawg epo.number-list**wordlist2dawg 
./epo.number-list ./epo.number-dawg ./epo.unicharset
*(This editor is playing up (Firefox 15.0): the left margin has gone wonky.) 

At this point I rescanned my test document and saved it as monato2.png 
(instead of monato.tif) and reran the test. 
This reduced the number of errors from 16 to 8 (in 4371) 
so the error rate is 0.18%. Most of the errors are now related to spaces 
and 
end-of-lines (e.g. 2 eol where the there was only 1; spurious character 
at end of line). At least these are easy to find, but I wonder why they 
happen. That's all for now.
Donaldo



-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess v3 not recognising accented Esperanto characters.

Reply via email to