I found that someone in tesseract-ocr group recommended using a config parameter to switch on a new method:enable_new_segsearch 1so I created a new epo.config file (there wasn't one before) with that one line in it. I generated a new epo.traineddata file, and reran my test*combine_tessdata epo.**sudo cp epo.traineddata /usr/share/tesseract-ocr/tessdata/**tesseract ../monato.tif monato3 -l epo*The result was exactly the same as without the config file.
I then took the epo.number-dawg from the distribution of tesseract-ocr 3.02. It is quite a bit shorter than the one in eng.number-dawg. I extracted it and made a new dawg with my epo.unicharset file. This reduced the number of number errors from 5 to 2. *dawg2wordlist epo.unicharset epo.number-dawg epo.number-list**wordlist2dawg ./epo.number-list ./epo.number-dawg ./epo.unicharset *(This editor is playing up (Firefox 15.0): the left margin has gone wonky.) At this point I rescanned my test document and saved it as monato2.png (instead of monato.tif) and reran the test. This reduced the number of errors from 16 to 8 (in 4371) so the error rate is 0.18%. Most of the errors are now related to spaces and end-of-lines (e.g. 2 eol where the there was only 1; spurious character at end of line). At least these are easy to find, but I wonder why they happen. That's all for now. Donaldo -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

