On Wed, Oct 03, 2012 at 08:08:16PM -0700, Donaldo wrote: > This time I got 0.3% character errors (c.p. 1.3% before).
Awesome! > Here are some of the main errors: > > 10 -> l0 > 0,1 % → O,l % > 0,0000001 -> 0,000000l > [It apparently is stuck on USA decimal separator, so it thinks that this is > alphabetic?] You may well be able to improve these by creating a number-dawg file. I can't find documentation for it now, but hopefully searching the mailing list should provide something. Use the English one as a guide (indeed you may well be able to copy it wholesale.) To see the English one, unpack the eng.traineddata with combine_tessdata -u eng.traineddata eng. and then undawg the number-dawg file with dawg2wordlist eng.unicharset eng.number-dawg numbers and the 'numbers' file is what you want. > tro -> tro Ŭ [strange character at end of line.] I suspect that character was a smudge or mark on the paper, after the real text? If so preprocessing more would be a solution; I don't know whether Tesseract can be made more strict about ignoring them. With your accuracy so high already though, I wouldn't worry about it. What are you using to test the accuracy, by the way? I don't know the answers to your other questions. Hopefully others may be able to help. Good job though, and thanks again for sharing the process! Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

