I am training Tesseract 3.02.02 to improve OCR accuracy on scanned pdfs in English. The accuracy with just tesseract and image manipulation alone is pretty poor, and I believe training and modifying the dictionary will improve it a lot. I am creating a new language with several scanned pdfs to create tiff/box pairs, following wiki directions, using jTessBoxEditor2 to correct errors in the box files.
There seem to be a lot of issues with overlapping characters being misidentified. - "&" as "86" - "fl" as "fi" - "tt" as "m" - "an" as "m" There seem to be multiple ways to deal with overlapping characters: 1. Limit tessedit_char_whitelist so the output doesn't include unwanted unicode characters. 2. Add rules to unicharambigs. Seems like this should only be used for statistically significant cases like "iii" -> "m" 3. Training by splitting the characters and hoping they get recognized separately. ex: "fl" -> "f" "l" 4. Training by recognizing the combined blob. ex: "fl" -> "fl" In addition, when running tesseract on the source tiff files used to create tiff/box pairs, I was expecting tesseract to have a perfect match. However, it still produces misidentifications. Would training redundant tiff/box pairs help improve accuracy? Are there any best practices or improvements missing? -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

