I am training Tesseract 3.02.02 to improve OCR accuracy on scanned pdfs in 
English. The accuracy with just tesseract and image manipulation alone is 
pretty poor, and I believe training and modifying the dictionary will 
improve it a lot. I am creating a new language with several scanned pdfs to 
create tiff/box pairs, following wiki directions, using jTessBoxEditor2 to 
correct errors in the box files.

There seem to be a lot of issues with overlapping characters being 
misidentified. 

   - "&" as "86"
   - "fl" as "fi"
   - "tt" as "m"
   - "an" as "m"

There seem to be multiple ways to deal with overlapping characters:
1. Limit tessedit_char_whitelist so the output doesn't include unwanted 
unicode characters.
2. Add rules to unicharambigs. Seems like this should only be used for 
statistically significant cases like "iii" -> "m"
3. Training by splitting the characters and hoping they get recognized 
separately. ex: "fl" -> "f" "l"
4. Training by recognizing the combined blob. ex: "fl" -> "fl"

In addition, when running tesseract on the source tiff files used to create 
tiff/box pairs, I was expecting tesseract to have a perfect match. However, 
it still produces misidentifications. Would training redundant tiff/box 
pairs help improve accuracy?

Are there any best practices or improvements missing?

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to