I am looking to use tesseract to convert numerical values only. The source of information is 70s era typewritten data in tables, so accuracy is very important I have recompiled tesseract with a modified value for "tessedit_char_whitelist" to only return numerals. This improves the result significantly, but I am curious if I can improve it further. I think I need to train tesseract to a domain that is very specific to my challenge. I have some questions:
1. Is training tesseract the best way forward? Are there other suggestions for improving accuracy? 2. Instead of generating new training pages on my own, I was planning on using actual scan data of the numbers to generate box files,etc.... It seems like this would greatly improve the recognition rate, since the same typewritten font, etc... is used everywhere. Is this a valid assumption? 3. The numbers returned have a set number of decimal places. This means I theoretically could load the Dictionary files with every possible number combination that I expect to see. CPU time is unimportant to me (within reason). Would it be a good idea to have ~10 million entries in the dictionary file? 4. If I do train tesseract, I would like to create a visual tutorial to help others as I do i- since I haven't seen one available. Any suggestions for making this helpful to others? Thanks in advance. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

