On Wed, Oct 8, 2008 at 2:33 PM, Weg <[EMAIL PROTECTED]> wrote: > > I am looking to use tesseract to convert numerical values only. The > source of information is 70s era typewritten data in tables, so > accuracy is very important I have recompiled tesseract with a > modified value for "tessedit_char_whitelist" to only return numerals. > This improves the result significantly, but I am curious if I can > improve it further. I think I need to train tesseract to a domain that > is very specific to my challenge. I have some questions: > > 1. Is training tesseract the best way forward? Are there other > suggestions for improving accuracy?
For your material, training probably is the best way forward, and would yield the quickest results, but you may have some page formatting issues. It depends on the pages. In general though, I think the fixed pitch detection works quite well. > > > 2. Instead of generating new training pages on my own, I was planning > on using actual scan data of the numbers to generate box > files,etc.... It seems like this would greatly improve the > recognition rate, since the same typewritten font, etc... is used > everywhere. Is this a valid assumption? Right on. > > > 3. The numbers returned have a set number of decimal places. This > means I theoretically could load the Dictionary files with every > possible number combination that I expect to see. CPU time is > unimportant to me (within reason). Would it be a good idea to have > ~10 million entries in the dictionary file? Well it is possible, but there is already a number parser. You would be better off hacking that specifically for your application. > > > 4. If I do train tesseract, I would like to create a visual tutorial > to help others as I do i- since I haven't seen one available. Any > suggestions for making this helpful to others? Nice idea. I suppose a good way would be to add a document to the wiki with images in it. I believe this is possible, but a bit long-winded. You have to check the images in to svn, and then point to them from the wiki. In any case it would be great to have more documentation, and I can add you to the list of developers, so you can check in to svn > > > Thanks in advance. > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

