Questions about training for numerical values only

Weg Wed, 08 Oct 2008 14:33:40 -0700

I am looking to use tesseract to convert numerical values only.  The
source of information is 70s era typewritten data in tables, so
accuracy is very important  I have recompiled tesseract with a
modified value for "tessedit_char_whitelist" to only return numerals.
This improves the result significantly, but I am curious if I can
improve it further. I think I need to train tesseract to a domain that
is very specific to my challenge. I have some questions:


1. Is training tesseract the best way forward? Are there other
suggestions for improving accuracy?

2.  Instead of generating new training pages on my own, I was planning
on using actual scan data of the numbers to generate box
files,etc....  It seems like this would greatly improve the
recognition rate, since the same typewritten font, etc... is used
everywhere. Is this a valid assumption?

3.  The numbers returned have a set number of decimal places. This
means I theoretically could load the Dictionary files with every
possible number combination that I expect to see.  CPU time is
unimportant to me (within reason).  Would it be a good idea to have
~10 million entries in the dictionary file?

4. If I do train tesseract, I would like to create a visual tutorial
to help others as I do i- since I haven't seen one available.  Any
suggestions for making this helpful to others?

Thanks in advance.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Questions about training for numerical values only

Reply via email to