Dmitri gave the detailed answer.  

A short-cut perhaps: try higher resolution images.  

Another short-cut: pre-process the images with graphicsmagick to get a 
photocopy-like effect, so that tesseract can choose a correct threshold 
value. My previous posts might help.

On Saturday, May 30, 2015 at 10:17:56 AM UTC-4, S Kirkwood wrote:
>
> Hi, I am working on a project that requires OCR.  I have not used 
> Tesseract much before, aside from using it on some basic examples using the 
> command line tool.  My goal is to use OCR on insurance cards to get all of 
> the characters and then find certain information such as the ID of the 
> cardholder from the output.  In this, accuracy is critical, as a single 
> misread character messes up the entire ID.  
>
> My concern stems from this need for extreme accuracy, which from this 
> discussion thread 
> <https://groups.google.com/forum/#!topic/tesseract-ocr/YO9XhsAWW_k>, 
> appears would only be possible by running the character recognition on each 
> individual character on the card.  The following quote is where I draw most 
> of my worries from:
>
> But if accuracy is critical in your app, in the long run I would 
>> absolutely avoid using any parts of Tesseract except char classifier. I.e. 
>> crop every single char out of your source image and run Tess in the single 
>> char PSM. I think it's should be easy as long as location of every 
>> character is quite stable among your source images. ImageMagick/shell 
>> scripts would suffice.
>>
>
> However, the images I will be processing differ vastly in layout - not 
> stable like the example I linked to.   Some examples of how the format may 
> differ follow:
>  
>
> <https://lh3.googleusercontent.com/-mPGe6BSmfSU/VWiQQMzkD8I/AAAAAAAAAA8/1WwUjQpPRkE/s1600/Sample_Card_2.jpg>
>  
> <https://lh3.googleusercontent.com/-ovzD1qb6x8g/VWiQWG6zP-I/AAAAAAAAABE/Sb6vNLozPoY/s1600/Sample_Card_3.jpg>
>  
> <https://lh3.googleusercontent.com/-K78wt72YzXA/VWiQinq_wiI/AAAAAAAAABM/wcYKEzXBYdI/s1600/Sample_Card_4.jpg>
>  
>
> I have run Tesseract on samples and while it works for most of the 
> characters, there will be cases where it misreads a single character (such 
> as registering an "H " when the character is a "W") or even worse an entire 
> phrase(such as registering "No New Rum" when the phrase is actually "No 
> Referral Required").  Because of errors like this, I would not be able to 
> use the output that Tesseract currently gives me.
>
> Is there a realistic way to use Tesseract for this kind of endeavor?
>
> Thanks for taking the time to read,
> Scott
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6dcd3fa9-03d4-403a-9f1d-34e30b2a936c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to