I am playing with scanning invoices to save on key strokes and
currently evaluating Tesseract. I have two invoices from the same
supplier. Both scanned with the same settings through the same
scanner. The quality of the paper documents is similar. The only
difference between the two documents is the data content, numbers,
products, prices.

Running different images through Tesseract from the same source
produces significantly different results.

The first line of sample "A" is shown below:

     P FIBEMI; go; 1 _ 1005824227  `

The first line  of sample "B" is below:

     REMIT TO‘ _ 1005822166             "



Another example is from sample "A":

     1`0TAL AMOUNT DUE


The same text from sample "B":

    TUTAL AMOUNT UUE


My plan was to take the output and map it to a tab delimited text file
for subsequent processing. I have written a small java program to
parse the OCR  output  using string processing and pattern recognition
to identify specific bits of data in the OCR output. For example find
the index of "REMIT TO" and then  identify the substring of data using
the index value of "REMIT TO"  .

My problem is in order to to parse the output with any degree of
predictability I need consistency in the OCR output. Not getting that
right now. The string "REMIT TO" is returned from OCR as " P FIBEMI;
go; 1" and "REMIT TO‘".

Scanner is set on Grayscale at 300dpi . Have tried Black and White at
300 dpi with similar results.

Are there significant variances between scanners in terms of image
quality? I am using a Canon multifunction.If I was to go to  an HP or
something similar would I get more consistent results.

Any tips on improving consistency between the OCR

Thanks.

TC




--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to