I am playing with scanning invoices to save on key strokes and
currently evaluating Tesseract. I have two invoices from the same
supplier. Both scanned with the same settings through the same
scanner. The quality of the paper documents is similar. The only
difference between the two documents is the data content, numbers,
products, prices.
Running different images through Tesseract from the same source
produces significantly different results.
The first line of sample "A" is shown below:
P FIBEMI; go; 1 _ 1005824227 `
The first line of sample "B" is below:
REMIT TO‘ _ 1005822166 "
Another example is from sample "A":
1`0TAL AMOUNT DUE
The same text from sample "B":
TUTAL AMOUNT UUE
My plan was to take the output and map it to a tab delimited text file
for subsequent processing. I have written a small java program to
parse the OCR output using string processing and pattern recognition
to identify specific bits of data in the OCR output. For example find
the index of "REMIT TO" and then identify the substring of data using
the index value of "REMIT TO" .
My problem is in order to to parse the output with any degree of
predictability I need consistency in the OCR output. Not getting that
right now. The string "REMIT TO" is returned from OCR as " P FIBEMI;
go; 1" and "REMIT TO‘".
Scanner is set on Grayscale at 300dpi . Have tried Black and White at
300 dpi with similar results.
Are there significant variances between scanners in terms of image
quality? I am using a Canon multifunction.If I was to go to an HP or
something similar would I get more consistent results.
Any tips on improving consistency between the OCR
Thanks.
TC
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---