OCR output issue with mixed size fonts

Arun Wed, 08 Jun 2011 19:53:19 -0700

Hello Tess Group,

I am running into few issues with the OCR output when the document has
words that vary in size.
Basically I am trying to OCR invoice documents.A sample document for
which I am having trouble is present here


Sample 1:
http://docs.mylifedocs.com/mld/BerryTemplate.tif

I need to extract Date, Invoice No and Total amount from this
document.
After doing OCR, Date and Invoice number appear fine.
However the word "Total" on the right bottom corner ends up being
garbage like "&apos;|&apos;g|ga|"

I assumed that this could be due to some noise and random data before
the word "Total"
So I manually cleared most of the unwanted data and the Template ended
up being like this.

Sample 2:
http://docs.mylifedocs.com/mld/Berry.tif

After I OCRd again the word "Total" ended up being "Tgtal"

With this background I have few questions.

1. Why does "Total" end up as "Tgtal" and is there a way to correct
it.

2. Is there a programmatic way to convert my image from sample1 to
sample2 so that I have a  chance of getting better OCR data

3. Any other best practice i can use on the image or some other
Tesseract setting to improve my OCR.

I tried different DPIs for the image. However 300 DPI gave me seemed
the best so far.


Appreciate your feedback.

Regards
Arun












-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

OCR output issue with mixed size fonts

Reply via email to