Hello Tess Group, I am running into few issues with the OCR output when the document has words that vary in size. Basically I am trying to OCR invoice documents.A sample document for which I am having trouble is present here
Sample 1: http://docs.mylifedocs.com/mld/BerryTemplate.tif I need to extract Date, Invoice No and Total amount from this document. After doing OCR, Date and Invoice number appear fine. However the word "Total" on the right bottom corner ends up being garbage like "'|'g|ga|" I assumed that this could be due to some noise and random data before the word "Total" So I manually cleared most of the unwanted data and the Template ended up being like this. Sample 2: http://docs.mylifedocs.com/mld/Berry.tif After I OCRd again the word "Total" ended up being "Tgtal" With this background I have few questions. 1. Why does "Total" end up as "Tgtal" and is there a way to correct it. 2. Is there a programmatic way to convert my image from sample1 to sample2 so that I have a chance of getting better OCR data 3. Any other best practice i can use on the image or some other Tesseract setting to improve my OCR. I tried different DPIs for the image. However 300 DPI gave me seemed the best so far. Appreciate your feedback. Regards Arun -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

