I wonder if there is anything consistent about the invoice design? 

For instance I notice that your invoice has "Honda" logos on the top left 
and top right essentially providing 2 anchors from which you could 
extrapolate resolution and location/orientation of the table of data.

You could also look at techniques for table recognition thereby automating 
your rectangular cropping modes.

http://www.researchgate.net/publication/220781373_Automatic_Table_Detection_in_Document_Images/links/0fcfd5107ee667db68000000

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6628801&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6628801

I would suggest rather than using ImageMagick you look to use Open CV 
instead as it provides more advanced algorithms for understanding your 
image (such as edge detection/pattern recognition, e.g. for the Honda 
logos).

I think the problem you have is best served by trying to identify the 
discrete rectangles else you will get noise that is difficult to filter for 
what you need, e.g. a person's name.

Cheers

On Tuesday, 18 November 2014 19:53:08 UTC, Vinay Matam wrote:
>
> Hi All,
>
> I really need your help with one of the projects that I am working on. I 
> am using Tesseract 3.02 on a Ubuntu machine.
>
> I have an invoice (please see the attached file). I want to extract some 
> information from that invoice like Advisor Name, Invoice Number, Invoice 
> Date, License No, Mileage etc..
>
> I have tried to extract the whole data from the image to a text file. By 
> doing some pre-processing on the image using Imagemagick, I was able to 
> extract the info to some extent. However, I am not totally satisfied with 
> the output. 
> I need your inputs on how I should extract the information. Shall I first 
> crop the specific portion of the image to different rectangles and then OCR 
> them individually..? I tried this way and gained great results. But again 
> in this case, not all the images are in the same size with same resolution 
> and hence the rectangles co-ordinates will not work on all the cases. I 
> thought this method will not work on all images (scanned, taken from mobile 
> or pdf files).
>
> Then I thought of using Regular expressions on the extracted data and then 
> pick up the data that I require from the whole text file. But this method 
> also does not seem to be working. 
>
> I am totally in a confused state now. Any help or inputs are much 
> appreciated. .. :) I have attached a sample image and the extracted output.
>
> Thanks,
> Vinay.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d4502710-3f75-4709-b0a8-7a9a0f54ad41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to