I wonder if there is anything consistent about the invoice design? For instance I notice that your invoice has "Honda" logos on the top left and top right essentially providing 2 anchors from which you could extrapolate resolution and location/orientation of the table of data.
You could also look at techniques for table recognition thereby automating your rectangular cropping modes. http://www.researchgate.net/publication/220781373_Automatic_Table_Detection_in_Document_Images/links/0fcfd5107ee667db68000000 http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6628801&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6628801 I would suggest rather than using ImageMagick you look to use Open CV instead as it provides more advanced algorithms for understanding your image (such as edge detection/pattern recognition, e.g. for the Honda logos). I think the problem you have is best served by trying to identify the discrete rectangles else you will get noise that is difficult to filter for what you need, e.g. a person's name. Cheers On Tuesday, 18 November 2014 19:53:08 UTC, Vinay Matam wrote: > > Hi All, > > I really need your help with one of the projects that I am working on. I > am using Tesseract 3.02 on a Ubuntu machine. > > I have an invoice (please see the attached file). I want to extract some > information from that invoice like Advisor Name, Invoice Number, Invoice > Date, License No, Mileage etc.. > > I have tried to extract the whole data from the image to a text file. By > doing some pre-processing on the image using Imagemagick, I was able to > extract the info to some extent. However, I am not totally satisfied with > the output. > I need your inputs on how I should extract the information. Shall I first > crop the specific portion of the image to different rectangles and then OCR > them individually..? I tried this way and gained great results. But again > in this case, not all the images are in the same size with same resolution > and hence the rectangles co-ordinates will not work on all the cases. I > thought this method will not work on all images (scanned, taken from mobile > or pdf files). > > Then I thought of using Regular expressions on the extracted data and then > pick up the data that I require from the whole text file. But this method > also does not seem to be working. > > I am totally in a confused state now. Any help or inputs are much > appreciated. .. :) I have attached a sample image and the extracted output. > > Thanks, > Vinay. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d4502710-3f75-4709-b0a8-7a9a0f54ad41%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

