[tesseract-ocr] Re: Need Help with extracting info from Invoice

Vinay Matam Wed, 19 Nov 2014 11:23:28 -0800

Thanks Allistair for replying. I have a wide variety of invoice types which 
are of no particular type. But all the invoice types will have the 
necessary fields that I have mentioned earlier in my post but they may 
exist at different locations in the image. Our solution should be able to 
extract the necessary fields of data irrespective of whatever the invoice 
format is.


I will surely check the links that you have provided.. I also got another 
thought.. I will try to implement and update here.. :)

Thanks again.. :)
Vinay

On Wednesday, November 19, 2014 5:26:34 AM UTC+5:30, Allistair C wrote:
>
> I wonder if there is anything consistent about the invoice design? 
>
> For instance I notice that your invoice has "Honda" logos on the top left 
> and top right essentially providing 2 anchors from which you could 
> extrapolate resolution and location/orientation of the table of data.
>
> You could also look at techniques for table recognition thereby automating 
> your rectangular cropping modes.
>
>
> http://www.researchgate.net/publication/220781373_Automatic_Table_Detection_in_Document_Images/links/0fcfd5107ee667db68000000
>
>
> http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6628801&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6628801
>
> I would suggest rather than using ImageMagick you look to use Open CV 
> instead as it provides more advanced algorithms for understanding your 
> image (such as edge detection/pattern recognition, e.g. for the Honda 
> logos).
>
> I think the problem you have is best served by trying to identify the 
> discrete rectangles else you will get noise that is difficult to filter for 
> what you need, e.g. a person's name.
>
> Cheers
>
> On Tuesday, 18 November 2014 19:53:08 UTC, Vinay Matam wrote:
>>
>> Hi All,
>>
>> I really need your help with one of the projects that I am working on. I 
>> am using Tesseract 3.02 on a Ubuntu machine.
>>
>> I have an invoice (please see the attached file). I want to extract some 
>> information from that invoice like Advisor Name, Invoice Number, Invoice 
>> Date, License No, Mileage etc..
>>
>> I have tried to extract the whole data from the image to a text file. By 
>> doing some pre-processing on the image using Imagemagick, I was able to 
>> extract the info to some extent. However, I am not totally satisfied with 
>> the output. 
>> I need your inputs on how I should extract the information. Shall I first 
>> crop the specific portion of the image to different rectangles and then OCR 
>> them individually..? I tried this way and gained great results. But again 
>> in this case, not all the images are in the same size with same resolution 
>> and hence the rectangles co-ordinates will not work on all the cases. I 
>> thought this method will not work on all images (scanned, taken from mobile 
>> or pdf files).
>>
>> Then I thought of using Regular expressions on the extracted data and 
>> then pick up the data that I require from the whole text file. But this 
>> method also does not seem to be working. 
>>
>> I am totally in a confused state now. Any help or inputs are much 
>> appreciated. .. :) I have attached a sample image and the extracted output.
>>
>> Thanks,
>> Vinay.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/968ceb16-d4ff-40b0-97c5-897bd85a9e2a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Need Help with extracting info from Invoice

Reply via email to