On Tuesday, May 10, 2016 at 7:35:06 AM UTC-4, Javed Shaikh wrote: > > > I have to load non-readable PDFs which are mainly invoices. They are > mostly scans of excel generated data and are in tabular format. I am able > to read the data within these tables however in some cases the position or > column of a particular value in the table is important to me (so as to > determine what attributes I need to set in my code). > Some of the scans are pretty complex (with certain columns blank so I need > to assume a 0 or blank value) but after the OCR is done these minor yet > significant details are missed out. >
The hOCR output includes coordinates of where on the page the text was found. You could use this with your favorite XML parser as a starting point. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ecefafe2-6b63-4dd3-af52-0f32fa54377f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

