On Tuesday, May 10, 2016 at 7:35:06 AM UTC-4, Javed Shaikh wrote:
>
>
> I have to load non-readable PDFs which are mainly invoices. They are 
> mostly scans of excel generated data and are in tabular format. I am able 
> to read the data within these tables however in some cases the position or 
> column of a particular value in the table is important to me (so as to 
> determine what attributes I need to set in my code).
> Some of the scans are pretty complex (with certain columns blank so I need 
> to assume a 0 or blank value) but after the OCR is done these minor yet 
> significant details are missed out. 
>

The hOCR output includes coordinates of where on the page the text was 
found. You could use this with your favorite XML parser as a starting point.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ecefafe2-6b63-4dd3-af52-0f32fa54377f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to