Re: [tesseract-ocr] Re: is hOCR the best route to convert a large number of repetitive forms into structured data?

Helmut Wollmersdorfer Wed, 15 Jul 2015 00:42:07 -0700


Am Dienstag, 14. Juli 2015 22:43:09 UTC+2 schrieb [email protected]:
>
> Thanks everyone for helpful pointers! These all appear to be different 
> ways of describing the position of the identified words on the page? This 
> definitely seems like it would help me produce structured data because I 
> can classify the words as belonging to certain attributes of a json object 
> for each page based on their vertical and horizontal positions.
>


There is another HTML format using positioning via CSS-classes (i.e. valid 
HTML): pdf2htmlEX. See example here:

http://coolwanglu.github.io/pdf2htmlEX/demo/geneve.html

Project:

https://github.com/coolwanglu/pdf2htmlEX

 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f0784b55-3e97-4095-b0ce-f650a4bf3bff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: is hOCR the best route to convert a large number of repetitive forms into structured data?

Reply via email to