Am Dienstag, 14. Juli 2015 22:43:09 UTC+2 schrieb [email protected]: > > Thanks everyone for helpful pointers! These all appear to be different > ways of describing the position of the identified words on the page? This > definitely seems like it would help me produce structured data because I > can classify the words as belonging to certain attributes of a json object > for each page based on their vertical and horizontal positions. >
There is another HTML format using positioning via CSS-classes (i.e. valid HTML): pdf2htmlEX. See example here: http://coolwanglu.github.io/pdf2htmlEX/demo/geneve.html Project: https://github.com/coolwanglu/pdf2htmlEX -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f0784b55-3e97-4095-b0ce-f650a4bf3bff%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

