Thanks everyone for helpful pointers! These all appear to be different ways of describing the position of the identified words on the page? This definitely seems like it would help me produce structured data because I can classify the words as belonging to certain attributes of a json object for each page based on their vertical and horizontal positions.
I am afraid since I am so new to Tesseract and OCR in general I am missing important points or asking stupid questions, so unless you all suggest otherwise I will spend quite a bit of time with the tesseract source code on github. On Tuesday, July 14, 2015 at 12:21:42 PM UTC-7, jsbien wrote: > > Quote/Cytat - Tom Morris <[email protected] <javascript:>> (Tue 14 Jul > 2015 > 08:35:19 PM CEST): > > > On Tuesday, July 14, 2015 at 2:47:40 AM UTC-4, James Owers wrote: > >> > >> You should consider also using the PAGE format. You can use this tool > for > >> conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE > >> > > > > Most PAGE format tools aren't available as open source and use a custom > > license specific to the lab that produces them and the primary thing > that > > PAGE adds over hOCR (ground truth text) doesn't sound like it's needed > here. > > In what sense PAGE adds ground truth text over hOCR? In my opinion > hOCR is as good as PAGE for ground truth texts. > > Personally I find the simple TSV format potentially quite useful. You > can find a sample output here: > > http://teksty.klf.uw.edu.pl/12/ > http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv > > Regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics > Department) > [email protected] <javascript:>, [email protected] <javascript:>, > http://fleksem.klf.uw.edu.pl/~jsbien/ > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dbb756a9-27d4-4c43-a1b5-a949fcdb54bd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

