Thanks everyone for helpful pointers! These all appear to be different ways 
of describing the position of the identified words on the page? This 
definitely seems like it would help me produce structured data because I 
can classify the words as belonging to certain attributes of a json object 
for each page based on their vertical and horizontal positions.

I am afraid since I am so new to Tesseract and OCR in general I am missing 
important points or asking stupid questions, so unless you all suggest 
otherwise I will spend quite a bit of time with the tesseract source code 
on github.

On Tuesday, July 14, 2015 at 12:21:42 PM UTC-7, jsbien wrote:
>
> Quote/Cytat - Tom Morris <[email protected] <javascript:>> (Tue 14 Jul 
> 2015   
> 08:35:19 PM CEST): 
>
> > On Tuesday, July 14, 2015 at 2:47:40 AM UTC-4, James Owers wrote: 
> >> 
> >> You should consider also using the PAGE format. You can use this tool 
> for 
> >> conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE 
> >> 
> > 
> > Most PAGE format tools aren't available as open source and use a custom 
> > license specific to the lab that produces them and the primary thing 
> that 
> > PAGE adds over hOCR (ground truth text) doesn't sound like it's needed 
> here. 
>
> In what sense PAGE adds ground truth text over hOCR? In my opinion   
> hOCR is as good as PAGE for ground truth texts. 
>
> Personally I find the simple TSV format potentially quite useful. You   
> can find a sample output here: 
>
> http://teksty.klf.uw.edu.pl/12/ 
> http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv 
>
> Regards 
>
> Janusz 
>
> -- 
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra   
> Lingwistyki Formalnej) 
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics 
> Department) 
> [email protected] <javascript:>, [email protected] <javascript:>, 
> http://fleksem.klf.uw.edu.pl/~jsbien/ 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/dbb756a9-27d4-4c43-a1b5-a949fcdb54bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to