Quote/Cytat - Tom Morris <[email protected]> (Tue 14 Jul 2015 08:35:19 PM CEST):

On Tuesday, July 14, 2015 at 2:47:40 AM UTC-4, James Owers wrote:

You should consider also using the PAGE format. You can use this tool for
conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE


Most PAGE format tools aren't available as open source and use a custom
license specific to the lab that produces them and the primary thing that
PAGE adds over hOCR (ground truth text) doesn't sound like it's needed here.

In what sense PAGE adds ground truth text over hOCR? In my opinion hOCR is as good as PAGE for ground truth texts.

Personally I find the simple TSV format potentially quite useful. You can find a sample output here:

http://teksty.klf.uw.edu.pl/12/
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv

Regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20150714212129.49501ejl2ih6ci7t%40mail.mimuw.edu.pl.
For more options, visit https://groups.google.com/d/optout.

Reply via email to