Quote/Cytat - Tom Morris <[email protected]> (Tue 14 Jul 2015
08:35:19 PM CEST):
On Tuesday, July 14, 2015 at 2:47:40 AM UTC-4, James Owers wrote:
You should consider also using the PAGE format. You can use this tool for
conversion: http://www.primaresearch.org/tools/TesseractOCRToPAGE
Most PAGE format tools aren't available as open source and use a custom
license specific to the lab that produces them and the primary thing that
PAGE adds over hOCR (ground truth text) doesn't sound like it's needed here.
In what sense PAGE adds ground truth text over hOCR? In my opinion
hOCR is as good as PAGE for ground truth texts.
Personally I find the simple TSV format potentially quite useful. You
can find a sample output here:
http://teksty.klf.uw.edu.pl/12/
http://teksty.klf.uw.edu.pl/12/1/alice_1.png.hocr.tsv
Regards
Janusz
--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/20150714212129.49501ejl2ih6ci7t%40mail.mimuw.edu.pl.
For more options, visit https://groups.google.com/d/optout.