Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Janusz S. Bien Mon, 09 Dec 2013 22:03:01 -0800

Quote/Cytat - matthew christy <matt.chri...@gmail.com> (Mon 09 Dec2013 11:05:25 PM CET):

I realized after talking to Bryan that someone would also have to develop
code cut the images of the boxes from the page image tiff based on the
boxes identified in the box file. However, since Tesseract and the
jTessBoxEditor are based on squares instead of polygons these glyph images
will end up with a lot of noise due to character overlap. So that will also
have to be edited out.

Where the polygons come from? The hot print technology doesn't allowfor overlapping characters, the "sort" body was always rectangular,cf. e.g.


http://en.wikipedia.org/wiki/Sort_%28typesetting%29

You mean probably characters belonging to ligatures. Ligatures in myopinion should be treated as single Unicode characters and assignedPrivate Use Area code if not available in the standard.


Best regards

Janusz

--

Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (KatedraLingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to