Quote/Cytat - matthew christy <matt.chri...@gmail.com> (Mon 09 Dec 2013 11:05:25 PM CET):

I realized after talking to Bryan that someone would also have to develop
code cut the images of the boxes from the page image tiff based on the
boxes identified in the box file. However, since Tesseract and the
jTessBoxEditor are based on squares instead of polygons these glyph images
will end up with a lot of noise due to character overlap. So that will also
have to be edited out.

Where the polygons come from? The hot print technology doesn't allow for overlapping characters, the "sort" body was always rectangular, cf. e.g.

http://en.wikipedia.org/wiki/Sort_%28typesetting%29

You mean probably characters belonging to ligatures. Ligatures in my opinion should be treated as single Unicode characters and assigned Private Use Area code if not available in the standard.

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to