jsbien, I've attached an example from one of our documents. Consider the capital 'T' which overhangs the 'u', and the 'k' which underlies the 'e'. We've also found instances where, on certain fonts, almost all of the italics characters overlap. These are not ligatures.
Thanks, Bryan On Tuesday, December 10, 2013 12:02:37 AM UTC-6, jsbien wrote: > > Quote/Cytat - matthew christy <matt.c...@gmail.com <javascript:>> (Mon 09 > Dec > 2013 11:05:25 PM CET): > > > I realized after talking to Bryan that someone would also have to > develop > > code cut the images of the boxes from the page image tiff based on the > > boxes identified in the box file. However, since Tesseract and the > > jTessBoxEditor are based on squares instead of polygons these glyph > images > > will end up with a lot of noise due to character overlap. So that will > also > > have to be edited out. > > Where the polygons come from? The hot print technology doesn't allow > for overlapping characters, the "sort" body was always rectangular, > cf. e.g. > > http://en.wikipedia.org/wiki/Sort_%28typesetting%29 > > You mean probably characters belonging to ligatures. Ligatures in my > opinion should be treated as single Unicode characters and assigned > Private Use Area code if not available in the standard. > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics > Department) > jsb...@uw.edu.pl <javascript:>, jsb...@mimuw.edu.pl <javascript:>, > http://fleksem.klf.uw.edu.pl/~jsbien/ > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
<<attachment: example.png>>