jsbien,

I've attached an example from one of our documents.  Consider the capital 
'T' which overhangs the 'u', and the 'k' which underlies the 'e'.  We've 
also found instances where, on certain fonts, almost all of the italics 
characters overlap.  These are not ligatures.

Thanks,
Bryan

On Tuesday, December 10, 2013 12:02:37 AM UTC-6, jsbien wrote:
>
> Quote/Cytat - matthew christy <matt.c...@gmail.com <javascript:>> (Mon 09 
> Dec   
> 2013 11:05:25 PM CET): 
>
> > I realized after talking to Bryan that someone would also have to 
> develop 
> > code cut the images of the boxes from the page image tiff based on the 
> > boxes identified in the box file. However, since Tesseract and the 
> > jTessBoxEditor are based on squares instead of polygons these glyph 
> images 
> > will end up with a lot of noise due to character overlap. So that will 
> also 
> > have to be edited out. 
>
> Where the polygons come from? The hot print technology doesn't allow   
> for overlapping characters, the "sort" body was always rectangular,   
> cf. e.g. 
>
> http://en.wikipedia.org/wiki/Sort_%28typesetting%29 
>
> You mean probably characters belonging to ligatures. Ligatures in my   
> opinion should be treated as single Unicode characters and assigned   
> Private Use Area code if not available in the standard. 
>
> Best regards 
>
> Janusz 
>
> -- 
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra   
> Lingwistyki Formalnej) 
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics 
> Department) 
> jsb...@uw.edu.pl <javascript:>, jsb...@mimuw.edu.pl <javascript:>, 
> http://fleksem.klf.uw.edu.pl/~jsbien/ 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

<<attachment: example.png>>

Reply via email to