On 4 December 2012 13:58, zdenko podobny <[email protected]> wrote: > > Where did you find "advertised features of tesseract is that it works > equally well for black-on-white and white-on-black text"? I never heard > about it.
It used to be mentioned fairly prominently, in the README in the wiki, I think. It's still mentioned here: http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf in section 2: Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and possibly remain so even now. The first step is a connected component analysis in which outlines of the components are stored. This was a computationally expensive design decision at the time, but had a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

