Just check out Ray's October 2007 paper "An Overview of the Tesseract OCR Engine" where it says:
The first step is a connected component analysis in which outlines of the components are stored. This was a computationally expensive design decision at the time, but had a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially. And in fact, in our own application after image preprocessing we pass the binarized image as a white-on-black image to tesseract and never had problems with that. Of course, our training images are also white-on-black, so this might also affect our findings. Marcus On Tuesday, December 4, 2012 2:58:26 PM UTC+1, zdenop wrote: > > Where did you find "advertised features of tesseract is that it works > equally well for black-on-white and white-on-black text"? I never heard > about it. > See forum for other experience: > https://groups.google.com/d/topic/tesseract-ocr/XoX6t5Ih1IM/discussion > > -- > Zdenko > > On Tue, Dec 4, 2012 at 2:42 PM, Speedy <[email protected] <javascript:> > > wrote: > >> Why is a black background a problem? One of the advertised features of >> tesseract is that it works equally well for black-on-white and >> white-on-black text. > > Marcus >> >> >> On Tuesday, December 4, 2012 11:11:36 AM UTC+1, zdenop wrote: >> >>> Search forum. I remember discussion about **similar topic. >>> AFAIR: tesseract has problem with letter(symbol) that consists of >>> several not connected parts (e.g. dots, lines) - solution should be to >>> preprocess image (blur). >>> >>> Generally: black background is problem. Quality of image is too low >>> (JPEG, quality: 75), there is no information about DPI... Anyway this "LED" >>> font is not standard font, so maybe training will be need. >>> >>> -- >>> Zdenko >>> >>> On Tue, Dec 4, 2012 at 12:43 AM, mike oldfield <[email protected]>wrote: >>> >>>> >>>> <https://lh5.googleusercontent.com/-Ly6oR_Rmkag/UL04-iH5XaI/AAAAAAAAAAU/J-T592D8834/s1600/1.jpg> >>>> Hello >>>> >>>> I`d like to recognize LED-like numbers/digits. >>>> I attached image (jpg, 680x320, brightness 65%, contrast 100%). >>>> Is there any libraries or presets to decode these digits? For example >>>> googledocuments conversion and free-ocr.com doesn`t work. >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> >>>> To unsubscribe from this group, send email to >>>> tesseract-oc...@**googlegroups.com >>>> >>>> For more options, visit this group at >>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>> >>> >>> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

