Hey, Any algorithm / whitepaper suggestions for text extraction, especially if the text is not over-lay text but a part of the image itself. Most algorithms I saw on the internet are compute intensive.
-- Regards, Saurabh Gandhi On Sat, Mar 5, 2011 at 11:20 AM, Dmitry Silaev <[email protected]>wrote: > Zdravko, > > You should do text-detection before passing images to Tesseract. > Text-detection is a process of determining of image regions containing > text. Even if an image contains no text, Tesseract anyways will treat > it as an image of text. > > Before recognition Tess applies a so-called binarization algorithm, > which converts an RGB image to monochrome one (black for text and > white for background). For your sample image the Otsu binarization > used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would > certainly give a number of skewed vertical lines resembling > backslashes and further recognition classifies them as such. > > "textord_heavy_nr" and some other variables control size-based noise > removal but work satisfactory only in case when there's a significant > body of good text surrounded but some amount of noise. In your image > everything is noise, so it won't work. > > Therefore you need to extend your pre-processing in order to feed Tess > with images indeed containing text. Decisions can be made based on > contrast estimation, distinctive color distribution, etc. > > HTH > > Warm regards, > Dmitry Silaev > > > > > > On Fri, Mar 4, 2011 at 5:25 PM, zdravco <[email protected]> wrote: > > Hello, > > > > I am using tesseract in my project after some image pre-processing. > > There are some false negatives I was hoping tesseract would eliminate > > by producing no output. However, sometimes there is a strange output > > that I get from almost blank images. > > Here is the sample image: > > https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274 > > > > When I run it with tesseract rev. 552 using English language I get: > > " \\\\ R \." > > > > Does anyone know if there are some options in tesseract that could > > eliminate this noise? Or maybe if I could improve my input image with > > some further pre-processing. I have also tried to recompile tesseract > > with "textord_heavy_nr" set to TRUE, but then the output is: > > "an \\“ R \". > > > > Thanks, > > Zdravko > > > > -- > > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > [email protected]. > > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

