Hey,

Any algorithm / whitepaper suggestions for text extraction, especially if
the text is not over-lay text but a part of the image itself. Most
algorithms I saw on the internet are compute intensive.

--
Regards,
Saurabh Gandhi




On Sat, Mar 5, 2011 at 11:20 AM, Dmitry Silaev <[email protected]>wrote:

> Zdravko,
>
> You should do text-detection before passing images to Tesseract.
> Text-detection is a process of determining of image regions containing
> text. Even if an image contains no text, Tesseract anyways will treat
> it as an image of text.
>
> Before recognition Tess applies a so-called binarization algorithm,
> which converts an RGB image to monochrome one (black for text and
> white for background). For your sample image the Otsu binarization
> used in Tesseract (http://en.wikipedia.org/wiki/Otsu%27s_method) would
> certainly give a number of skewed vertical lines resembling
> backslashes and further recognition classifies them as such.
>
> "textord_heavy_nr" and some other variables control size-based noise
> removal but work satisfactory only in case when there's a significant
> body of good text surrounded but some amount of noise. In your image
> everything is noise, so it won't work.
>
> Therefore you need to extend your pre-processing in order to feed Tess
> with images indeed containing text. Decisions can be made based on
> contrast estimation, distinctive color distribution, etc.
>
> HTH
>
> Warm regards,
> Dmitry Silaev
>
>
>
>
>
> On Fri, Mar 4, 2011 at 5:25 PM, zdravco <[email protected]> wrote:
> > Hello,
> >
> > I am using tesseract in my project after some image pre-processing.
> > There are some false negatives I was hoping tesseract would eliminate
> > by producing no output. However, sometimes there is a strange output
> > that I get from almost blank images.
> > Here is the sample image:
> > https://picasaweb.google.com/zdravco/TesseractTest#5580227257541654274
> >
> > When I run it with tesseract rev. 552 using English language I get:
> > " \\\\ R \."
> >
> > Does anyone know if there are some options in tesseract that could
> > eliminate this noise? Or maybe if I could improve my input image with
> > some further pre-processing. I have also tried to recompile tesseract
> > with "textord_heavy_nr" set to TRUE, but then the output is:
> > "an \\“ R \".
> >
> > Thanks,
> > Zdravko
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> [email protected].
> > For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to