Re: Numbers & Noise

Zvezdoslav Kunov Fri, 25 Feb 2011 05:03:46 -0800

Thank you for your idea Cong.

Before I start doing preprocessing I want to know is there some way to
improve the page layout analysis module, so it can distinguish between
letters/digits
and other stuff(noise of any kind). Or at least make tesseract to
process every
blob/contour separately.


On Feb 22, 4:35 am, Cong Nguyen <[email protected]> wrote:
> Dear Zvezdoslav Kunov,
>
> I have some ideas for preprocessing:
>
> 1. Apply thresholding image, analyze two simple method:
>     - static threshold: keep pixels have lower intensity
>     - adaptive threshold
>
> 2. Do connected component
>     - filter objects/clusters based on boundary
>
> 3. Based-on median of objects/clusters boundary, calculate scale
> factor (depend on trained character size) and apply scaling image
>
> After that, I think we should get good results.
>
> Cong.
>
> P/S: here are illustrations about the approach:
> extracted ROI (I cropped manually 
> :)):https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#557633504...
> scaled 
> image:https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#557633504...
> tesseract ocr recognition result for scaled 
> image:https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#557633505...
> You can find simple application at:http://code.google.com/p/tesseractdotnet/

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Numbers & Noise

Reply via email to