Thank you for your idea Cong. Before I start doing preprocessing I want to know is there some way to improve the page layout analysis module, so it can distinguish between letters/digits and other stuff(noise of any kind). Or at least make tesseract to process every blob/contour separately.
On Feb 22, 4:35 am, Cong Nguyen <[email protected]> wrote: > Dear Zvezdoslav Kunov, > > I have some ideas for preprocessing: > > 1. Apply thresholding image, analyze two simple method: > - static threshold: keep pixels have lower intensity > - adaptive threshold > > 2. Do connected component > - filter objects/clusters based on boundary > > 3. Based-on median of objects/clusters boundary, calculate scale > factor (depend on trained character size) and apply scaling image > > After that, I think we should get good results. > > Cong. > > P/S: here are illustrations about the approach: > extracted ROI (I cropped manually > :)):https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#557633504... > scaled > image:https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#557633504... > tesseract ocr recognition result for scaled > image:https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#557633505... > You can find simple application at:http://code.google.com/p/tesseractdotnet/ -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

