Dear Andres,
The recognition results which I showed, have achieved after I had used my simple tesseract engine 3.01 .net wrapper (link here: http://code.google.com/p/tesseractdotnet/). ROI detection is cropping ROI manually, after that I used my company software to filter. About filtering, you can analyze on control set to find out solution to estimate parameters feasibly. Thanks, Cong. From: [email protected] [mailto:[email protected]] On Behalf Of Andres Sent: Wednesday, February 23, 2011 4:02 AM To: [email protected] Subject: Re: Image pre-processing for good OCR results Hello, A few comments from my side, sorry for being disordered, but I have not much time right now. In OpenCV you can use thresholding with the Otsu algorithm, it's not documented in the documentation of the threshold function, but the parameter is CV_THRESH_OTSU. Otsu thresholding involves the calibration of the parameters by performing a previous histogram: http://en.wikipedia.org/wiki/Otsu%27s_method I tried it in my project (a licence plate recognition system) and I visually got too much better results, but surprisingly for Tesseract it was worse. It changed the thickness of the draw of the letters, and when I trained Tesseract the letters were bolder than the results of the Otsu threshold, so perhaps there is the explanation for my problem. So, perhaps it would be a good solution for you. If you want to make some rapid tests with OpenCV for preprocessing you can use this: http://code.google.com/p/cvpreprocessor/ It's not a complete tool but it helps. I think that your system is close to mine in certain aspects. I was thinking in doing some skeletonization or something like that for the fonts and then training Tesseract with these modified letters. Then doing the same process with the acquired images and executing Tesseract. I didn't try that yet. Skeletonization: http://homepages.inf.ed.ac.uk/rbf/HIPR2/skeleton.htm In accordance with what Tom Morris said, you have some constraints in text layout. Tesseract gives you the coordinates of each character. You can work with that. Perhaps you will need some grouping algorithm like k-means to make some statitstics: http://en.wikipedia.org/wiki/Kmeans OpenCV has an implementation of k-means, ask me for a snippet in case of needing it. Question to Cong Nguyen: The program that you used here, is something that is available on the web or is something that you have for your projects ? : https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605 922 Cheers, Andres www.visiondepatentes.com.ar 2011/2/22 Tom Morris <[email protected]> On Feb 20, 9:02 pm, Jon Andersen <[email protected]> wrote: > My project athttp://RecordAGrave.comis about recording headstones from > graves and posting the text and images on the Net so that people can > research their family history. I would appreciate some advice on how to > pre-process these headstone images to get the best results from Tesseract > OCR. I have thousands of 1-2 MB jpg images of headstones to process. Post-image capture is too late for one of the most important enhancements, namely high contrast lighting. It's not really an issue with stones that have the carving painted or are otherwise naturally high contrast, but for many stones sharp oblique lighting is important to get an image that's readable by humans, let alone OCR software. Once you've got the best quality image capture you can manage, you'll probably find that you need to use different image processing pipelines for different types of stones and carving, so the first step will be to categorize the stone and figure out which pipeline to run it through (or run it through them all and compare the results). In addition to image processing, you may also be able to improve results by making use of the fact that the vocabulary and layout of the text is much more constrained than free text. It'll be interesting to see what kind of results you get. I suspect it's going to be a fairly challenging project for the general case, but you may be able to pick of the low hanging fruit and gradually expand the types of stones you can handle. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected] <mailto:tesseract-ocr%[email protected]> . For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

