Jon, You will certainly need to implement mostly the steps that Cong Nguyen suggests. However complications arise if you wish to do pre-processing in a pure automatic way. You are going to precess real photographic images, and therefore fonts, backgrounds, lighting conditions, etc. differ much. And that's why a "one fits all" method (particularly for ROI detection and background removal) won't work. You will encounter that your fixed pipeline works fine with the first and second images but fails with the third one.
There are two possible ways to solve this. If you still want to do it automatically you'll need to choose several algorithms for every pipeline stage and implement a logic that would automatically, based on some metric, decide for each image which algorithm would work (or have worked) best. Or you can give up automatic approach and switch to manual selection of pre-processing scenarios for each image according to your experience. The next complication is getting results from Tesseract. Since the quality of text in photographic images is really low, usually you can't rely on that Tesseract's top-choice recognition results represent actual text. Imho the best approach here is to get all Tesseract's choices for every character and then remove uncertainty using language model (bigram and trigram statistics). This is the best you can do because dictionary won't help you much, at least for last names. And then you'll have to locate names within the recognition results. The first problem here is in that they can be few per headstone. The second one is in that Tesseract will try to recognize as text everything it sees in the image, including noise left from pre-processing. So this task can also pose some difficulties. But this seems to be mainly a question of engineering, not of research... To conclude, it all depends on how serious you are about investing your time and efforts into your project )) HTH Warm regards, Dmitry Silaev On Mon, Feb 21, 2011 at 6:45 PM, Jon Andersen <[email protected]> wrote: > Whoops, sorry - links were broken for a bit. I just fixed the image links, > they should work now. > Thanks!! > -Jon > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

