I work for a real estate data modeling company, and we produce several thousand images daily that we need OCR'd. As of today, I'm using a C++ wrapper to run the ocropus binary and grab the output, and then I'm just searching through those results looking for strings like bbox x1 y1 x2 y2 to get my data. The problem with this is that if the scanned in image is a few pixel off, then my program won't return the correct values. What I need to do is give ocropus a certain bounding box and then have it return the ocr'd content (if any) of that box. Is this possible? Ideally, I would give ocropus 20-30 bounding boxes and get back the data I need from the images. At the moment, I'm just allowing for a ten pixel tolerance on the areas returned in the hOCR text, which is working, but not as well as I would like.
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

