On Thursday, May 16, 2013 10:52:51 PM UTC-4, Mike Masinick wrote: > So, I have several hundred thousand scans of sports cards that look > similar to the attached. I want to scan the text at the top of the page > and extract at least the 8 digit number. Ideally more of the text as well, > but the 8 digit number is the most important. Before I spend a ton of time > researching the best way to train tesseract for this font, is there a > suggested way to preprocess an image like this to get the best results? > It seems to only grab the 8 digit number correctly about 1/10th of the > time. It gets the numbers wrong a lot. > > I'm using tesseract on Amazon EC2 with the Image::OCR::Tesseract perl > module. Any suggestions much appreciated. Might also be willling to pay > for somebody to create training data for me if anybody is well versed in > this and can save me the time of having to figure it out.... >
Why not use barcode recognition software and take advantage of the error correction inherent in barcodes? The bar code to the left of the bottom line encodes the number on the right. If you want to OCR the rest of the text, I'd focusing on cropping/preprocessing, as others have suggested, rather than training. Tom -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

