På 2007-04-11, skrev Dylan Beaudette: > Hi everyone, > > I am about to embark on an exciting adventure into the land of original > character recognition, processing nearly 1,000 documents and extracting > numbers from them. I am interested in any anecdotal wisdom regarding: > > 1. efficient scanning parameters: > DPI > color / BW / grayscale
B&W, as high DPI as feasible. > 2. pre-processing steps one might do with imagemagick Clipping off borders is recommended. > 3. any filtering that one might do to get ready for the OCR Make sure there are no handwritten notes, post-it pieces, or other miscellaneous cruft on the documents before scanning them. If the paper is colored or there are ghost images (such as the back-side printing showing through thin paper), scan in grayscale and then carefully reduce to B&W with an appropriate hand-picked threshhold. I think I used pnmremap to do that the last time that need came up for me. > I plan to use Google's new OCR project, ocropus, which currently uses > the 'tesseract' engine. Naive attempts to OCR these documents is resulting in > marginal accuracy, so any help is appreciated. Vertical and horizontal lines > on the original documents are confusing the OCR, so removing them might be a > start. I have thought about extracting each 'cell' of data with imagemagick, > and then running the resulting mini-images though the OCR... that might be a > last resort though... Neat. I've never tried that. The only OCR engine I've sucessfully used is gocr, which was pretty decent and worked out of the box with minimal tweaking. I tried Clara but it seemed unstable and I gave up before I could figure out how to make it work. -- Henry House +1 530 753 3361 ext. 13 Please don't send me HTML mail! My mail system frequently rejects it. The unintelligible text that may follow is a digital signature. See <http://hajhouse.org/pgp> to find out how to use it. My OpenPGP key: <http://hajhouse.org/hajhouse.asc>.
signature.asc
Description: Digital signature
_______________________________________________ vox-tech mailing list [email protected] http://lists.lugod.org/mailman/listinfo/vox-tech
