I am entirely new to the OCR world, and I am anything but technically proficient in this area. I set up an entirely new Fedora14 virtual machine for a project, then downloaded, built and installed tesseract with leptonica. There were no errors. I didn't do anything to train it - my impression was that the eng.traindata was at least the initial input for this.
My project amounts to digitizing thousands of index cards. The language is english - so I would not think that I need to do a certain type of training, although I imagine that I might have to explain something about the fonts, etc. The text is typed on index card stock - so they have red and blue lines on them. They mostly date from the first half of the 20th century. On a trial run with an image of one of the cards, I'm getting essentially random bits. I'm barely getting strings of three vowels or consonants together, I'm guessing that the training is probably insufficient. I've read the "Training Tesseract 3.0" page - but that seemingly addresses training it for another language. Clearly the system comes pre-trained with at least some notion of English. There are signs that perhaps one is also training the fonts - which I'd guess I want to do - but it's far from clear what steps are required, and the relationship between what I do now and the previously trained English data. What do I need to do here? (Can I read something else?) Also, what should I do about the lines? I can probably maximize the color skew (in my sample image I flipped all of the color extremes in Lightroom) but I definitely can't make them disappear without actual image editing. I notice that tesseract is able to process full color files. Is it feasible to teach tesseract how to ignore the colored lines and pay attention only to grey/black printing? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

