getting started...

Brian Sat, 16 Apr 2011 18:44:11 -0700

I am entirely new to the OCR world, and I am anything but technically
proficient in this area. I set up an entirely new Fedora14 virtual
machine for a project, then downloaded, built and installed tesseract
with leptonica. There were no errors. I didn't do anything to train it
- my impression was that the eng.traindata was at least the initial
input for this.


My project amounts to digitizing thousands of index cards. The
language is english - so I would not think that I need to do a certain
type of training, although I imagine that I might have to explain
something about the fonts, etc. The text is typed on index card stock
- so they have red and blue lines on them. They mostly date from the
first half of the 20th century. On a trial run with an image of one of
the cards, I'm getting essentially random bits. I'm barely getting
strings of three vowels or consonants together, I'm guessing that the
training is probably insufficient.

I've read the "Training Tesseract 3.0" page - but that seemingly
addresses training it for another language. Clearly the system comes
pre-trained with at least some notion of English. There are signs that
perhaps one is also training the fonts - which I'd guess I want to do
- but it's far from clear what steps are required, and the
relationship between what I do now and the previously trained English
data.  What do I need to do here?  (Can I read something else?)

Also, what should I do about the lines?  I can probably maximize the
color skew (in my sample image I flipped all of the color extremes in
Lightroom) but I definitely can't make them disappear without actual
image editing. I notice that tesseract is able to process full color
files. Is it feasible to teach tesseract how to ignore the colored
lines and pay attention only to grey/black printing?

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

getting started...

Reply via email to