So I've been tossing an idea around in my head for a while now, and
I think it deserves discussion.

As I understand it, the box/tif steps basically reduce varying
character shapes to basic simplifications, for each font, which can
be quickly and smartly compared with the blobs tesseract later
reads.

The recommended way to create box/tif files has been through scans,
but this is time consuming, and not practical for languages with a
significant number of characters. So some of us have opted to create
the image and box files with a few different programs (my lazytrain,
jTessBoxEditor I think does, and one or two python based programs,
from memory). I can see that theoretically this may be inferior, as
it won't capture common scanning distortions of letters, but in
practise it seems to work well.

If we accept that as a valid way to train, it seems like a more
sensible idea to extract the character shape prototypes straight
from font files. They have the ideal shapes embedded in them, so it
shouldn't be particularly difficult, and would make training easier,
faster, and I imagine the training files would be smaller, as there
would only be one prototype for each character.

Is there anything I'm missing with this proposal? Does it sound
sensible? If so I'll open a ticket for it, and will have a stab at
doing it.

Thanks for any input,

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to