So I've been tossing an idea around in my head for a while now, and I think it deserves discussion.
As I understand it, the box/tif steps basically reduce varying character shapes to basic simplifications, for each font, which can be quickly and smartly compared with the blobs tesseract later reads. The recommended way to create box/tif files has been through scans, but this is time consuming, and not practical for languages with a significant number of characters. So some of us have opted to create the image and box files with a few different programs (my lazytrain, jTessBoxEditor I think does, and one or two python based programs, from memory). I can see that theoretically this may be inferior, as it won't capture common scanning distortions of letters, but in practise it seems to work well. If we accept that as a valid way to train, it seems like a more sensible idea to extract the character shape prototypes straight from font files. They have the ideal shapes embedded in them, so it shouldn't be particularly difficult, and would make training easier, faster, and I imagine the training files would be smaller, as there would only be one prototype for each character. Is there anything I'm missing with this proposal? Does it sound sensible? If so I'll open a ticket for it, and will have a stab at doing it. Thanks for any input, Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

