On Friday, October 12, 2012 8:50:25 AM UTC-4, Nick White wrote: > Hi Tom, thanks for your thoughts. > > A key reason for not using scans when training is when the character > set is quite large, so it would take many pages of real scans to get > a few samples of each. Plus I found the process of box editing quite > error-prone when dealing with large sets. For my Ancient Greek > training, due to different combinations of diacritics which can > apply to many characters, the character set was a lot larger than it > looked at first. Finding and scanning 'real' pages which definitely > contain all characters would be difficult. I'm sure the same would > be true for other scripts which make use of many diacritics. >
Sorry, let me clarify. I wasn't suggesting using scans, I was suggesting using images created by taking representative texts, representative fonts, and rendering page images from them (which I suspect is what your current automated training program does. > > • font 'hints' which cause the glyph do be rendered differently at > different > > resolutions > > • kerning information which affects glyph placement relative to its > > neighbors > > Aren't these two arguments *for* using font information? As one > could encode the information for characters at a few different > sizes, in a more representative fashion that you could from half a > dozen examples of characters from a page scan. > Except that you have to understand not only the data, but how it interacts with the font rasterization machinery. If you just render the text, that's all taken care of for you. Rendering images with different font sizes may be a good idea if that's representative of what you'll encounter in your real world images. Perhaps it's possible to interpret the font information directly, but my suspicion is that you'll be introducing at least as many problems as you're solving. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

