Hi all, All of those glyphs in the sample with the o's were taken from the same document. This is part of the crux of the problem we were having with training Tesseract from documents like these. Likely, all of these exemplars represent different forms of the same character. For example, most are likely lower-case o in a roman font; some look to be o in an italic font; and likely a few of those are upper-case O's or even some zeros that were mis-labeled. In this example you are also be seeing that Franken+ allows us to easily see where we have identified the same glyph in different point sizes, which is not unusual for these documents. Title pages often have larger point sizes (and sometimes even use different font families) and footnotes or printed marginalia are often smaller than the text body font.
There is also the possibility, as Janusz noted, that in any early modern print document a printer used more than one punch for a character, whether of another character from the same set, or of the same character from another typeface set. I am not familiar with 19th century documents since our project is concerned with 15th-18th century printed works. I have not seen a '1' used in place of an 'I' in our documents (we do have 45 million page images, however), but I have seen typefaces where the glyph for a '1' is essentially the same (to my eye, and probably to an OCR engine) as an 'I'. All of this is why we created Franken+. Using it we are able to pick out just the exemplar o's that we want for the font we are training. For example, if we are training for a roman font, we would only pick exemplars of those and unselect the italics exemplars. Likewise, if we're currently looking at the lower-case o's we would unselect any upper-case o's (I admit that in this example it's hard to tell the difference). Franken+ even lets us reclassify a particular glyph image by changing the unicode point value associated with it. So if we find an upper-case o in our set of images for a lower-case o we can reclassify it and it appears with the other set of glyph images. What we are doing is creating a training set (a 'font' in Tesseract's terms) of the roman variants of a typeface from a document, and then creating a separate font for the italic variants of a typeface from the same document. We can then use Franken+ to combine the training for the two fonts and OCR a document with Tesseract trained for both. That seems to be working pretty well for us so far, but we still have lots more testing to do on this front. We also intend to do some more testing on training Tesseract with exemplars of different point sizes. I think that the Gamera classifier tool and its associated OCR engine are probably the kind of thing that's more suited to OCR'ing early modern texts. As opposed to Tesseract, it recognizes characters better as you give it more training on the kinds of varied glyphs you'll find in early modern printing. However, we found that it tends to have a lot more trouble than Tesseract in dealing with noise, images, skewing and a lot of the problems that the page images we are OCR'ing exhibit. So much trouble that OCR'ing a 300 page document took over 9 hours in some cases. Thanks, Matt -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.