Re: Training Tesseract for early printed text

matthew christy Mon, 09 Dec 2013 13:52:36 -0800

Hi all,

All of those glyphs in the sample with the o's were taken from the same 
document. This is part of the crux of the problem we were having with 
training Tesseract from documents like these. Likely, all of these 
exemplars represent different forms of the same character. For example, 
most are likely lower-case o in a roman font; some look to be o in an 
italic font; and likely a few of those are upper-case O's or even some 
zeros that were mis-labeled. In this example you are also be seeing that 
Franken+ allows us to easily see where we have identified the same glyph in 
different point sizes, which is not unusual for these documents. Title 
pages often have larger point sizes (and sometimes even use different font 
families) and footnotes or printed marginalia are often smaller than the 
text body font.


There is also the possibility, as Janusz noted, that in any early modern 
print document a printer used more than one punch for a character, whether 
of another character from the same set, or of the same character from 
another typeface set. I am not familiar with 19th century documents since 
our project is concerned with 15th-18th century printed works. I have not 
seen a '1' used in place of an 'I' in our documents (we do have 45 million 
page images, however), but I have seen typefaces where the glyph for a '1' 
is essentially the same (to my eye, and probably to an OCR engine) as an 
'I'.

All of this is why we created Franken+. Using it we are able to pick out 
just the exemplar o's that we want for the font we are training. For 
example, if we are training for a roman font, we would only pick exemplars 
of those and unselect the italics exemplars. Likewise, if we're currently 
looking at the lower-case o's we would unselect any upper-case o's (I admit 
that in this example it's hard to tell the difference). Franken+ even lets 
us reclassify a particular glyph image by changing the unicode point value 
associated with it. So if we find an upper-case o in our set of images for 
a lower-case o we can reclassify it and it appears with the other set of 
glyph images.

What we are doing is creating a training set (a 'font' in Tesseract's 
terms) of the roman variants of a typeface from a document, and then 
creating a separate font for the italic variants of a typeface from the 
same document. We can then use Franken+ to combine the training for the two 
fonts and OCR a document with Tesseract trained for both. That seems to be 
working pretty well for us so far, but we still have lots more testing to 
do on this front. We also intend to do some more testing on training 
Tesseract with exemplars of different point sizes.

I think that the Gamera classifier tool and its associated OCR engine are 
probably the kind of thing that's more suited to OCR'ing early modern 
texts. As opposed to Tesseract, it recognizes characters better as you give 
it more training on the kinds of varied glyphs you'll find in early modern 
printing. However, we found that it tends to have a lot more trouble than 
Tesseract in dealing with noise, images, skewing and a lot of the problems 
that the page images we are OCR'ing exhibit. So much trouble that OCR'ing a 
300 page document took over 9 hours in some cases. 

Thanks,
Matt

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training Tesseract for early printed text

Reply via email to