Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

matthew christy Mon, 09 Dec 2013 07:20:59 -0800

Hi all,

Thanks for the comments. I was not aware that there were concerns with 
Aletheia's availability or trouble getting access to the tool. We have not 
had any problems with that ourselves.


We are only using Aletheia as a tool to identify each glyph and create 
tiff/box file pairs for each page processed. We are not using the PAGE 
format or anything like that. Early in our process of trying to create 
training for Tesseract with early modern printed documents we had begun to 
use Aletheia as a tool to create tiff/box pairs for Tesseract (with some 
translation, of course). At the point that we decided that we had to create 
some new mechanism for training Tesseract, we already had quite a lot of 
page images processed with Aletheia which had all been corrected and 
checked by hand. So we started from that when creating Franken+. 

We have discussed internally some of the suggestions that you are all 
making above. Being able to use Tesseract's built in box file generator and 
jTessBoxEditor as input instead of Aletheia is one. Creating a version that 
runs on other platforms is another that would follow the above (as Aletheia 
also only runs on Windows). Making our own repository for Franken+ is also 
a good idea. It will take some time to get to all of that though. In the 
meantime we wanted to share what we had created, and this is a beta 
release. As open-source code we also hope that others will feel free to 
make some of these changes themselves and share with the rest of us. In 
fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor input 
should be a simple matter of adjusting the coordinate system that Franken+ 
is expecting in the incoming box files (since Tesseract and Aletheia box 
files have 0,0 in different corners).

-

Franken+ was created not only to allow us to identify the best possible 
exemplars of each glyph in our training documents, but to generate tiffs 
with accompanying box files that could be used to train Tesseract as well. 
The early modern printed documents that we are trying to use to train 
Tesseract are far from ideal. They suffer from many problems introduced at 
every stage of the process from the original printing, to 250-550 years of 
use and storage, to the digitization of the document. We found in early 
testing that the more examples of glyphs we tried to train Tesseract 
with--often with highly variable example images for each glyph--the worse 
Tesseract did. 

In a nutshell Franken+ allows us to process several pages (with Aletheia), 
see every exemplar of each glyph discovered, pick some small number of 
ideal samples for each (we are in the process of testing whether Tesseract 
does better when Franken+ is used to pick one example of each glyph, 5 
examples, or more), and generate a set of tiff images and box files that 
are produced by creating a Franken-doc using only those set of exemplars 
identified in Franken+. However, we have found that using Franken+ has 
other advantages: being able to easily identify miss-labeled glyphs, doing 
typeface comparison of glyphs in a document, quickly identifying and 
removing "junk" glyphs, being able to quickly identify the different point 
sizes used in a doc, etc. And we also added some extra features that allow 
users to complete the Tesseract training process with Franken+ rather than 
having to go back to the command line.

We have seen a variable amount of improvement in our OCR results with 
Tesseract using training generated with Franken+. Some of that improvement 
has been quite good; upwards of 15-20%, without adding dictionaries. We are 
continuing to test variables in Franken+ training to see what generates the 
best results. We'll add all of that information to the eMOP and Franken+ 
pages when we have it.

We do appreciate the comments and suggestions and if anyone is interested 
in getting Franken+ to work with Tesseract's tiff/box pairs before we can 
get to it, please do. We are always happy to get help and we do want 
Franken+ to be an active open-source project.
Thanks,
Matt Christy

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to