Hi all, Thanks for the comments. I was not aware that there were concerns with Aletheia's availability or trouble getting access to the tool. We have not had any problems with that ourselves.
We are only using Aletheia as a tool to identify each glyph and create tiff/box file pairs for each page processed. We are not using the PAGE format or anything like that. Early in our process of trying to create training for Tesseract with early modern printed documents we had begun to use Aletheia as a tool to create tiff/box pairs for Tesseract (with some translation, of course). At the point that we decided that we had to create some new mechanism for training Tesseract, we already had quite a lot of page images processed with Aletheia which had all been corrected and checked by hand. So we started from that when creating Franken+. We have discussed internally some of the suggestions that you are all making above. Being able to use Tesseract's built in box file generator and jTessBoxEditor as input instead of Aletheia is one. Creating a version that runs on other platforms is another that would follow the above (as Aletheia also only runs on Windows). Making our own repository for Franken+ is also a good idea. It will take some time to get to all of that though. In the meantime we wanted to share what we had created, and this is a beta release. As open-source code we also hope that others will feel free to make some of these changes themselves and share with the rest of us. In fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor input should be a simple matter of adjusting the coordinate system that Franken+ is expecting in the incoming box files (since Tesseract and Aletheia box files have 0,0 in different corners). - Franken+ was created not only to allow us to identify the best possible exemplars of each glyph in our training documents, but to generate tiffs with accompanying box files that could be used to train Tesseract as well. The early modern printed documents that we are trying to use to train Tesseract are far from ideal. They suffer from many problems introduced at every stage of the process from the original printing, to 250-550 years of use and storage, to the digitization of the document. We found in early testing that the more examples of glyphs we tried to train Tesseract with--often with highly variable example images for each glyph--the worse Tesseract did. In a nutshell Franken+ allows us to process several pages (with Aletheia), see every exemplar of each glyph discovered, pick some small number of ideal samples for each (we are in the process of testing whether Tesseract does better when Franken+ is used to pick one example of each glyph, 5 examples, or more), and generate a set of tiff images and box files that are produced by creating a Franken-doc using only those set of exemplars identified in Franken+. However, we have found that using Franken+ has other advantages: being able to easily identify miss-labeled glyphs, doing typeface comparison of glyphs in a document, quickly identifying and removing "junk" glyphs, being able to quickly identify the different point sizes used in a doc, etc. And we also added some extra features that allow users to complete the Tesseract training process with Franken+ rather than having to go back to the command line. We have seen a variable amount of improvement in our OCR results with Tesseract using training generated with Franken+. Some of that improvement has been quite good; upwards of 15-20%, without adding dictionaries. We are continuing to test variables in Franken+ training to see what generates the best results. We'll add all of that information to the eMOP and Franken+ pages when we have it. We do appreciate the comments and suggestions and if anyone is interested in getting Franken+ to work with Tesseract's tiff/box pairs before we can get to it, please do. We are always happy to get help and we do want Franken+ to be an active open-source project. Thanks, Matt Christy -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.