Janusz, The TCP (Text Creation Partnership) is interested in creating ground truth for historic texts by hand-keying them: http://www.textcreationpartnership.org/
We use thousands of their documents for ground truth comparisons, and have generated our word frequency lists using them. I just realized that they only use a limited set of ligatures in their transcriptions, however. I apologize for reading your suggestions as though you were advocating typographical accuracy above searchability. Our initial findings are that trying to train Tesseract to recognize these ligatures is less effective than training it to treat them as separate characters. In other words, we're having better results normalizing on the front end, both in terms of accuracy and efficiency re:Tesseract. Having a sophisticated search engine that offers different versions of text would be interesting--we'll have to look into that. Clemens Neudecker from IMPACT is one of our collaborators. Thanks, b On Tue, Dec 10, 2013 at 2:48 PM, Janusz S. Bien <jsb...@mimuw.edu.pl> wrote: > Quote/Cytat - Bryan Tarpley <bptarp...@gmail.com> (Tue 10 Dec 2013 > 09:28:41 PM CET): > > > Janusz, >> >> I'm going to try to interpret your comments as constructive criticism :) >> > > That is definitely my intention. > > > >> We tried using MUFI. There simply does not exist in MUFI a unicode value >> for "ke," for example (we looked: >> http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf). >> > > You can make your own assignment. You can get an idea how it was done in > the IMPACT project e.g. from my note > > http://bc.klf.uw.edu.pl/288/ > > The problem is that you need also the font compatible with your > assignments. In the IMPACT project the font used by Aletheia was changed as > often as it was needed. I understand this can be a problem for you if you > are not familiar with font development software. > > > I >> strongly disagree that we're training on different character shapes than >> those occurring in the texts. We're actually cutting out images of the >> characters themselves and training on those. What you are saying is that >> we should not treat them as separate entities, that we should value >> typographical faithfulness over readability in our OCR. You seem to be >> advocating a kind of purity or exact consistency with the original >> typesetting that is not the immediate goal of the eMOP project. >> > > This is not a question of ideology but of Tesseract accuracy and > efficiency. I'm not a Tesseract expert so it is just a hypothesis that > better results can be achieved training on the original data. > > > Our >> ultimate concern is to make these texts searchable for early modern >> scholars--not to produce 100% typographically faithful textual simulacra. >> We believe this caliber of work (the production of scholarly digital >> editions) is best left to textual scholars, not machines. How is a >> scholar >> supposed to search for instances of the word "turkey" if there are no >> unicode values they could enter using the keyboard (or even copy and paste >> from the character map) for "ke?" >> > > You have just normalize the text before using it in the search engine. If > your search engine is sufficiently sophisticated, you can offer several > versions of your texts. In our search engine the user by default searches > the normalized text but can search also for original spelling with > ligatures. More information is available in my note > > http://bc.klf.uw.edu.pl/289/ > > and the search engine is available at > > http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_1/ > http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_2/ > > > There exist great initiatives like the >> TCP which are more interested in the kind of digitization you seem to be >> advocating. >> > > I'm not familiar with this project. I will appreciate a link. > > > Best regards > > Janusz > > > > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) > jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~ > jsbien/ > -- Bryan Tarpley Graduate Research Assistant Texas A&M | IDHMC LAAH 439 bptarp...@tamu.edu -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.