Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Janusz S. Bien Tue, 10 Dec 2013 12:49:07 -0800

Quote/Cytat - Bryan Tarpley <bptarp...@gmail.com> (Tue 10 Dec 201309:28:41 PM CET):

Janusz,


I'm going to try to interpret your comments as constructive criticism :)


That is definitely my intention.


We tried using MUFI.  There simply does not exist in MUFI a unicode value
for "ke," for example (we looked:
http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf).

You can make your own assignment. You can get an idea how it was donein the IMPACT project e.g. from my note


http://bc.klf.uw.edu.pl/288/

The problem is that you need also the font compatible with yourassignments. In the IMPACT project the font used by Aletheia waschanged as often as it was needed. I understand this can be a problemfor you if you are not familiar with font development software.

I
strongly disagree that we're training on different character shapes than
those occurring in the texts.  We're actually cutting out images of the
characters themselves and training on those.  What you are saying is that
we should not treat them as separate entities, that we should value
typographical faithfulness over readability in our OCR.  You seem to be
advocating a kind of purity or exact consistency with the original
typesetting that is not the immediate goal of the eMOP project.

This is not a question of ideology but of Tesseract accuracy andefficiency. I'm not a Tesseract expert so it is just a hypothesis thatbetter results can be achieved training on the original data.

Our
ultimate concern is to make these texts searchable for early modern
scholars--not to produce 100% typographically faithful textual simulacra.
 We believe this caliber of work (the production of scholarly digital
editions) is best left to textual scholars, not machines.  How is a scholar
supposed to search for instances of the word "turkey" if there are no
unicode values they could enter using the keyboard (or even copy and paste
from the character map) for "ke?"

You have just normalize the text before using it in the search engine.If your search engine is sufficiently sophisticated, you can offerseveral versions of your texts. In our search engine the user bydefault searches the normalized text but can search also for originalspelling with ligatures. More information is available in my note


http://bc.klf.uw.edu.pl/289/

and the search engine is available at

http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_1/
http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_2/

There exist great initiatives like the
TCP which are more interested in the kind of digitization you seem to be
advocating.


I'm not familiar with this project. I will appreciate a link.

Best regards

Janusz




--

Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (KatedraLingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to