Quote/Cytat - Bryan Tarpley <bptarp...@gmail.com> (Tue 10 Dec 2013 09:28:41 PM CET):

Janusz,

I'm going to try to interpret your comments as constructive criticism :)

That is definitely my intention.


We tried using MUFI.  There simply does not exist in MUFI a unicode value
for "ke," for example (we looked:
http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf).

You can make your own assignment. You can get an idea how it was done in the IMPACT project e.g. from my note

http://bc.klf.uw.edu.pl/288/

The problem is that you need also the font compatible with your assignments. In the IMPACT project the font used by Aletheia was changed as often as it was needed. I understand this can be a problem for you if you are not familiar with font development software.

I
strongly disagree that we're training on different character shapes than
those occurring in the texts.  We're actually cutting out images of the
characters themselves and training on those.  What you are saying is that
we should not treat them as separate entities, that we should value
typographical faithfulness over readability in our OCR.  You seem to be
advocating a kind of purity or exact consistency with the original
typesetting that is not the immediate goal of the eMOP project.

This is not a question of ideology but of Tesseract accuracy and efficiency. I'm not a Tesseract expert so it is just a hypothesis that better results can be achieved training on the original data.

Our
ultimate concern is to make these texts searchable for early modern
scholars--not to produce 100% typographically faithful textual simulacra.
 We believe this caliber of work (the production of scholarly digital
editions) is best left to textual scholars, not machines.  How is a scholar
supposed to search for instances of the word "turkey" if there are no
unicode values they could enter using the keyboard (or even copy and paste
from the character map) for "ke?"

You have just normalize the text before using it in the search engine. If your search engine is sufficiently sophisticated, you can offer several versions of your texts. In our search engine the user by default searches the normalized text but can search also for original spelling with ligatures. More information is available in my note

http://bc.klf.uw.edu.pl/289/

and the search engine is available at

http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_1/
http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_2/

There exist great initiatives like the
TCP which are more interested in the kind of digitization you seem to be
advocating.

I'm not familiar with this project. I will appreciate a link.

Best regards

Janusz




--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to