Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Bryan Tarpley Tue, 10 Dec 2013 13:16:17 -0800

Janusz,

The TCP (Text Creation Partnership) is interested in creating ground truth
for historic texts by hand-keying them:
http://www.textcreationpartnership.org/


We use thousands of their documents for ground truth comparisons, and have
generated our word frequency lists using them.  I just realized that they
only use a limited set of ligatures in their transcriptions, however.  I
apologize for reading your suggestions as though you were advocating
typographical accuracy above searchability.  Our initial findings are that
trying to train Tesseract to recognize these ligatures is less effective
than training it to treat them as separate characters.  In other words,
we're having better results normalizing on the front end, both in terms of
accuracy and efficiency re:Tesseract.

Having a sophisticated search engine that offers different versions of text
would be interesting--we'll have to look into that.  Clemens Neudecker from
IMPACT is one of our collaborators.

Thanks,
b


On Tue, Dec 10, 2013 at 2:48 PM, Janusz S. Bien <jsb...@mimuw.edu.pl> wrote:

> Quote/Cytat - Bryan Tarpley <bptarp...@gmail.com> (Tue 10 Dec 2013
> 09:28:41 PM CET):
>
>
>  Janusz,
>>
>> I'm going to try to interpret your comments as constructive criticism :)
>>
>
> That is definitely my intention.
>
>
>
>> We tried using MUFI.  There simply does not exist in MUFI a unicode value
>> for "ke," for example (we looked:
>> http://www.ub.uib.no/elpub/2003/r/000001/MUFI-standard-1.0.pdf).
>>
>
> You can make your own assignment. You can get an idea how it was done in
> the IMPACT project e.g. from my note
>
> http://bc.klf.uw.edu.pl/288/
>
> The problem is that you need also the font compatible with your
> assignments. In the IMPACT project the font used by Aletheia was changed as
> often as it was needed. I understand this can be a problem for you if you
> are not familiar with font development software.
>
>
>  I
>> strongly disagree that we're training on different character shapes than
>> those occurring in the texts.  We're actually cutting out images of the
>> characters themselves and training on those.  What you are saying is that
>> we should not treat them as separate entities, that we should value
>> typographical faithfulness over readability in our OCR.  You seem to be
>> advocating a kind of purity or exact consistency with the original
>> typesetting that is not the immediate goal of the eMOP project.
>>
>
> This is not a question of ideology but of Tesseract accuracy and
> efficiency. I'm not a Tesseract expert so it is just a hypothesis that
> better results can be achieved training on the original data.
>
>
>  Our
>> ultimate concern is to make these texts searchable for early modern
>> scholars--not to produce 100% typographically faithful textual simulacra.
>>  We believe this caliber of work (the production of scholarly digital
>> editions) is best left to textual scholars, not machines.  How is a
>> scholar
>> supposed to search for instances of the word "turkey" if there are no
>> unicode values they could enter using the keyboard (or even copy and paste
>> from the character map) for "ke?"
>>
>
> You have just normalize the text before using it in the search engine. If
> your search engine is sufficiently sophisticated, you can offer several
> versions of your texts. In our search engine the user by default searches
> the normalized text but can search also for original spelling with
> ligatures. More information is available in my note
>
> http://bc.klf.uw.edu.pl/289/
>
> and the search engine is available at
>
> http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_1/
> http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_2/
>
>
>  There exist great initiatives like the
>> TCP which are more interested in the kind of digitization you seem to be
>> advocating.
>>
>
> I'm not familiar with this project. I will appreciate a link.
>
>
> Best regards
>
> Janusz
>
>
>
>
> --
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
> jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~
> jsbien/
>



-- 
Bryan Tarpley
Graduate Research Assistant
Texas A&M | IDHMC
LAAH 439
bptarp...@tamu.edu

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to