[tesseract-ocr] Re: Latin language

Ryan Baumann Fri, 21 Nov 2014 13:12:34 -0800

On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer wrote:
>
>
>
> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese:
>>
>> I am a regular user of tesseract and it's an essential tool for my daily 
>> work, so thank you, before anything else. The support for Ancient Greek is 
>> simply superb -- works like a charm. I did not find a support for Latin -- 
>> I mean the Latin language, not the Latin alphabet. Is there any project for 
>> this?
>>
>> Thank you very much for your kind attention.
>> guido, italy
>>
>
> Would be nice to have for me too, because of old scientific (zoological, 
> botanic) texts, which mostly contain Latin and Greek besides the native 
> language.
>
> Do you have a good Latin dictionary for training?
>
> Helmut Wollmersdorfer
>

Coincidentally, I recently began looking into this for my own use. I
decided the easiest couse would probably be to adapt the excellent, open
work done by Nick White for Ancient Greek. Unfortunately I'm not very far
along yet, as part of the first steps are making sure I can correctly
replicate the existing process for Ancient Greek on my own machine (the
mftraining step in the grc repository seems to be taking quite some time).

You can find my work-in-progress
here: https://github.com/ryanfb/latinocr-lattraining

Right now that should just build you (from the same Perseus sources):

- training_text.txt
- lat.word.txt
- lat.freq.txt
- lat.unicharambigs
- lat.wordlist

Note that this is very initial, as I've just trivially altered it at this
point so that I can start figuring out what I need to clean up in the
input/processing.

Note also that there's a modification here in tools/wordlistfromperseus.sh
to strip <foreign> tags instead of skipping files with foreign words
altogether. I think this would help Ancient Greek as well (though how much
it will improve or alter overall accuracy I don't know). For Greek, this
change results in the wordlist being 7202347 lines for me instead of
5605967, or a 28% increase in the size of the corpus. I originally did this
with Saxon/XSLT, but the processing was slow, so I switched to using Perl
so I could apply a non-greedy regex substitution instead (which is much
faster):
https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77

-Ryan

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6b6a5bce-d95e-4b67-90c2-51435d978324%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Latin language

Reply via email to