On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer wrote:
>
>
>
> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese:
>>
>> I am a regular user of tesseract and it's an essential tool for my daily 
>> work, so thank you, before anything else. The support for Ancient Greek is 
>> simply superb -- works like a charm. I did not find a support for Latin -- 
>> I mean the Latin language, not the Latin alphabet. Is there any project for 
>> this?
>>
>> Thank you very much for your kind attention.
>> guido, italy
>>
>
> Would be nice to have for me too, because of old scientific (zoological, 
> botanic) texts, which mostly contain Latin and Greek besides the native 
> language.
>
> Do you have a good Latin dictionary for training?
>
> Helmut Wollmersdorfer
>

Coincidentally, I recently began looking into this for my own use. I 
decided the easiest couse would probably be to adapt the excellent, open 
work done by Nick White for Ancient Greek. Unfortunately I'm not very far 
along yet, as part of the first steps are making sure I can correctly 
replicate the existing process for Ancient Greek on my own machine (the 
mftraining step in the grc repository seems to be taking quite some time).

You can find my work-in-progress 
here: https://github.com/ryanfb/latinocr-lattraining

Right now that should just build you (from the same Perseus sources):

- training_text.txt
- lat.word.txt
- lat.freq.txt
- lat.unicharambigs
- lat.wordlist

Note that this is very initial, as I've just trivially altered it at this 
point so that I can start figuring out what I need to clean up in the 
input/processing.

Note also that there's a modification here in tools/wordlistfromperseus.sh 
to strip <foreign> tags instead of skipping files with foreign words 
altogether. I think this would help Ancient Greek as well (though how much 
it will improve or alter overall accuracy I don't know). For Greek, this 
change results in the wordlist being 7202347 lines for me instead of 
5605967, or a 28% increase in the size of the corpus. I originally did this 
with Saxon/XSLT, but the processing was slow, so I switched to using Perl 
so I could apply a non-greedy regex substitution instead (which is much 
faster): 
https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77

-Ryan 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6b6a5bce-d95e-4b67-90c2-51435d978324%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to