On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer wrote: > > > > Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese: >> >> I am a regular user of tesseract and it's an essential tool for my daily >> work, so thank you, before anything else. The support for Ancient Greek is >> simply superb -- works like a charm. I did not find a support for Latin -- >> I mean the Latin language, not the Latin alphabet. Is there any project for >> this? >> >> Thank you very much for your kind attention. >> guido, italy >> > > Would be nice to have for me too, because of old scientific (zoological, > botanic) texts, which mostly contain Latin and Greek besides the native > language. > > Do you have a good Latin dictionary for training? > > Helmut Wollmersdorfer >
Coincidentally, I recently began looking into this for my own use. I decided the easiest couse would probably be to adapt the excellent, open work done by Nick White for Ancient Greek. Unfortunately I'm not very far along yet, as part of the first steps are making sure I can correctly replicate the existing process for Ancient Greek on my own machine (the mftraining step in the grc repository seems to be taking quite some time). You can find my work-in-progress here: https://github.com/ryanfb/latinocr-lattraining Right now that should just build you (from the same Perseus sources): - training_text.txt - lat.word.txt - lat.freq.txt - lat.unicharambigs - lat.wordlist Note that this is very initial, as I've just trivially altered it at this point so that I can start figuring out what I need to clean up in the input/processing. Note also that there's a modification here in tools/wordlistfromperseus.sh to strip <foreign> tags instead of skipping files with foreign words altogether. I think this would help Ancient Greek as well (though how much it will improve or alter overall accuracy I don't know). For Greek, this change results in the wordlist being 7202347 lines for me instead of 5605967, or a 28% increase in the size of the corpus. I originally did this with Saxon/XSLT, but the processing was slow, so I switched to using Perl so I could apply a non-greedy regex substitution instead (which is much faster): https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77 -Ryan -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6b6a5bce-d95e-4b67-90c2-51435d978324%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

