Thank you for you very promising answer. Would you please tell me/us how to co-operate in you project?
Best wishes, guido milanese Il giorno venerdì 21 novembre 2014 22:12:17 UTC+1, Ryan Baumann ha scritto: > > On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer wrote: >> >> >> >> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese: >>> >>> I am a regular user of tesseract and it's an essential tool for my daily >>> work, so thank you, before anything else. The support for Ancient Greek is >>> simply superb -- works like a charm. I did not find a support for Latin -- >>> I mean the Latin language, not the Latin alphabet. Is there any project for >>> this? >>> >>> Thank you very much for your kind attention. >>> guido, italy >>> >> >> Would be nice to have for me too, because of old scientific (zoological, >> botanic) texts, which mostly contain Latin and Greek besides the native >> language. >> >> Do you have a good Latin dictionary for training? >> >> Helmut Wollmersdorfer >> > > Coincidentally, I recently began looking into this for my own use. I > decided the easiest couse would probably be to adapt the excellent, open > work done by Nick White for Ancient Greek. Unfortunately I'm not very far > along yet, as part of the first steps are making sure I can correctly > replicate the existing process for Ancient Greek on my own machine (the > mftraining step in the grc repository seems to be taking quite some time). > > You can find my work-in-progress here: > https://github.com/ryanfb/latinocr-lattraining > > Right now that should just build you (from the same Perseus sources): > > - training_text.txt > - lat.word.txt > - lat.freq.txt > - lat.unicharambigs > - lat.wordlist > > Note that this is very initial, as I've just trivially altered it at this > point so that I can start figuring out what I need to clean up in the > input/processing. > > Note also that there's a modification here in tools/wordlistfromperseus.sh > to strip <foreign> tags instead of skipping files with foreign words > altogether. I think this would help Ancient Greek as well (though how much > it will improve or alter overall accuracy I don't know). For Greek, this > change results in the wordlist being 7202347 lines for me instead of > 5605967, or a 28% increase in the size of the corpus. I originally did this > with Saxon/XSLT, but the processing was slow, so I switched to using Perl > so I could apply a non-greedy regex substitution instead (which is much > faster): > https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77 > > -Ryan > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aa6fc93e-9cc8-4c10-bc16-9c17046bdfec%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

