Pull requests or patches are more than welcome, as I'm just getting familiar with the Tesseract training process myself. I've just pushed a few changes to get possibly-better output for the training_text and word/frequency files, but incorporating Latin-specific changes for unicharambigs may be something where someone with more domain-specific knowledge of both Latin and Tesseract will be able to do a better job than me. Due to the upcoming US holidays, I probably won't be able to do much more work on it this week.
Best, -Ryan On Saturday, November 22, 2014 4:15:12 AM UTC-5, Guido Milanese wrote: > > Thank you for you very promising answer. Would you please tell me/us how > to co-operate in you project? > > Best wishes, > guido milanese > > Il giorno venerdì 21 novembre 2014 22:12:17 UTC+1, Ryan Baumann ha scritto: >> >> On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer >> wrote: >>> >>> >>> >>> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese: >>>> >>>> I am a regular user of tesseract and it's an essential tool for my >>>> daily work, so thank you, before anything else. The support for Ancient >>>> Greek is simply superb -- works like a charm. I did not find a support for >>>> Latin -- I mean the Latin language, not the Latin alphabet. Is there any >>>> project for this? >>>> >>>> Thank you very much for your kind attention. >>>> guido, italy >>>> >>> >>> Would be nice to have for me too, because of old scientific (zoological, >>> botanic) texts, which mostly contain Latin and Greek besides the native >>> language. >>> >>> Do you have a good Latin dictionary for training? >>> >>> Helmut Wollmersdorfer >>> >> >> Coincidentally, I recently began looking into this for my own use. I >> decided the easiest couse would probably be to adapt the excellent, open >> work done by Nick White for Ancient Greek. Unfortunately I'm not very far >> along yet, as part of the first steps are making sure I can correctly >> replicate the existing process for Ancient Greek on my own machine (the >> mftraining step in the grc repository seems to be taking quite some time). >> >> You can find my work-in-progress here: >> https://github.com/ryanfb/latinocr-lattraining >> >> Right now that should just build you (from the same Perseus sources): >> >> - training_text.txt >> - lat.word.txt >> - lat.freq.txt >> - lat.unicharambigs >> - lat.wordlist >> >> Note that this is very initial, as I've just trivially altered it at this >> point so that I can start figuring out what I need to clean up in the >> input/processing. >> >> Note also that there's a modification here in >> tools/wordlistfromperseus.sh to strip <foreign> tags instead of skipping >> files with foreign words altogether. I think this would help Ancient Greek >> as well (though how much it will improve or alter overall accuracy I don't >> know). For Greek, this change results in the wordlist being 7202347 lines >> for me instead of 5605967, or a 28% increase in the size of the corpus. I >> originally did this with Saxon/XSLT, but the processing was slow, so I >> switched to using Perl so I could apply a non-greedy regex substitution >> instead (which is much faster): >> https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77 >> >> -Ryan >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c05294a8-0cfe-40a7-984e-bd15af0b2744%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

