[tesseract-ocr] Re: Latin language

Ryan Baumann Mon, 24 Nov 2014 08:16:41 -0800

Pull requests or patches are more than welcome, as I'm just getting 
familiar with the Tesseract training process myself. I've just pushed a few 
changes to get possibly-better output for the training_text and 
word/frequency files, but incorporating Latin-specific changes for 
unicharambigs may be something where someone with more domain-specific 
knowledge of both Latin and Tesseract will be able to do a better job than 
me. Due to the upcoming US holidays, I probably won't be able to do much 
more work on it this week.


Best,
-Ryan

On Saturday, November 22, 2014 4:15:12 AM UTC-5, Guido Milanese wrote:
>
> Thank you for you very promising answer. Would you please tell me/us how 
> to co-operate in you project?
>
> Best wishes,
> guido milanese
>
> Il giorno venerdì 21 novembre 2014 22:12:17 UTC+1, Ryan Baumann ha scritto:
>>
>> On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer 
>> wrote:
>>>
>>>
>>>
>>> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese:
>>>>
>>>> I am a regular user of tesseract and it's an essential tool for my 
>>>> daily work, so thank you, before anything else. The support for Ancient 
>>>> Greek is simply superb -- works like a charm. I did not find a support for 
>>>> Latin -- I mean the Latin language, not the Latin alphabet. Is there any 
>>>> project for this?
>>>>
>>>> Thank you very much for your kind attention.
>>>> guido, italy
>>>>
>>>
>>> Would be nice to have for me too, because of old scientific (zoological, 
>>> botanic) texts, which mostly contain Latin and Greek besides the native 
>>> language.
>>>
>>> Do you have a good Latin dictionary for training?
>>>
>>> Helmut Wollmersdorfer
>>>
>>
>> Coincidentally, I recently began looking into this for my own use. I 
>> decided the easiest couse would probably be to adapt the excellent, open 
>> work done by Nick White for Ancient Greek. Unfortunately I'm not very far 
>> along yet, as part of the first steps are making sure I can correctly 
>> replicate the existing process for Ancient Greek on my own machine (the 
>> mftraining step in the grc repository seems to be taking quite some time).
>>
>> You can find my work-in-progress here: 
>> https://github.com/ryanfb/latinocr-lattraining
>>
>> Right now that should just build you (from the same Perseus sources):
>>
>> - training_text.txt
>> - lat.word.txt
>> - lat.freq.txt
>> - lat.unicharambigs
>> - lat.wordlist
>>
>> Note that this is very initial, as I've just trivially altered it at this 
>> point so that I can start figuring out what I need to clean up in the 
>> input/processing.
>>
>> Note also that there's a modification here in 
>> tools/wordlistfromperseus.sh to strip <foreign> tags instead of skipping 
>> files with foreign words altogether. I think this would help Ancient Greek 
>> as well (though how much it will improve or alter overall accuracy I don't 
>> know). For Greek, this change results in the wordlist being 7202347 lines 
>> for me instead of 5605967, or a 28% increase in the size of the corpus. I 
>> originally did this with Saxon/XSLT, but the processing was slow, so I 
>> switched to using Perl so I could apply a non-greedy regex substitution 
>> instead (which is much faster): 
>> https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77
>>
>> -Ryan 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c05294a8-0cfe-40a7-984e-bd15af0b2744%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Latin language

Reply via email to