[tesseract-ocr] Re: Latin language

Guido Milanese Sat, 22 Nov 2014 01:15:34 -0800

Thank you for you very promising answer. Would you please tell me/us how to 
co-operate in you project?


Best wishes,
guido milanese

Il giorno venerdì 21 novembre 2014 22:12:17 UTC+1, Ryan Baumann ha scritto:
>
> On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer wrote:
>>
>>
>>
>> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese:
>>>
>>> I am a regular user of tesseract and it's an essential tool for my daily 
>>> work, so thank you, before anything else. The support for Ancient Greek is 
>>> simply superb -- works like a charm. I did not find a support for Latin -- 
>>> I mean the Latin language, not the Latin alphabet. Is there any project for 
>>> this?
>>>
>>> Thank you very much for your kind attention.
>>> guido, italy
>>>
>>
>> Would be nice to have for me too, because of old scientific (zoological, 
>> botanic) texts, which mostly contain Latin and Greek besides the native 
>> language.
>>
>> Do you have a good Latin dictionary for training?
>>
>> Helmut Wollmersdorfer
>>
>
> Coincidentally, I recently began looking into this for my own use. I 
> decided the easiest couse would probably be to adapt the excellent, open 
> work done by Nick White for Ancient Greek. Unfortunately I'm not very far 
> along yet, as part of the first steps are making sure I can correctly 
> replicate the existing process for Ancient Greek on my own machine (the 
> mftraining step in the grc repository seems to be taking quite some time).
>
> You can find my work-in-progress here: 
> https://github.com/ryanfb/latinocr-lattraining
>
> Right now that should just build you (from the same Perseus sources):
>
> - training_text.txt
> - lat.word.txt
> - lat.freq.txt
> - lat.unicharambigs
> - lat.wordlist
>
> Note that this is very initial, as I've just trivially altered it at this 
> point so that I can start figuring out what I need to clean up in the 
> input/processing.
>
> Note also that there's a modification here in tools/wordlistfromperseus.sh 
> to strip <foreign> tags instead of skipping files with foreign words 
> altogether. I think this would help Ancient Greek as well (though how much 
> it will improve or alter overall accuracy I don't know). For Greek, this 
> change results in the wordlist being 7202347 lines for me instead of 
> 5605967, or a 28% increase in the size of the corpus. I originally did this 
> with Saxon/XSLT, but the processing was slow, so I switched to using Perl 
> so I could apply a non-greedy regex substitution instead (which is much 
> faster): 
> https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77
>
> -Ryan 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/aa6fc93e-9cc8-4c10-bc16-9c17046bdfec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Latin language

Reply via email to