[tesseract-ocr] Re: Latin language

Ryan Baumann Tue, 16 Dec 2014 13:48:02 -0800

I've resumed working on this some this week, but the bottleneck of the 
mftraining process really makes the feedback loop of 
tweak/train/test/repeat quite slow:


http://ryanfb.github.io/latinocr/

I've incorporated the Latin from Bruce Robertson's Greek/Latin spellcheck 
dictionary in his "rigaudon" OCR repository (
https://github.com/brobertson/rigaudon/), a process that might also be 
portable back to Ancient Greek (though not for frequency, as the Greek 
lacks frequency data). Right now I'm tweaking the process to try to see 
what works and what doesn't for various ligatures and the notoriously 
tricky long s. I've also updated the repos with conditional runtime code 
for running on a Mac, so that I won't have to spend as much time doing 
complicated branch management.

Also, if there are any particular (open/free) fonts that you think would be 
helpful with training for texts typically printed in Latin, I would love to 
hear about them so I can incorporate them into the training process. I've 
added Cardo (a free Bembo-style font with wide coverage) and some Fell 
fonts I came a cross (http://iginomarini.com/fell/the-revival-fonts/), as 
well as retaining some of the GFS fonts. Right now I'm not training on 
bold/italic variants until I'm pretty confident I've ironed out any other 
issues with the training process. I've also pulled macrons out of 
allchars.txt for the same reason, figuring I can add them back in later 
while leaving them on the tessedit_char_blacklist.

-Ryan

On Monday, November 24, 2014 11:16:01 AM UTC-5, Ryan Baumann wrote:
>
> Pull requests or patches are more than welcome, as I'm just getting 
> familiar with the Tesseract training process myself. I've just pushed a few 
> changes to get possibly-better output for the training_text and 
> word/frequency files, but incorporating Latin-specific changes for 
> unicharambigs may be something where someone with more domain-specific 
> knowledge of both Latin and Tesseract will be able to do a better job than 
> me. Due to the upcoming US holidays, I probably won't be able to do much 
> more work on it this week.
>
> Best,
> -Ryan
>
> On Saturday, November 22, 2014 4:15:12 AM UTC-5, Guido Milanese wrote:
>>
>> Thank you for you very promising answer. Would you please tell me/us how 
>> to co-operate in you project?
>>
>> Best wishes,
>> guido milanese
>>
>> Il giorno venerdì 21 novembre 2014 22:12:17 UTC+1, Ryan Baumann ha 
>> scritto:
>>>
>>> On Friday, November 21, 2014 6:44:46 AM UTC-5, Helmut Wollmersdorfer 
>>> wrote:
>>>>
>>>>
>>>>
>>>> Am Freitag, 21. November 2014 00:40:39 UTC+1 schrieb Guido Milanese:
>>>>>
>>>>> I am a regular user of tesseract and it's an essential tool for my 
>>>>> daily work, so thank you, before anything else. The support for Ancient 
>>>>> Greek is simply superb -- works like a charm. I did not find a support 
>>>>> for 
>>>>> Latin -- I mean the Latin language, not the Latin alphabet. Is there any 
>>>>> project for this?
>>>>>
>>>>> Thank you very much for your kind attention.
>>>>> guido, italy
>>>>>
>>>>
>>>> Would be nice to have for me too, because of old scientific 
>>>> (zoological, botanic) texts, which mostly contain Latin and Greek besides 
>>>> the native language.
>>>>
>>>> Do you have a good Latin dictionary for training?
>>>>
>>>> Helmut Wollmersdorfer
>>>>
>>>
>>> Coincidentally, I recently began looking into this for my own use. I 
>>> decided the easiest couse would probably be to adapt the excellent, open 
>>> work done by Nick White for Ancient Greek. Unfortunately I'm not very far 
>>> along yet, as part of the first steps are making sure I can correctly 
>>> replicate the existing process for Ancient Greek on my own machine (the 
>>> mftraining step in the grc repository seems to be taking quite some time).
>>>
>>> You can find my work-in-progress here: 
>>> https://github.com/ryanfb/latinocr-lattraining
>>>
>>> Right now that should just build you (from the same Perseus sources):
>>>
>>> - training_text.txt
>>> - lat.word.txt
>>> - lat.freq.txt
>>> - lat.unicharambigs
>>> - lat.wordlist
>>>
>>> Note that this is very initial, as I've just trivially altered it at 
>>> this point so that I can start figuring out what I need to clean up in the 
>>> input/processing.
>>>
>>> Note also that there's a modification here in 
>>> tools/wordlistfromperseus.sh to strip <foreign> tags instead of skipping 
>>> files with foreign words altogether. I think this would help Ancient Greek 
>>> as well (though how much it will improve or alter overall accuracy I don't 
>>> know). For Greek, this change results in the wordlist being 7202347 lines 
>>> for me instead of 5605967, or a 28% increase in the size of the corpus. I 
>>> originally did this with Saxon/XSLT, but the processing was slow, so I 
>>> switched to using Perl so I could apply a non-greedy regex substitution 
>>> instead (which is much faster): 
>>> https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77
>>>
>>> -Ryan 
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c74798ee-a132-4d8a-a1b2-e817c2ea2be7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Latin language

Reply via email to