Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

O CR Fri, 10 Apr 2020 07:24:46 -0700

Thank you for responding.
I did the finetuning on the best Latin float model. And I converted the 
model to integer. But it's still slower then the fast integer Latin 
model....
Any other ideas to make it faster?


Op vrijdag 10 april 2020 14:17:55 UTC+2 schreef shree:
>
> The file is probably there as script/Latin.traineddata 
> You can copy to wherever you are looking for the best traineddata files.
>
> On Fri, Apr 10, 2020, 16:59 O CR <[email protected] <javascript:>> 
> wrote:
>
>> Which language do I have to use? Because Latin isn't supported.
>> ./tesstrain.sh --fonts_dir "/usr/share/fonts" *--lang Latin* 
>> --linedata_only  --noextract_font_properties --langdata_dir ./langdata 
>> --tessdata_dir ./tessdata  --output_dir ./output
>>
>> Op woensdag 8 april 2020 18:27:15 UTC+2 schreef shree:
>>>
>>> I suggest you fine-tune Latin.traineddata using text of the kind you 
>>> expect. It will have a smaller unicharset and when you convert to fast 
>>> integer model, it should be smaller in size.
>>>
>>> On Wed, Apr 8, 2020, 20:39 O CR <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I try to read names on images with tesseract LSTM. Names like:
>>>>
>>>> Śerena Kovitch
>>>>
>>>> ŁAGUNA EVREIST
>>>>
>>>> Äna Optici
>>>>
>>>> Orğu Moninck
>>>>
>>>>
>>>> (I don't have to recognize words)
>>>>
>>>>
>>>> Latin.traineddata (fast integer) is doing well with the diacritics, but 
>>>> there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ 
>>>> ,﹚ ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so 
>>>> Latin.traineddata is too slow.
>>>>
>>>> So I thought I take eng.traineddata (best float for LSTM) and I train 
>>>> it for the diacritics. But there are almost 400 diacritics. So I don't 
>>>> know 
>>>> if fine-tuning for such amount of characters is a good idea?
>>>>
>>>> However I tried it but the quality is very poor.
>>>>
>>>> I trained with eng.training_text (a English text of 72 lines) and I 
>>>> added all the diacritics several times. The char error rate during 
>>>> lstmeval 
>>>> is around 0.1. I did a test with 80 documents, and I read 30 names 
>>>> correct. 
>>>> (on each document there is one name). (time is similar to 
>>>> Latin.traineddata)
>>>>
>>>>
>>>> What can I do to get a model that is as good as Latin.traineddata on 
>>>> diacritics but is much faster in ocr reading? 
>>>>
>>>>
>>>> Thank you.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com.

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Reply via email to