Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Lorenzo Bolzani Fri, 10 Apr 2020 10:35:26 -0700

I thought this may lead to some insights useful for the OP but as the
matter gets more mysterious I'm opening a new thread not to hijack this.



Lorenzo


Il giorno ven 10 apr 2020 alle ore 17:27 Lorenzo Bolzani <
[email protected]> ha scritto:

> Hi,
> I started writing this email thinking that removing some characters should
> not make any real difference: I think the model parameters do not change
> with fine tuning and even when removing a few layers the bulk of the model
> remains the same.
>
> I decided to test it and I found a very strange thing. I have 14 custom
> trained models and I found out that 2/3 of these are twice as slow as the
> others.
>
> The slow ones are as slow as the standard ones, "eng", "spa", etc.
>
> I do not remember ever converting them to fast models. All the models are
> about 6.4MB, all trained with ocr-d (tesstrain).
>
> The speed difference is visible from python code (tesserocr API wrapper)
> and from command line (I repeat the same recognition 100 times, with one as
> warmup.
>
> The oldest ones (maybe trained with 4.0.0-beta?), from 2018, are generally
> faster except for one. All use a reduced charset but the size of the
> charset makes no difference.
>
> Any ideas?
>
>
> Bye
>
> Lorenzo
>
> Il giorno mer 8 apr 2020 alle ore 17:09 O CR <[email protected]>
> ha scritto:
>
>> Hi all,
>>
>> I try to read names on images with tesseract LSTM. Names like:
>>
>> Śerena Kovitch
>>
>> ŁAGUNA EVREIST
>>
>> Äna Optici
>>
>> Orğu Moninck
>>
>>
>> (I don't have to recognize words)
>>
>>
>> Latin.traineddata (fast integer) is doing well with the diacritics, but
>> there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ ,﹚
>> ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so
>> Latin.traineddata is too slow.
>>
>> So I thought I take eng.traineddata (best float for LSTM) and I train it
>> for the diacritics. But there are almost 400 diacritics. So I don't know if
>> fine-tuning for such amount of characters is a good idea?
>>
>> However I tried it but the quality is very poor.
>>
>> I trained with eng.training_text (a English text of 72 lines) and I added
>> all the diacritics several times. The char error rate during lstmeval is
>> around 0.1. I did a test with 80 documents, and I read 30 names correct.
>> (on each document there is one name). (time is similar to Latin.traineddata)
>>
>>
>> What can I do to get a model that is as good as Latin.traineddata on
>> diacritics but is much faster in ocr reading?
>>
>>
>> Thank you.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyds-%3Dpa0Q3ss%2BDiS8%2BTs7Qv_owyDVdyK8YFr0vp5F39g%40mail.gmail.com.

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Reply via email to