Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Lorenzo Bolzani Fri, 10 Apr 2020 08:27:45 -0700

 Hi,
I started writing this email thinking that removing some characters should
not make any real difference: I think the model parameters do not change
with fine tuning and even when removing a few layers the bulk of the model
remains the same.


I decided to test it and I found a very strange thing. I have 14 custom
trained models and I found out that 2/3 of these are twice as slow as the
others.

The slow ones are as slow as the standard ones, "eng", "spa", etc.

I do not remember ever converting them to fast models. All the models are
about 6.4MB, all trained with ocr-d (tesstrain).

The speed difference is visible from python code (tesserocr API wrapper)
and from command line (I repeat the same recognition 100 times, with one as
warmup.

The oldest ones (maybe trained with 4.0.0-beta?), from 2018, are generally
faster except for one. All use a reduced charset but the size of the
charset makes no difference.

Any ideas?


Bye

Lorenzo

Il giorno mer 8 apr 2020 alle ore 17:09 O CR <[email protected]> ha
scritto:

> Hi all,
>
> I try to read names on images with tesseract LSTM. Names like:
>
> Śerena Kovitch
>
> ŁAGUNA EVREIST
>
> Äna Optici
>
> Orğu Moninck
>
>
> (I don't have to recognize words)
>
>
> Latin.traineddata (fast integer) is doing well with the diacritics, but
> there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ ,﹚
> ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so
> Latin.traineddata is too slow.
>
> So I thought I take eng.traineddata (best float for LSTM) and I train it
> for the diacritics. But there are almost 400 diacritics. So I don't know if
> fine-tuning for such amount of characters is a good idea?
>
> However I tried it but the quality is very poor.
>
> I trained with eng.training_text (a English text of 72 lines) and I added
> all the diacritics several times. The char error rate during lstmeval is
> around 0.1. I did a test with 80 documents, and I read 30 names correct.
> (on each document there is one name). (time is similar to Latin.traineddata)
>
>
> What can I do to get a model that is as good as Latin.traineddata on
> diacritics but is much faster in ocr reading?
>
>
> Thank you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzCP36cc30L5eUSPHZKJA7N0RgAETHTW-eiCvzOyPu64A%40mail.gmail.com.

Re: [tesseract-ocr] As good as Latin.traineddata (fast integer) but faster

Reply via email to