I thought this may lead to some insights useful for the OP but as the matter gets more mysterious I'm opening a new thread not to hijack this.
Lorenzo Il giorno ven 10 apr 2020 alle ore 17:27 Lorenzo Bolzani < [email protected]> ha scritto: > Hi, > I started writing this email thinking that removing some characters should > not make any real difference: I think the model parameters do not change > with fine tuning and even when removing a few layers the bulk of the model > remains the same. > > I decided to test it and I found a very strange thing. I have 14 custom > trained models and I found out that 2/3 of these are twice as slow as the > others. > > The slow ones are as slow as the standard ones, "eng", "spa", etc. > > I do not remember ever converting them to fast models. All the models are > about 6.4MB, all trained with ocr-d (tesstrain). > > The speed difference is visible from python code (tesserocr API wrapper) > and from command line (I repeat the same recognition 100 times, with one as > warmup. > > The oldest ones (maybe trained with 4.0.0-beta?), from 2018, are generally > faster except for one. All use a reduced charset but the size of the > charset makes no difference. > > Any ideas? > > > Bye > > Lorenzo > > Il giorno mer 8 apr 2020 alle ore 17:09 O CR <[email protected]> > ha scritto: > >> Hi all, >> >> I try to read names on images with tesseract LSTM. Names like: >> >> Śerena Kovitch >> >> ŁAGUNA EVREIST >> >> Äna Optici >> >> Orğu Moninck >> >> >> (I don't have to recognize words) >> >> >> Latin.traineddata (fast integer) is doing well with the diacritics, but >> there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ ,﹚ >> ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so >> Latin.traineddata is too slow. >> >> So I thought I take eng.traineddata (best float for LSTM) and I train it >> for the diacritics. But there are almost 400 diacritics. So I don't know if >> fine-tuning for such amount of characters is a good idea? >> >> However I tried it but the quality is very poor. >> >> I trained with eng.training_text (a English text of 72 lines) and I added >> all the diacritics several times. The char error rate during lstmeval is >> around 0.1. I did a test with 80 documents, and I read 30 names correct. >> (on each document there is one name). (time is similar to Latin.traineddata) >> >> >> What can I do to get a model that is as good as Latin.traineddata on >> diacritics but is much faster in ocr reading? >> >> >> Thank you. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyds-%3Dpa0Q3ss%2BDiS8%2BTs7Qv_owyDVdyK8YFr0vp5F39g%40mail.gmail.com.

