Hi, I started writing this email thinking that removing some characters should not make any real difference: I think the model parameters do not change with fine tuning and even when removing a few layers the bulk of the model remains the same.
I decided to test it and I found a very strange thing. I have 14 custom trained models and I found out that 2/3 of these are twice as slow as the others. The slow ones are as slow as the standard ones, "eng", "spa", etc. I do not remember ever converting them to fast models. All the models are about 6.4MB, all trained with ocr-d (tesstrain). The speed difference is visible from python code (tesserocr API wrapper) and from command line (I repeat the same recognition 100 times, with one as warmup. The oldest ones (maybe trained with 4.0.0-beta?), from 2018, are generally faster except for one. All use a reduced charset but the size of the charset makes no difference. Any ideas? Bye Lorenzo Il giorno mer 8 apr 2020 alle ore 17:09 O CR <[email protected]> ha scritto: > Hi all, > > I try to read names on images with tesseract LSTM. Names like: > > Śerena Kovitch > > ŁAGUNA EVREIST > > Äna Optici > > Orğu Moninck > > > (I don't have to recognize words) > > > Latin.traineddata (fast integer) is doing well with the diacritics, but > there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ ,﹚ > ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so > Latin.traineddata is too slow. > > So I thought I take eng.traineddata (best float for LSTM) and I train it > for the diacritics. But there are almost 400 diacritics. So I don't know if > fine-tuning for such amount of characters is a good idea? > > However I tried it but the quality is very poor. > > I trained with eng.training_text (a English text of 72 lines) and I added > all the diacritics several times. The char error rate during lstmeval is > around 0.1. I did a test with 80 documents, and I read 30 names correct. > (on each document there is one name). (time is similar to Latin.traineddata) > > > What can I do to get a model that is as good as Latin.traineddata on > diacritics but is much faster in ocr reading? > > > Thank you. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzCP36cc30L5eUSPHZKJA7N0RgAETHTW-eiCvzOyPu64A%40mail.gmail.com.

