Please see https://tesseract-ocr.github.io/tessdoc/Data-Files-in-tessdata_fast
It seems that Ray used a smaller network spec for many languages when training for tessdata_fast to speed them up. However since their float versions are not available, training has to be done using tessdata_best models. That might explain the result you got. Fine-tuning for impact does not change the model. Plus-minus or replace top layer may do that. On Fri, Apr 10, 2020, 19:54 O CR <[email protected]> wrote: > Thank you for responding. > I did the finetuning on the best Latin float model. And I converted the > model to integer. But it's still slower then the fast integer Latin > model.... > Any other ideas to make it faster? > > Op vrijdag 10 april 2020 14:17:55 UTC+2 schreef shree: >> >> The file is probably there as script/Latin.traineddata >> You can copy to wherever you are looking for the best traineddata files. >> >> On Fri, Apr 10, 2020, 16:59 O CR <[email protected]> wrote: >> >>> Which language do I have to use? Because Latin isn't supported. >>> ./tesstrain.sh --fonts_dir "/usr/share/fonts" *--lang Latin* >>> --linedata_only --noextract_font_properties --langdata_dir ./langdata >>> --tessdata_dir ./tessdata --output_dir ./output >>> >>> Op woensdag 8 april 2020 18:27:15 UTC+2 schreef shree: >>>> >>>> I suggest you fine-tune Latin.traineddata using text of the kind you >>>> expect. It will have a smaller unicharset and when you convert to fast >>>> integer model, it should be smaller in size. >>>> >>>> On Wed, Apr 8, 2020, 20:39 O CR <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I try to read names on images with tesseract LSTM. Names like: >>>>> >>>>> Śerena Kovitch >>>>> >>>>> ŁAGUNA EVREIST >>>>> >>>>> Äna Optici >>>>> >>>>> Orğu Moninck >>>>> >>>>> >>>>> (I don't have to recognize words) >>>>> >>>>> >>>>> Latin.traineddata (fast integer) is doing well with the diacritics, >>>>> but there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ >>>>> ,﹗,﹙ ,﹚ ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so >>>>> Latin.traineddata is too slow. >>>>> >>>>> So I thought I take eng.traineddata (best float for LSTM) and I train >>>>> it for the diacritics. But there are almost 400 diacritics. So I don't >>>>> know >>>>> if fine-tuning for such amount of characters is a good idea? >>>>> >>>>> However I tried it but the quality is very poor. >>>>> >>>>> I trained with eng.training_text (a English text of 72 lines) and I >>>>> added all the diacritics several times. The char error rate during >>>>> lstmeval >>>>> is around 0.1. I did a test with 80 documents, and I read 30 names >>>>> correct. >>>>> (on each document there is one name). (time is similar to >>>>> Latin.traineddata) >>>>> >>>>> >>>>> What can I do to get a model that is as good as Latin.traineddata on >>>>> diacritics but is much faster in ocr reading? >>>>> >>>>> >>>>> Thank you. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVi7b5GeJYinwKfYBDcgKXY%3DOYzj%2B3%3DnFQbfS4UEjK0RQ%40mail.gmail.com.

