Thank you for responding. I did the finetuning on the best Latin float model. And I converted the model to integer. But it's still slower then the fast integer Latin model.... Any other ideas to make it faster?
Op vrijdag 10 april 2020 14:17:55 UTC+2 schreef shree: > > The file is probably there as script/Latin.traineddata > You can copy to wherever you are looking for the best traineddata files. > > On Fri, Apr 10, 2020, 16:59 O CR <[email protected] <javascript:>> > wrote: > >> Which language do I have to use? Because Latin isn't supported. >> ./tesstrain.sh --fonts_dir "/usr/share/fonts" *--lang Latin* >> --linedata_only --noextract_font_properties --langdata_dir ./langdata >> --tessdata_dir ./tessdata --output_dir ./output >> >> Op woensdag 8 april 2020 18:27:15 UTC+2 schreef shree: >>> >>> I suggest you fine-tune Latin.traineddata using text of the kind you >>> expect. It will have a smaller unicharset and when you convert to fast >>> integer model, it should be smaller in size. >>> >>> On Wed, Apr 8, 2020, 20:39 O CR <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I try to read names on images with tesseract LSTM. Names like: >>>> >>>> Śerena Kovitch >>>> >>>> ŁAGUNA EVREIST >>>> >>>> Äna Optici >>>> >>>> Orğu Moninck >>>> >>>> >>>> (I don't have to recognize words) >>>> >>>> >>>> Latin.traineddata (fast integer) is doing well with the diacritics, but >>>> there are a lot of characters I don't need like numbers, %, ﹕ ,﹖ ,﹗,﹙ >>>> ,﹚ ,﹛ ,﹜ ,﹝ ,﹞ ,﹟ ,﹠ ,﹡ ,﹢ ,﹣ ,﹤,﹥,﹦ ,﹨ ,﹩ ﹪ ,﹫,and much more. And so >>>> Latin.traineddata is too slow. >>>> >>>> So I thought I take eng.traineddata (best float for LSTM) and I train >>>> it for the diacritics. But there are almost 400 diacritics. So I don't >>>> know >>>> if fine-tuning for such amount of characters is a good idea? >>>> >>>> However I tried it but the quality is very poor. >>>> >>>> I trained with eng.training_text (a English text of 72 lines) and I >>>> added all the diacritics several times. The char error rate during >>>> lstmeval >>>> is around 0.1. I did a test with 80 documents, and I read 30 names >>>> correct. >>>> (on each document there is one name). (time is similar to >>>> Latin.traineddata) >>>> >>>> >>>> What can I do to get a model that is as good as Latin.traineddata on >>>> diacritics but is much faster in ocr reading? >>>> >>>> >>>> Thank you. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/b9ddf333-1229-45d3-9a02-809973294a47%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/d692a36f-81c4-4226-94d6-15ec8238673b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f2e55590-d6e6-4322-b64b-5954735a6360%40googlegroups.com.

