> You can experiment and see if it is better. I think I'll do just that, thanks for the idea.
2018-06-01 10:29 GMT+02:00 ShreeDevi Kumar <[email protected]>: > From what I understand from the documentation provided by Ray Smith > regarding LSTM training, the models have been trained on hundreds of > thousands of lines and hundreds of fonts. The network spec used for > training from scratch will therefore be optimized for such large models. > > You seem to have a different requirement, hence I suggested building the > legacy tesseract model. > > You can experiment and see if it is better. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Fri, Jun 1, 2018 at 12:23 PM, Julien Jemine <[email protected]> > wrote: > >> Hi Shree, >> >> Thanks for your answer. >> If you don't mind, could you explain why it'd be better ? >> >> Le jeudi 31 mai 2018 17:25:47 UTC+2, shree a écrit : >>> >>> >I've trained a LSTM model for a custom language from scratch as >>> explained here >>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00> >>> . >>> >>> >The language only has about 100 words and 17 characters, so it's pretty >>> simple. >>> >>> For such a small model, try to build the legacy version rather than LSTM. >>> >>> $tesstrain_dir/tesstrain.sh \ >>> --lang $Lang \ >>> --exposures "0" \ >>> --fonts_dir $fonts_dir \ >>> --fontlist $fonts_for_training \ >>> --langdata_dir $langdata_dir \ >>> --tessdata_dir $tessdata_dir \ >>> --training_text $langdata_dir/$Lang/$Lang.training_text \ >>> --output_dir $train_output_dir >>> >>> >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Thu, May 31, 2018 at 3:43 PM, Julien Jemine <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I've trained a LSTM model for a custom language from scratch as >>>> explained here >>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00> >>>> . >>>> >>>> The language only has about 100 words and 17 characters, so it's pretty >>>> simple. >>>> >>>> When I run lstmeval on my model, I get a perfect match: >>>> [icm@u16-offcao-07] train1$ lstmeval --model >>>> /home/icm/share/tessdata/iqi.traineddata --eval_listfile >>>> iqitrain2/iqi.training_files.txt --verbosity 2 >>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi >>>> .Arial.exp0.lstmf >>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi >>>> .Calibri.exp0.lstmf >>>> Warning: LSTMTrainer deserialized an LSTMRecognizer! >>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> Truth:6CUEN 6 CU EN >>>> OCR :6CUEN 6 CU EN >>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi >>>> .Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf >>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> Truth:6CUEN 6 CU EN >>>> OCR :6CUEN 6 CU EN >>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi >>>> .Verdana.exp0.lstmf >>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> Truth:6CUEN 6 CU EN >>>> OCR :6CUEN 6 CU EN >>>> Truth:6CUEN 6 CU EN >>>> OCR :6CUEN 6 CU EN >>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0 >>>> >>>> However, when I put my iqi.traineddata file in my tessdata folder and >>>> try to run tesseract on the same tif file, I get errors: >>>> [icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt >>>> stdout -l iqi >>>> Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif >>>> 6CFEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN >>>> 6CUEN 1 CU EN >>>> Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif >>>> >>>> 6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116 >>>> 6UEN 16 FE >>>> Page 2 : /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-C >>>> ondensed.exp0.tif >>>> >>>> 6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16 >>>> 6 6 CU EN >>>> Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif >>>> >>>> ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16 >>>> 6CUEN 6 CU EN >>>> >>>> >>>> Now the really frustrating part: I have the opposite phenomenon with >>>> the "eng" language! (with eng.traineddata taken from tessdata_best) >>>> lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word >>>> error rate=16.666667) >>>> tesseract gives me the right answer! (But the images are generated with >>>> tesstrain.sh and very common fonts, it's probably to be expected). >>>> >>>> Am I doing something wrong? >>>> What's going on here? >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit https://groups.google.com/d/ms >>>> gid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40goo >>>> glegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/827ac3ce-21dc-448b-901c-28faea02cfa0%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/827ac3ce-21dc-448b-901c-28faea02cfa0%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit https://groups.google.com/d/ > topic/tesseract-ocr/0_bN53wL7zw/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/CAG2NduVtvpuXvtcMLV-8A7sFr9C_ > GWvLfC6DO5ka3g1pb4Jw-Q%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtvpuXvtcMLV-8A7sFr9C_GWvLfC6DO5ka3g1pb4Jw-Q%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAB94iUEufRwQrr%2BiVT6hhs7%2B10%2BMS90YAS79btWncdyO9k6UyA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

