Hi Shree, Thanks for your answer. If you don't mind, could you explain why it'd be better ?
Le jeudi 31 mai 2018 17:25:47 UTC+2, shree a écrit : > > >I've trained a LSTM model for a custom language from scratch as explained > here > <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>. > > >The language only has about 100 words and 17 characters, so it's pretty > simple. > > For such a small model, try to build the legacy version rather than LSTM. > > $tesstrain_dir/tesstrain.sh \ > --lang $Lang \ > --exposures "0" \ > --fonts_dir $fonts_dir \ > --fontlist $fonts_for_training \ > --langdata_dir $langdata_dir \ > --tessdata_dir $tessdata_dir \ > --training_text $langdata_dir/$Lang/$Lang.training_text \ > --output_dir $train_output_dir > > > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, May 31, 2018 at 3:43 PM, Julien Jemine <[email protected] > <javascript:>> wrote: > >> Hi, >> >> I've trained a LSTM model for a custom language from scratch as explained >> here >> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>. >> >> The language only has about 100 words and 17 characters, so it's pretty >> simple. >> >> When I run lstmeval on my model, I get a perfect match: >> [icm@u16-offcao-07] train1$ lstmeval --model >> /home/icm/share/tessdata/iqi.traineddata --eval_listfile >> iqitrain2/iqi.training_files.txt --verbosity 2 >> Loaded 2/2 pages (1-2) of document >> /home/icm/train1/iqitrain2/iqi.Arial.exp0.lstmf >> Loaded 2/2 pages (1-2) of document >> /home/icm/train1/iqitrain2/iqi.Calibri.exp0.lstmf >> Warning: LSTMTrainer deserialized an LSTMRecognizer! >> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> Truth:6CUEN 6 CU EN >> OCR :6CUEN 6 CU EN >> Loaded 2/2 pages (1-2) of document >> /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf >> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> Truth:6CUEN 6 CU EN >> OCR :6CUEN 6 CU EN >> Loaded 2/2 pages (1-2) of document >> /home/icm/train1/iqitrain2/iqi.Verdana.exp0.lstmf >> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> Truth:6CUEN 6 CU EN >> OCR :6CUEN 6 CU EN >> Truth:6CUEN 6 CU EN >> OCR :6CUEN 6 CU EN >> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> OCR :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16 >> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0 >> >> However, when I put my iqi.traineddata file in my tessdata folder and try >> to run tesseract on the same tif file, I get errors: >> [icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt >> stdout -l iqi >> Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif >> 6CFEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN >> 6CUEN 1 CU EN >> Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif >> >> 6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116 >> 6UEN 16 FE >> Page 2 : >> /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-Condensed.exp0.tif >> >> 6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16 >> 6 6 CU EN >> Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif >> >> ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16 >> 6CUEN 6 CU EN >> >> >> Now the really frustrating part: I have the opposite phenomenon with the >> "eng" language! (with eng.traineddata taken from tessdata_best) >> lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word >> error rate=16.666667) >> tesseract gives me the right answer! (But the images are generated with >> tesstrain.sh and very common fonts, it's probably to be expected). >> >> Am I doing something wrong? >> What's going on here? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/827ac3ce-21dc-448b-901c-28faea02cfa0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

