Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.
> I have fixed the Langdata folder now. And also the previous files are different from the file now. Look at the error messages. Search for 'Failed' You now have more langdata related errors. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFYbA-h2o0QYi4L58Dx19k8KstB8-S8OFpSqov6Bd2bw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.
Thanks alot for replying shree. I will be asking more doubtsin future because of people like you. Ill revert back if the problem still exists. Thanks a lot. Regards Adarsh REGARDS ADARSH SHUKLA Junior Developer Trainee *TURNING CLOUD SOLUTIONS+91 9717783099* On Thu, Feb 15, 2018 at 1:34 PM, ShreeDevi Kumarwrote: > You are missing langdata files > > Failed to load script unicharset from:/home/adarsh/tesseract/la > ngdata/Latin.unicharset > > Failed to read data from: /home/adarsh/tesseract/langdat > a/radical-stroke.txt > Error reading radical code table /home/adarsh/tesseract/langdat > a/radical-stroke.txt > > Even after you fix the above, this is only first step of LSTM training > process. > > It creates a starter traineddata and lstmf files to be used by > lstmtraining. > > The starter traineddata cannot be used to OCR. > > Please read wiki pages regarding training 4.0 > > > > ShreeDevi > > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, Feb 15, 2018 at 12:52 PM, wrote: > >> adarsh@adarsh-X555LJ:~/tesseract$ training/tesstrain.sh --fonts_dir >> /usr/share/fonts --lang eng --noextract_font_properties --langdata_dir >> /home/adarsh/tesseract/langdata --training_text >> /home/adarsh/tesseract/langdata/eng/eng.training_text --linedata_only >> --tessdata_dir /home/tessdata/tessdata --output_dir >> ~/tesstutorial/engtrain --overwrite >> >> === Starting training for language 'eng' >> [Thu Feb 15 11:56:06 IST 2018] /usr/local/bin/text2image >> --fonts_dir=/usr/share/fonts --font=Arial Bold >> --outputbase=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt >> --text=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> Rendered page 0 to file /tmp/font_tmp.zQ3JffkHYN/sample_text.txt.tif >> >> === Phase I: Generating training images === >> Rendering using Arial Bold >> Rendering using Arial Italic >> Rendering using Arial >> Rendering using Courier New Bold Italic >> Rendering using Courier New >> Rendering using Courier New Italic >> Rendering using Courier New Bold >> Rendering using Arial Bold Italic >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4X >> bo/eng/eng.Courier_New_Bold_Italic.exp0 --max_pages=3 --font=Courier New >> Bold Italic --text=/home/adarsh/tesseract/langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 >> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0 >> --max_pages=3 --font=Arial --text=/home/adarsh/tesseract/ >> langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 >> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0 >> --max_pages=3 --font=Arial Italic --text=/home/adarsh/tesseract/ >> langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 >> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0 >> --max_pages=3 --font=Arial Bold --text=/home/adarsh/tesseract/ >> langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 >> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0 >> --max_pages=3 --font=Courier New --text=/home/adarsh/tesseract/ >> langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4X >> bo/eng/eng.Courier_New_Bold.exp0 --max_pages=3 --font=Courier New Bold >> --text=/home/adarsh/tesseract/langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 >> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4X >> bo/eng/eng.Arial_Bold_Italic.exp0 --max_pages=3 --font=Arial Bold Italic >> --text=/home/adarsh/tesseract/langdata/eng/eng.training_text >> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image >> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN >>
Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.
You are missing langdata files Failed to load script unicharset from:/home/adarsh/tesseract/ langdata/Latin.unicharset Failed to read data from: /home/adarsh/tesseract/langdata/radical-stroke.txt Error reading radical code table /home/adarsh/tesseract/ langdata/radical-stroke.txt Even after you fix the above, this is only first step of LSTM training process. It creates a starter traineddata and lstmf files to be used by lstmtraining. The starter traineddata cannot be used to OCR. Please read wiki pages regarding training 4.0 ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Feb 15, 2018 at 12:52 PM,wrote: > adarsh@adarsh-X555LJ:~/tesseract$ training/tesstrain.sh --fonts_dir > /usr/share/fonts --lang eng --noextract_font_properties --langdata_dir > /home/adarsh/tesseract/langdata --training_text > /home/adarsh/tesseract/langdata/eng/eng.training_text > --linedata_only --tessdata_dir /home/tessdata/tessdata --output_dir > ~/tesstutorial/engtrain --overwrite > > === Starting training for language 'eng' > [Thu Feb 15 11:56:06 IST 2018] /usr/local/bin/text2image > --fonts_dir=/usr/share/fonts --font=Arial Bold > --outputbase=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt > --text=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN > Rendered page 0 to file /tmp/font_tmp.zQ3JffkHYN/sample_text.txt.tif > > === Phase I: Generating training images === > Rendering using Arial Bold > Rendering using Arial Italic > Rendering using Arial > Rendering using Courier New Bold Italic > Rendering using Courier New > Rendering using Courier New Italic > Rendering using Courier New Bold > Rendering using Arial Bold Italic > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0 > --max_pages=3 --font=Courier New Bold Italic --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0 --max_pages=3 > --font=Arial --text=/home/adarsh/tesseract/langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0 --max_pages=3 > --font=Arial Italic --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0 --max_pages=3 > --font=Arial Bold --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0 --max_pages=3 > --font=Courier New --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold.exp0 > --max_pages=3 --font=Courier New Bold --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0 > --max_pages=3 --font=Arial Bold Italic --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts > --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Italic.exp0 > --max_pages=3 --font=Courier New Italic --text=/home/adarsh/tesseract/ > langdata/eng/eng.training_text > Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0.tif > Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng. > Courier_New_Bold_Italic.exp0.tif > Rendered page 0 to file
[tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.
adarsh@adarsh-X555LJ:~/tesseract$ training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --noextract_font_properties --langdata_dir /home/adarsh/tesseract/langdata --training_text /home/adarsh/tesseract/langdata/eng/eng.training_text --linedata_only --tessdata_dir /home/tessdata/tessdata --output_dir ~/tesstutorial/engtrain --overwrite === Starting training for language 'eng' [Thu Feb 15 11:56:06 IST 2018] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts --font=Arial Bold --outputbase=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt --text=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN Rendered page 0 to file /tmp/font_tmp.zQ3JffkHYN/sample_text.txt.tif === Phase I: Generating training images === Rendering using Arial Bold Rendering using Arial Italic Rendering using Arial Rendering using Courier New Bold Italic Rendering using Courier New Rendering using Courier New Italic Rendering using Courier New Bold Rendering using Arial Bold Italic [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0 --max_pages=3 --font=Courier New Bold Italic --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0 --max_pages=3 --font=Arial --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0 --max_pages=3 --font=Arial Italic --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0 --max_pages=3 --font=Arial Bold --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0 --max_pages=3 --font=Courier New --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold.exp0 --max_pages=3 --font=Courier New Bold --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0 --max_pages=3 --font=Arial Bold Italic --text=/home/adarsh/tesseract/langdata/eng/eng.training_text [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Italic.exp0 --max_pages=3 --font=Courier New Italic --text=/home/adarsh/tesseract/langdata/eng/eng.training_text Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Italic.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0.tif Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold.exp0.tif Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0.tif Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0.tif Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0.tif Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0.tif Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0.tif Rendered page 1 to file