Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.

2018-02-15 Thread ShreeDevi Kumar
>  I have fixed the Langdata folder now. And also the previous files are
different from the file now.

Look at the error messages.
Search for 'Failed'

You now have more langdata related errors.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFYbA-h2o0QYi4L58Dx19k8KstB8-S8OFpSqov6Bd2bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.

2018-02-15 Thread Adarsh Shukla
Thanks alot for replying shree.
I will be asking more doubtsin future because of people like you.
Ill revert back if the problem still exists. Thanks a lot.

Regards

Adarsh

REGARDS
ADARSH SHUKLA
Junior Developer Trainee


*TURNING CLOUD SOLUTIONS+91 9717783099*

On Thu, Feb 15, 2018 at 1:34 PM, ShreeDevi Kumar 
wrote:

> You are missing langdata files
>
> Failed to load script unicharset from:/home/adarsh/tesseract/la
> ngdata/Latin.unicharset
>
> Failed to read data from: /home/adarsh/tesseract/langdat
> a/radical-stroke.txt
> Error reading radical code table /home/adarsh/tesseract/langdat
> a/radical-stroke.txt
>
> Even after you fix the above, this is only first step of LSTM training
> process.
>
> It creates a starter traineddata and lstmf files to be used by
> lstmtraining.
>
> The starter traineddata cannot be used to OCR.
>
> Please read wiki pages regarding training 4.0
>
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Feb 15, 2018 at 12:52 PM,  wrote:
>
>> adarsh@adarsh-X555LJ:~/tesseract$ training/tesstrain.sh --fonts_dir
>> /usr/share/fonts --lang eng   --noextract_font_properties --langdata_dir
>> /home/adarsh/tesseract/langdata --training_text
>> /home/adarsh/tesseract/langdata/eng/eng.training_text --linedata_only
>> --tessdata_dir /home/tessdata/tessdata --output_dir
>> ~/tesstutorial/engtrain  --overwrite
>>
>> === Starting training for language 'eng'
>> [Thu Feb 15 11:56:06 IST 2018] /usr/local/bin/text2image
>> --fonts_dir=/usr/share/fonts --font=Arial Bold
>> --outputbase=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt
>> --text=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> Rendered page 0 to file /tmp/font_tmp.zQ3JffkHYN/sample_text.txt.tif
>>
>> === Phase I: Generating training images ===
>> Rendering using Arial Bold
>> Rendering using Arial Italic
>> Rendering using Arial
>> Rendering using Courier New Bold Italic
>> Rendering using Courier New
>> Rendering using Courier New Italic
>> Rendering using Courier New Bold
>> Rendering using Arial Bold Italic
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4X
>> bo/eng/eng.Courier_New_Bold_Italic.exp0 --max_pages=3 --font=Courier New
>> Bold Italic --text=/home/adarsh/tesseract/langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 
>> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0
>> --max_pages=3 --font=Arial --text=/home/adarsh/tesseract/
>> langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 
>> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0
>> --max_pages=3 --font=Arial Italic --text=/home/adarsh/tesseract/
>> langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 
>> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0
>> --max_pages=3 --font=Arial Bold --text=/home/adarsh/tesseract/
>> langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 
>> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0
>> --max_pages=3 --font=Courier New --text=/home/adarsh/tesseract/
>> langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4X
>> bo/eng/eng.Courier_New_Bold.exp0 --max_pages=3 --font=Courier New Bold
>> --text=/home/adarsh/tesseract/langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32
>> --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.kisZVM4X
>> bo/eng/eng.Arial_Bold_Italic.exp0 --max_pages=3 --font=Arial Bold Italic
>> --text=/home/adarsh/tesseract/langdata/eng/eng.training_text
>> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
>> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
>> 

Re: [tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.

2018-02-15 Thread ShreeDevi Kumar
You are missing langdata files

Failed to load script unicharset from:/home/adarsh/tesseract/
langdata/Latin.unicharset

Failed to read data from: /home/adarsh/tesseract/langdata/radical-stroke.txt
Error reading radical code table /home/adarsh/tesseract/
langdata/radical-stroke.txt

Even after you fix the above, this is only first step of LSTM training
process.

It creates a starter traineddata and lstmf files to be used by
lstmtraining.

The starter traineddata cannot be used to OCR.

Please read wiki pages regarding training 4.0



ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Feb 15, 2018 at 12:52 PM,  wrote:

> adarsh@adarsh-X555LJ:~/tesseract$ training/tesstrain.sh --fonts_dir
> /usr/share/fonts --lang eng   --noextract_font_properties --langdata_dir
> /home/adarsh/tesseract/langdata --training_text 
> /home/adarsh/tesseract/langdata/eng/eng.training_text
> --linedata_only   --tessdata_dir /home/tessdata/tessdata --output_dir
> ~/tesstutorial/engtrain  --overwrite
>
> === Starting training for language 'eng'
> [Thu Feb 15 11:56:06 IST 2018] /usr/local/bin/text2image
> --fonts_dir=/usr/share/fonts --font=Arial Bold 
> --outputbase=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt
> --text=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
> Rendered page 0 to file /tmp/font_tmp.zQ3JffkHYN/sample_text.txt.tif
>
> === Phase I: Generating training images ===
> Rendering using Arial Bold
> Rendering using Arial Italic
> Rendering using Arial
> Rendering using Courier New Bold Italic
> Rendering using Courier New
> Rendering using Courier New Italic
> Rendering using Courier New Bold
> Rendering using Arial Bold Italic
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0
> --max_pages=3 --font=Courier New Bold Italic --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0 --max_pages=3
> --font=Arial --text=/home/adarsh/tesseract/langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0 --max_pages=3
> --font=Arial Italic --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0 --max_pages=3
> --font=Arial Bold --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0 --max_pages=3
> --font=Courier New --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold.exp0
> --max_pages=3 --font=Courier New Bold --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0
> --max_pages=3 --font=Arial Bold Italic --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> [Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts
> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Italic.exp0
> --max_pages=3 --font=Courier New Italic --text=/home/adarsh/tesseract/
> langdata/eng/eng.training_text
> Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0.tif
> Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.
> Courier_New_Bold_Italic.exp0.tif
> Rendered page 0 to file 

[tesseract-ocr] Error in training Tesseract 4.0. Training gets completed somehow but then the output it gives after reading the pdf is incorrect.

2018-02-14 Thread adarsh
adarsh@adarsh-X555LJ:~/tesseract$ training/tesstrain.sh --fonts_dir 
/usr/share/fonts --lang eng   --noextract_font_properties --langdata_dir 
/home/adarsh/tesseract/langdata --training_text 
/home/adarsh/tesseract/langdata/eng/eng.training_text --linedata_only   
--tessdata_dir /home/tessdata/tessdata --output_dir 
~/tesstutorial/engtrain  --overwrite

=== Starting training for language 'eng'
[Thu Feb 15 11:56:06 IST 2018] /usr/local/bin/text2image 
--fonts_dir=/usr/share/fonts --font=Arial Bold 
--outputbase=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt 
--text=/tmp/font_tmp.zQ3JffkHYN/sample_text.txt 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN
Rendered page 0 to file /tmp/font_tmp.zQ3JffkHYN/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial Bold
Rendering using Arial Italic
Rendering using Arial
Rendering using Courier New Bold Italic
Rendering using Courier New
Rendering using Courier New Italic
Rendering using Courier New Bold
Rendering using Arial Bold Italic
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0 
--max_pages=3 --font=Courier New Bold Italic 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0 --max_pages=3 
--font=Arial --text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0 --max_pages=3 
--font=Arial Italic 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0 --max_pages=3 
--font=Arial Bold 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0 --max_pages=3 
--font=Courier New 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold.exp0 
--max_pages=3 --font=Courier New Bold 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0 
--max_pages=3 --font=Arial Bold Italic 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
[Thu Feb 15 11:56:27 IST 2018] /usr/local/bin/text2image 
--fontconfig_tmpdir=/tmp/font_tmp.zQ3JffkHYN --fonts_dir=/usr/share/fonts 
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
--outputbase=/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Italic.exp0 
--max_pages=3 --font=Courier New Italic 
--text=/home/adarsh/tesseract/langdata/eng/eng.training_text
Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0.tif
Rendered page 0 to file 
/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0.tif
Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0.tif
Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0.tif
Rendered page 0 to file 
/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0.tif
Rendered page 0 to file 
/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Italic.exp0.tif
Rendered page 0 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New.exp0.tif
Rendered page 0 to file 
/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold.exp0.tif
Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Italic.exp0.tif
Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold.exp0.tif
Rendered page 1 to file 
/tmp/tmp.kisZVM4Xbo/eng/eng.Courier_New_Bold_Italic.exp0.tif
Rendered page 1 to file 
/tmp/tmp.kisZVM4Xbo/eng/eng.Arial_Bold_Italic.exp0.tif
Rendered page 1 to file /tmp/tmp.kisZVM4Xbo/eng/eng.Arial.exp0.tif
Rendered page 1 to file