Training tesseract 4.0.0 is different from process for 3.0x.

Training  using images is not supported for tesseract 4.0.0.

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

On Thu 5 Apr, 2018, 1:36 AM Fanatico, <fanatico.s...@gmail.com> wrote:

> Hi, I'm new to tesseract and ocr in general, and need some help to train
> my tesseract.
>
> Config
> Platform: Mac OS X 10.13.3
> Tesseract Version: 4.0.0-beta.1
> leptonica: 1.75.3
>   libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
>
> images used
>
> kor.AppleMyungjo.exp1.tif
>
>
> <https://lh3.googleusercontent.com/-HfEWwZudKjE/WsUtd-CH2iI/AAAAAAAAHig/u_gQpXArU4cU4jREJJegB2dIjo3tqv3lwCLcBGAs/s1600/kor.AppleMyungjo.exp1.tif>
>
>
> kor.AppleMyungjo.exp0.tif
>
>
> <https://lh3.googleusercontent.com/-OGn-qgzxBgE/WsUr2NKqeBI/AAAAAAAAHiQ/aZ7PnPiB7qwHvyXTGb-wHVyGJ4Gs-N9GwCLcBGAs/s1600/kor.AppleMyungjo.exp0.tif>
>
>
> Step by step
> I'm trying to train (fine tuning) my tesseract to better detect commas (")
> and dot (.) in korean, but I'm getting some errors. Here what I did until
> now:
>
> 1 - Got the Images, I'm using 2 images .tif (both images has only 1 line
> and few characters)
> 2 - Renamed the images to kor.AppleMyungjo.exp0.tif and
> kor.AppleMyungjo.exp1.tif
> 3 - Created the .box file for each image ```tesseract
> [language].[fontname].exp[samplenumber].tif
> [language].[fontname].exp[samplenumber] -l [language] batch.nochop
> makebox``` (one of them come empty)
> 4 - Corrected the .box files using the site
> https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning
> in the file)
> 5 - Created the .tr files for each image ```tesseract
> kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both
> image got an empty .tr file)
> 6 - Created the unicharset file ```unicharset_extractor [box file 0] [box
> file 1]...```
> 7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
> 8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
> 9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
> 10 - Found the folder where the brew installed my tesseract, path
> ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
> 11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file
>
>
> ```
> sudo ~/projects/tesseract/training/tesstrain.sh \
>   --fonts_dir /Library/Fonts  \
>   --lang kor \
>   --linedata_only  \
>   --noextract_font_properties  \
>   --exposures "0"    \
>   --langdata_dir ~/projects/langdata \
>   --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
>   --output_dir ~/tesstutorial/kor \
>   --fontlist "AppleMyungjo"
> ```
> and got the error:
> ```
> === Starting training for language 'kor'
> mktemp: illegal option -- -
> usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
>        mktemp [-d] [-q] [-u] -t prefix
> [Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=AppleMyungjo
> --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
> Fontconfig error: Cannot load default config file
>
> === Phase I: Generating training images ===
> Rendering using AppleMyungjo
> [Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words
> --leading=32 --char_spacing=0.0 --exposure=0
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0
> --max_pages=3 --font=AppleMyungjo
> --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
> Fontconfig error: Cannot load default config file
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ```
>
> I found that the ```Fontconfig error: Cannot load default config file```
> was being generated because of the mktemp on mac, I fixed it replacing the
> code:
>
> training/tesstrain_utils.sh
> ```diff
> - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
> + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
> ```
> After executing the same code I get:
>
> ```
> === Starting training for language 'kor'
> [Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image
> --fonts_dir=/Library/Fonts --font=AppleMyungjo
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt
> --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt
> --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs
>
> === Phase I: Generating training images ===
> Rendering using AppleMyungjo
> [Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image
> --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs
> --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32
> --char_spacing=0.0 --exposure=0
> --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0
> --max_pages=3 --font=AppleMyungjo
> --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ERROR:
> /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box
> does not exist or is not readable
> ```
>
> So I'm stuck at these 2 erros, I do have this file in the folder that Im
> executing the code ```~/projects/ocr/trainning/```, but what can I do to
> make it work?
>
>
> Thanks for reading all this text and for your time
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpbcy9%3DtzJ8%2BDKNv0iDenhb_kEvRH0Ojq_HXoDpnrhcA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to