Training tesseract 4.0.0 is different from process for 3.0x. Training using images is not supported for tesseract 4.0.0.
See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 On Thu 5 Apr, 2018, 1:36 AM Fanatico, <fanatico.s...@gmail.com> wrote: > Hi, I'm new to tesseract and ocr in general, and need some help to train > my tesseract. > > Config > Platform: Mac OS X 10.13.3 > Tesseract Version: 4.0.0-beta.1 > leptonica: 1.75.3 > libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 > > images used > > kor.AppleMyungjo.exp1.tif > > > <https://lh3.googleusercontent.com/-HfEWwZudKjE/WsUtd-CH2iI/AAAAAAAAHig/u_gQpXArU4cU4jREJJegB2dIjo3tqv3lwCLcBGAs/s1600/kor.AppleMyungjo.exp1.tif> > > > kor.AppleMyungjo.exp0.tif > > > <https://lh3.googleusercontent.com/-OGn-qgzxBgE/WsUr2NKqeBI/AAAAAAAAHiQ/aZ7PnPiB7qwHvyXTGb-wHVyGJ4Gs-N9GwCLcBGAs/s1600/kor.AppleMyungjo.exp0.tif> > > > Step by step > I'm trying to train (fine tuning) my tesseract to better detect commas (") > and dot (.) in korean, but I'm getting some errors. Here what I did until > now: > > 1 - Got the Images, I'm using 2 images .tif (both images has only 1 line > and few characters) > 2 - Renamed the images to kor.AppleMyungjo.exp0.tif and > kor.AppleMyungjo.exp1.tif > 3 - Created the .box file for each image ```tesseract > [language].[fontname].exp[samplenumber].tif > [language].[fontname].exp[samplenumber] -l [language] batch.nochop > makebox``` (one of them come empty) > 4 - Corrected the .box files using the site > https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning > in the file) > 5 - Created the .tr files for each image ```tesseract > kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both > image got an empty .tr file) > 6 - Created the unicharset file ```unicharset_extractor [box file 0] [box > file 1]...``` > 7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0``` > 8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract``` > 9 - cloned the langdata repo to my mac, path ```~/projects/langdata``` > 10 - Found the folder where the brew installed my tesseract, path > ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata``` > 11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file > > > ``` > sudo ~/projects/tesseract/training/tesstrain.sh \ > --fonts_dir /Library/Fonts \ > --lang kor \ > --linedata_only \ > --noextract_font_properties \ > --exposures "0" \ > --langdata_dir ~/projects/langdata \ > --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \ > --output_dir ~/tesstutorial/kor \ > --fontlist "AppleMyungjo" > ``` > and got the error: > ``` > === Starting training for language 'kor' > mktemp: illegal option -- - > usage: mktemp [-d] [-q] [-t prefix] [-u] template ... > mktemp [-d] [-q] [-u] -t prefix > [Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image > --fonts_dir=/Library/Fonts --font=AppleMyungjo > --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir= > Fontconfig error: Cannot load default config file > > === Phase I: Generating training images === > Rendering using AppleMyungjo > [Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image > --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words > --leading=32 --char_spacing=0.0 --exposure=0 > --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0 > --max_pages=3 --font=AppleMyungjo > --text=/Users/fernandogot/projects/langdata/kor/kor.training_text > Fontconfig error: Cannot load default config file > ERROR: > /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box > does not exist or is not readable > ERROR: > /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box > does not exist or is not readable > ``` > > I found that the ```Fontconfig error: Cannot load default config file``` > was being generated because of the mktemp on mac, I fixed it replacing the > code: > > training/tesstrain_utils.sh > ```diff > - export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX) > + export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX) > ``` > After executing the same code I get: > > ``` > === Starting training for language 'kor' > [Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image > --fonts_dir=/Library/Fonts --font=AppleMyungjo > --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt > --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt > --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs > > === Phase I: Generating training images === > Rendering using AppleMyungjo > [Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image > --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs > --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 > --char_spacing=0.0 --exposure=0 > --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0 > --max_pages=3 --font=AppleMyungjo > --text=/Users/fernandogot/projects/langdata/kor/kor.training_text > ERROR: > /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box > does not exist or is not readable > ERROR: > /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box > does not exist or is not readable > ``` > > So I'm stuck at these 2 erros, I do have this file in the folder that Im > executing the code ```~/projects/ocr/trainning/```, but what can I do to > make it work? > > > Thanks for reading all this text and for your time > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpbcy9%3DtzJ8%2BDKNv0iDenhb_kEvRH0Ojq_HXoDpnrhcA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.