Hi Shikamuk, Hello from neighboring Georgia! You're exactly right, the issue is that you don't have hye.traineddata yet. For completely new character sets, you need to issue the tesseract command without "-l yournewlanguage". The line you're referring to is suggesting what to do after you have trained Tesseract on one font in your new language. Since you are training for a unique script, it doesn't really matter what you use as the language code; you will get equally bad results no matter what.
I don't suggest using auto_train.sh at this stage; you will need to edit the boxfiles generated by make_boxes.sh before continuing the training process, so I suggest running make_boxes.sh on its own, and then using merge_boxes.py and align_boxfile.py along with manual editing to get the boxfiles in order before continuing with the training process. I've made some small modifications to the scripts and README to make this clear, so I suggest doing 'git pull' to get the latest copy. Hope that helps! Derek On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote: > > Hey, Derek. > Thank you for scripts, they seem to work. > > However, a couple of questions: > > 0. So, I've compiled svn version of tesseract and installed it to the / > local/tesseract-svn prefix with all language files. > I've also exported /local/tesseract-svn/bin in PATH so that binaries > from there can be called from scripts. > > 1. Then, I've created the text.txt file with a nice long text in it. > > 2. I've run > python text2img.py -b -i _some_fonts_here > Now I have png files. > > 3. Then I run png2tif.sh and get all tif files. > That's correct. > > 4. Then I am supposed to run autotrain.sh, right? > Anyway, it is failing on the first step - make_boxes.sh > I debugged the script by putting "set -x" there and I have > > --- > + LANG=hye > + for file in '*.tif' > ++ basename hye.Dejavu_Serifbold.exp0.tif > + filename=hye.Dejavu_Serifbold.exp0.tif > + filename=hye.Dejavu_Serifbold.exp0 > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 -l > hye batch.nochop makebox > Error opening data file /local/tesseract-svn/share/tessdata/ > hye.traineddata > Please make sure the TESSDATA_PREFIX environment variable is set to > the parent directory of your "tessdata" directory. > Failed loading language 'hye' > Tesseract couldn't load any languages! > Could not initialize tesseract. > --- > > and the same messages for the all fonts. > > Obviously, there is no hye.traineddata file there. > I wonder if it should be there on this step, when I am bootstrapping a > new language? > > According to the > http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > while bootstrapping a new language one has to issue: > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l > yournewlanguage batch.nochop makebox > > which is what make_boxes.sh script tries to do and what is failed from > the commandline as well: > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 - > l hy batch.nochop makebox > Error opening data file /local/tesseract-svn/share/tessdata/ > hy.traineddata > Please make sure the TESSDATA_PREFIX environment variable is set to > the parent directory of your "tessdata" directory. > Failed loading language 'hy' > Tesseract couldn't load any languages! > Could not initialize tesseract. > > Any ideas? > > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote: > > Hi all, > > > > I have been doing a lot of tesseract training recently, so I decided to > put together some Python and shell scripts to speed up the process. I > haven't done any prep to prepare these for public consumption, but they > have made my life a lot easier, so I thought I'd throw them out on the list > in case anyone else finds them useful. > > > > Just a head's up, the default language is Georgian because that's what > I'm training for, so make sure to change that to your language when > training. > > > > https://github.com/ddohler/tess_school > > > > Cheers, > > Derek -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

