> Hello from neighboring Georgia! Yay! Thank you, I'll do a git pull and give it a try! Not right now, cause I am under the load.
I've also already noticed that without "-l" I can get it work. Thank you again, I guess I may have further questions. მადლობტ ლორაირ On Jun 4, 8:03 pm, Derek <[email protected]> wrote: > Hi Shikamuk, > > Hello from neighboring Georgia! You're exactly right, the issue is that you > don't have hye.traineddata yet. For completely new character sets, you need > to issue the tesseract command without "-l yournewlanguage". The line > you're referring to is suggesting what to do after you have trained > Tesseract on one font in your new language. Since you are training for a > unique script, it doesn't really matter what you use as the language code; > you will get equally bad results no matter what. > > I don't suggest using auto_train.sh at this stage; you will need to edit > the boxfiles generated by make_boxes.sh before continuing the training > process, so I suggest running make_boxes.sh on its own, and then using > merge_boxes.py and align_boxfile.py along with manual editing to get the > boxfiles in order before continuing with the training process. I've made > some small modifications to the scripts and README to make this clear, so I > suggest doing 'git pull' to get the latest copy. > > Hope that helps! > > Derek > > > > > > > > On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote: > > > Hey, Derek. > > Thank you for scripts, they seem to work. > > > However, a couple of questions: > > > 0. So, I've compiled svn version of tesseract and installed it to the / > > local/tesseract-svn prefix with all language files. > > I've also exported /local/tesseract-svn/bin in PATH so that binaries > > from there can be called from scripts. > > > 1. Then, I've created the text.txt file with a nice long text in it. > > > 2. I've run > > python text2img.py -b -i _some_fonts_here > > Now I have png files. > > > 3. Then I run png2tif.sh and get all tif files. > > That's correct. > > > 4. Then I am supposed to run autotrain.sh, right? > > Anyway, it is failing on the first step - make_boxes.sh > > I debugged the script by putting "set -x" there and I have > > > --- > > + LANG=hye > > + for file in '*.tif' > > ++ basename hye.Dejavu_Serifbold.exp0.tif > > + filename=hye.Dejavu_Serifbold.exp0.tif > > + filename=hye.Dejavu_Serifbold.exp0 > > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 -l > > hye batch.nochop makebox > > Error opening data file /local/tesseract-svn/share/tessdata/ > > hye.traineddata > > Please make sure the TESSDATA_PREFIX environment variable is set to > > the parent directory of your "tessdata" directory. > > Failed loading language 'hye' > > Tesseract couldn't load any languages! > > Could not initialize tesseract. > > --- > > > and the same messages for the all fonts. > > > Obviously, there is no hye.traineddata file there. > > I wonder if it should be there on this step, when I am bootstrapping a > > new language? > > > According to the > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > while bootstrapping a new language one has to issue: > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l > > yournewlanguage batch.nochop makebox > > > which is what make_boxes.sh script tries to do and what is failed from > > the commandline as well: > > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 - > > l hy batch.nochop makebox > > Error opening data file /local/tesseract-svn/share/tessdata/ > > hy.traineddata > > Please make sure the TESSDATA_PREFIX environment variable is set to > > the parent directory of your "tessdata" directory. > > Failed loading language 'hy' > > Tesseract couldn't load any languages! > > Could not initialize tesseract. > > > Any ideas? > > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote: > > > Hi all, > > > > I have been doing a lot of tesseract training recently, so I decided to > > put together some Python and shell scripts to speed up the process. I > > haven't done any prep to prepare these for public consumption, but they > > have made my life a lot easier, so I thought I'd throw them out on the list > > in case anyone else finds them useful. > > > > Just a head's up, the default language is Georgian because that's what > > I'm training for, so make sure to change that to your language when > > training. > > > >https://github.com/ddohler/tess_school > > > > Cheers, > > > Derek -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

