Re: Scripts to semi-automate training

Balthazar Rouberol Fri, 08 Jun 2012 20:23:09 -0700

Hello all,

I've written a small Python tool taking over the training process, and also 
the tif (multipage supported) and boxfile generation: 
https://github.com/BaltoRouberol/TesseractTrainer


This can be useful when you want to train Tesseract on a given font, and 
you thus have to create the tif yourself.
With this tool, you specify a text, a font (among other things) and a 
multipage tif containing your text/font will then be generated, along with 
the corresponding boxfile.
This allows you to be 100% sure of the boxfile accuracy, and skip the 
boxfile checking process. 
The training process can now be fully automated, from the tif generation to 
the traineddata file combination.

I'll be happy to get feedback from you!

Balthazar

Le mercredi 6 juin 2012 20:26:40 UTC+2, shikamuk a écrit :
>
> > Hello from neighboring Georgia! 
> Yay! 
> Thank you, I'll do a git pull and give it a try! 
> Not right now, cause I am under the load. 
>
> I've also already noticed that without "-l" I can get it work. 
> Thank you again, I guess I may have further questions. 
>
> მადლობტ 
>
> ლორაირ 
>
>
> On Jun 4, 8:03 pm, Derek <[email protected]> wrote: 
> > Hi Shikamuk, 
> > 
> > Hello from neighboring Georgia! You're exactly right, the issue is that 
> you 
> > don't have hye.traineddata yet. For completely new character sets, you 
> need 
> > to issue the tesseract command without "-l yournewlanguage". The line 
> > you're referring to is suggesting what to do after you have trained 
> > Tesseract on one font in your new language. Since you are training for a 
> > unique script, it doesn't really matter what you use as the language 
> code; 
> > you will get equally bad results no matter what. 
> > 
> > I don't suggest using auto_train.sh at this stage; you will need to edit 
> > the boxfiles generated by make_boxes.sh before continuing the training 
> > process, so I suggest running make_boxes.sh on its own, and then using 
> > merge_boxes.py and align_boxfile.py along with manual editing to get the 
> > boxfiles in order before continuing with the training process. I've made 
> > some small modifications to the scripts and README to make this clear, 
> so I 
> > suggest doing 'git pull' to get the latest copy. 
> > 
> > Hope that helps! 
> > 
> > Derek 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote: 
> > 
> > > Hey, Derek. 
> > > Thank you for scripts, they seem to work. 
> > 
> > > However, a couple of questions: 
> > 
> > > 0. So, I've compiled svn version of tesseract and installed it to the 
> / 
> > > local/tesseract-svn prefix with all language files. 
> > > I've also exported /local/tesseract-svn/bin in PATH so that binaries 
> > > from there can be called from scripts. 
> > 
> > > 1. Then, I've created the text.txt file with a nice long text in it. 
> > 
> > > 2.  I've run 
> > > python text2img.py -b -i _some_fonts_here 
> > > Now I have png files. 
> > 
> > > 3. Then I run png2tif.sh and get all tif files. 
> > > That's correct. 
> > 
> > > 4. Then I am supposed to run autotrain.sh, right? 
> > > Anyway, it is failing on the first step - make_boxes.sh 
> > > I debugged the script by putting "set -x" there and I have 
> > 
> > > --- 
> > > + LANG=hye 
> > > + for file in '*.tif' 
> > > ++ basename hye.Dejavu_Serifbold.exp0.tif 
> > > + filename=hye.Dejavu_Serifbold.exp0.tif 
> > > + filename=hye.Dejavu_Serifbold.exp0 
> > > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 -l 
> > > hye batch.nochop makebox 
> > > Error opening data file /local/tesseract-svn/share/tessdata/ 
> > > hye.traineddata 
> > > Please make sure the TESSDATA_PREFIX environment variable is set to 
> > > the parent directory of your "tessdata" directory. 
> > > Failed loading language 'hye' 
> > > Tesseract couldn't load any languages! 
> > > Could not initialize tesseract. 
> > > --- 
> > 
> > > and the same messages for the all fonts. 
> > 
> > > Obviously, there is no hye.traineddata file there. 
> > > I wonder if it should be there on this step, when I am bootstrapping a 
> > > new language? 
> > 
> > > According to the 
> > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
> > > while bootstrapping a new language one has to issue: 
> > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l 
> > > yournewlanguage batch.nochop makebox 
> > 
> > > which is what make_boxes.sh script tries to do and what is failed from 
> > > the commandline as well: 
> > 
> > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 - 
> > > l hy batch.nochop makebox 
> > > Error opening data file /local/tesseract-svn/share/tessdata/ 
> > > hy.traineddata 
> > > Please make sure the TESSDATA_PREFIX environment variable is set to 
> > > the parent directory of your "tessdata" directory. 
> > > Failed loading language 'hy' 
> > > Tesseract couldn't load any languages! 
> > > Could not initialize tesseract. 
> > 
> > > Any ideas? 
> > 
> > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote: 
> > > > Hi all, 
> > 
> > > > I have been doing a lot of tesseract training recently, so I decided 
> to 
> > > put together some Python and shell scripts to speed up the process. I 
> > > haven't done any prep to prepare these for public consumption, but 
> they 
> > > have made my life a lot easier, so I thought I'd throw them out on the 
> list 
> > > in case anyone else finds them useful. 
> > 
> > > > Just a head's up, the default language is Georgian because that's 
> what 
> > > I'm training for, so make sure to change that to your language when 
> > > training. 
> > 
> > > >https://github.com/ddohler/tess_school 
> > 
> > > > Cheers, 
> > > > Derek

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Scripts to semi-automate training

Reply via email to