will work in windows winXP provided if you have installed python2.7 and PIL.
On Sat, Jun 23, 2012 at 1:21 PM, blavatsky3 < [email protected]> wrote: > Hello Balthazar, > > Do you have a 'windows' version of your box trainer for tesseract ? > > kind regards > > Richard > > > On Saturday, June 9, 2012 4:17:11 AM UTC+10, Balthazar Rouberol wrote: >> >> Hello all, >> >> I've written a small Python tool taking over the training process, and >> also the tif (multipage supported) and boxfile generation: >> https://github.com/**BaltoRouberol/TesseractTrainer<https://github.com/BaltoRouberol/TesseractTrainer> >> >> This can be useful when you want to train Tesseract on a given font, and >> you thus have to create the tif yourself. >> With this tool, you specify a text, a font (among other things) and a >> multipage tif containing your text/font will then be generated, along with >> the corresponding boxfile. >> This allows you to be 100% sure of the boxfile accuracy, and skip the >> boxfile checking process. >> The training process can now be fully automated, from the tif generation >> to the traineddata file combination. >> >> I'll be happy to get feedback from you! >> >> Balthazar >> >> Le mercredi 6 juin 2012 20:26:40 UTC+2, shikamuk a écrit : >>> >>> > Hello from neighboring Georgia! >>> Yay! >>> Thank you, I'll do a git pull and give it a try! >>> Not right now, cause I am under the load. >>> >>> I've also already noticed that without "-l" I can get it work. >>> Thank you again, I guess I may have further questions. >>> >>> მადლობტ >>> >>> ლორაირ >>> >>> >>> On Jun 4, 8:03 pm, Derek <[email protected]> wrote: >>> > Hi Shikamuk, >>> > >>> > Hello from neighboring Georgia! You're exactly right, the issue is >>> that you >>> > don't have hye.traineddata yet. For completely new character sets, you >>> need >>> > to issue the tesseract command without "-l yournewlanguage". The line >>> > you're referring to is suggesting what to do after you have trained >>> > Tesseract on one font in your new language. Since you are training for >>> a >>> > unique script, it doesn't really matter what you use as the language >>> code; >>> > you will get equally bad results no matter what. >>> > >>> > I don't suggest using auto_train.sh at this stage; you will need to >>> edit >>> > the boxfiles generated by make_boxes.sh before continuing the training >>> > process, so I suggest running make_boxes.sh on its own, and then using >>> > merge_boxes.py and align_boxfile.py along with manual editing to get >>> the >>> > boxfiles in order before continuing with the training process. I've >>> made >>> > some small modifications to the scripts and README to make this clear, >>> so I >>> > suggest doing 'git pull' to get the latest copy. >>> > >>> > Hope that helps! >>> > >>> > Derek >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote: >>> > >>> > > Hey, Derek. >>> > > Thank you for scripts, they seem to work. >>> > >>> > > However, a couple of questions: >>> > >>> > > 0. So, I've compiled svn version of tesseract and installed it to >>> the / >>> > > local/tesseract-svn prefix with all language files. >>> > > I've also exported /local/tesseract-svn/bin in PATH so that binaries >>> > > from there can be called from scripts. >>> > >>> > > 1. Then, I've created the text.txt file with a nice long text in it. >>> > >>> > > 2. I've run >>> > > python text2img.py -b -i _some_fonts_here >>> > > Now I have png files. >>> > >>> > > 3. Then I run png2tif.sh and get all tif files. >>> > > That's correct. >>> > >>> > > 4. Then I am supposed to run autotrain.sh, right? >>> > > Anyway, it is failing on the first step - make_boxes.sh >>> > > I debugged the script by putting "set -x" there and I have >>> > >>> > > --- >>> > > + LANG=hye >>> > > + for file in '*.tif' >>> > > ++ basename hye.Dejavu_Serifbold.exp0.tif >>> > > + filename=hye.Dejavu_Serifbold.**exp0.tif >>> > > + filename=hye.Dejavu_Serifbold.**exp0 >>> > > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 >>> -l >>> > > hye batch.nochop makebox >>> > > Error opening data file /local/tesseract-svn/share/**tessdata/ >>> > > hye.traineddata >>> > > Please make sure the TESSDATA_PREFIX environment variable is set to >>> > > the parent directory of your "tessdata" directory. >>> > > Failed loading language 'hye' >>> > > Tesseract couldn't load any languages! >>> > > Could not initialize tesseract. >>> > > --- >>> > >>> > > and the same messages for the all fonts. >>> > >>> > > Obviously, there is no hye.traineddata file there. >>> > > I wonder if it should be there on this step, when I am bootstrapping >>> a >>> > > new language? >>> > >>> > > According to the >>> > >http://code.google.com/p/**tesseract-ocr/wiki/**TrainingTesseract3<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> >>> > > while bootstrapping a new language one has to issue: >>> > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] >>> -l >>> > > yournewlanguage batch.nochop makebox >>> > >>> > > which is what make_boxes.sh script tries to do and what is failed >>> from >>> > > the commandline as well: >>> > >>> > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 >>> - >>> > > l hy batch.nochop makebox >>> > > Error opening data file /local/tesseract-svn/share/**tessdata/ >>> > > hy.traineddata >>> > > Please make sure the TESSDATA_PREFIX environment variable is set to >>> > > the parent directory of your "tessdata" directory. >>> > > Failed loading language 'hy' >>> > > Tesseract couldn't load any languages! >>> > > Could not initialize tesseract. >>> > >>> > > Any ideas? >>> > >>> > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote: >>> > > > Hi all, >>> > >>> > > > I have been doing a lot of tesseract training recently, so I >>> decided to >>> > > put together some Python and shell scripts to speed up the process. >>> I >>> > > haven't done any prep to prepare these for public consumption, but >>> they >>> > > have made my life a lot easier, so I thought I'd throw them out on >>> the list >>> > > in case anyone else finds them useful. >>> > >>> > > > Just a head's up, the default language is Georgian because that's >>> what >>> > > I'm training for, so make sure to change that to your language when >>> > > training. >>> > >>> > > >https://github.com/ddohler/**tess_school<https://github.com/ddohler/tess_school> >>> > >>> > > > Cheers, >>> > > > Derek >> >> -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

