Hello Balthazar, Do you have a 'windows' version of your box trainer for tesseract ?
kind regards Richard On Saturday, June 9, 2012 4:17:11 AM UTC+10, Balthazar Rouberol wrote: > > Hello all, > > I've written a small Python tool taking over the training process, and > also the tif (multipage supported) and boxfile generation: > https://github.com/BaltoRouberol/TesseractTrainer > > This can be useful when you want to train Tesseract on a given font, and > you thus have to create the tif yourself. > With this tool, you specify a text, a font (among other things) and a > multipage tif containing your text/font will then be generated, along with > the corresponding boxfile. > This allows you to be 100% sure of the boxfile accuracy, and skip the > boxfile checking process. > The training process can now be fully automated, from the tif generation > to the traineddata file combination. > > I'll be happy to get feedback from you! > > Balthazar > > Le mercredi 6 juin 2012 20:26:40 UTC+2, shikamuk a écrit : >> >> > Hello from neighboring Georgia! >> Yay! >> Thank you, I'll do a git pull and give it a try! >> Not right now, cause I am under the load. >> >> I've also already noticed that without "-l" I can get it work. >> Thank you again, I guess I may have further questions. >> >> მადლობტ >> >> ლორაირ >> >> >> On Jun 4, 8:03 pm, Derek <[email protected]> wrote: >> > Hi Shikamuk, >> > >> > Hello from neighboring Georgia! You're exactly right, the issue is that >> you >> > don't have hye.traineddata yet. For completely new character sets, you >> need >> > to issue the tesseract command without "-l yournewlanguage". The line >> > you're referring to is suggesting what to do after you have trained >> > Tesseract on one font in your new language. Since you are training for >> a >> > unique script, it doesn't really matter what you use as the language >> code; >> > you will get equally bad results no matter what. >> > >> > I don't suggest using auto_train.sh at this stage; you will need to >> edit >> > the boxfiles generated by make_boxes.sh before continuing the training >> > process, so I suggest running make_boxes.sh on its own, and then using >> > merge_boxes.py and align_boxfile.py along with manual editing to get >> the >> > boxfiles in order before continuing with the training process. I've >> made >> > some small modifications to the scripts and README to make this clear, >> so I >> > suggest doing 'git pull' to get the latest copy. >> > >> > Hope that helps! >> > >> > Derek >> > >> > >> > >> > >> > >> > >> > >> > On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote: >> > >> > > Hey, Derek. >> > > Thank you for scripts, they seem to work. >> > >> > > However, a couple of questions: >> > >> > > 0. So, I've compiled svn version of tesseract and installed it to the >> / >> > > local/tesseract-svn prefix with all language files. >> > > I've also exported /local/tesseract-svn/bin in PATH so that binaries >> > > from there can be called from scripts. >> > >> > > 1. Then, I've created the text.txt file with a nice long text in it. >> > >> > > 2. I've run >> > > python text2img.py -b -i _some_fonts_here >> > > Now I have png files. >> > >> > > 3. Then I run png2tif.sh and get all tif files. >> > > That's correct. >> > >> > > 4. Then I am supposed to run autotrain.sh, right? >> > > Anyway, it is failing on the first step - make_boxes.sh >> > > I debugged the script by putting "set -x" there and I have >> > >> > > --- >> > > + LANG=hye >> > > + for file in '*.tif' >> > > ++ basename hye.Dejavu_Serifbold.exp0.tif >> > > + filename=hye.Dejavu_Serifbold.exp0.tif >> > > + filename=hye.Dejavu_Serifbold.exp0 >> > > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 >> -l >> > > hye batch.nochop makebox >> > > Error opening data file /local/tesseract-svn/share/tessdata/ >> > > hye.traineddata >> > > Please make sure the TESSDATA_PREFIX environment variable is set to >> > > the parent directory of your "tessdata" directory. >> > > Failed loading language 'hye' >> > > Tesseract couldn't load any languages! >> > > Could not initialize tesseract. >> > > --- >> > >> > > and the same messages for the all fonts. >> > >> > > Obviously, there is no hye.traineddata file there. >> > > I wonder if it should be there on this step, when I am bootstrapping >> a >> > > new language? >> > >> > > According to the >> > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 >> > > while bootstrapping a new language one has to issue: >> > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] >> -l >> > > yournewlanguage batch.nochop makebox >> > >> > > which is what make_boxes.sh script tries to do and what is failed >> from >> > > the commandline as well: >> > >> > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 >> - >> > > l hy batch.nochop makebox >> > > Error opening data file /local/tesseract-svn/share/tessdata/ >> > > hy.traineddata >> > > Please make sure the TESSDATA_PREFIX environment variable is set to >> > > the parent directory of your "tessdata" directory. >> > > Failed loading language 'hy' >> > > Tesseract couldn't load any languages! >> > > Could not initialize tesseract. >> > >> > > Any ideas? >> > >> > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote: >> > > > Hi all, >> > >> > > > I have been doing a lot of tesseract training recently, so I >> decided to >> > > put together some Python and shell scripts to speed up the process. I >> > > haven't done any prep to prepare these for public consumption, but >> they >> > > have made my life a lot easier, so I thought I'd throw them out on >> the list >> > > in case anyone else finds them useful. >> > >> > > > Just a head's up, the default language is Georgian because that's >> what >> > > I'm training for, so make sure to change that to your language when >> > > training. >> > >> > > >https://github.com/ddohler/tess_school >> > >> > > > Cheers, >> > > > Derek > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

