Hello Balthazar,

Do you have a 'windows' version of your box trainer for tesseract ?

kind regards

Richard

On Saturday, June 9, 2012 4:17:11 AM UTC+10, Balthazar Rouberol wrote:
>
> Hello all,
>
> I've written a small Python tool taking over the training process, and 
> also the tif (multipage supported) and boxfile generation: 
> https://github.com/BaltoRouberol/TesseractTrainer
>
> This can be useful when you want to train Tesseract on a given font, and 
> you thus have to create the tif yourself.
> With this tool, you specify a text, a font (among other things) and a 
> multipage tif containing your text/font will then be generated, along with 
> the corresponding boxfile.
> This allows you to be 100% sure of the boxfile accuracy, and skip the 
> boxfile checking process. 
> The training process can now be fully automated, from the tif generation 
> to the traineddata file combination.
>
> I'll be happy to get feedback from you!
>
> Balthazar
>
> Le mercredi 6 juin 2012 20:26:40 UTC+2, shikamuk a écrit :
>>
>> > Hello from neighboring Georgia! 
>> Yay! 
>> Thank you, I'll do a git pull and give it a try! 
>> Not right now, cause I am under the load. 
>>
>> I've also already noticed that without "-l" I can get it work. 
>> Thank you again, I guess I may have further questions. 
>>
>> მადლობტ 
>>
>> ლორაირ 
>>
>>
>> On Jun 4, 8:03 pm, Derek <[email protected]> wrote: 
>> > Hi Shikamuk, 
>> > 
>> > Hello from neighboring Georgia! You're exactly right, the issue is that 
>> you 
>> > don't have hye.traineddata yet. For completely new character sets, you 
>> need 
>> > to issue the tesseract command without "-l yournewlanguage". The line 
>> > you're referring to is suggesting what to do after you have trained 
>> > Tesseract on one font in your new language. Since you are training for 
>> a 
>> > unique script, it doesn't really matter what you use as the language 
>> code; 
>> > you will get equally bad results no matter what. 
>> > 
>> > I don't suggest using auto_train.sh at this stage; you will need to 
>> edit 
>> > the boxfiles generated by make_boxes.sh before continuing the training 
>> > process, so I suggest running make_boxes.sh on its own, and then using 
>> > merge_boxes.py and align_boxfile.py along with manual editing to get 
>> the 
>> > boxfiles in order before continuing with the training process. I've 
>> made 
>> > some small modifications to the scripts and README to make this clear, 
>> so I 
>> > suggest doing 'git pull' to get the latest copy. 
>> > 
>> > Hope that helps! 
>> > 
>> > Derek 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote: 
>> > 
>> > > Hey, Derek. 
>> > > Thank you for scripts, they seem to work. 
>> > 
>> > > However, a couple of questions: 
>> > 
>> > > 0. So, I've compiled svn version of tesseract and installed it to the 
>> / 
>> > > local/tesseract-svn prefix with all language files. 
>> > > I've also exported /local/tesseract-svn/bin in PATH so that binaries 
>> > > from there can be called from scripts. 
>> > 
>> > > 1. Then, I've created the text.txt file with a nice long text in it. 
>> > 
>> > > 2.  I've run 
>> > > python text2img.py -b -i _some_fonts_here 
>> > > Now I have png files. 
>> > 
>> > > 3. Then I run png2tif.sh and get all tif files. 
>> > > That's correct. 
>> > 
>> > > 4. Then I am supposed to run autotrain.sh, right? 
>> > > Anyway, it is failing on the first step - make_boxes.sh 
>> > > I debugged the script by putting "set -x" there and I have 
>> > 
>> > > --- 
>> > > + LANG=hye 
>> > > + for file in '*.tif' 
>> > > ++ basename hye.Dejavu_Serifbold.exp0.tif 
>> > > + filename=hye.Dejavu_Serifbold.exp0.tif 
>> > > + filename=hye.Dejavu_Serifbold.exp0 
>> > > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 
>> -l 
>> > > hye batch.nochop makebox 
>> > > Error opening data file /local/tesseract-svn/share/tessdata/ 
>> > > hye.traineddata 
>> > > Please make sure the TESSDATA_PREFIX environment variable is set to 
>> > > the parent directory of your "tessdata" directory. 
>> > > Failed loading language 'hye' 
>> > > Tesseract couldn't load any languages! 
>> > > Could not initialize tesseract. 
>> > > --- 
>> > 
>> > > and the same messages for the all fonts. 
>> > 
>> > > Obviously, there is no hye.traineddata file there. 
>> > > I wonder if it should be there on this step, when I am bootstrapping 
>> a 
>> > > new language? 
>> > 
>> > > According to the 
>> > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
>> > > while bootstrapping a new language one has to issue: 
>> > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] 
>> -l 
>> > > yournewlanguage batch.nochop makebox 
>> > 
>> > > which is what make_boxes.sh script tries to do and what is failed 
>> from 
>> > > the commandline as well: 
>> > 
>> > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 
>> - 
>> > > l hy batch.nochop makebox 
>> > > Error opening data file /local/tesseract-svn/share/tessdata/ 
>> > > hy.traineddata 
>> > > Please make sure the TESSDATA_PREFIX environment variable is set to 
>> > > the parent directory of your "tessdata" directory. 
>> > > Failed loading language 'hy' 
>> > > Tesseract couldn't load any languages! 
>> > > Could not initialize tesseract. 
>> > 
>> > > Any ideas? 
>> > 
>> > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote: 
>> > > > Hi all, 
>> > 
>> > > > I have been doing a lot of tesseract training recently, so I 
>> decided to 
>> > > put together some Python and shell scripts to speed up the process. I 
>> > > haven't done any prep to prepare these for public consumption, but 
>> they 
>> > > have made my life a lot easier, so I thought I'd throw them out on 
>> the list 
>> > > in case anyone else finds them useful. 
>> > 
>> > > > Just a head's up, the default language is Georgian because that's 
>> what 
>> > > I'm training for, so make sure to change that to your language when 
>> > > training. 
>> > 
>> > > >https://github.com/ddohler/tess_school 
>> > 
>> > > > Cheers, 
>> > > > Derek
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to