will work in windows winXP provided if you have installed python2.7 and PIL.

On Sat, Jun 23, 2012 at 1:21 PM, blavatsky3 <
[email protected]> wrote:

> Hello Balthazar,
>
> Do you have a 'windows' version of your box trainer for tesseract ?
>
> kind regards
>
> Richard
>
>
> On Saturday, June 9, 2012 4:17:11 AM UTC+10, Balthazar Rouberol wrote:
>>
>> Hello all,
>>
>> I've written a small Python tool taking over the training process, and
>> also the tif (multipage supported) and boxfile generation:
>> https://github.com/**BaltoRouberol/TesseractTrainer<https://github.com/BaltoRouberol/TesseractTrainer>
>>
>> This can be useful when you want to train Tesseract on a given font, and
>> you thus have to create the tif yourself.
>> With this tool, you specify a text, a font (among other things) and a
>> multipage tif containing your text/font will then be generated, along with
>> the corresponding boxfile.
>> This allows you to be 100% sure of the boxfile accuracy, and skip the
>> boxfile checking process.
>> The training process can now be fully automated, from the tif generation
>> to the traineddata file combination.
>>
>> I'll be happy to get feedback from you!
>>
>> Balthazar
>>
>> Le mercredi 6 juin 2012 20:26:40 UTC+2, shikamuk a écrit :
>>>
>>> > Hello from neighboring Georgia!
>>> Yay!
>>> Thank you, I'll do a git pull and give it a try!
>>> Not right now, cause I am under the load.
>>>
>>> I've also already noticed that without "-l" I can get it work.
>>> Thank you again, I guess I may have further questions.
>>>
>>> მადლობტ
>>>
>>> ლორაირ
>>>
>>>
>>> On Jun 4, 8:03 pm, Derek <[email protected]> wrote:
>>> > Hi Shikamuk,
>>> >
>>> > Hello from neighboring Georgia! You're exactly right, the issue is
>>> that you
>>> > don't have hye.traineddata yet. For completely new character sets, you
>>> need
>>> > to issue the tesseract command without "-l yournewlanguage". The line
>>> > you're referring to is suggesting what to do after you have trained
>>> > Tesseract on one font in your new language. Since you are training for
>>> a
>>> > unique script, it doesn't really matter what you use as the language
>>> code;
>>> > you will get equally bad results no matter what.
>>> >
>>> > I don't suggest using auto_train.sh at this stage; you will need to
>>> edit
>>> > the boxfiles generated by make_boxes.sh before continuing the training
>>> > process, so I suggest running make_boxes.sh on its own, and then using
>>> > merge_boxes.py and align_boxfile.py along with manual editing to get
>>> the
>>> > boxfiles in order before continuing with the training process. I've
>>> made
>>> > some small modifications to the scripts and README to make this clear,
>>> so I
>>> > suggest doing 'git pull' to get the latest copy.
>>> >
>>> > Hope that helps!
>>> >
>>> > Derek
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Sunday, June 3, 2012 10:29:26 PM UTC+4, shikamuk wrote:
>>> >
>>> > > Hey, Derek.
>>> > > Thank you for scripts, they seem to work.
>>> >
>>> > > However, a couple of questions:
>>> >
>>> > > 0. So, I've compiled svn version of tesseract and installed it to
>>> the /
>>> > > local/tesseract-svn prefix with all language files.
>>> > > I've also exported /local/tesseract-svn/bin in PATH so that binaries
>>> > > from there can be called from scripts.
>>> >
>>> > > 1. Then, I've created the text.txt file with a nice long text in it.
>>> >
>>> > > 2.  I've run
>>> > > python text2img.py -b -i _some_fonts_here
>>> > > Now I have png files.
>>> >
>>> > > 3. Then I run png2tif.sh and get all tif files.
>>> > > That's correct.
>>> >
>>> > > 4. Then I am supposed to run autotrain.sh, right?
>>> > > Anyway, it is failing on the first step - make_boxes.sh
>>> > > I debugged the script by putting "set -x" there and I have
>>> >
>>> > > ---
>>> > > + LANG=hye
>>> > > + for file in '*.tif'
>>> > > ++ basename hye.Dejavu_Serifbold.exp0.tif
>>> > > + filename=hye.Dejavu_Serifbold.**exp0.tif
>>> > > + filename=hye.Dejavu_Serifbold.**exp0
>>> > > + tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0
>>> -l
>>> > > hye batch.nochop makebox
>>> > > Error opening data file /local/tesseract-svn/share/**tessdata/
>>> > > hye.traineddata
>>> > > Please make sure the TESSDATA_PREFIX environment variable is set to
>>> > > the parent directory of your "tessdata" directory.
>>> > > Failed loading language 'hye'
>>> > > Tesseract couldn't load any languages!
>>> > > Could not initialize tesseract.
>>> > > ---
>>> >
>>> > > and the same messages for the all fonts.
>>> >
>>> > > Obviously, there is no hye.traineddata file there.
>>> > > I wonder if it should be there on this step, when I am bootstrapping
>>> a
>>> > > new language?
>>> >
>>> > > According to the
>>> > >http://code.google.com/p/**tesseract-ocr/wiki/**TrainingTesseract3<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
>>> > > while bootstrapping a new language one has to issue:
>>> > > tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num]
>>> -l
>>> > > yournewlanguage batch.nochop makebox
>>> >
>>> > > which is what make_boxes.sh script tries to do and what is failed
>>> from
>>> > > the commandline as well:
>>> >
>>> > > $tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0
>>> -
>>> > > l hy batch.nochop makebox
>>> > > Error opening data file /local/tesseract-svn/share/**tessdata/
>>> > > hy.traineddata
>>> > > Please make sure the TESSDATA_PREFIX environment variable is set to
>>> > > the parent directory of your "tessdata" directory.
>>> > > Failed loading language 'hy'
>>> > > Tesseract couldn't load any languages!
>>> > > Could not initialize tesseract.
>>> >
>>> > > Any ideas?
>>> >
>>> > > On May 24, 11:02 pm, Derek Dohler <[email protected]> wrote:
>>> > > > Hi all,
>>> >
>>> > > > I have been doing a lot of tesseract training recently, so I
>>> decided to
>>> > > put together some Python and shell scripts to speed up the process.
>>> I
>>> > > haven't done any prep to prepare these for public consumption, but
>>> they
>>> > > have made my life a lot easier, so I thought I'd throw them out on
>>> the list
>>> > > in case anyone else finds them useful.
>>> >
>>> > > > Just a head's up, the default language is Georgian because that's
>>> what
>>> > > I'm training for, so make sure to change that to your language when
>>> > > training.
>>> >
>>> > > >https://github.com/ddohler/**tess_school<https://github.com/ddohler/tess_school>
>>> >
>>> > > > Cheers,
>>> > > > Derek
>>
>>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to