Re: traineddata file size varies according to box file images?

Frederico Ferro Schuh Tue, 25 Feb 2014 23:21:09 -0800

I created my traineddata by following these two guides:

*    http://blog.cedric.ws/how-to-train-tesseract-301*
*    https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3*


I will now describe in detail every single step I used below.
I have called my test font *hwdigitbig*.
Here are the steps:

- Create 1 box file for each of my TIF files (each TIF holds samples for 1 
digit):
    *tesseract eng.hwdigitbig.exp0.tif eng.hwdigitbig.exp0 batch.nochop 
makebox*
*    tesseract eng.hwdigitbig.exp1.tif eng.hwdigitbig.exp1 batch.nochop 
makebox*
*    ...*
*    tesseract eng.hwdigitbig.exp9.tif eng.hwdigitbig.exp9 batch.nochop 
makebox*

- Open box files in jTessBoxEditor and fix incorrect values

- Also in jTessBoxEditor, split/merge invalid bounding boxes (I get many 
bad bounding boxes in those samples, some spanning 3 characters vertically, 
I guess I need to clean the images a bit)

- Retrain tesseract with fixed box files for each digit
*    tesseract **eng.hwdigitbig.exp0.tif **eng.hwdigitbig.exp0.box nobatch 
box.train*
*    ...*
*    tesseract **eng.hwdigitbig.exp9.tif **eng.hwdigitbig.exp9.box nobatch 
box.train*

- Generate unicharset for all boxes together
*    unicharset_extractor* *eng.hwdigitbig.exp0.box *
*eng.hwdigitbig.exp1.box **eng.hwdigitbig.exp2.box *
*eng.hwdigitbig.exp3.box **eng.hwdigitbig.exp4.box *
*eng.hwdigitbig.exp5.box **eng.hwdigitbig.exp6.box *
*eng.hwdigitbig.exp7.box **eng.hwdigitbig.exp8.box *
*eng.hwdigitbig.exp9.box*

- Font properties file (the simplest font possible, no effects applied to 
it)
*    echo "hwdigitbig 0 0 0 0 0" > font_properties*

- Clustering step (2 commands, all trained box files together on each 
command)
*    mftraining -F font_properties -U unicharset -O eng.unicharset *
*eng.hwdigitbig.exp0.box.tr **eng.hwdigitbig.exp1.box.tr *
*eng.hwdigitbig.exp2.box.tr **eng.hwdigitbig.exp3.box.tr *
*eng.hwdigitbig.exp4.box.tr **eng.hwdigitbig.exp5.box.tr *
*eng.hwdigitbig.exp6.box.tr **eng.hwdigitbig.exp7.box.tr *
*eng.hwdigitbig.exp8.box.tr **eng.hwdigitbig.exp9.box.tr*

*cftraining **eng.hwdigitbig.exp0.box.tr **eng.hwdigitbig.exp1.box.tr *
*eng.hwdigitbig.exp2.box.tr **eng.hwdigitbig.exp3.box.tr *
*eng.hwdigitbig.exp4.box.tr **eng.hwdigitbig.exp5.box.tr *
*eng.hwdigitbig.exp6.box.tr **eng.hwdigitbig.exp7.box.tr *
*eng.hwdigitbig.exp8.box.tr **eng.hwdigitbig.exp9.box.tr*

   - Renaming generated files. The resulting files are:
*    eng.shapetable *
*    eng.normproto *
*    eng.inttemp*
*    eng.pffmtable*

- Generating traineddata
*    combine_tessdata eng*

- The last step will generate this file (137 kb big)
    *eng.traineddata*

- I then rename this file to my new test language name, which I'll call the 
same as my font
*    hwdigitbig.traineddata*


So that concludes the steps I used.
The traineddata generate with the steps above is 137 kb big, no matter if I 
use my big samples of 6000 characters per digit, or reduced files of 1000 
samples per digit.
The OCR results are not satisfactory at all, in fact even using the default 
*eng* language for handwriting recognition is giving better results.
Any ideas/suggestions?

Thank you very much!

On Wednesday, February 26, 2014 7:31:03 AM UTC+8, peiman F. wrote:
>
> i have this problem too 
> i used jtessboxeditor to train the tesseract 
> my tif file had 34000 word and i build it with a 50 pages tiff file 
>
> but the output trained file was 1.5 mb and dont detected any words!! 
>
> jtessboxeditor have problem? 
>
> On 2/25/14, Bernard Polarski <[email protected] <javascript:>> wrote: 
> > How do you produce your traineddata ? 
> > 
> > 
> > 
> > Le mardi 25 février 2014 17:51:39 UTC+1, Frederico Ferro Schuh a écrit : 
> >> 
> >> Hello all, 
> >> 
> >> I'm training Tesseract to recognize handwritten digits, and I have 
> >> provided it about 6000 samples of each digit, in 10 different box 
> files, 
> >> one for each digit. Each box file is a 2152x2152 TIF file. However, the 
> >> resulting traineddata file I get after completing the training 
> procedure 
> >> is 
> >> only 137 kb. 
> >> I went through the process again, providing smaller sample files (1000 
> >> samples of each digit), and ended up with the same traineddata size of 
> 137 
> >> 
> >> kb. 
> >> Is this size reasonable or am I doing something wrong? 
> >> I assume something is wrong because my results are pretty bad so far. 
> >> 
> >> I've attached the sample image I am using for the digit 0. 
> >> 
> >> Thanks in advance, 
> >> Fred 
> >> 
> > 
> > -- 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "tesseract-ocr" group. 
> > To post to this group, send email to 
> > [email protected]<javascript:> 
> > To unsubscribe from this group, send email to 
> > [email protected] <javascript:> 
> > For more options, visit this group at 
> > http://groups.google.com/group/tesseract-ocr?hl=en 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "tesseract-ocr" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an 
> > email to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/groups/opt_out. 
> > 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: traineddata file size varies according to box file images?

Reply via email to