Re: [tesseract-ocr] fine tuning from traineddata_best

Shree Devi Kumar Fri, 03 Apr 2020 04:28:16 -0700

There are alternate approaches to training.

tesstrain.sh in tesseract repo works on training text and fonts, creating
synthetic training data as multi-page tifs.


tesstrain repo uses a makefile for training from images with their
corresponding ground truth.

For fine-tuning for a font, both can work. Both will also be quite fast to
try, as you only need to run 400 iterations.

On Fri, Apr 3, 2020, 16:53 Shree Devi Kumar <[email protected]> wrote:

> As per the info given by Ray Smith, lead developer of tesseract, if you
> just need to fine-tune for a new font face, use fine-tune by impact.
>
> His example uses the training text from langdata repo (approx 80 lines)
> rendered with the font, generating lstmf files and then running
> lstmtraining on that for about 400 iterations.
>
> Using too few lines or too many iterations will lead to suboptimal results.
>
> You can whitelist only digits to further improve your results.
>
> The above info is for lstm training - neural network based. That is the
> only one that allows fine-tuning.
>
> Your second approach is for the legacy engine. That does not have any
> option for fine-tuning.
>
> You can see shreeshrii/tess4training repo for my replication of the
> tesstutorials by Ray.
>
> On Fri, Apr 3, 2020, 16:40 hmaster <[email protected]> wrote:
>
>> Hello,
>>
>> I am trying to improve accuracy for my use case, by fine tuning.
>> Currently I'm getting between 80-90% accuracy on my scanned images, and
>> around 60% for images taken via phone.
>> I'm running on a Jetson Nano, using:
>> ```
>> tesseract 4.1.1-rc2-21-gf4ef
>>  leptonica-1.78.0
>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 :
>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>  Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6
>> liblz4/1.7.1
>> ```
>>
>> I'm training on a single image, just to understand the mechanism, and
>> learn about it.
>> I'm using a scanned receipt, as an example, 600dpi. Identity, and
>> imagemagick says it's 1696x3930.
>>
>> I'm confused a bit by this, as the script still runs, and the error rate
>> keeps dropping.
>> I've read the tutorials and examples, and the scripts, and it's all too
>> much for now, as I've been at it for about 2-3 weeks now.
>>
>> There are a couple of things that are still unclear to me, and have some
>> questions:
>>
>> 1. Do I need to create single line images from each image I have? (~3000)
>> 2. would it help if I create ground-truth text files - for the entire
>> image, or should I create only for a single line? (that is I must have
>> tiff, box and ground-truth files for each image)
>> 3. some of the words in my images are not found in the
>> eng.training_files.txt, as such would it speed up/help if I add them?
>> 4. is there a way to do fine tuning with my own images and my own
>> eng.training_files.txt data, without running tesstrain.sh?
>>
>> I could not find details about how to train/fine tune with own tif/box.
>> Meaning, I have created a folder with my data, and passed it to
>> tesstrain.sh via my_box_tiff_dir, but it's not using those, from what I can
>> tell, as it creates synth data.
>> As said above, it's unclear to me if I need to generate the ground-truth
>> data as well, do I still need to fiddle/fix the box files, etc.
>>
>> Sorry if I asked too many questions, I've invested so much time in it,
>> and I'm not sure where exactly I'm doing wrong.
>>
>> I've followed the steps in few of the questions posted in this group, and
>> I am getting decent results, however, they are not as good as using the
>> traineddata_best on its own.
>>
>> Steps I've done were:
>>
>> *Method 1*
>> 1. create box files via lstmbox and fix any mistakes - tesseract img.tif
>> img --dpi 600 lstmbox
>> 2. extract lstm from eng.traneddata_best
>> 3. run lstmtraining for fine tuning - lstmtraining --continue from...
>> 4. generate eng.traineddata - lstmtraining stop...
>>
>> *Method 2*
>> 1. create box files via lstmbox and fix any mistakes - tesseract img.tif
>> img --dpi 600 lstmbox
>> 2. create lstmf files - tesseract img.tif img --dpi 600 lstm.train
>> 3. extract unicharset - unicharset_extractor *.box
>> 4. shapeclustering -F font_properties -U unicharset *.tr
>> 5. mftraining -F font_properties -U unicharset -O eng.unicharset *.tr
>> 6. cntraining *.tr
>> 7. rename inttemp, normproto, pffmtable, shapetable
>> 8. combine_tessdata eng.
>>
>> Thank you for your support and help with my endeavor.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWfsKwMPRau2EXeNGd-78601hOEc90tjKhsjf7Tc1nc3A%40mail.gmail.com.

Re: [tesseract-ocr] fine tuning from traineddata_best

Reply via email to