There are alternate approaches to training. tesstrain.sh in tesseract repo works on training text and fonts, creating synthetic training data as multi-page tifs.
tesstrain repo uses a makefile for training from images with their corresponding ground truth. For fine-tuning for a font, both can work. Both will also be quite fast to try, as you only need to run 400 iterations. On Fri, Apr 3, 2020, 16:53 Shree Devi Kumar <[email protected]> wrote: > As per the info given by Ray Smith, lead developer of tesseract, if you > just need to fine-tune for a new font face, use fine-tune by impact. > > His example uses the training text from langdata repo (approx 80 lines) > rendered with the font, generating lstmf files and then running > lstmtraining on that for about 400 iterations. > > Using too few lines or too many iterations will lead to suboptimal results. > > You can whitelist only digits to further improve your results. > > The above info is for lstm training - neural network based. That is the > only one that allows fine-tuning. > > Your second approach is for the legacy engine. That does not have any > option for fine-tuning. > > You can see shreeshrii/tess4training repo for my replication of the > tesstutorials by Ray. > > On Fri, Apr 3, 2020, 16:40 hmaster <[email protected]> wrote: > >> Hello, >> >> I am trying to improve accuracy for my use case, by fine tuning. >> Currently I'm getting between 80-90% accuracy on my scanned images, and >> around 60% for images taken via phone. >> I'm running on a Jetson Nano, using: >> ``` >> tesseract 4.1.1-rc2-21-gf4ef >> leptonica-1.78.0 >> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : >> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >> Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 >> liblz4/1.7.1 >> ``` >> >> I'm training on a single image, just to understand the mechanism, and >> learn about it. >> I'm using a scanned receipt, as an example, 600dpi. Identity, and >> imagemagick says it's 1696x3930. >> >> I'm confused a bit by this, as the script still runs, and the error rate >> keeps dropping. >> I've read the tutorials and examples, and the scripts, and it's all too >> much for now, as I've been at it for about 2-3 weeks now. >> >> There are a couple of things that are still unclear to me, and have some >> questions: >> >> 1. Do I need to create single line images from each image I have? (~3000) >> 2. would it help if I create ground-truth text files - for the entire >> image, or should I create only for a single line? (that is I must have >> tiff, box and ground-truth files for each image) >> 3. some of the words in my images are not found in the >> eng.training_files.txt, as such would it speed up/help if I add them? >> 4. is there a way to do fine tuning with my own images and my own >> eng.training_files.txt data, without running tesstrain.sh? >> >> I could not find details about how to train/fine tune with own tif/box. >> Meaning, I have created a folder with my data, and passed it to >> tesstrain.sh via my_box_tiff_dir, but it's not using those, from what I can >> tell, as it creates synth data. >> As said above, it's unclear to me if I need to generate the ground-truth >> data as well, do I still need to fiddle/fix the box files, etc. >> >> Sorry if I asked too many questions, I've invested so much time in it, >> and I'm not sure where exactly I'm doing wrong. >> >> I've followed the steps in few of the questions posted in this group, and >> I am getting decent results, however, they are not as good as using the >> traineddata_best on its own. >> >> Steps I've done were: >> >> *Method 1* >> 1. create box files via lstmbox and fix any mistakes - tesseract img.tif >> img --dpi 600 lstmbox >> 2. extract lstm from eng.traneddata_best >> 3. run lstmtraining for fine tuning - lstmtraining --continue from... >> 4. generate eng.traineddata - lstmtraining stop... >> >> *Method 2* >> 1. create box files via lstmbox and fix any mistakes - tesseract img.tif >> img --dpi 600 lstmbox >> 2. create lstmf files - tesseract img.tif img --dpi 600 lstm.train >> 3. extract unicharset - unicharset_extractor *.box >> 4. shapeclustering -F font_properties -U unicharset *.tr >> 5. mftraining -F font_properties -U unicharset -O eng.unicharset *.tr >> 6. cntraining *.tr >> 7. rename inttemp, normproto, pffmtable, shapetable >> 8. combine_tessdata eng. >> >> Thank you for your support and help with my endeavor. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/a2a43c7e-c658-4d22-af1c-32dbd1d5b2f4%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWfsKwMPRau2EXeNGd-78601hOEc90tjKhsjf7Tc1nc3A%40mail.gmail.com.

