Re: [tesseract-ocr] How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

sai sumanth Kalluri Fri, 22 Nov 2019 06:34:30 -0800

Hi,
How do I roll back to Version 4.1?

Sumanth


On Thu, Jul 11, 2019 at 6:58 PM Lorenzo Bolzani <[email protected]> wrote:

> Hi, a few things I would try (*I never trained on cursive fonts*):
>
> - I would use a stable tesseract version (4.1 right now)
> - 0.7 is not a very good score for a text this clean
> - I think 6000 lines is not much, hard to tell if it is enough, this is
> not a classic font
> - data pre processing may help, but the sample looks perfectly clean. This
> is already processed.
> - How much testing data did you use? 20%? Real world accuracy will always
> be a little worse than testing accuracy because you pick the model that
> best fits the test dataset. But do not trust your guts on this difference,
> it's very hard to estimate it informally. Make sure the real document is
> processed in the same way as the training/test data
> - do some data augmentation: bold, noise, stretch, skew, blur, tiny
> rotations, etc. to generate more data (not too much, maybe 3 to 5 times
> more), also keep the original data. If you use python you can use imgaug.
> - if you can find the font, it should be possible, add some synthetic data
> too (again with augmentation). There are online tools to find fonts by
> samples.
> - small labels errors are not a big problem if you have a lot of data and
> if you do not overfit too much. In this case you can first train one model
> with current data, then use it to tell you which samples do not match the
> gt.txt files according to this model. It will likely find most of the
> mislabeled data. Fix it and then of course train again on the new data. If
> this is english text you could even run a spell check on the gt.txt files
> to find some errors.
> - restrict the output charset only to the characters you need
> - there is some "noise/dust" around the text, probably it is just the jpeg
> compression, I would apply a simple threshold and save the files as png.
> Noise should not be a problem if it is present in the training data and
> prediction data but maybe you are getting this extra noise because you
> saved the file on disk and maybe at runtime you won't have it. Maybe
> tesseract will remove it for you, but if you want to remove a source of
> doubt just threshold them.
> - check the boxes of the recognized text to understand what is going on
> (see ocr_boxes.py or maybe hocr output)
>
> - Your text has long/tall legs, the body is 35px but it goes up to 120
> with the legs. So I think it is important to understand how your lines are
> cropped. The input size for the LSTM is 48(*) so if you feed lines 120px
> tall these are going to be downscaled a lot and the core part will suffer
> most. So maybe (just speculating) it is better to cut a little the "legs"
> and the top (see the example). In any case I'd try to understand what
> images are fed to the NN at training time and prediction time.
>
> - your text is aligned extremely well, it does not look like something out
> of a scanner. Is this real scanned text?
> - as this is English text, consider doing a dictionary spell check/fix.
>
> - maybe also consider to try to train from scratch using only a lot of
> synthetic data with very similar fonts only, then fine tune with real data
> (if you have enough time)
>
>
>
> (*) According to this page:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
>
> input size, for the "fast" model is 36 or 48, I suppose it is 48 for all
> the "best" models.
>
>
>
> Lorenzo
>
>
>
> Il giorno mar 9 lug 2019 alle ore 08:22 sai sumanth Kalluri <
> [email protected]> ha scritto:
>
>> Hi!
>>
>> I'm trying to teach tesseract to recognize a particularly tricky font of
>> the english language (I do not know the name of the font and any online
>> tool couldn't find it as well) and I have a very high accuracy
>> requirement.It is completely *okay if my model does not generalize to
>> other fonts* and works only on this font. Following are the details
>> about what I've done so far.
>>
>> -I'm using: tesseract 5.0.0-alpha-174-g60b4c
>>                 leptonica-1.78.0
>>                 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng
>> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>                 Found AVX2
>>                 Found AVX
>>                 Found SSE
>> - I have approx. 6000 lines of training data, each line has around 12-15
>> words. I'm guessing around 1 in 50 lines has a mislabelled character (how
>> much does that affect the result?).
>> - I'm *fine-tuning* the *'eng.traineddata' **bes*t model using this
>> data.
>> - The training as well as the testing data are properly scanned document
>> images in jpg format so I'm assuming any data preprocessing is not required.
>> - Also when I apply the end trained model to a document with approx. 50
>> lines of text, I believe the error rate is definitely higher than what
>> lstemeval is telling me.
>> - I have trained tesseract on this data incrementally from 300 iterations
>> to 6000 iterations and the best I could achieve was *after 4200
>> iterations: Eval Char error rate=0.70714604, Word error rate=1.922281*
>> - After that it has more or less saturated and I even suspect overfitting
>> from the kind of errors its making.
>>  - I need to achieve* ~0.1 char error rate*. What can be my next steps?
>> (it is possible for me to create more training data if thats and option but
>> i would prefer something simpler, changing network parameter perhaps?).
>>
>> (NOTE: The font is indeed very tricky sometimes even for the human eye
>> and I have attached a small sample of it with this post)
>> Thanks in Advance!
>>
>> (PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very
>> frequently mis-labelled in the training data but I really don't care about
>> puntuation for my project, I only want accurate detection of the other
>> characters. should I be worrying about this?)
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/012416b0-4605-494b-a12f-f939ead3d62e%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/012416b0-4605-494b-a12f-f939ead3d62e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxsBNUrGaF7L_iSsCw1Hpcm%2B177DY0hVm58sGDq%3DdtyNQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxsBNUrGaF7L_iSsCw1Hpcm%2B177DY0hVm58sGDq%3DdtyNQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKrQtvoP4fpVosH72ktqZhzDhTwZ1wY89K727frE3hXpMFa32Q%40mail.gmail.com.

Re: [tesseract-ocr] How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

Reply via email to