Hi, How do I roll back to Version 4.1? Sumanth
On Thu, Jul 11, 2019 at 6:58 PM Lorenzo Bolzani <[email protected]> wrote: > Hi, a few things I would try (*I never trained on cursive fonts*): > > - I would use a stable tesseract version (4.1 right now) > - 0.7 is not a very good score for a text this clean > - I think 6000 lines is not much, hard to tell if it is enough, this is > not a classic font > - data pre processing may help, but the sample looks perfectly clean. This > is already processed. > - How much testing data did you use? 20%? Real world accuracy will always > be a little worse than testing accuracy because you pick the model that > best fits the test dataset. But do not trust your guts on this difference, > it's very hard to estimate it informally. Make sure the real document is > processed in the same way as the training/test data > - do some data augmentation: bold, noise, stretch, skew, blur, tiny > rotations, etc. to generate more data (not too much, maybe 3 to 5 times > more), also keep the original data. If you use python you can use imgaug. > - if you can find the font, it should be possible, add some synthetic data > too (again with augmentation). There are online tools to find fonts by > samples. > - small labels errors are not a big problem if you have a lot of data and > if you do not overfit too much. In this case you can first train one model > with current data, then use it to tell you which samples do not match the > gt.txt files according to this model. It will likely find most of the > mislabeled data. Fix it and then of course train again on the new data. If > this is english text you could even run a spell check on the gt.txt files > to find some errors. > - restrict the output charset only to the characters you need > - there is some "noise/dust" around the text, probably it is just the jpeg > compression, I would apply a simple threshold and save the files as png. > Noise should not be a problem if it is present in the training data and > prediction data but maybe you are getting this extra noise because you > saved the file on disk and maybe at runtime you won't have it. Maybe > tesseract will remove it for you, but if you want to remove a source of > doubt just threshold them. > - check the boxes of the recognized text to understand what is going on > (see ocr_boxes.py or maybe hocr output) > > - Your text has long/tall legs, the body is 35px but it goes up to 120 > with the legs. So I think it is important to understand how your lines are > cropped. The input size for the LSTM is 48(*) so if you feed lines 120px > tall these are going to be downscaled a lot and the core part will suffer > most. So maybe (just speculating) it is better to cut a little the "legs" > and the top (see the example). In any case I'd try to understand what > images are fed to the NN at training time and prediction time. > > - your text is aligned extremely well, it does not look like something out > of a scanner. Is this real scanned text? > - as this is English text, consider doing a dictionary spell check/fix. > > - maybe also consider to try to train from scratch using only a lot of > synthetic data with very similar fonts only, then fine tune with real data > (if you have enough time) > > > > (*) According to this page: > > https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast > > input size, for the "fast" model is 36 or 48, I suppose it is 48 for all > the "best" models. > > > > Lorenzo > > > > Il giorno mar 9 lug 2019 alle ore 08:22 sai sumanth Kalluri < > [email protected]> ha scritto: > >> Hi! >> >> I'm trying to teach tesseract to recognize a particularly tricky font of >> the english language (I do not know the name of the font and any online >> tool couldn't find it as well) and I have a very high accuracy >> requirement.It is completely *okay if my model does not generalize to >> other fonts* and works only on this font. Following are the details >> about what I've done so far. >> >> -I'm using: tesseract 5.0.0-alpha-174-g60b4c >> leptonica-1.78.0 >> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng >> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >> Found AVX2 >> Found AVX >> Found SSE >> - I have approx. 6000 lines of training data, each line has around 12-15 >> words. I'm guessing around 1 in 50 lines has a mislabelled character (how >> much does that affect the result?). >> - I'm *fine-tuning* the *'eng.traineddata' **bes*t model using this >> data. >> - The training as well as the testing data are properly scanned document >> images in jpg format so I'm assuming any data preprocessing is not required. >> - Also when I apply the end trained model to a document with approx. 50 >> lines of text, I believe the error rate is definitely higher than what >> lstemeval is telling me. >> - I have trained tesseract on this data incrementally from 300 iterations >> to 6000 iterations and the best I could achieve was *after 4200 >> iterations: Eval Char error rate=0.70714604, Word error rate=1.922281* >> - After that it has more or less saturated and I even suspect overfitting >> from the kind of errors its making. >> - I need to achieve* ~0.1 char error rate*. What can be my next steps? >> (it is possible for me to create more training data if thats and option but >> i would prefer something simpler, changing network parameter perhaps?). >> >> (NOTE: The font is indeed very tricky sometimes even for the human eye >> and I have attached a small sample of it with this post) >> Thanks in Advance! >> >> (PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very >> frequently mis-labelled in the training data but I really don't care about >> puntuation for my project, I only want accurate detection of the other >> characters. should I be worrying about this?) >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/012416b0-4605-494b-a12f-f939ead3d62e%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/012416b0-4605-494b-a12f-f939ead3d62e%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxsBNUrGaF7L_iSsCw1Hpcm%2B177DY0hVm58sGDq%3DdtyNQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxsBNUrGaF7L_iSsCw1Hpcm%2B177DY0hVm58sGDq%3DdtyNQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKrQtvoP4fpVosH72ktqZhzDhTwZ1wY89K727frE3hXpMFa32Q%40mail.gmail.com.

