Sorry I forgot to mention that I'm fine-tuning this with the eng.traineddata best file as the base model. On Tuesday, 9 July 2019 11:43:28 UTC+5:30, sai sumanth Kalluri wrote: > > Hi! > > I'm trying to teach tesseract to recognize a particularly tricky font of > the english language (I do not know the name of the font and any online > tool couldn't find it as well) and I have a very high accuracy > requirement.It is completely okay if my model does not generalize to other > fonts and works only on this font. Following are the details about what > I've done so far. > > -I'm using: tesseract 5.0.0-alpha-174-g60b4c > leptonica-1.78.0 > libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng > 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 > Found AVX2 > Found AVX > Found SSE > - I have approx. 6000 lines of training data, each line has around 12-15 > words. I'm guessing around 1 in 50 lines has a mislabelled character (how > much does that affect the result?) > - Also when I apply the end trained model to a document with approx. 50 > lines of text, I believe the error rate is definitely higher than what > lstemeval is telling me. > - I have trained tesseract on this data incrementally from 300 iterations > to 6000 iterations and the best I could achieve was after 4200 iterations: > Eval Char error rate=0.70714604, Word error rate=1.922281 > - After that it has more or less saturated and I even suspect overfitting > from the kind of errors its making. > - I need to achieve ~0.1 char error rate. What can be my next steps? (it > is possible for me to create more training data if thats and option but i > would prefer something simpler, changing network parameter perhaps?). > (NOTE: The font is indeed very tricky sometimes even for the human eye and > I have attached a small sample of it with this post) > Thanks in Advance! > > (PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very > frequently mis-labelled in the training data but I really don't care about > puntuation for my project, I only want accurate detection of the other > characters. should I be worrying about this?) >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/046e2c9a-df7e-41cf-8258-646e832e66e9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

