Search the forum for Cursive On Thu, 11 Jul 2019, 13:00 sai sumanth Kalluri, <[email protected]> wrote:
> Thanks for the reply but that link does not lead anywhere. Could you > please correct it? > > On Thursday, 11 July 2019 12:34:38 UTC+5:30, shree wrote: >> >> See >> https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!searchin/tesseract-ocr/Cursive/tesseract-ocr/6naBkXZvTlI >> >> >> >> On Thu, 11 Jul 2019, 11:58 sai sumanth Kalluri, <[email protected]> >> wrote: >> >>> Can somebody please give me some advice regarding this? >>> >>> On Tuesday, 9 July 2019 11:52:28 UTC+5:30, sai sumanth Kalluri wrote: >>>> >>>> Hi! >>>> >>>> I'm trying to teach tesseract to recognize a particularly tricky font >>>> of the english language (I do not know the name of the font and any online >>>> tool couldn't find it as well) and I have a very high accuracy >>>> requirement.It is completely *okay if my model does not generalize to >>>> other fonts* and works only on this font. Following are the details >>>> about what I've done so far. >>>> >>>> -I'm using: tesseract 5.0.0-alpha-174-g60b4c >>>> leptonica-1.78.0 >>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : >>>> libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 >>>> 2.3.0 >>>> Found AVX2 >>>> Found AVX >>>> Found SSE >>>> - I have approx. 6000 lines of training data, each line has around >>>> 12-15 words. I'm guessing around 1 in 50 lines has a mislabelled character >>>> (how much does that affect the result?). >>>> - I'm *fine-tuning* the *'eng.traineddata' **bes*t model using this >>>> data. >>>> - The training as well as the testing data are properly scanned >>>> document images in jpg format so I'm assuming any data preprocessing is not >>>> required. >>>> - Also when I apply the end trained model to a document with approx. 50 >>>> lines of text, I believe the error rate is definitely higher than what >>>> lstemeval is telling me. >>>> - I have trained tesseract on this data incrementally from 300 >>>> iterations to 6000 iterations and the best I could achieve was *after >>>> 4200 iterations: Eval Char error rate=0.70714604, Word error rate=1.922281* >>>> - After that it has more or less saturated and I even suspect >>>> overfitting from the kind of errors its making. >>>> - I need to achieve* ~0.1 char error rate*. What can be my next >>>> steps? (it is possible for me to create more training data if thats and >>>> option but i would prefer something simpler, changing network parameter >>>> perhaps?). >>>> >>>> (NOTE: The font is indeed very tricky sometimes even for the human eye >>>> and I have attached a small sample of it with this post) >>>> Thanks in Advance! >>>> >>>> (PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very >>>> frequently mis-labelled in the training data but I really don't care about >>>> puntuation for my project, I only want accurate detection of the other >>>> characters. should I be worrying about this?) >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/35abc1cd-552b-405c-85be-9e0af720b04d%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/35abc1cd-552b-405c-85be-9e0af720b04d%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/d8dff6ac-10ba-4038-a027-e1a9802acdcd%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d8dff6ac-10ba-4038-a027-e1a9802acdcd%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWsU1gV7%2BNfDXhnwE07oqjVNoUgfhk4VjbrrUnnQp8i9A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

