Hi, yes, at the very least you can use some adaptive threshold method, like OTSU, to find the best parameters. But OTSU has its own parameters so you need to fine tune those too (a little).
What worked best for me was first to do a rough normalization of the images (lightness, contrast) and then do the thresholding. To do this you have to measure the current brightness and/or do a CLAHE adaptive correction. https://stackoverflow.com/questions/25008458/how-to-apply-clahe-on-rgb-color-images I think tesseract is an LSTM+Ctc based solution. I think by default it uses one convolutional layer ( https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs). So yes, theoretically it could do the cleanup and the text conversion too. Maybe a single conv layer is not enough, you may need more. And I would start from scratch with a ton of synthetic data mixed with real data(+augmentation). Is it going to work better than external cleanup + fine tuning? I do not know and obviously depends on the specific data. Note: maybe there is some automatic pre-processing that does a thresholding internally before feeding the data to the NN. If this is the case obviously this needs to be removed. BTW: I'm seeing right now that tesseract can accept a three channels input but I do not know how the pre-trained models are configured. Bye Lorenzo Il giorno gio 4 apr 2019 alle ore 02:01 Du Kotomi <[email protected]> ha scritto: > Thank you so much for your sharing. > > It seems a very complicated cleanup. It will be very useful if you can > provide some preprocessing script. And I am wondering there is also some > thresholds > depending on different images, right? > > By the way, I have read some papers about LSTM +Ctc for ocr. The > advantages for such techniques is from deep learning. We can get any > complicated feature from convolution. So theoretically, it is no need to do > such preprocessing. How do you think about this ? > > > On Wed, Apr 3, 2019 at 21:17 Lorenzo Bolzani <[email protected]> wrote: > >> Hi, I train with real data. I use grayscale images, I think color makes >> no difference. >> >> I do a very good image cleanup: background removal, denoise, >> straightening, sharpening, illumination correction, contrast stretching, >> etc. before passing the text to tesseract. This part is likely better done >> on color images (you can split in RGB/HSV channel depending on what you >> need). >> >> So my final output is already almost "binary" and I do not do any real >> binarization/thresholding, I'm not sure if tessaeract does it or not but >> the difference would be minimal. >> >> All the images are rescaled so that the text has always the same height, >> about 35/40px, with not border or a small (1/2px) border. Try with an >> evaluation set and see what works best for you. >> >> >> Bye >> >> Lorenzo >> >> Il giorno mer 3 apr 2019 alle ore 11:08 Du Kotomi <[email protected]> >> ha scritto: >> >>> If we use text2image tool, there is no such problem. >>> >>> What about training with our real data. I have enough images for >>> training. Should I need to do some preprocess like binary or resized dpi >>> and then do lstm training? >>> >>> On Wed, Apr 3, 2019 at 16:36 Shree Devi Kumar <[email protected]> >>> wrote: >>> >>>> Usually for LSTM training we are using synthetic images created by >>>> text2image program using training text and fonts using tesstrain.sh or >>>> tesstrain.py. Hence there is no question of binarization or dpi as the >>>> program creates images as expected by tesseract training process. >>>> >>>> On Wed, Apr 3, 2019 at 12:31 PM Du Kotomi <[email protected]> wrote: >>>> >>>>> Anybody here? >>>>> >>>>> On Wed, Apr 3, 2019 at 09:57 <[email protected]> wrote: >>>>> >>>>>> Sorry for disturb again. I have sent my issue befire, but no one >>>>>> gives the answer. I really need your help. >>>>>> >>>>>> >>>>>> I go through the source code and find tesseract do Otsu Thresholding >>>>>> and put the binary pix in the Thresholder object. >>>>>> But It seems the Thresholder object haven't been invoked if I use >>>>>> lstm engines. >>>>>> As well as dpi size,tesseract wiki said it is better for 300 dpi. >>>>>> This is a requirement for tesseract 3.0 engine or even before, right? >>>>>> If I training lstm tesseract, it doesn't matter whether I do binary >>>>>> or resize the dpi of images, right? >>>>>> >>>>>> >>>>>> I will be every appreciated if any response is sent. Thank you so >>>>>> much! >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f1004b09-daa5-4d6b-909b-ad8eac267d34%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f1004b09-daa5-4d6b-909b-ad8eac267d34%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93svAsWZDUjQ9kVEp_bvh53F6Yv5jQ8q5Ts4zObiCRy2Q%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93svAsWZDUjQ9kVEp_bvh53F6Yv5jQ8q5Ts4zObiCRy2Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2mq8zKivmYtq8aMO7D%2BUeRiyxY%3D%3DL5qaOCi7iF9XC-A%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2mq8zKivmYtq8aMO7D%2BUeRiyxY%3D%3DL5qaOCi7iF9XC-A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR91s8vppNUwXPvkZzY%2BrmdJLee3pNMoKnLyAu3feoXzJsg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR91s8vppNUwXPvkZzY%2BrmdJLee3pNMoKnLyAu3feoXzJsg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyLS5TEYN-C%2B-wRHgmsGrhFFBSmafqBr%3DwE86qEve7grA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyLS5TEYN-C%2B-wRHgmsGrhFFBSmafqBr%3DwE86qEve7grA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93bxKE_RTa%3DFd6iR6WWhL-nUpRQDD%2BA3TcLMudioL-prg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93bxKE_RTa%3DFd6iR6WWhL-nUpRQDD%2BA3TcLMudioL-prg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyu6LKexybr7S%2B2QxF-qvm5xaNCgPsj4zuCOy2iu1WMMw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

