Hi Shree,
I'd love to but it is a commercial project I'm working on so I cannot share
the current solution.

I will try to find the old scripts I used for the first attempts. Basically
it was something like this:

- normalize lightness
- make illumination uniform (CLAHE on HSV "V" channel)
- denoise/divide to remove background (with custom level based on noise
estimation)
- normalize text size for a fixed value
- remove "dust" with morphological operations
- remove light gray shades with a "soft threshold"
- stretch contrast/histogram
- straighten text (and dewarp for very long lines)

I used opencv and PIP.

The main problem is that a ton of fine tuning is required for each of these
steps if the input are random pictures from smartphones/scanner/etc.
It also depends on how noisy the background is or if color can be used as a
hint for background detection. For example converting the image to HSV
makes very simple to remove colored noise or colored background. You select
the parts with high saturation with a numpy mask and set them to white or
black depending on their luminance.

Measuring noise, blurriness, contrast, etc. helps to decide what processing
to apply or to do it proportionally to the measured value.

Many fine tuning values also depend on the image/text size.

Gaussian difference and divide and the best way I found general cleanup.

Sometimes multiply works great for details enhancement of low contrast
images.

I can try to put together a small sample script because there are not many
around or at least easy to find. Not much time to do it but I'll try.



Bye

Lorenzo


Il giorno mer 3 apr 2019 alle ore 18:31 Shree Devi Kumar <
[email protected]> ha scritto:

> Hi Lorenzo,
>
> Do you have a script for image pre-processing? Please share, if possible.
> It will be helpful to many.
>
> On Wed, Apr 3, 2019 at 6:47 PM Lorenzo Bolzani <[email protected]>
> wrote:
>
>> Hi, I train with real data. I use grayscale images, I think color makes
>> no difference.
>>
>> I do a very good image cleanup: background removal, denoise,
>> straightening, sharpening, illumination correction, contrast stretching,
>> etc. before passing the text to tesseract. This part is likely better done
>> on color images (you can split in RGB/HSV channel depending on what you
>> need).
>>
>> So my final output is already almost "binary" and I do not do any real
>> binarization/thresholding, I'm not sure if tessaeract does it or not but
>> the difference would be minimal.
>>
>> All the images are rescaled so that the text has always the same height,
>> about 35/40px, with not border or a small (1/2px) border. Try with an
>> evaluation set and see what works best for you.
>>
>>
>> Bye
>>
>> Lorenzo
>>
>> Il giorno mer 3 apr 2019 alle ore 11:08 Du Kotomi <[email protected]>
>> ha scritto:
>>
>>> If we use text2image tool, there is no such problem.
>>>
>>> What about training with our real data. I have enough images for
>>> training. Should I need to do some preprocess like binary or resized dpi
>>> and then do lstm training?
>>>
>>> On Wed, Apr 3, 2019 at 16:36 Shree Devi Kumar <[email protected]>
>>> wrote:
>>>
>>>> Usually for LSTM training we are using synthetic images created by
>>>> text2image program using training text and fonts using tesstrain.sh or
>>>> tesstrain.py. Hence there is no question of binarization or dpi as the
>>>> program creates images as expected by tesseract training process.
>>>>
>>>> On Wed, Apr 3, 2019 at 12:31 PM Du Kotomi <[email protected]> wrote:
>>>>
>>>>> Anybody here?
>>>>>
>>>>> On Wed, Apr 3, 2019 at 09:57 <[email protected]> wrote:
>>>>>
>>>>>> Sorry for disturb again. I have sent my issue befire, but no one
>>>>>> gives the answer.   I really need your help.
>>>>>>
>>>>>>
>>>>>> I go through the source code and find tesseract do Otsu Thresholding
>>>>>> and put the binary pix in the Thresholder object.
>>>>>> But It seems  the Thresholder object haven't been  invoked if I use
>>>>>> lstm engines.
>>>>>> As well as dpi size,tesseract wiki said it is better for 300 dpi.
>>>>>> This is a requirement for tesseract 3.0 engine or even before, right?
>>>>>> If I training lstm tesseract, it doesn't matter whether I do binary
>>>>>> or resize the dpi of images, right?
>>>>>>
>>>>>>
>>>>>> I will be every appreciated if any response is sent. Thank you so
>>>>>> much!
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/f1004b09-daa5-4d6b-909b-ad8eac267d34%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f1004b09-daa5-4d6b-909b-ad8eac267d34%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>
>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93svAsWZDUjQ9kVEp_bvh53F6Yv5jQ8q5Ts4zObiCRy2Q%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR93svAsWZDUjQ9kVEp_bvh53F6Yv5jQ8q5Ts4zObiCRy2Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2mq8zKivmYtq8aMO7D%2BUeRiyxY%3D%3DL5qaOCi7iF9XC-A%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV2mq8zKivmYtq8aMO7D%2BUeRiyxY%3D%3DL5qaOCi7iF9XC-A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR91s8vppNUwXPvkZzY%2BrmdJLee3pNMoKnLyAu3feoXzJsg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJ7gR91s8vppNUwXPvkZzY%2BrmdJLee3pNMoKnLyAu3feoXzJsg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyLS5TEYN-C%2B-wRHgmsGrhFFBSmafqBr%3DwE86qEve7grA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyLS5TEYN-C%2B-wRHgmsGrhFFBSmafqBr%3DwE86qEve7grA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX8unpRq38fxiUs2fTTtNVvSqAfhxZEUQWntm%3DtUAY8tQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX8unpRq38fxiUs2fTTtNVvSqAfhxZEUQWntm%3DtUAY8tQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwNbz8ewjEt_uep5e3kKbvLmR6D%2BkFosdxK9Kk_JWv6FA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to