Re: [tesseract-ocr] Re: OCR fails on a preprocessed visually good looking image

Ger Hobbelt Thu, 01 Oct 2020 11:57:55 -0700

Hi,

AFAICT tesseract OCR quality deteriorates a lot when being fed 'inverted
colors', i.e. white text on black background. (Can't dig up the tesseract
blog / article I first saw this mentioned and google fails me in this
regard right this minute, sorry.)

Second, from what I gather from all the applications/code I've investigated
which feed images to tesseract, the last stage is always a [type of]
'threshold' stage where text is converted to a simple black&white picture:
tesseract expects black text on white background.

Given your purple+yellow "image test" image, a simple threshold action very
probably would render that as white text on black background, which is the
wrong way around if you want to get the best performance from tesseract.

Hence a potential solution vector would be:

- find ways to 'preprocess' your images to ensure each is converted to
black text on white background in a subsequent thresholding pass. (Do the
thresholding yourself in your preprocess to have maximum control over the
image you feed to tesseract.)

  (Quick initial thought: it might be good enough to count pixels with each
hue, then find the two major 'bulges' in the color distribution and code a
quick filter which assigns the hues in the least major hump to black and
ones in the most major one to white.
  Another way would be to run a threshold filter and then do this counting
on the threshold /output/: pixels there can only be either black or white
as the threshold action outputs a monochrome image and thus the code would
be extremely easy to count pixels and flip the colors if the black color
count happens to be larger than the white color count. Just some rough
idea, this.)

- Quick google on 'tesseract white text black background' pops up this as
the top entry for me:
https://stackoverflow.com/questions/39002966/detect-white-characters-on-black-background-using-tesseract

  Did a quick scan of that one sounds like it might be good to check out
further for you.

HTH

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   [email protected]
mobile: +31-6-11 120 978
--------------------------------------------------

On Thu, Oct 1, 2020 at 3:46 PM Jean-Marc Spaggiari <[email protected]>
wrote:

> Hi Fabian,
>
> Are you able to try by removing the camera picture on the left? Or it has
> to stay there? Maybe you can split your picture into smaller one, by
> looking for vertical delimiters?
>
> JM
>
> Le mercredi 30 septembre 2020 à 06 h 50 min 44 s UTC-4,
> [email protected] a écrit :
>
>> Hello,
>>
>> i am currently working on a OCR for detecting text from some cropped
>> region of interests. At most of the roi's it works fine, but for example in
>> the attached image tesseract ignores 'Test'. I have tested different --psm
>> modes. DPI looks fine to me aswell.
>>
>>    - Any suggestions for further testing or preprocessing?
>>    - Should i try to provide a set of rois for tesseract to train on it?
>>
>> Thanks for your help!
>>
>> [image: cropped_roi_tesseract.png]
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/43d66ca1-10f9-40aa-ac02-5d9c8de2f598n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/43d66ca1-10f9-40aa-ac02-5d9c8de2f598n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq_NWR_spRc2Qwtrh93Sa%2BwRtWigtKR5hto8N%2Bz3VFOoA%40mail.gmail.com.

Re: [tesseract-ocr] Re: OCR fails on a preprocessed visually good looking image

Reply via email to