Re: [tesseract-ocr] Re: How to prevern Tesseract from interpreting noise as characters

Iain Downs Tue, 06 Aug 2024 07:56:14 -0700

Thanks for this Zdenko. I've had a look at some resources on 'greyscale 
closing' and kind of get it.  However, my app is currently in c# and the 
library I'm using does all the pix functions.  I will try and build the 
sample in C++ and see what it does.


Iain

On Sunday, August 4, 2024 at 12:44:41 PM UTC+1 zdenop wrote:

> tesseract unnamed.jpg -
> Estimating resolution as 182
>
>  e.g. no recognized word... So the problem could be in the parameters you 
> used for OCR...
>
> Before OCR I suggest image preprocessing and maybe the detection of empty 
> pages.
> Have a look at leptonica example for Normalize for uneven illumination 
> (pixBackgroundNorm in 
> https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c) 
> and then binarize image.
> I think with some more "aggressive" parameters you can get a clean empty 
> page, so will not need to modify your OCR parameters...
>
> Zdenko
>
>
> ne 4. 8. 2024 o 13:22 Iain Downs <[email protected]> napísal(a):
>
>> In the event that anyone else has a similar issue, this is how I 
>> approached it.
>>
>> Firstly, make a histogram of the number of pixels with each intensity (so 
>> an array of 256 numbers).
>>
>> When you inspect this you get results like the below.
>>
>> [image: Finding empty pages.png]
>>
>> This is after a little smoothing and taking the log of the values.
>>
>> You can see that the properly blank pages show little or no very dark 
>> (black) pixels, whereas the pages with some text, even if a small amount 
>> have a fair number.
>>
>> I simply set a cutoff level (in this case 1) and a cutoff intensity (in 
>> my case 80), so providing the first peak of 1 of the log smoothed intensity 
>> is below 80 it is text, otherwise it is blank.
>>
>> You can also see the problem which tesseract has (with default 
>> binarisation) in that the intensity is distinctly bimodal.  I think this is 
>> due to bleedthrough from the reverse of the page.  Of course that is 
>> essentially what OTSU uses pick out 'black' from 'white'.
>>
>> Iain
>> On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote:
>>
>>> I'm working on processing scanned paperback books with tesseract (C++ 
>>> API at the moment).  One issue I've found is that when a page has little or 
>>> no text tesseract gets overkeen and interprets the noise as text.
>>>
>>> The image below is the raw page.  In this case it's the inside front 
>>> cover of a book.
>>> [image: HookRawPage.jpg]
>>> This is the image after tesseract has processed it (binarization) and 
>>> before the character recognition.
>>> [image: HookPostProcessed.jpg]
>>>
>>> tesseract suggests that there are 160 or so words (by some definition of 
>>> word!) on this page as per the attached (Hook02Small.txt).
>>>
>>> This also happens on pages which DO contain text but a small amount.  I 
>>> suspect that the binarization (possibly OTSU?) is to blame.  I can probable 
>>> do something to detect entirely blank pages, but less sure what do do with 
>>> mainly blank pages.
>>>
>>> Any suggestions most welcome!
>>>
>>> Iain
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/212acf62-1157-4c16-962d-aac775815456n%40googlegroups.com.

Re: [tesseract-ocr] Re: How to prevern Tesseract from interpreting noise as characters

Reply via email to