Thanks for this Zdenko. I've had a look at some resources on 'greyscale closing' and kind of get it. However, my app is currently in c# and the library I'm using does all the pix functions. I will try and build the sample in C++ and see what it does.
Iain On Sunday, August 4, 2024 at 12:44:41 PM UTC+1 zdenop wrote: > tesseract unnamed.jpg - > Estimating resolution as 182 > > e.g. no recognized word... So the problem could be in the parameters you > used for OCR... > > Before OCR I suggest image preprocessing and maybe the detection of empty > pages. > Have a look at leptonica example for Normalize for uneven illumination > (pixBackgroundNorm in > https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c) > and then binarize image. > I think with some more "aggressive" parameters you can get a clean empty > page, so will not need to modify your OCR parameters... > > Zdenko > > > ne 4. 8. 2024 o 13:22 Iain Downs <[email protected]> napísal(a): > >> In the event that anyone else has a similar issue, this is how I >> approached it. >> >> Firstly, make a histogram of the number of pixels with each intensity (so >> an array of 256 numbers). >> >> When you inspect this you get results like the below. >> >> [image: Finding empty pages.png] >> >> This is after a little smoothing and taking the log of the values. >> >> You can see that the properly blank pages show little or no very dark >> (black) pixels, whereas the pages with some text, even if a small amount >> have a fair number. >> >> I simply set a cutoff level (in this case 1) and a cutoff intensity (in >> my case 80), so providing the first peak of 1 of the log smoothed intensity >> is below 80 it is text, otherwise it is blank. >> >> You can also see the problem which tesseract has (with default >> binarisation) in that the intensity is distinctly bimodal. I think this is >> due to bleedthrough from the reverse of the page. Of course that is >> essentially what OTSU uses pick out 'black' from 'white'. >> >> Iain >> On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote: >> >>> I'm working on processing scanned paperback books with tesseract (C++ >>> API at the moment). One issue I've found is that when a page has little or >>> no text tesseract gets overkeen and interprets the noise as text. >>> >>> The image below is the raw page. In this case it's the inside front >>> cover of a book. >>> [image: HookRawPage.jpg] >>> This is the image after tesseract has processed it (binarization) and >>> before the character recognition. >>> [image: HookPostProcessed.jpg] >>> >>> tesseract suggests that there are 160 or so words (by some definition of >>> word!) on this page as per the attached (Hook02Small.txt). >>> >>> This also happens on pages which DO contain text but a small amount. I >>> suspect that the binarization (possibly OTSU?) is to blame. I can probable >>> do something to detect entirely blank pages, but less sure what do do with >>> mainly blank pages. >>> >>> Any suggestions most welcome! >>> >>> Iain >>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/212acf62-1157-4c16-962d-aac775815456n%40googlegroups.com.

