(apologies for the typos and uncorrected mobile phone autocorrect eff-ups in that text just now)
Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: [email protected] mobile: +31-6-11 120 978 -------------------------------------------------- On Sat, 1 Nov 2025, 16:48 Ger Hobbelt, <[email protected]> wrote: > I suspected something like this. > > FYI a technical detail that is very relevant for your case: when somebody > feeds tesseract a white text on dark background image, tesseract OFTEN > SEEMS TO WORK. Until you think it's doing fine and you get a very hard to > notice lower total quality of OCR output than with comparable white text on > black background. Here's what's going on under the hood and why I > emphathetically advise everybody to NEVER feed tesseract white in black: > > Tesseract code picks up your image and looks at its metadata: width, > height and RGB/number of colors. Fine so far. > Now it goes and looks at the image pixels and runs a so-called > segmentation process. Fundamentally, it runs it's own thresholding filter > over your pixels to produce a pure 0/1 black & white picture copy: this one > is simpler and faster to search as tesseract applies algorithms to discover > the position and size of each but if text: the bounding-boxes list. Every > box (a horizontal rectangle) surround [one] [word] [each]. Like I did with > the square brackets [•••] just now. (for c++ code readers: yes, in skipping > stuff and not being *exact* in what happens. RTFC of you want the absolute > and definitive truth.) > > Now each of these b-boxes (bounding boxes) are clipped (extracted) from > your source image and fed, one vertical pixel line after another, into the > LSTM OCR engine, which spots out a synchronous stream of probabilities: > think "30% chance that was an 'a' just now, 83% chance is was a 'd' and 57% > chance I was looking at a 'b' instead. Meanwhile here's all the rest of the > alphabet but their chances are very low indeed." > So the next bit of tesseract logic looks at this and picks the highest > probable occurrence: 'd'. (Again, wat more complex than this, but this is > the base of it all and very relevant for our "don't ever do white-on-black > while it might seem to work just fine right now!" > > By the time tesseract has 'decoded' the perceived word in that little > b-box image, it may have 'read' the word 'dank', for example. The 'd' was > just the first character in there. > Tesseract ALSO has collected the top rankings (you may have noticed that > my 'probabilities' did not add up to 100%, so we call them rankings instead > of probabilities). > It also calculated a ranking for the word as a whole, say 78% (and > rankings are not real percentages so I'm lying through my teeth here. RTFC > if you need that for comfort. Meanwhile I stick to the storyline here...) > > Now there's a tiny single line of code in tesseract which now gets to look > at that number. It is one of the many "heuristics" in there. And it says: > "if this word ranking is below 0.7 (70%), we need to TRY AGAIN: Invert(!!!) > that word box image and run it through the engine once more! When you're > done, compare the ranking of the word you got this second time around and > may the best one win!" > For a human, the heuristic seems obvious and flawless. In actual practice > however, the engine can be a little crazy sometimes when it's fed horribly > unexpected pixel input and there's a small bit noticeable number of times > where the gibberish wins because the engine got stoned as a squirrel and > announced the inverted pixels have a 71% ranking for 'Q0618'. Highest > bidder wins and you get gibberish (at best) or a totally incorrect word > like 'quirk' at worst: both are very wrong, but your chances of discovering > the second example fault is nigh impossible, particularly when you have > automated this process as you process images in bulk. > > Two ways (3, rather!) this has a detrimental affect on your output ice > quality: > > 1: if you start with white-on-black, tesseract 'segmentation' has to deal > with white-on-black too and my findings are: the b-boxes discovery delivers > worse results. That bad in two ways and both (2) and (3) don't receive > optimal input image clippings. > 2: by now you will have guessed it: you started with white-on-black > (white-on-green in your specific case) so the first round through tesseract > is feeding it a bunch of highly unexpected 'crap' it was never taught to > deal with: gibberish is the result and lots of 'words' arrive at that > heuristic with rankings way below that 0.7 benchmark, so the second saves > your ass by rerunning the INVERTED image and very probably observing > serious winners that time, so everything LOOKS good for the test image. > > Meanwhile, we know that the tesseract engine, like any neural net, can go > nuts and output gibberish at surprising high confidence rankings: assuming > your first run delivered gibberish with such a high confidence, barely or > quite a lot higher than the 0.7 benchmark, you WILL NOT GET TGAT SECOND RUN > and thus crazy stuff will be your end result. Ouch. > > 3: same as (2) but now twisted in the other direction: tesseract has a > bout of self-doubt somehow (computer pixel fonts like yours are a candidate > for this) and thus produces the intended word 'dank' during the second run > but at a surprisingly LOW ranking if, say, 65%, while first round gibberish > had the rather idiotic ranking of 67%, still below the 0.7 benchmark but > "winner takes all" now has to obey and let the gibberish pass anyhow: > 'dank' scored just a wee bit lower! > Again, fat failure in terms of total quality of output, but it happens. > Rarely, but often enough to screw you up. > > Of course you can argue the same from the by-design black-on-white input, > so what's the real catch here?! When you ensure, BEFOREHAND, that tesseract > receives black-on-white, high contrast, input images, (1) Will do a better > job, hence reducing your total error rate. (2) is a non-scenario now > because your first round gets black-on-white, as everybody trained for, so > no crazy confusion this way. Thus another, notable, improvement in total > error rate /quality. > (3) still happens, but in the reverse order: the first round produces the > intended 'dank' word at low confidence, so second round is run and > gibberish wins, OUCH!, **but** the actual probability of this happening > just dropped a lot as your 'not passing the benchmark' test is now > dependent on the 'lacking confidence' scenario part, which is (obviously?) > *rarer* than the *totally-confused-but-rather-confident* first part of the > original scenario (3). > > Thus all 3 failure modes have a significantly lower probability of > actually occurring when you feed tesseract black-on-white text, as it was > designed to eat that kind of porridge. > > Therefor: high contrast is good. Better yet: flip it around (Invert the > image), possibly after having done the to-grwyscale conversion yourself, as > well. Your images will thank you (bonus points! Not having to execute the > second run means spending about half the time in the CPU-intensive neural > net: higher performance and fewer errors all at the same time 🥳🥳) > > > > Why does tesseract have that 0.7 heuristic then? That's a story for > another time, but it has it's uses... > > Met vriendelijke groeten / Best regards, > > Ger Hobbelt > > -------------------------------------------------- > web: http://www.hobbelt.com/ > http://www.hebbut.net/ > mail: [email protected] > mobile: +31-6-11 120 978 > -------------------------------------------------- > > On Sat, 1 Nov 2025, 06:01 Michael Schuh, <[email protected]> wrote: > >> Rucha > Green? Why? >> >> Ger > Indeed, why? (What is the thought that drove you to run this >> particular imagemagick command?) >> >> Fair questions. I saw both black and white in the text so I picked a >> background color that does not exist in the text and has high contrast. >> tesseract did a great job with the green background. I want to process >> images to extract Palo Alto California tide data, date, and time and then >> plot the results against xtide predictions. I am close to processing a >> day's worth of images collected once a minute so I will see how well the >> green background works. If I have problems, I will definitely try using >> your (Ger and Rucha's) advice. >> >> Thank you Ger and Racha very much for your advice. >> >> Best Regards, >> Michael >> >> On Fri, Oct 31, 2025 at 5:52 PM Ger Hobbelt <[email protected]> >> wrote: >> >>> Indeed, why? (What is the thought that drove you to run this particular >>> imagemagick command?) While it might help visually debugging something >>> you're trying, the simplest path towards "black text on white background" >>> is >>> >>> 1. converting any image to greyscale. (and see for yourself if that >>> output is easily legible; if it's not, chances are the machine will have >>> trouble too, so more preprocessing /before/ the greyscale transform is >>> needed then) >>> 2. use a 'threshold' (a.k.a. binarization) step to possibly help (though >>> tesseract can oftentimes do a better job with greyscale instead of hard >>> black & white as there's more 'detail' in the image pixels then. YMMV). >>> >>> You can do this many ways, using imagemagick is one, openCV another. For >>> one-offs I use Krita / Photoshop filter layers (stacking the filters to get >>> what I want). >>> Anything really that gets you something that approaches 'crisp >>> dark/black text on a clean, white background, text characters about 30px >>> high' (dpi is irrelevant, though often mentioned elsewhere: tesseract does >>> digital image pixels, not classical printer mindset dots-per-inch). >>> >>> Note that 'simplest path towards' does not mean 'always the best way'. >>> >>> Met vriendelijke groeten / Best regards, >>> >>> Ger Hobbelt >>> >>> -------------------------------------------------- >>> web: http://www.hobbelt.com/ >>> http://www.hebbut.net/ >>> mail: [email protected] >>> mobile: +31-6-11 120 978 >>> -------------------------------------------------- >>> >>> >>> On Fri, Oct 31, 2025 at 5:46 AM Rucha Patil < >>> [email protected]> wrote: >>> >>>> Green? Why? I dont know if this might resolve the issue. Lmk the >>>> behavior I’m curious. But you need an image that has white background and >>>> black text. You can achieve this easily using cv2 functions. >>>> >>>> On Thu, Oct 30, 2025 at 1:26 PM Michael Schuh <[email protected]> >>>> wrote: >>>> >>>>> I am trying to extract the date and time from >>>>> >>>>> [image: time.png] >>>>> >>>>> I have successfully use tesseract to extract text from other images. >>>>> tesseract does not find any text in the above image, >>>>> >>>>> michael@argon:~/michael/trunk/src/tides$ tesseract time.png out >>>>> Estimating resolution as 142 >>>>> >>>>> michael@argon:~/michael/trunk/src/tides$ cat out.txt >>>>> >>>>> michael@argon:~/michael/trunk/src/tides$ ls -l out.txt >>>>> -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt >>>>> >>>>> Any help you can give me would be appreciated. I attached the >>>>> time.png file I used above. >>>>> >>>>> Thanks, >>>>> Michael >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq9ppE-ad5L7yBbZGk8F9daCLMC%3DthcNB357zoJFcCW7w%40mail.gmail.com.

