Re: [tesseract-ocr] Trouble extracting date and time from image

Ger Hobbelt Sat, 01 Nov 2025 08:51:05 -0700

(apologies for the typos and uncorrected mobile phone autocorrect eff-ups
in that text just now)


Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   [email protected]
mobile: +31-6-11 120 978
--------------------------------------------------

On Sat, 1 Nov 2025, 16:48 Ger Hobbelt, <[email protected]> wrote:

> I suspected something like this.
>
> FYI a technical detail that is very relevant for your case: when somebody
> feeds tesseract a white text on dark background image, tesseract OFTEN
> SEEMS TO WORK. Until you think it's doing fine and you get a very hard to
> notice lower total quality of OCR output than with comparable white text on
> black background. Here's what's going on under the hood and why I
> emphathetically advise everybody to NEVER feed tesseract white in black:
>
> Tesseract code picks up your image and looks at its metadata: width,
> height and RGB/number of colors. Fine so far.
> Now it goes and looks at the image pixels and runs a so-called
> segmentation process. Fundamentally, it runs it's own thresholding filter
> over your pixels to produce a pure 0/1 black & white picture copy: this one
> is simpler and faster to search as tesseract applies algorithms to discover
> the position and size of each but if text: the bounding-boxes list. Every
> box (a horizontal rectangle) surround [one] [word] [each]. Like I did with
> the square brackets [•••] just now. (for c++ code readers: yes, in skipping
> stuff and not being *exact* in what happens. RTFC of you want the absolute
> and definitive truth.)
>
> Now each of these b-boxes (bounding boxes) are clipped (extracted) from
> your source image and fed, one vertical pixel line after another, into the
> LSTM OCR engine, which spots out a synchronous stream of probabilities:
> think "30% chance that was an 'a' just now, 83% chance is was a 'd' and 57%
> chance I was looking at a 'b' instead. Meanwhile here's all the rest of the
> alphabet but their chances are very low indeed."
> So the next bit of tesseract logic looks at this and picks the highest
> probable occurrence: 'd'. (Again, wat more complex than this, but this is
> the base of it all and very relevant for our "don't ever do white-on-black
> while it might seem to work just fine right now!"
>
> By the time tesseract has 'decoded' the perceived word in that little
> b-box image, it may have 'read' the word 'dank', for example. The 'd' was
> just the first character in there.
> Tesseract ALSO has collected the top rankings (you may have noticed that
> my 'probabilities' did not add up to 100%, so we call them rankings instead
> of probabilities).
> It also calculated a ranking for the word as a whole, say 78% (and
> rankings are not real percentages so I'm lying through my teeth here. RTFC
> if you need that for comfort. Meanwhile I stick to the storyline here...)
>
> Now there's a tiny single line of code in tesseract which now gets to look
> at that number. It is one of the many "heuristics" in there. And it says:
> "if this word ranking is below 0.7 (70%), we need to TRY AGAIN: Invert(!!!)
> that word box image and run it through the engine once more! When you're
> done, compare the ranking of the word you got this second time around and
> may the best one win!"
> For a human, the heuristic seems obvious and flawless. In actual practice
> however, the engine can be a little crazy sometimes when it's fed horribly
> unexpected pixel input and there's a small bit noticeable number of times
> where the gibberish wins because the engine got stoned as a squirrel and
> announced the inverted pixels have a 71% ranking for 'Q0618'. Highest
> bidder wins and you get gibberish (at best) or a totally incorrect word
> like 'quirk' at worst: both are very wrong, but your chances of discovering
> the second example fault is nigh impossible, particularly when you have
> automated this process as you process images in bulk.
>
> Two ways (3, rather!) this has a detrimental affect on your output ice
> quality:
>
> 1: if you start with white-on-black, tesseract 'segmentation' has to deal
> with white-on-black too and my findings are: the b-boxes discovery delivers
> worse results. That bad in two ways and both (2) and (3) don't receive
> optimal input image clippings.
> 2: by now you will have guessed it: you started with white-on-black
> (white-on-green in your specific case) so the first round through tesseract
> is feeding it a bunch of highly unexpected 'crap' it was never taught to
> deal with: gibberish is the result and lots of 'words' arrive at that
> heuristic with rankings way below that 0.7 benchmark, so the second saves
> your ass by rerunning the INVERTED image and very probably observing
> serious winners that time, so everything LOOKS good for the test image.
>
> Meanwhile, we know that the tesseract engine, like any neural net, can go
> nuts and output gibberish at surprising high confidence rankings: assuming
> your first run delivered gibberish with such a high confidence, barely or
> quite a lot higher than the 0.7 benchmark, you WILL NOT GET TGAT SECOND RUN
> and thus crazy stuff will be your end result. Ouch.
>
> 3: same as (2) but now twisted in the other direction: tesseract has a
> bout of self-doubt somehow (computer pixel fonts like yours are a candidate
> for this) and thus produces the intended word 'dank' during the second run
> but at a surprisingly LOW ranking if, say, 65%, while first round gibberish
> had the rather idiotic ranking of 67%, still below the 0.7 benchmark but
> "winner takes all" now has to obey and let the gibberish pass anyhow:
> 'dank' scored just a wee bit lower!
> Again, fat failure in terms of total quality of output, but it happens.
> Rarely, but often enough to screw you up.
>
> Of course you can argue the same from the by-design black-on-white input,
> so what's the real catch here?! When you ensure, BEFOREHAND, that tesseract
> receives black-on-white, high contrast, input images, (1) Will do a better
> job, hence reducing your total error rate. (2) is a non-scenario now
> because your first round gets black-on-white, as everybody trained for, so
> no crazy confusion this way. Thus another, notable, improvement in total
> error rate /quality.
> (3) still happens, but in the reverse order: the first round produces the
> intended 'dank' word at low confidence, so second round is run and
> gibberish wins, OUCH!, **but** the actual probability of this happening
> just dropped a lot as your 'not passing the benchmark' test is now
> dependent on the 'lacking confidence' scenario part, which is (obviously?)
> *rarer* than the *totally-confused-but-rather-confident* first part of the
> original scenario (3).
>
> Thus all 3 failure modes have a significantly lower probability of
> actually occurring when you feed tesseract black-on-white text, as it was
> designed to eat that kind of porridge.
>
> Therefor: high contrast is good. Better yet: flip it around (Invert the
> image), possibly after having done the to-grwyscale conversion yourself, as
> well. Your images will thank you (bonus points! Not having to execute the
> second run means spending about half the time in the CPU-intensive neural
> net: higher performance and fewer errors all at the same time 🥳🥳)
>
>
>
> Why does tesseract have that 0.7 heuristic then? That's a story for
> another time, but it has it's uses...
>
> Met vriendelijke groeten / Best regards,
>
> Ger Hobbelt
>
> --------------------------------------------------
> web:    http://www.hobbelt.com/
>         http://www.hebbut.net/
> mail:   [email protected]
> mobile: +31-6-11 120 978
> --------------------------------------------------
>
> On Sat, 1 Nov 2025, 06:01 Michael Schuh, <[email protected]> wrote:
>
>> Rucha > Green? Why?
>>
>> Ger > Indeed, why? (What is the thought that drove you to run this
>> particular imagemagick command?)
>>
>> Fair questions.  I saw both black and white in the text so I picked a
>> background color that does not exist in the text and has high contrast.
>>  tesseract did a great job with the green background.  I want to process
>> images to extract Palo Alto California tide data, date, and time and then
>> plot the results against xtide predictions.  I am close to processing a
>> day's worth of images collected once a minute so I will see how well the
>> green background works.  If I have problems, I will definitely try using
>> your (Ger and Rucha's) advice.
>>
>> Thank you Ger and Racha very much for your advice.
>>
>> Best Regards,
>>    Michael
>>
>> On Fri, Oct 31, 2025 at 5:52 PM Ger Hobbelt <[email protected]>
>> wrote:
>>
>>> Indeed, why? (What is the thought that drove you to run this particular
>>> imagemagick command?)  While it might help visually debugging something
>>> you're trying, the simplest path towards "black text on white background"
>>> is
>>>
>>> 1. converting any image to greyscale. (and see for yourself if that
>>> output is easily legible; if it's not, chances are the machine will have
>>> trouble too, so more preprocessing /before/ the greyscale transform is
>>> needed then)
>>> 2. use a 'threshold' (a.k.a. binarization) step to possibly help (though
>>> tesseract can oftentimes do a better job with greyscale instead of hard
>>> black & white as there's more 'detail' in the image pixels then. YMMV).
>>>
>>> You can do this many ways, using imagemagick is one, openCV another. For
>>> one-offs I use Krita / Photoshop filter layers (stacking the filters to get
>>> what I want).
>>> Anything really that gets you something that approaches 'crisp
>>> dark/black text on a clean, white background, text characters about 30px
>>> high' (dpi is irrelevant, though often mentioned elsewhere: tesseract does
>>> digital image pixels, not classical printer mindset dots-per-inch).
>>>
>>> Note that 'simplest path towards' does not mean 'always the best way'.
>>>
>>> Met vriendelijke groeten / Best regards,
>>>
>>> Ger Hobbelt
>>>
>>> --------------------------------------------------
>>> web:    http://www.hobbelt.com/
>>>         http://www.hebbut.net/
>>> mail:   [email protected]
>>> mobile: +31-6-11 120 978
>>> --------------------------------------------------
>>>
>>>
>>> On Fri, Oct 31, 2025 at 5:46 AM Rucha Patil <
>>> [email protected]> wrote:
>>>
>>>> Green? Why? I dont know if this might resolve the issue. Lmk the
>>>> behavior I’m curious. But you need an image that has white background and
>>>> black text. You can achieve this easily using cv2 functions.
>>>>
>>>> On Thu, Oct 30, 2025 at 1:26 PM Michael Schuh <[email protected]>
>>>> wrote:
>>>>
>>>>> I am trying to extract the date and time from
>>>>>
>>>>> [image: time.png]
>>>>>
>>>>> I have successfully use tesseract to extract text from other images.
>>>>> tesseract does not find any text in the above image,
>>>>>
>>>>>    michael@argon:~/michael/trunk/src/tides$ tesseract time.png out
>>>>>    Estimating resolution as 142
>>>>>
>>>>>    michael@argon:~/michael/trunk/src/tides$ cat out.txt
>>>>>
>>>>>    michael@argon:~/michael/trunk/src/tides$ ls -l out.txt
>>>>>    -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt
>>>>>
>>>>> Any help you can give me would be appreciated.  I attached the
>>>>> time.png file I used above.
>>>>>
>>>>> Thanks,
>>>>>    Michael
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CADEFw17btz6nKqyhFKd-GXVCu7qtBQQ6gY5AV0pZJusXa4CpXg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpUCz1LFq_aqk0ea6W8GR7a7mrX5%3DPdZhv6%3Dn6t-1YVrg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAAo-6adqVtsaoEhFxwwiXc%2Brx6uCi2zx4q7viYBZJWJMYVeeQA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fq9ppE-ad5L7yBbZGk8F9daCLMC%3DthcNB357zoJFcCW7w%40mail.gmail.com.

Re: [tesseract-ocr] Trouble extracting date and time from image

Reply via email to