Re: [tesseract-ocr] Trouble extracting date and time from image

Ger Hobbelt Thu, 30 Oct 2025 11:57:28 -0700

I cannot emphasize this single item (in a long list of stuff one can/must
do before feeding any image to an OCR engine) enough: *tesseract has been
trained to 'read' books, i.e black text on white background. Consequently,
any image preprocessing step(s) that get you there, are strongly advised.*

This, and lots of other "*I don't wanna hear this 🥴*" important details
show up in the documents and emails listed below:
(I know people like twitter-sized or shorter text, but you've got some
reading to do if you want to be successful at OCRing stuff. We all have to,
it's not simple.)

*- https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html
<https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html> 🎯*
-
https://github.com/tesseract-ocr/tessdoc/blob/main/tess3/FAQ-Old.md#is-there-a-minimum--maximum-text-size-it-wont-read-screen-text
-
https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ?pli=1
- https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ

and then a bunch of messages that are related; I'd rather not repeat
myself, so please take your time and read those threads: some of it may
sound crazy at first, but you're doing something that's touching on the
edge of the original design goals and that means you're bound to meet some
"weird behaviour" along the way. Before I let myself out, *the second most
important piece of advice I can give everyone: use HOCR (which is HTML
content plus coordinates) or TSV output instead of anything else; do not, I
repeat: !DO NOT! output txt format, just because every internet wizard out
there does it in their blog: txt (text) format is minimal-information and
you are way better off with a maximal-information output for when you need
to diagnose trouble* -- plus, now you've seen the workflow diagram that's
part of the info above, *turning HOCR/TSV into TXT should be part of your
postprocessing*, AFAIAC.
Other direct or sideways relevant blurbs to be read here (again, consider
reading the entire threads; OCR is one of those activities where 'quickly
scanning my text books to pass my exam' as you previously learned at school
is not going to get you closer to success faster, on the contrary:

- https://groups.google.com/g/tesseract-ocr/c/jWdpUF7mTxE
- https://groups.google.com/g/tesseract-ocr/c/vrBc1FPeprQ/m/GxTlapF-BwAJ
- https://groups.google.com/g/tesseract-ocr/c/c_S7GG5njkw/m/OPQ6q5zBAQAJ
- https://groups.google.com/g/tesseract-ocr/c/8BerjYWGGQU/m/KwSz7724AQAJ
- https://groups.google.com/g/tesseract-ocr/c/YLOkyuOMsrs/m/wEKTYtfQAAAJ

HTH

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   [email protected]
mobile: +31-6-11 120 978
--------------------------------------------------

On Thu, Oct 30, 2025 at 6:26 PM Michael Schuh <[email protected]> wrote:

> I am trying to extract the date and time from
>
> [image: time.png]
>
> I have successfully use tesseract to extract text from other images.
> tesseract does not find any text in the above image,
>
>    michael@argon:~/michael/trunk/src/tides$ tesseract time.png out
>    Estimating resolution as 142
>
>    michael@argon:~/michael/trunk/src/tides$ cat out.txt
>
>    michael@argon:~/michael/trunk/src/tides$ ls -l out.txt
>    -rw-r----- 1 michael michael 0 Oct 30 08:58 out.txt
>
> Any help you can give me would be appreciated.  I attached the time.png
> file I used above.
>
> Thanks,
>    Michael
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/77ac0d2b-7796-4f17-8bc6-0e70a9653adan%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fo3H7nEZ%2BMLWE6j9c--hBzmqFWSFUdGAXbDULBb27wnPQ%40mail.gmail.com.

Re: [tesseract-ocr] Trouble extracting date and time from image

Reply via email to