Here is the resend for the group.
Cheers
Nor
-------- Forwarded Message --------
Subject: Re: [tesseract-ocr] App to adjust imgage scaling
Date: Fri, 21 Jul 2023 12:45:21 -0400
From: astro <njsgas...@gmail.com>
To: Ger Hobbelt <g...@hobbelt.com>
Hi Ger,
The images I'm scanning are trail camera images that have the
date/time on the picture in the bottom corner. I'm trying to extract the
date/time values from the image. Normally the images are 1440x1080 at
96dpi . the only way I could get tesseract to read some of the time
stamp was by upping the image size. I have since changed my strategy
and used imageMagick to crop the bottom corner of the image that
contains the date/time to a 540x70 image and leaving the 96dpi ( see
attached). That seems to work very well. I'm currently looking to
increase the reliability by trying various things including correcting
the output where possible.
Thanks for the reply.
Cheers
Nor
On 7/21/2023 12:01 PM, Ger Hobbelt wrote:
6000*4500?!
Hm, sounds way too large for a simple text.
I'm guessing here, but it might be that you got thwarted by the
various "dpi" notes re ocr/tesseract out there.
Bottom line: IIRC tesseract was trained on text of around 30px high
(note that I use PX = pixels as the relevant unit of measure, I don't
care about dpi because that's something only really relevant to
printing press people (desktop publishing, etc.)
While a lot of folks hang onto dpi as unit of measure it's derivative
and only relevant when you scan printed pages, which turns "points"
(and picas and ....) into pixels, which is where dpi pops up.
Anyway, the key bit for every image you feed to an ocr engine like
tesseract is attempting to match the ”x height” Vs the training
material as closely as possible for any attempt at a good/optimal match.
For tesseract, this means you should aim for each line if text to be
somewhere between 20 and 50 pixels high (and as clean looking in black
& white / greyscale as possible, but that comes second, after getting
that line height to the 20-50px range. Computers work in PX, not DPI,
so it's PX that's the driving criterium.
Since you mention "picking out a date” I ASSUME your text area is one
line of text only.
Drop all image areas that do not contain text.
Make sure the text is black on a white background (you may need to
invert your image when this is a video grab or some such, f.e.)
There's a long wiki page about improving image quality for tesseract
processing too.
But first try to extract that line of text, scale it so the digits are
between 20-50px high and try some sizes within that range.
Second most important bit, I find, is making sure the input image has
black text on white background or anything greyscale/luminance-wise
that approaches this as best as possible. SOME tesseract modes /
settings can cope with white text on black BG, but that's you getting
rather lucky so don't bet on it.
tesseract is *engineered* for black text in white background input
images (paper book scans)
If you need further assistance on this forum/mailing list, attack the
image and tesseract commandline you tried; those messages get more
feedback as they are less of a guessing game ;-)
PS: third most important work item that lots of folks do wrong: when
clipping/extracting lines of text, postprocess those line images by
adding a nice large white=BACKGROUND COLOR boundary around the entire
line. Personally, I favor a "border" like that of about 0.5 to 1.0 the
size of the line itself. The added border should be SMOOTHLY
transitioning from the actual image background to prevent false edge
detections in tesseract itself: this problem doesn't happen for clean
paper book scans (which already have a plain white background) but is
an important aspect when extracting from "busy backgrounds".
Anyway, that topic is the size of a book all by itself, so take it
slow and get prio 1 right first: 1 line of text to ocr = 20-50px high.
Cheers,
Ger
On Fri, 21 Jul 2023, 13:35 astro, <njsgas...@gmail.com> wrote:
Hi Ger,
Thanks for your response. Yes. I found ImageMagick. Looks t be
very powerful and easy to implement. I tried it out by upping the
the image to 300 dpi and 6000x4500 and ran the image thru the OCR
process but tesseract had difficulty in picking out the date on
the image. I guess I will have to play around so to see if I can
improve things.
Cheers
Nor
On 7/21/2023 12:13 AM, Ger Hobbelt wrote:
Check out ImageMagick, an open source image toolset. Specifically
the 'convert' tool, look for commandline usage and application
parameters/arguments, where you will find several ways to
resize/rescale the image.
Also useful to ”tweak” the image as part of your ocr
preprocessing pipeline before your image reaches tesseract.
Another big one would be OpenCV, but that would require you to
write programs (python software or similar) while ImageMagick can
accomplish a lot of what you want or might need and can be driven
by some simple batch / Powershell / shell lines: much easier to
get success that way if you're not already comfortable with
coding software.
https://legacy.imagemagick.org/Usage/resize/
May appear overwhelming at first; read and try the various ways
mentioned there to get a grasp and discover what you need to do
for your scenario specifically. Ocr is not a simple process
pipeline, so take your time with it.
On Thu, 20 Jul 2023, 15:03 nor s, <njsgas...@gmail.com> wrote:
I'm trying to run tesseract-OCR on images that come to me at
72 DPI . The program is unable to decode these images and
requires a 200 dpi or better scale to be successful. Is
there a program available, similar to tesseract-OCR, that
would read a command line and convert an 72 dpi image to 200
dpi or some other specified value and save it in a specified
location. I'm running windows 10.
I can make these change in Photoshop but I'm trying to
automate the process since I have a lot of image to scan.
Any suggestion would be greatly appreciated.
Thanks
Nor
--
You received this message because you are subscribed to the
Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com
<https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the
Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frm0CYyZnKuVpuYHmLh9j_6XjBx%3DMYZ5i8B%3DO1zsRK8pA%40mail.gmail.com
<https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frm0CYyZnKuVpuYHmLh9j_6XjBx%3DMYZ5i8B%3DO1zsRK8pA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a4d63c5c-0cf4-c2de-3f68-8b435f23ea77%40gmail.com.