Here is the resend for the group.
Cheers
 Nor
-------- Forwarded Message --------
Subject:        Re: [tesseract-ocr] App to adjust imgage scaling
Date:   Fri, 21 Jul 2023 12:45:21 -0400
From:   astro <njsgas...@gmail.com>
To:     Ger Hobbelt <g...@hobbelt.com>



Hi Ger,
 The images I'm scanning are trail camera images that have the date/time on the picture in the bottom corner. I'm trying to extract the date/time values from the image. Normally the images are 1440x1080 at 96dpi . the only way I could get tesseract to read some of the time stamp was by upping the image size.  I have since changed my strategy and  used imageMagick to crop the bottom corner of the image that contains the date/time to a 540x70 image and leaving the 96dpi ( see attached). That seems to work very well. I'm currently looking to increase the reliability by trying various things including correcting the output where possible.

Thanks for the reply.

Cheers
 Nor

On 7/21/2023 12:01 PM, Ger Hobbelt wrote:
6000*4500?!

Hm, sounds way too large for a simple text.

I'm guessing here, but it might be that you got thwarted by the various "dpi" notes re ocr/tesseract out there.

Bottom line: IIRC tesseract was trained on text of around 30px high (note that I use PX = pixels as the relevant unit of measure, I don't care about dpi because that's something only really relevant to printing press people (desktop publishing, etc.) While a lot of folks hang onto dpi as unit of measure it's derivative and only relevant when you scan printed pages, which turns "points" (and picas and ....) into pixels, which is where dpi pops up.

Anyway, the key bit for every image you feed to an ocr engine like tesseract is attempting to match the ”x height” Vs the training material as closely as possible for any attempt at a good/optimal match. For tesseract, this means you should aim for each line if text to be somewhere between 20 and 50 pixels high (and as clean looking in black & white / greyscale as possible, but that comes second, after getting that line height to the 20-50px range. Computers work in PX, not DPI, so it's PX that's the driving criterium.

Since you mention "picking out a date” I ASSUME your text area is one line of text only.

Drop all image areas that do not contain text.
Make sure the text is black on a white background (you may need to invert your image when this is a video grab or some such, f.e.) There's a long wiki page about improving image quality for tesseract processing too. But first try to extract that line of text, scale it so the digits are between 20-50px high and try some sizes within that range.

Second most important bit, I find, is making sure the input image has black text on white background or anything greyscale/luminance-wise that approaches this as best as possible. SOME tesseract modes / settings can cope with white text on black BG, but that's you getting rather lucky so don't bet on it.

tesseract is *engineered* for black text in white background input images (paper book scans)

If you need further assistance on this forum/mailing list, attack the image and tesseract commandline you tried; those messages get more feedback as they are less of a guessing game ;-)

PS: third most important work item that lots of folks do wrong: when clipping/extracting lines of text, postprocess those line images by adding a nice large white=BACKGROUND COLOR boundary around the entire line. Personally, I favor a "border" like that of about 0.5 to 1.0 the size of the line itself. The added border should be SMOOTHLY transitioning from the actual image background to prevent false edge detections in tesseract itself: this problem doesn't happen for clean paper book scans (which already have a plain white background) but is an important aspect when extracting from "busy backgrounds". Anyway, that topic is the size of a book all by itself, so take it slow and get prio 1 right first: 1 line of text to ocr = 20-50px high.

Cheers,

Ger




On Fri, 21 Jul 2023, 13:35 astro, <njsgas...@gmail.com> wrote:

    Hi Ger,
    Thanks for your response. Yes. I found ImageMagick. Looks t be
    very powerful and easy to implement. I tried it out by upping the
    the image to 300 dpi and 6000x4500 and ran the image thru the OCR
    process but tesseract had difficulty in picking out the date on
    the image. I guess I will have to play around so to see if I can
    improve things.

    Cheers
     Nor

    On 7/21/2023 12:13 AM, Ger Hobbelt wrote:
    Check out ImageMagick, an open source image toolset. Specifically
    the 'convert' tool, look for commandline usage and application
    parameters/arguments, where you will find several ways to
    resize/rescale the image.
    Also useful to ”tweak” the image as part of your ocr
    preprocessing pipeline before your image reaches tesseract.

    Another big one would be OpenCV, but that would require you to
    write programs (python software or similar) while ImageMagick can
    accomplish a lot of what you want or might need and can be driven
    by some simple batch / Powershell / shell lines: much easier to
    get success that way if you're not already comfortable with
    coding software.

    https://legacy.imagemagick.org/Usage/resize/
    May appear overwhelming at first; read and try the various ways
    mentioned there to get a grasp and discover what you need to do
    for your scenario specifically. Ocr is not a simple process
    pipeline, so take your time with it.


    On Thu, 20 Jul 2023, 15:03 nor s, <njsgas...@gmail.com> wrote:

        I'm trying to run tesseract-OCR on images that come to me at
        72 DPI . The program is unable to decode these images and
        requires a 200 dpi  or better scale to be successful. Is
        there a program available, similar to tesseract-OCR, that
        would read a command line and convert an 72 dpi image to 200
        dpi or some other specified value and save it in a specified
        location.  I'm running windows 10.
        I can make these change in Photoshop but  I'm trying to
        automate the process since I have a lot of image to scan.

        Any suggestion would be greatly appreciated.

        Thanks
         Nor
-- You received this message because you are subscribed to the
        Google Groups "tesseract-ocr" group.
        To unsubscribe from this group and stop receiving emails from
        it, send an email to tesseract-ocr+unsubscr...@googlegroups.com.
        To view this discussion on the web visit
        
https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com
        
<https://groups.google.com/d/msgid/tesseract-ocr/b6075062-921e-4da9-acdf-b0364dc3c960n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- You received this message because you are subscribed to the
    Google Groups "tesseract-ocr" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to tesseract-ocr+unsubscr...@googlegroups.com.
    To view this discussion on the web visit
    
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frm0CYyZnKuVpuYHmLh9j_6XjBx%3DMYZ5i8B%3DO1zsRK8pA%40mail.gmail.com
    
<https://groups.google.com/d/msgid/tesseract-ocr/CAFP60frm0CYyZnKuVpuYHmLh9j_6XjBx%3DMYZ5i8B%3DO1zsRK8pA%40mail.gmail.com?utm_medium=email&utm_source=footer>.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a4d63c5c-0cf4-c2de-3f68-8b435f23ea77%40gmail.com.

Reply via email to