Re: [tesseract-ocr] Reading # from image only ~75% successful

Dmitri Silaev Fri, 29 Sep 2017 18:36:46 -0700

Hi Ben,

What you want to achieve is not possible with Tesseract alone. At all. And
even with ABBYY, and any other OCR engine, if you use them out-of-the-box.
Well, maybe something *might* be done, if you combine it with one's
ImageMagick-fu, but I'm not sure, and only if you put some serious
restrictions on images. Maybe an interactive mobile app would let disguise
some of those restrictions in an unobtrusive manner.


I think, what you'd really want is a system that can work with arbitrary
images, not only with scanned paper pages which are what a regular OCR
system is designed for. There would be some special logic implemented to
detect, enhance and recognize text in "OCR-tough" conditions. I'm not even
going to list here what conditions in your images make them tough for OCR.
There are lots of.

And yes, such systems exist. If you'd like to know more, just PM me, I'd be
happy to help.

Best regards,
Dmitri Silaev
www.CustomOCR.com



On Fri, Sep 29, 2017 at 12:12 PM, Ben Schipper <[email protected]>
wrote:

> I am attempting to read a fairly large 6 digit number from an image using
> Tesseract 3.02 on a windows 7 machine.
>
> I have been able to get slightly better results by resampling the image to
> 300dpi using imagemagick, but I am still only able to get ~75% accuracy.
> I have tried some other options (-lat, -blur, -contrast-stretch), but they
> only seem to make it worse.  (I am not a graphic designer most sources of
> image manipulation help are greek to me)
>
> Since the image does not contain many dictionary words I am using a config
> file to disable the dictionary (https://github.com/tesseract-
> ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns)
>
> load_freq_dawg 0
> load_system_dawg 0
> load_punc_dawg 0
>
> Whitelisting numbers only didn't help because it just returned more
> characters as numbers which made it more difficult to pull the 6 digit #
> that I wanted out.
>
> Unfortunately the data that I am pulling from the image can be located in
> different regions of the image so I can't crop the image.
>
> Image samples attached.  The largest text is the # that I would like to
> extract in both cases.
>
>
>
>
>
> This correspondence may contain personal or confidential information. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify London Hydro immediately.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b02239a9-51bb-40de-af87-db2e2bea0574%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/b02239a9-51bb-40de-af87-db2e2bea0574%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFPPpXOMdr8Uz74Zt-jRfPK8Nr0k%2BKKPAHeDcFYuy9EyXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Reading # from image only ~75% successful

Reply via email to