Re: [tesseract-ocr] OCR failing on simple and clear text codes

Dmitri Silaev Wed, 20 May 2015 05:48:18 -0700

>
> I can't really use pre defined patterns since the code pattern and font
> can change over time.


Think of using a bit more flexible patterns - by means of '*'. Second, you
can use more than one pattern in "user-patterns". And fonts have nothing to
do with patterns.

Implementing your own char-by-char segmentation is relatively easy even
with ImageMagick and shell scripts, given you receive nicely binarized and
cleaned source images. As far as I can see, this indeed is the case. I
suggest CC labeling. For one possible implementation you can see my reply
here: https://groups.google.com/d/msg/tesseract-ocr/STHaLGYsiCo/pYZyAG2AuMAJ

>From my experience, solely by parameter tweaking a problem like your #3
cannot be solved reliably. You defeat one issue, eventually another rises.
Then you're wasting your time to investigate if it's caused by a recent
parameter change or it's independent. Change back, tweak another, fight a
new issue. Repeat.

A better way is to *force* conditions for reliable OCR. Preprocessing,
white-/blacklists, own segmentation using layout priors, etc.

Or, at least OCR output *postprocessing*. E.g. at some positions your O's
are definitely zeros. I know people who ended up with *thousands* of such
rules for Tess output in an app that allows much more diverse input than
yours.

-Dmitri





On Wed, May 20, 2015 at 2:52 PM, Yoann Nicod <[email protected]> wrote:

> Thanks for your reply,
>
> I can't really use pre defined patterns since the code pattern and font
> can change over time.
> I like the idea to segment the characters myself before giving it to
> tesseract one by one, but it looks time consuming (coding it I mean).
> Isn't there any other suitable method ? In particular to solve the 3rd
> issue, which I think must be easy to solve.
>
> On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote:
>>
>> One no-brainer method to try out would be turning off all dictionaries
>> and using your own custom "user-patterns" file. Since you said about "your
>> application" I suppose you can program. So you can take a look at the
>> comment preceding read_pattern_list() declaration in "dict/trie.h" for more
>> details.
>>
>> It seems all your strings are of the same format:
>> \A\A\d\d\d\d\d\d\d\d\d\d
>> (Tess understands very limited pattern syntax).
>>
>> But if accuracy is critical in your app, in the long run I would
>> absolutely avoid using any parts of Tesseract except char classifier. I.e.
>> crop every single char out of your source image and run Tess in the single
>> char PSM. I think it's should be easy as long as location of every
>> character is quite stable among your source images. ImageMagick/shell
>> scripts would suffice.
>>
>> Best regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0da310e9-57b6-41a1-a363-66d35dc1bc19%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/0da310e9-57b6-41a1-a363-66d35dc1bc19%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKzLxFN9VHjx%2B-FPaG6i0Xbp%2BSF9pnZkKaKDBmDVyO9kG6K2tQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] OCR failing on simple and clear text codes

Reply via email to