Re: [tesseract-ocr] [4.00] Extra symbols produced

Lorenzo Bolzani Fri, 01 Mar 2019 01:46:26 -0800

Yes, I have the same problem, some characters are split, sometimes from one
character you even get three ("O0O" for example).


https://github.com/tesseract-ocr/tesseract/issues/1778


I wrote quite a complex code to try to limit the problem (with psm 13). The
idea is this:

Process each symbol individually with iterator:
 - add symbol to current group
 - check if you can close the group
 - if you can close it pick the best symbol/symbols and add them to the
result, leave the rest for the following check.

The criteria to "close" a group is based on the distance between symbols,
symbol size and confidence. You also need to take care of the spaces, not
to drop them, as these are not handled as symbols. Quite a mess.
You need to look at the next symbol to decide what to do. A symbol can be
"cancelled" by the next one or by the following one. My code does not fix
it completely but is reasonable (with false negatives and a few false
positives).

If you want to try this I suggest to first write some code to visualize the
boxes, like this.

[image: ocr_boxes_sanit2_11500.png]


The very latest version of tesseract (checkout and build from github)
handles boxes in a different (better) way, if you want to try this you may
want to use that. I do not know if it could fix this problem too.


Lorenzo

Il giorno ven 1 mar 2019 alle ore 10:07 <[email protected]> ha scritto:

> Gday.
>
> Using 4.00, compiled from release src, Linux env, LSTM engine.
>
> I have pages produced from PDFs (ghostscript) with 300 dpi, then
> greyscaled using opencv.
>
> Found an issue when ocr output for some specific region has more symbols
> than there is in the image.
>
> Example: there is an outstanding "word" with "15" in it (actually, it is a
> part of date - like "15 OCT", identified as two words - which is correct).
> Box coords are correct, no other symbols fit in, but output from running
> tesseract .. --psm 11 --dpi 300 is "156" (instead of "15").
>
> If I cut that part of the image and save it as a separate file, them ocr
> it with psm=6 (or 7) - output is "15" (correct).
>
> I encountered such behavior only on several symbol combinations - like
> "15"->"156", "08"->"0O8". Looks like when confidence level between top two
> identified symbols is very close - both symbols go to output, instead of
> one.
>
> Did anyone have same issues?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f8649172-a33b-4d29-900d-fc49ff5d42bc%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f8649172-a33b-4d29-900d-fc49ff5d42bc%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLx7HBk2YQykGi4_kuKs0mq7z52X63xp9qsfrX9U4tsS2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] [4.00] Extra symbols produced

Reply via email to