[tesseract-ocr] Block detection in document header

Peter Sun, 12 Aug 2018 08:06:01 -0700

Hi everyone,

does any of you know a way to make tessearct acknowledge large horizontal 
distances as separators for blocks?

Considering the attached document (it's just a random example from the web,
tesseract shows the same behavior on similar documents). Tesseract
consistently fails to recognize the two separate blocks in the header and
instead reads the words line by line.

The output then looks like this:
COUR EUROPEENNE EUROPEAN COURT
des of
DROITS DE L’HOMME HUMAN RIGHTS

Where it should clearly look like this:
COUR EUROPEENNE
des
DROITS DE L’HOMME

EUROPEAN COURT
of
HUMAN RIGHTS

Looking at the blocks, it becomes clear that tesseract does not recognize
the two header blocks as separate, even though they are clearly
distinguishable.
Is there a way to tweak tesseract's block/paragraph detection to be more
sensitive to this and correctly separate the header blocks?

This problem has been haunting me for a while now. and tesseract is such a
powerful tool and does such a great job with tasks that are way more
complex, that I just cannot accept that it can't get this right.

Thanks in advance for you help,
best,
Peter

PS:
Find below the version I'm using. I do not think this is a problem of the
version, though, the issue is the same with version 3.
tesseract 4.0.0-beta.3-199-gba757
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff
4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found SSE

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ec15c1c1-849a-41d9-b77a-782d5b911496%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Block detection in document header

Reply via email to