Hi everyone,

does any of you know a way to make tessearct acknowledge large horizontal 
distances as separators for blocks?

Considering the attached document (it's just a random example from the web, 
tesseract shows the same behavior on similar documents). Tesseract 
consistently fails to recognize the two separate blocks in the header and 
instead reads the words line by line.

The output then looks like this:
COUR EUROPEENNE EUROPEAN COURT
des of
DROITS DE L’HOMME HUMAN RIGHTS

Where it should clearly look like this:
COUR EUROPEENNE 
des 
DROITS DE L’HOMME 

EUROPEAN COURT
of
HUMAN RIGHTS


Looking at the blocks, it becomes clear that tesseract does not recognize 
the two header blocks as separate, even though they are clearly 
distinguishable.
Is there a way to tweak tesseract's block/paragraph detection to be more 
sensitive to this and correctly separate the header blocks?

This problem has been haunting me for a while now. and tesseract is such a 
powerful tool and does such a great job with tasks that are way more 
complex, that I just cannot accept that it can't get this right.

Thanks in advance for you help,
best,
Peter


PS:
Find below the version I'm using. I do not think this is a problem of the 
version, though, the issue is the same with version 3.
tesseract 4.0.0-beta.3-199-gba757
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found SSE

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ec15c1c1-849a-41d9-b77a-782d5b911496%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to