Hi everyone, does any of you know a way to make tessearct acknowledge large horizontal distances as separators for blocks?
Considering the attached document (it's just a random example from the web, tesseract shows the same behavior on similar documents). Tesseract consistently fails to recognize the two separate blocks in the header and instead reads the words line by line. The output then looks like this: COUR EUROPEENNE EUROPEAN COURT des of DROITS DE L’HOMME HUMAN RIGHTS Where it should clearly look like this: COUR EUROPEENNE des DROITS DE L’HOMME EUROPEAN COURT of HUMAN RIGHTS Looking at the blocks, it becomes clear that tesseract does not recognize the two header blocks as separate, even though they are clearly distinguishable. Is there a way to tweak tesseract's block/paragraph detection to be more sensitive to this and correctly separate the header blocks? This problem has been haunting me for a while now. and tesseract is such a powerful tool and does such a great job with tasks that are way more complex, that I just cannot accept that it can't get this right. Thanks in advance for you help, best, Peter PS: Find below the version I'm using. I do not think this is a problem of the version, though, the issue is the same with version 3. tesseract 4.0.0-beta.3-199-gba757 leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0 Found SSE -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ec15c1c1-849a-41d9-b77a-782d5b911496%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

