Re: Wrong space parsed pdf

Tilman Hausherr Thu, 25 Jan 2018 11:06:06 -0800

The font has some extremely high values that we use for our heuristics,these are misleading the software:

I'll see if something can be done... but I suspect that it requires achange that would break other text extractions so we can't commit it tothe repository.


Tilman

Am 25.01.2018 um 15:20 schrieb Hesham Gneady:

Hello ,

While reading a pdf using PDFBox v2.0.8 many spaces are being ignored, so
words are merged together while reading the pdf. You can test a 1-page
sample PDF from here:

https://www.dropbox.com/s/9i9ofl3tje4iy3k/wrong_space_parsed_sample.pdf?dl=1

You can see wrong read words like :

aboutmidnight, andbefore, CountyDonegal, ...

I have tried to use PDFTextStripper.setAverageCharTolerance(...) to control
space sensitivity but it didn't make any change.

Any idea why this happens and how to fix it ?

Best regards ,

Hesham



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Re: Wrong space parsed pdf

Reply via email to