Ignore my last post, I completely forgot what it was really about. I'll look at this matter again.

Tilman

Am 21.12.2017 um 10:43 schrieb Tilman Hausherr:
Thanks, and yes, it is what I mentioned: the pages I looked at don't have spaces. PDF is mostly a graphic format. Spaces are not needed, glyphs are simply put to the correct position.

Tilman



Am 21.12.2017 um 02:21 schrieb Dan Liu:
Hello all:
     I'm using pdfbox 2.0.8, the test pdf file can download from  http://proj.gz-yibo.com:2880/nk7.pdf

------------------
With best regards
Daniel







------------------ Original ------------------
From:  "Tilman Hausherr";<[email protected]>;
Date:  Wed, Dec 20, 2017 04:43 PM
To:  "users"<[email protected]>;

Subject:  Re: all spaces between english words is lost after extraction



Hi,

Please upload your file to a sharehoster. Also mention what PDFBox
version you are using.

If the PDF doesn't have spaces (most PDFs don't), then you won't get any
positions.

High level PDFBox text extraction (i.e. just get text) creates spaces by
using heuristics.

Tilman

Am 20.12.2017 um 03:46 schrieb Dan Liu:
Hello all:
     I extract the text according to the codes of
https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/
, but all spaces between english words are lost.

Such as:
"severe acute respiratory syndrome"

becomes:
severeacuterespiratorysyndrome

The attachment is origianl text.


------------------

With best regards
Daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to