Re: Couldn't be retrieve some of character's locations.

Tilman Hausherr Mon, 28 Aug 2017 08:27:37 -0700

Am 28.08.2017 um 09:36 schrieb 二川村田:

Hello.


I noticed the cause.

The difference of the characters order that retrieved by
PDFTextStripper.processTextPosition and stripper.getText is that.



Hi,

it's more complex. I ran your code and was surprised too.

What your code does is to get the text, then for each character in thedecoded text use its offset to access the list you got by overridingprocessTextPosition().

This failed after some time because "25" appears twice in the PDF but atthe exact same x/y position. You can see this by looking at the pagecontent stream with PDFDebugger command line application, you'll findthis segment twice:


  10.3477 0 0 10.4288 534.7 29.2994 Tm
  (25) Tj

534.7 29.2994 is the position.

PDFBox text extraction detects this duplicate and has it only once inthe result.


To prevent this from happening, use this call:

    stripper.setSuppressDuplicateOverlappingText(false);

of course, doing "only PDFTextStripper.processTextPosition" works too.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Couldn't be retrieve some of character's locations.

Reply via email to