Am 28.08.2017 um 09:36 schrieb 二川村田:
Hello.

I noticed the cause.

The difference of the characters order that retrieved by
PDFTextStripper.processTextPosition and stripper.getText is that.


Hi,

it's more complex. I ran your code and was surprised too.

What your code does is to get the text, then for each character in the decoded text use its offset to access the list you got by overriding processTextPosition().

This failed after some time because "25" appears twice in the PDF but at the exact same x/y position. You can see this by looking at the page content stream with PDFDebugger command line application, you'll find this segment twice:

  10.3477 0 0 10.4288 534.7 29.2994 Tm
  (25) Tj

534.7 29.2994 is the position.

PDFBox text extraction detects this duplicate and has it only once in the result.

To prevent this from happening, use this call:

    stripper.setSuppressDuplicateOverlappingText(false);

of course, doing "only PDFTextStripper.processTextPosition" works too.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to