Am 28.08.2017 um 09:36 schrieb 二川村田:
Hello.
I noticed the cause.
The difference of the characters order that retrieved by
PDFTextStripper.processTextPosition and stripper.getText is that.
Hi,
it's more complex. I ran your code and was surprised too.
What your code does is to get the text, then for each character in the
decoded text use its offset to access the list you got by overriding
processTextPosition().
This failed after some time because "25" appears twice in the PDF but at
the exact same x/y position. You can see this by looking at the page
content stream with PDFDebugger command line application, you'll find
this segment twice:
10.3477 0 0 10.4288 534.7 29.2994 Tm
(25) Tj
534.7 29.2994 is the position.
PDFBox text extraction detects this duplicate and has it only once in
the result.
To prevent this from happening, use this call:
stripper.setSuppressDuplicateOverlappingText(false);
of course, doing "only PDFTextStripper.processTextPosition" works too.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]