Thank you for your reply.

I have forgotten that I need also space information in the words,
so I used PDFTextStripper.processTextPosition and stripper.getText.

But I modified my source to use setSuppressDuplicateOverlappingText.

Then I succeeded to  retrieve texts correctly.

Thank you.


2017-08-29 0:26 GMT+09:00 Tilman Hausherr <[email protected]>:
> Am 28.08.2017 um 09:36 schrieb 二川村田:
>>
>> Hello.
>>
>> I noticed the cause.
>>
>> The difference of the characters order that retrieved by
>> PDFTextStripper.processTextPosition and stripper.getText is that.
>
>
>
> Hi,
>
> it's more complex. I ran your code and was surprised too.
>
> What your code does is to get the text, then for each character in the
> decoded text use its offset to access the list you got by overriding
> processTextPosition().
>
> This failed after some time because "25" appears twice in the PDF but at the
> exact same x/y position. You can see this by looking at the page content
> stream with PDFDebugger command line application, you'll find this segment
> twice:
>
>   10.3477 0 0 10.4288 534.7 29.2994 Tm
>   (25) Tj
>
> 534.7 29.2994 is the position.
>
> PDFBox text extraction detects this duplicate and has it only once in the
> result.
>
> To prevent this from happening, use this call:
>
>     stripper.setSuppressDuplicateOverlappingText(false);
>
> of course, doing "only PDFTextStripper.processTextPosition" works too.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to