Thank you for your reply. I have forgotten that I need also space information in the words, so I used PDFTextStripper.processTextPosition and stripper.getText.
But I modified my source to use setSuppressDuplicateOverlappingText. Then I succeeded to retrieve texts correctly. Thank you. 2017-08-29 0:26 GMT+09:00 Tilman Hausherr <[email protected]>: > Am 28.08.2017 um 09:36 schrieb 二川村田: >> >> Hello. >> >> I noticed the cause. >> >> The difference of the characters order that retrieved by >> PDFTextStripper.processTextPosition and stripper.getText is that. > > > > Hi, > > it's more complex. I ran your code and was surprised too. > > What your code does is to get the text, then for each character in the > decoded text use its offset to access the list you got by overriding > processTextPosition(). > > This failed after some time because "25" appears twice in the PDF but at the > exact same x/y position. You can see this by looking at the page content > stream with PDFDebugger command line application, you'll find this segment > twice: > > 10.3477 0 0 10.4288 534.7 29.2994 Tm > (25) Tj > > 534.7 29.2994 is the position. > > PDFBox text extraction detects this duplicate and has it only once in the > result. > > To prevent this from happening, use this call: > > stripper.setSuppressDuplicateOverlappingText(false); > > of course, doing "only PDFTextStripper.processTextPosition" works too. > > Tilman > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

