Hi, > Am 05.04.2017 um 21:46 schrieb David Patterson <[email protected]>: > > Hello, > > > > I’m trying to extract the text from a PDF that was saved from a Word > document. > > > > I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a > Windows machine. > > > > I’m using the following code to get the text: > > > > PDDocument pdDocument = PDDocument.load( pdfFile ); > > PDFTextStripper stripper = new PDFTextStripper(); > > String rawText = stripper.getText( pdDocument ); > > // end of code excerpt > > > > I’m running the same code on a collection of files. Most work as expected. > I can see the following in the text of the Table of Contents: > > 5.15.1 ADDENDA..................................................... > ................................. 1 > > 5.15.2 YOU ARE HERE .............................. > .............................................. 2 > > 5.15.3 INTRODUCTION .............................. > .............................................. 4 > > > > However, for two files, what I see is: > > 5.16 xxx SYSTEM PROCEDURES > ............................................................ > 1 > > ADDENDA...................................... > ......................................................... 1 5.16.1 > > YOU ARE HERE .............................. > ........................................................ > 2 5.16.2 > > INTRODUCTION > ....................................................................................... > 4 5.16.3 > > > > Note: the outline numbers (5.16.1, etc.) are at the end of the line, not at > the beginning. > > > > A) Is this a known, solvable problem? > > B) If not, is there a different way I can try to extract the data? > > C) If not, can I help debug/diagnose the problem? I cannot send the > offending PDF file out of my system.
try PDFTextStripper.setSortByPosition(true); https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition(boolean) BR Maruan > > Thanks > > > > Dave Patterson --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

