Thanks for the suggestion. I'll let the list know tomorrow when I get a chance to test it.
Dave Patterson On Wed, Apr 5, 2017 at 4:02 PM, Maruan Sahyoun <[email protected]> wrote: > Hi, > > > Am 05.04.2017 um 21:46 schrieb David Patterson <[email protected]>: > > > > Hello, > > > > > > > > I’m trying to extract the text from a PDF that was saved from a Word > > document. > > > > > > > > I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a > > Windows machine. > > > > > > > > I’m using the following code to get the text: > > > > > > > > PDDocument pdDocument = PDDocument.load( pdfFile ); > > > > PDFTextStripper stripper = new PDFTextStripper(); > > > > String rawText = stripper.getText( pdDocument ); > > > > // end of code excerpt > > > > > > > > I’m running the same code on a collection of files. Most work as > expected. > > I can see the following in the text of the Table of Contents: > > > > 5.15.1 ADDENDA..................................................... > > ................................. 1 > > > > 5.15.2 YOU ARE HERE .............................. > > .............................................. 2 > > > > 5.15.3 INTRODUCTION .............................. > > .............................................. 4 > > > > > > > > However, for two files, what I see is: > > > > 5.16 xxx SYSTEM PROCEDURES > > ............................................................ > > 1 > > > > ADDENDA...................................... > > ......................................................... 1 5.16.1 > > > > YOU ARE HERE .............................. > > ........................................................ > > 2 5.16.2 > > > > INTRODUCTION .............................. > ......................................................... > > 4 5.16.3 > > > > > > > > Note: the outline numbers (5.16.1, etc.) are at the end of the line, not > at > > the beginning. > > > > > > > > A) Is this a known, solvable problem? > > > > B) If not, is there a different way I can try to extract the data? > > > > C) If not, can I help debug/diagnose the problem? I cannot send the > > offending PDF file out of my system. > > > try PDFTextStripper.setSortByPosition(true); > https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/ > PDFTextStripper.html#setSortByPosition(boolean) > > BR > Maruan > > > > > > Thanks > > > > > > > > Dave Patterson > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

