Hi,

Am 12.08.2011 21:03, schrieb Harper, Brad:
Hello:

Here's a brand new PDFBox user with a problem

    Aug 12, 2011 11:57:52 AM org.apache.pdfbox.filter.FlateFilter decode

SEVERE: Stop reading corrupt stream

I've found these [possibly] related issues

    https://issues.apache.org/jira/browse/PDFBOX-872   [resolved]

    https://issues.apache.org/jira/browse/PDFBOX-697   [unresolved]
mentioned as possible duplicate of 872

I'm using PDFBox version 1.6 / Windows XP/ Java 7. Two PDF docs in
question are 1.4. One was created by the PDFComplete plugin to Windows
Word 2010. The other was created by OpenOffice 3.x Write from the
original Word .docx file.

Comments in the issues [above] seem related to encryption/decryption,
but the docs have not been encrypted [unless these producing tools do so
implicitly]. The files can both be viewed in Adobe Acrobat Reader and
don't require a password.
It is easy to check wether a document is encrypted or not, just check PDDocument#isEncrypted.

The code in question looks like

    String result = null;

     try ( FileInputStream fis = new FileInputStream( file ); ) {

       PDFParser parser = new PDFParser( fis );

       parser.parse();

       COSDocument cd = parser.getDocument();

       PDDocument  pd = new PDDocument( cd );

       cd.close();

       PDFTextStripper stripper = new PDFTextStripper();

       result = stripper.getText( pd );

       pd.close();

     }

     catch ( FileNotFoundException ex ) { ...

     }

     catch ( IOException ex ) {...

     }



And hints or suggestions on how to proceed?
You should use something like this to extract the text:

PDFTextStripper stripper = new PDFTextStripper();
PDDocument  pd = PDDocument.load("example.pdf");
Writer output = new OutputStreamWriter( System.out );
stripper.writeText( document, output );

Have a look at [1] for further details.

If your problem still persists, create an issue on JIRA [2] and attach a sample pdf if possible.

Thanks.

BR
Andreas Lehmkühler

[1] http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java
[2] https://issues.apache.org/jira/browse/PDFBOX

Reply via email to