Parsing read-only PDFs

Ian Rogers Sat, 02 Apr 2016 08:37:57 -0700

Hi,

I am using PDFBox 1.8.2 because I installed an available NuGet package for
.Net


My question is this. I am reading in the PDF files with the following
commands:
            PDDocument pdDoc = PDDocument.load(path_to_file);
            java.util.List allPages =
pdDoc.getDocumentCatalog().getAllPages();
            PDPage firstPage = (PDPage)allPages.get(0);
            PDStream contents = firstPage.getContents();
            COSStream content = contents.getStream();
            Debug.WriteLine(content.getStreamTokens());

This works great until there is password security on the PDF, that does not
allow modifying contents but does allow freely reading and copying of the
PDF content. In that case I get an IO exception with the following stack
trace:

   at org.apache.pdfbox.cos.COSStream.doDecode(COSName , Int32 )
   at org.apache.pdfbox.cos.COSStream.doDecode()
   at org.apache.pdfbox.cos.COSStream.getUnfilteredStream()
   at org.apache.pdfbox.pdfparser.PDFStreamParser..ctor(COSStream stream)
   at org.apache.pdfbox.cos.COSStream.getStreamTokens()

I used the utility PDFTextStripper and that seems to parse the PDF fine for
PDF documents with and without the abovementioned password security. I
looked through 1.8.10 source to compare what I am doing, but can't see how
I am going wrong.

Any help or pointers would be much appreciated.

Thanks,
Ian

Parsing read-only PDFs

Reply via email to