Y, sorry. As you point out, that should be fixed in PDFBox 1.8.8. A vote was
just taken for that, so that will be out very soon. Last I looked at
integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test
(I think?) in Tika…which is why you’re getting a failed build. Your error
message is not what I was getting, but it was in that test.
In short…by early next week (I hope), Tika trunk will be good to go with PDFBox
1.8.8.
If you’d like the one or two lines of code to change to get a Tika to build
with 1.8.8-SNAPSHOT, let me know.
Best,
Tim
From: Peter Bowyer [mailto:[email protected]]
Sent: Thursday, December 11, 2014 12:43 PM
To: [email protected]
Subject: Encrypted PDF issues & build issues
Hi list,
I'm having issues with encrypted PDFs
PDF Testcases pass, but fail on my own encrypted PDF (sample file at
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is
'testing123')
To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts
the text without issue. Unfortunately I need the metadata too.
$ pdftotext -opw testing123 encrypted.pdf
I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64 1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates
Some outputs:
$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to
extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input
length must be multiple of 16 when decrypting with padded cipher
at
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
at
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be
multiple of 16 when decrypting with padded cipher
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
at javax.crypto.Cipher.doFinal(Cipher.java:1805)
at
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
... 19 more
I searched the pdfbox issue tracker and found
https://issues.apache.org/jira/browse/PDFBOX-2469 and
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to
related issues. The ticket status says a number of these issues are fixed in
the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.
So I edited `tika-parsers/pom.xml` and set
<pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit
`tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties`
and enable the non-sequential parser.
Now tika won't build. I change PDFParser.properties back and it won't build
either.
Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream
due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<<
FAILURE!
...
Results :
Tests in error:
testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest):
Non-Sequential Parser failed on test file
/root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract
PDF content
System info:
root@31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
root@31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4;
2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family:
"unix"
I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've
tested only with Java 1.7.
Can anyone advise please?
Thanks,
Peter