Hi list, I'm having issues with encrypted PDFs
PDF Testcases pass, but fail on my own encrypted PDF (sample file at https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 'testing123') To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts the text without issue. Unfortunately I need the metadata too. $ pdftotext -opw testing123 encrypted.pdf I'm running on Centos 6.6, and the Java packages installed are: java-1.6.0-openjdk.x86_64 1:1.6.0.33-1.13.5.1.el6_6 java-1.6.0-openjdk-devel.x86_64 1:1.6.0.33-1.13.5.1.el6_6 java-1.7.0-openjdk.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates java-1.7.0-openjdk-devel.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates Some outputs: $ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf INFO - Document is encrypted Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116) Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115) at javax.crypto.CipherInputStream.read(CipherInputStream.java:233) at javax.crypto.CipherInputStream.read(CipherInputStream.java:209) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192) at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158) at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597) at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134) ... 7 more Caused by: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750) at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676) at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420) at javax.crypto.Cipher.doFinal(Cipher.java:1805) at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112) ... 19 more I searched the pdfbox issue tracker and found https://issues.apache.org/jira/browse/PDFBOX-2469 and https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to related issues. The ticket status says a number of these issues are fixed in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser. So I edited `tika-parsers/pom.xml` and set <pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit `tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties` and enable the non-sequential parser. Now tika won't build. I change PDFParser.properties back and it won't build either. Running org.apache.tika.parser.pdf.PDFParserTest ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0) INFO [main] (PDFParser.java:259) - Document is encrypted [Fatal Error] :1:1: Content is not allowed in prolog. ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0) [Fatal Error] :1:1: Content is not allowed in prolog. ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0) Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<< FAILURE! ... Results : Tests in error: testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Non-Sequential Parser failed on test file /root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract PDF content System info: root@31 [~/tika-trunk]# java -version java version "1.7.0_71" OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) root@31 [~/tika-trunk]# mvn -version Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T21:58:10+01:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_71, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre Default locale: en_GB, platform encoding: UTF-8 OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: "unix" I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've tested only with Java 1.7. Can anyone advise please? Thanks, Peter
