Upgrade just made in Tika trunk. The integration required more than changing the one test…Sorry about that!
Let us know if there are any surprises with the upgrade. From: Allison, Timothy B. [mailto:[email protected]] Sent: Thursday, December 11, 2014 2:41 PM To: [email protected] Subject: RE: Encrypted PDF issues & build issues Y, sorry. As you point out, that should be fixed in PDFBox 1.8.8. A vote was just taken for that, so that will be out very soon. Last I looked at integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test (I think?) in Tika…which is why you’re getting a failed build. Your error message is not what I was getting, but it was in that test. In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 1.8.8. If you’d like the one or two lines of code to change to get a Tika to build with 1.8.8-SNAPSHOT, let me know. Best, Tim From: Peter Bowyer [mailto:[email protected]] Sent: Thursday, December 11, 2014 12:43 PM To: [email protected]<mailto:[email protected]> Subject: Encrypted PDF issues & build issues Hi list, I'm having issues with encrypted PDFs PDF Testcases pass, but fail on my own encrypted PDF (sample file at https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 'testing123') To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts the text without issue. Unfortunately I need the metadata too. $ pdftotext -opw testing123 encrypted.pdf I'm running on Centos 6.6, and the Java packages installed are: java-1.6.0-openjdk.x86_64 1:1.6.0.33-1.13.5.1.el6_6 java-1.6.0-openjdk-devel.x86_64 1:1.6.0.33-1.13.5.1.el6_6 java-1.7.0-openjdk.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates java-1.7.0-openjdk-devel.x86_64 1:1.7.0.71-2.5.3.1.el6 @updates Some outputs: $ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf INFO - Document is encrypted Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116) Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115) at javax.crypto.CipherInputStream.read(CipherInputStream.java:233) at javax.crypto.CipherInputStream.read(CipherInputStream.java:209) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361) at org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192) at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158) at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597) at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134) ... 7 more Caused by: javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 when decrypting with padded cipher at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750) at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676) at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420) at javax.crypto.Cipher.doFinal(Cipher.java:1805) at javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112) ... 19 more I searched the pdfbox issue tracker and found https://issues.apache.org/jira/browse/PDFBOX-2469 and https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to related issues. The ticket status says a number of these issues are fixed in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser. So I edited `tika-parsers/pom.xml` and set <pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit `tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties` and enable the non-sequential parser. Now tika won't build. I change PDFParser.properties back and it won't build either. Running org.apache.tika.parser.pdf.PDFParserTest ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 (origin offset 0) ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0) INFO [main] (PDFParser.java:259) - Document is encrypted [Fatal Error] :1:1: Content is not allowed in prolog. ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0) [Fatal Error] :1:1: Content is not allowed in prolog. ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 (origin offset 0) Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<< FAILURE! ... Results : Tests in error: testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): Non-Sequential Parser failed on test file /root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract PDF content System info: root@31 [~/tika-trunk]# java -version java version "1.7.0_71" OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) root@31 [~/tika-trunk]# mvn -version Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T21:58:10+01:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_71, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre Default locale: en_GB, platform encoding: UTF-8 OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: "unix" I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've tested only with Java 1.7. Can anyone advise please? Thanks, Peter
