RE: Encrypted PDF issues & build issues

Allison, Timothy B. Mon, 15 Dec 2014 09:15:07 -0800

Upgrade just made in Tika trunk.  The integration required more than changing 
the one test…Sorry about that!

Let us know if there are any surprises with the upgrade.

From: Allison, Timothy B. [mailto:[email protected]]
Sent: Thursday, December 11, 2014 2:41 PM
To: [email protected]
Subject: RE: Encrypted PDF issues & build issues

Y, sorry.  As you point out, that should be fixed in PDFBox 1.8.8.  A vote was 
just taken for that, so that will be out very soon.  Last I looked at 
integrating PDFBox 1.8.8-SNAPSHOT, the upgrade requires us to change one test 
(I think?) in Tika…which is why you’re getting a failed build.  Your error 
message is not what I was getting, but it was in that test.

In short…by early next week (I hope), Tika trunk will be good to go with PDFBox 
1.8.8.

If you’d like the one or two lines of code to change to get a Tika to build 
with 1.8.8-SNAPSHOT, let me know.

Best,

           Tim

From: Peter Bowyer [mailto:[email protected]]
Sent: Thursday, December 11, 2014 12:43 PM
To: [email protected]<mailto:[email protected]>
Subject: Encrypted PDF issues & build issues

Hi list,

I'm having issues with encrypted PDFs

PDF Testcases pass, but fail on my own encrypted PDF (sample file at 
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is 
'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext extracts 
the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64                       1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64                 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64                       1:1.7.0.71-2.5.3.1.el6 @updates
java-1.7.0-openjdk-devel.x86_64                 1:1.7.0.71-2.5.3.1.el6 @updates

Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to 
extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException: Input 
length must be multiple of 16 when decrypting with padded cipher
        at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
        at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
        at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
        at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
        at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
        at 
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
        at 
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
        at 
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
        ... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be 
multiple of 16 when decrypting with padded cipher
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
        at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at 
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
        ... 19 more

I searched the pdfbox issue tracker and found 
https://issues.apache.org/jira/browse/PDFBOX-2469 and 
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to 
related issues. The ticket status says a number of these issues are fixed in 
the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set 
<pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit 
`tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties`
 and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build 
either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0 
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 
(origin offset 0)
 INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 
(origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt stream 
due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0 
(origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec <<< 
FAILURE!
...
Results :

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest): 
Non-Sequential Parser failed on test file 
/root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
  testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to extract 
PDF content

System info:
root@31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

root@31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 
2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family: 
"unix"

I tried originally with both Java 1.7 and Java 1.6. In the latest attempts I've 
tested only with Java 1.7.

Can anyone advise please?

Thanks,
Peter

RE: Encrypted PDF issues & build issues

Reply via email to