Hi list,

I'm having issues with encrypted PDFs



PDF Testcases pass, but fail on my own encrypted PDF (sample file at
https://dl.dropboxusercontent.com/u/2460167/encryption.pdf. Its password is
'testing123')

To rule out a problem with the PDF I tested with Xpdf, and pdftotext
extracts the text without issue. Unfortunately I need the metadata too.

$ pdftotext -opw testing123 encrypted.pdf

I'm running on Centos 6.6, and the Java packages installed are:
java-1.6.0-openjdk.x86_64                       1:1.6.0.33-1.13.5.1.el6_6
java-1.6.0-openjdk-devel.x86_64                 1:1.6.0.33-1.13.5.1.el6_6
java-1.7.0-openjdk.x86_64                       1:1.7.0.71-2.5.3.1.el6
@updates
java-1.7.0-openjdk-devel.x86_64                 1:1.7.0.71-2.5.3.1.el6
@updates


Some outputs:

$ java -jar tika-app-1.7-SNAPSHOT.jar --password=testing123 ~/sample.pdf
INFO - Document is encrypted
Exception in thread "main" org.apache.tika.exception.TikaException: Unable
to extract PDF content
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:150)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:161)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
        at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:146)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:440)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: java.io.IOException: javax.crypto.IllegalBlockSizeException:
Input length must be multiple of 16 when decrypting with padded cipher
        at
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:115)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:233)
        at javax.crypto.CipherInputStream.read(CipherInputStream.java:209)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:312)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:413)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decrypt(SecurityHandler.java:386)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptObject(SecurityHandler.java:361)
        at
org.apache.pdfbox.pdmodel.encryption.SecurityHandler.proceedDecryption(SecurityHandler.java:192)
        at
org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:158)
        at
org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
        at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
        at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:134)
        ... 7 more
Caused by: javax.crypto.IllegalBlockSizeException: Input length must be
multiple of 16 when decrypting with padded cipher
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:750)
        at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:676)
        at
com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:420)
        at javax.crypto.Cipher.doFinal(Cipher.java:1805)
        at
javax.crypto.CipherInputStream.getMoreData(CipherInputStream.java:112)
        ... 19 more




I searched the pdfbox issue tracker and found
https://issues.apache.org/jira/browse/PDFBOX-2469 and
https://issues.apache.org/jira/browse/PDFBOX-2510, which in turn link to
related issues. The ticket status says a number of these issues are fixed
in the 1.8.8 snapshot, and if you run using the Non-Sequential Parser.

So I edited `tika-parsers/pom.xml` and set
<pdfbox.version>1.8.8-SNAPSHOT</pdfbox.version>. I also edit
`tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties`
and enable the non-sequential parser.

Now tika won't build. I change PDFParser.properties back and it won't build
either.

Running org.apache.tika.parser.pdf.PDFParserTest
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 7 0
(origin offset 0)
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
 INFO [main] (PDFParser.java:259) - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (FlateFilter.java:107) - FlateFilter: stop reading corrupt
stream due to a DataFormatException
ERROR [main] (NonSequentialPDFParser.java:1998) - Can't find the object 0 0
(origin offset 0)
Tests run: 29, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.359 sec
<<< FAILURE!
...
Results :

Tests in error:
  testSequentialParser(org.apache.tika.parser.pdf.PDFParserTest):
Non-Sequential Parser failed on test file
/root/tika-trunk/tika-parsers/target/test-classes/test-documents/testPDF_protected.pdf
  testProtectedPDF(org.apache.tika.parser.pdf.PDFParserTest): Unable to
extract PDF content


System info:
root@31 [~/tika-trunk]# java -version
java version "1.7.0_71"
OpenJDK Runtime Environment (rhel-2.5.3.1.el6-x86_64 u71-b14)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

root@31 [~/tika-trunk]# mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4;
2014-08-11T21:58:10+01:00)
Maven home: /usr/share/apache-maven
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/jre
Default locale: en_GB, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-504.el6.x86_64", arch: "amd64", family:
"unix"

I tried originally with both Java 1.7 and Java 1.6. In the latest attempts
I've tested only with Java 1.7.

Can anyone advise please?

Thanks,
Peter

Reply via email to