[ 
https://issues.apache.org/jira/browse/TIKA-267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sascha Szott updated TIKA-267:
------------------------------

    Summary: encrypted pdf files aren't handled properly  (was: encrypted files 
aren't handled properly)

> encrypted pdf files aren't handled properly
> -------------------------------------------
>
>                 Key: TIKA-267
>                 URL: https://issues.apache.org/jira/browse/TIKA-267
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux 8.10, JRE 1.5
>            Reporter: Sascha Szott
>            Priority: Critical
>   Original Estimate: 0.08h
>  Remaining Estimate: 0.08h
>
> While I was working on extracting full texts out of a bunch of pdf documents, 
> I realized an odd behaviour of Tika when processing encrypted documents 
> (those documents that restrict the execution of specific actions, e.g. 
> editing or printing). To extract content from an encrypted pdf document you 
> do not have to decrypt the document in every case. For instance, when 
> creating an (encrypted) pdf document the author can decide to allow content 
> extraction without the need of providing a password. Unfortunately, Tika's 
> pdf parser isn't aware of this at the moment. Therefore, I suggest a minor 
> change inside the parse method in class org.apache.tika.parser.pdf.PDFParser 
> by introducing an additional check ("is copying allowed") before trying to 
> decrypt the document.
> To be more precise, I'll provide a code snippet:
> public void parse(...) throws ... {
>   PDDocument pdfDocument = PDDocument.load(stream);
>   try {
>     //decrypt document only if copying is not allowed
>     if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
>       if (pdfDocument.isEncrypted()) {
>         try {
>           pdfDocument.decrypt("");
>         } catch (Exception e) {
>           // Ignore
>         }
>       }
>     }
>     ...
> Another solution to this problem would be to eliminate the "isEncrypted" 
> check since PDFBox seems to handle the extraction of content out of encrypted 
> documents correctly (and throws an IOException in case of failure).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to