[ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292203#comment-14292203
 ] 

Tim Allison edited comment on TIKA-1489 at 1/26/15 7:17 PM:
------------------------------------------------------------

I haven't been able to find standards in XMP or elsewhere.  DC's 
[[accessRights]] and [[rights]] are as close as I could find, but they aren't a 
good fit. 

Has anyone had any luck finding a standard?

I did just open up MSWord to see what is available there with the current 
document format.  I don't have Information Rights Management (IRM) set up so I 
can't see exactly what that offers, but it looks like, MSWord has these options:
* Read only
* Restricted editing
** tracked changes
** comments
** filling in forms
** read only (yes, again)
* Restricted Access (this is what I can't experiment with)
** Edit permission
** Copy permission
** Print permission

LibreOffice's Writer appears to have:
* Read Only
* Record Changes

There are clearly some overlaps with the permissions allowed in PDF, but there 
are also some differences.  For most of Tika's use cases (I think), we'd want 
to set a general Tika Metadata key/value for "do not extract text" if both pdf 
fields were false or if the MSOffice CopyPermission were false???

||Application||Permission Name||
|PDF|CanExtractContent|
|PDF|CanExtractForAccessibility|
|MSOffice|Copy permission|

Should we start with PDFBox's AccessPermission as a model and add where 
necessary from there?


was (Author: talli...@mitre.org):
I haven't been able to find standards in XMP or elsewhere.  Has anyone had any 
luck?

I did just open up MSWord to see what is available there with the current 
document format.  I don't have Information Rights Management (IRM) set up so I 
can't see exactly what that offers, but it looks like, MSWord has these options:
* Read only
* Restricted editing
** tracked changes
** comments
** filling in forms
** read only (yes, again)
* Restricted Access (this is what I can't experiment with)
** Edit permission
** Copy permission
** Print permission

LibreOffice's Writer appears to have:
* Read Only
* Record Changes

There are clearly some overlaps with the permissions allowed in PDF, but there 
are also some differences.  For most of Tika's use cases (I think), we'd want 
to set a general Tika Metadata key/value for "do not extract text" if both pdf 
fields were false or if the MSOffice CopyPermission were false???

||Application||Permission Name||
|PDF|CanExtractContent|
|PDF|CanExtractForAccessibility|
|MSOffice|Copy permission|

> PDF Text extraction without permission
> --------------------------------------
>
>                 Key: TIKA-1489
>                 URL: https://issues.apache.org/jira/browse/TIKA-1489
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Tilman Hausherr
>
> In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
> extraction permission works. The permissions in PDF files are only enforced 
> by the application (i.e. PDFBox), i.e. the text information isn't stored 
> separately in encrypted form. 
> PDFBox ExtractText command line does throw an exception.
> So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
> used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to