PDF encryption and access permissions are tricky (see, e.g., the discussion and 
links here: https://issues.apache.org/jira/browse/TIKA-1489 ).  There are 
potentially two passwords for a PDF document, the owner password and the user 
password.  Often, the user password is set to the empty string...this allows 
the owner to modify the document but can effectively give "read" access to the 
user.

Aside from encryption, but related to it, a PDF file has various 
AccessPermissions.  Among other permissions, an owner can specify whether or 
not text should be extracted and/or whether or not text should be extracted for 
accessibility.  As of Tika 1.8, you can have Tika respect these permissions by 
sending in an AccessChecker via the ParseContext.


1) What ist he preferred way to extract text from a 
pdf("-that-can-be-read-in-AcrobatReader")? 

If you only want text from the PDFDocument (not attachments/embedded documents) 
and you are only parsing PDFs, then it might make sense to use pure PDFBox. 
<unconfirmed> I haven't checked recently, but I _think_ that Tika may be 
pulling out some text from annotations or maybe AcroFields that PDFTextStripper 
isn't </unconfirmed>. ..I can look into this if it matters to you. Tika also 
extracts normalized metadata and does a bit more with metadata than if you were 
using the PDFTextStripper.

2) Does the second approach possibly return more than text? Blobs? Binary data?
The second approach will leverage the full power of Tika to extract content 
from embedded documents/attachments.  The first approach will only extract text 
from the outer pdf document.   You can extract binary data (embedded images or 
other embedded files) in Tika by sending in an EmbeddedDocumentExtractor 
instead of the Parser.class.



Reply via email to