PDF encryption and access permissions are tricky (see, e.g., the discussion and
links here: https://issues.apache.org/jira/browse/TIKA-1489 ). There are
potentially two passwords for a PDF document, the owner password and the user
password. Often, the user password is set to the empty string...this allows
the owner to modify the document but can effectively give "read" access to the
user.
Aside from encryption, but related to it, a PDF file has various
AccessPermissions. Among other permissions, an owner can specify whether or
not text should be extracted and/or whether or not text should be extracted for
accessibility. As of Tika 1.8, you can have Tika respect these permissions by
sending in an AccessChecker via the ParseContext.
1) What ist he preferred way to extract text from a
pdf("-that-can-be-read-in-AcrobatReader")?
If you only want text from the PDFDocument (not attachments/embedded documents)
and you are only parsing PDFs, then it might make sense to use pure PDFBox.
<unconfirmed> I haven't checked recently, but I _think_ that Tika may be
pulling out some text from annotations or maybe AcroFields that PDFTextStripper
isn't </unconfirmed>. ..I can look into this if it matters to you. Tika also
extracts normalized metadata and does a bit more with metadata than if you were
using the PDFTextStripper.
2) Does the second approach possibly return more than text? Blobs? Binary data?
The second approach will leverage the full power of Tika to extract content
from embedded documents/attachments. The first approach will only extract text
from the outer pdf document. You can extract binary data (embedded images or
other embedded files) in Tika by sending in an EmbeddedDocumentExtractor
instead of the Parser.class.