extracting text from an "encrypted" pdf

Clemens Wyss DEV Fri, 08 May 2015 08:33:49 -0700

When I try to extract an "encrypted" (which can be read in AcrobatReader) 
document with:


pdfDocument = PDDocument.load( TIKA_FILES_DIR + "doc1.pdf" ); // 
"dauertewig.pdf" );                    
PDFTextStripper pdfStripper = new PDFTextStripper();
parsedText = pdfStripper.getText( pdfDocument );

I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document is 
encrypted" is logged.

When, on the other hand, I do:

ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
parser = new AutoDetectParser();
context.set( Parser.class, parser );
parser.parse( is, handler, metadata, context );
parsedText = handler.toString();

I get to see some text/content oft he very pdf. 

1) What ist he preferred way to extract text from a 
pdf("-that-can-be-read-in-AcrobatReader")? 
2) Does the second approach possibly return more than text? Blobs? Binary data?

extracting text from an "encrypted" pdf

Reply via email to