AW: extracting text from an "encrypted" pdf

Clemens Wyss DEV Fri, 08 May 2015 08:52:14 -0700

Thx for the very fast answer. 
> new StandardDecryptionMaterial( password );
I have no password. The pdf is a public user manual.


> That is TIKA, isn't it?
True


-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr [mailto:[email protected]] 
Gesendet: Freitag, 8. Mai 2015 17:44
An: [email protected]
Betreff: Re: extracting text from an "encrypted" pdf

Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
> When I try to extract an "encrypted" (which can be read in AcrobatReader) 
> document with:
>
> pdfDocument = PDDocument.load( is );

add
if( document.isEncrypted() )
{
  StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); 
document.openProtection( sdm ); }

or use loadNonSeq()

> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText = 
> pdfStripper.getText( pdfDocument );
>
> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document 
> is encrypted" is logged.
>
> When, on the other hand, I do:
>
> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext 
> context = new ParseContext(); parser = new AutoDetectParser(); 
> context.set( Parser.class, parser );
>   parser.parse( is, handler, metadata, context ); parsedText = 
> handler.toString();
>
> I get to see the text/content of the very pdf.
>
> 1) What ist he preferred way to extract text from a 
> pdf("-that-can-be-read-in-AcrobatReader")?
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date

>   
> 2) Does the second approach possibly return "more than text"? Blobs? Binary 
> data?

That is TIKA, isn't it?

Tilman

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

AW: extracting text from an "encrypted" pdf

Reply via email to