Thx for the very fast answer. > new StandardDecryptionMaterial( password ); I have no password. The pdf is a public user manual.
> That is TIKA, isn't it? True -----Ursprüngliche Nachricht----- Von: Tilman Hausherr [mailto:[email protected]] Gesendet: Freitag, 8. Mai 2015 17:44 An: [email protected] Betreff: Re: extracting text from an "encrypted" pdf Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV: > When I try to extract an "encrypted" (which can be read in AcrobatReader) > document with: > > pdfDocument = PDDocument.load( is ); add if( document.isEncrypted() ) { StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( password ); document.openProtection( sdm ); } or use loadNonSeq() > PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText = > pdfStripper.getText( pdfDocument ); > > I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - Document > is encrypted" is logged. > > When, on the other hand, I do: > > ContentHandler handler = new BodyContentHandler( -1 ); ParseContext > context = new ParseContext(); parser = new AutoDetectParser(); > context.set( Parser.class, parser ); > parser.parse( is, handler, metadata, context ); parsedText = > handler.toString(); > > I get to see the text/content of the very pdf. > > 1) What ist he preferred way to extract text from a > pdf("-that-can-be-read-in-AcrobatReader")? https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date > > 2) Does the second approach possibly return "more than text"? Blobs? Binary > data? That is TIKA, isn't it? Tilman > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

