Hi
On 12/09/16 22:19, Sergey Beryozkin wrote:
Hi Tim
This is very helpful, thanks.
I'll experiment with the code below.

By the way, I've found out AutoDetectParser may not work if the (pdf)
stream is an attachment stream which may not support a mark.

I've been wondering, would it make sense to pass a MediaType identifying
the data format as either a ParseContext or Metadata property for
AutoDetectParser to avoid trying to read the stream ?
My demo works with PDF & ODT files, and before a parse call I already
know the media type

It does indeed help if Metadata provides a Content-Type hint - I nearly started creating a patch before seeing the obvious, that Tika was already supporting it :-)

Sergey

Thanks, Sergey
On 12/09/16 14:26, Allison, Timothy B. wrote:
Hi Sergey,

Is this code good enough to get all the content (and metadata) out of
a 'simple' PDF ?
Yes, but...

For example, Tim has mentioned that it is possible to handle embedded
PDF attachments - I don't even know what they are, to me every PDF is
just a text when I look at it :-).

PDFs can have regular attachments (.doc,.ppt, etc, even other PDFs).
There are two traditional ways to get content from embedded files
inlined in the xhtml:

Option 1 (for the 3 param call to parse):
Parser parser = new AutoDetectParser();
ToTextContentHandler contentHandler = new ToTextContentHandler();
Metadata m = new Metadata();
parser.parse(pdfInputStream, contentHandler, m); //3 param parse

Option 2 (for the 4 param call to parse):
Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser); //NEED TO ADD PARSER FOR EMBEDDED DOCS
ToTextContentHandler contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context); //4 param
call to parse

Another option is to use the RecursiveParserWrapper.  This returns a
List<Metadata>, where the first Metadata object represents the
container document, and the subsequent Metadata objects represent
embedded documents.  The text content for each document is stored in
the RecursiveParserWrapper.TIKA_CONTENT field within each Metadata
object.

Option 3

        Parser p = new AutoDetectParser();
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p,
                new
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));

        try (InputStream is = getResourceAsStream("/test-documents/" +
filePath)) {
            wrapper.parse(is, new DefaultHandler(), new Metadata(),
context);
        }
        return wrapper.getMetadata();

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, September 9, 2016 10:06 AM
To: user@tika.apache.org
Subject: How to parse PDF files effectively with Tika

Hi All

While I've experimented with writing a simple demo code which creates
a Tika PDFParser (and few other parsers) and provides a
ToTextContentHandler for it to return the content, I'm realizing I'm
not really quite sure what the best strategy is.

For example, Tim has mentioned that it is possible to handle embedded
PDF attachments - I don't even know what they are, to me every PDF is
just a text when I look at it :-). Besides I'm not sure if
ToTextContentHandler is not missing some content.

Here is the basic code I have:

PDFParser parser = new PDFParser();
Metadata m = new Metadata();
ParseContext context = new ParseContext(); ToTextContentHandler
contentHandler = new ToTextContentHandler();
parser.parse(pdfInputStream, contentHandler, m, context);

String content = contentHandler.toString(); // work with the returned
content, and filled-in Metadata

Is this code good enough to get all the content (and metadata) out of
a 'simple' PDF ?

How to enhance this code to handle the embedded attachments too ?
Ideally such that it continues supporting both 'simple' and 'complex'
PDFs.

I'd like to understand it better so that I can enhance out CXF Tika
integration code a bit

Thanks, Sergey



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Reply via email to