I have a PDF document with a docx attachment. I wasn't having luck getting the
contents of the docx with tika.parseToString(file).
I dug around a bit in the PDFExtractor and found that when I changed this line:
embeddedExtractor.parseEmbedded(
stream,
new EmbeddedContentHandler(new BodyContentHandler(localHandler)),
metadata,
false);
to:
embeddedExtractor.parseEmbedded(
stream,
new EmbeddedContentHandler(handler),
metadata,
false);
in other words, when I no longer required "body" elements, I was able to get
the content of the attached document.
I attached the same inner document to a docx file and had luck without this
change. Does anyone know why this change is required in PDFExtractor? Is
this a bad solution?
Unfortunately, I can't share the documents.
Best,
Tim