On Thu, 6 Oct 2016, Ingo Siebert wrote:
Am 05.10.2016 um 20:04 schrieb Nick Burch:
On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mail with multipart/mixed content.

How do you want to get the various parts back? All text inlined, or a special callback for each part? What about the metadata for the parts?

A MS Office document consists also of several parts and chapters and I get them as one string.

A MS Office document can have other documents, images, sounds etc embedded in it too! You have to ask Tika for those in the same way

At least for my use-case I would be sufficient to get the data concatenated into on string, but I would also be nice if I get the parts separately.

If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'll be called to let you handle each part in turn. You might want a ParsingEmbeddedDocumentExtractor to give you parsed contents rather than raw parts

Nick

Reply via email to