Re: Tika: parsing mixed content e-mails

Nick Burch Thu, 06 Oct 2016 01:59:47 -0700

On Thu, 6 Oct 2016, Ingo Siebert wrote:

Am 05.10.2016 um 20:04 schrieb Nick Burch:
On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mailwith multipart/mixed content.
How do you want to get the various parts back? All text inlined, or aspecial callback for each part? What about the metadata for the parts?
A MS Office document consists also of several parts and chapters and I getthem as one string.

A MS Office document can have other documents, images, sounds etc embeddedin it too! You have to ask Tika for those in the same way

At least for my use-case I would be sufficient to get the data concatenatedinto on string, but I would also be nice if I get the parts separately.

If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'll becalled to let you handle each part in turn. You might want aParsingEmbeddedDocumentExtractor to give you parsed contents rather thanraw parts


Nick

Re: Tika: parsing mixed content e-mails

Reply via email to