Re: Tika: parsing mixed content e-mails

Ingo Siebert Thu, 06 Oct 2016 04:30:36 -0700

Am 06.10.2016 um 10:58 schrieb Nick Burch:

On Thu, 6 Oct 2016, Ingo Siebert wrote:
Am 05.10.2016 um 20:04 schrieb Nick Burch:
On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse ane-mail with multipart/mixed content.
How do you want to get the various parts back? All text inlined, ora special callback for each part? What about the metadata for theparts?
A MS Office document consists also of several parts and chapters andI get them as one string.
A MS Office document can have other documents, images, sounds etcembedded in it too! You have to ask Tika for those in the same way

Ok, I never thought about that.
Thank you.

At least for my use-case I would be sufficient to get the dataconcatenated into on string, but I would also be nice if I get theparts separately.
If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'llbe called to let you handle each part in turn. You might want aParsingEmbeddedDocumentExtractor to give you parsed contents ratherthan raw parts

I'll take a look at ParsingEmbeddedDocumentExtractor you mentioned,thank you.

In the mean time I tested the following code, which only returns onlyone part because the HTMLParser is used.I think the AutoDetectParser chooses the wrong parser, because if Imanually choose the RFC822Parser then the output is quite fine.

What's your opionion?

final AutoDetectParser wrapped = new AutoDetectParser();
final RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,

newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT,-1));wrapper.parse(tikaInputStream, new DefaultHandler(), metadata, newParseContext());

final List<Metadata> list = wrapper.getMetadata();
System.out.println("parts: " + list.size());

Ingo

Re: Tika: parsing mixed content e-mails

Reply via email to