Am 06.10.2016 um 10:58 schrieb Nick Burch:
On Thu, 6 Oct 2016, Ingo Siebert wrote:
Am 05.10.2016 um 20:04 schrieb Nick Burch:
On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mail with multipart/mixed content.

How do you want to get the various parts back? All text inlined, or a special callback for each part? What about the metadata for the parts?

A MS Office document consists also of several parts and chapters and I get them as one string.

A MS Office document can have other documents, images, sounds etc embedded in it too! You have to ask Tika for those in the same way
Ok, I never thought about that.
Thank you.


At least for my use-case I would be sufficient to get the data concatenated into on string, but I would also be nice if I get the parts separately.

If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'll be called to let you handle each part in turn. You might want a ParsingEmbeddedDocumentExtractor to give you parsed contents rather than raw parts


I'll take a look at ParsingEmbeddedDocumentExtractor you mentioned, thank you.

In the mean time I tested the following code, which only returns only one part because the HTMLParser is used. I think the AutoDetectParser chooses the wrong parser, because if I manually choose the RFC822Parser then the output is quite fine.
What's your opionion?

final AutoDetectParser wrapped = new AutoDetectParser();
final RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,
new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1)); wrapper.parse(tikaInputStream, new DefaultHandler(), metadata, new ParseContext());
final List<Metadata> list = wrapper.getMetadata();
System.out.println("parts: " + list.size());

Ingo


Reply via email to