Am 06.10.2016 um 10:58 schrieb Nick Burch:
On Thu, 6 Oct 2016, Ingo Siebert wrote:
Am 05.10.2016 um 20:04 schrieb Nick Burch:
On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an
e-mail with multipart/mixed content.
How do you want to get the various parts back? All text inlined, or
a special callback for each part? What about the metadata for the
parts?
A MS Office document consists also of several parts and chapters and
I get them as one string.
A MS Office document can have other documents, images, sounds etc
embedded in it too! You have to ask Tika for those in the same way
Ok, I never thought about that.
Thank you.
At least for my use-case I would be sufficient to get the data
concatenated into on string, but I would also be nice if I get the
parts separately.
If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'll
be called to let you handle each part in turn. You might want a
ParsingEmbeddedDocumentExtractor to give you parsed contents rather
than raw parts
I'll take a look at ParsingEmbeddedDocumentExtractor you mentioned,
thank you.
In the mean time I tested the following code, which only returns only
one part because the HTMLParser is used.
I think the AutoDetectParser chooses the wrong parser, because if I
manually choose the RFC822Parser then the output is quite fine.
What's your opionion?
final AutoDetectParser wrapped = new AutoDetectParser();
final RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,
new
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT,
-1));
wrapper.parse(tikaInputStream, new DefaultHandler(), metadata, new
ParseContext());
final List<Metadata> list = wrapper.getMetadata();
System.out.println("parts: " + list.size());
Ingo