On Thu, 6 Oct 2016, Ingo Siebert wrote:
Am 05.10.2016 um 20:04 schrieb Nick Burch:
On Wed, 5 Oct 2016, Ingo Siebert wrote:
I just used Tika (org.apache.tika:tika-parsers:1.13) to parse an e-mail
with multipart/mixed content.
How do you want to get the various parts back? All text inlined, or a
special callback for each part? What about the metadata for the parts?
A MS Office document consists also of several parts and chapters and I get
them as one string.
A MS Office document can have other documents, images, sounds etc embedded
in it too! You have to ask Tika for those in the same way
At least for my use-case I would be sufficient to get the data concatenated
into on string, but I would also be nice if I get the parts separately.
If you pop a EmbeddedDocumentExtractor onto the ParseContext, that'll be
called to let you handle each part in turn. You might want a
ParsingEmbeddedDocumentExtractor to give you parsed contents rather than
raw parts
Nick