Hi, Mark.

If you use Tika facade you will receive all text content to ContentHandler
passed to parse(...), including attachments. You can use
XHTMLContentHandler to receive each part of email to it's own <div
class="email-entry">. Tika usually parse content recursively and emits all
to ContentHandler.

If you need more fine-grained control take a look at RecursiveParserWrapper
(
http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrapper.html).
It returns metadata object for each parsed document and its children with
content stored in that metadata object. It isn't thread safe (so create new
object for each thread) and you have to reset it after each parse call.
Also, this method is not suitable for large files since their content will
be stored in memory.

If you need even more fine-grained control -- use Apache James Mime4j
(which is used in Tika itself to parse emails). If your application is
email-centric and you don't need metadata normalization (provided by Tika)
for email messages it can be right way. Also, each multipart message body
can be parsed by Tika. I recommend to set at least content-type info to
metadata object from MIME Content-Type of appropriate multipart/* headers
before parsing it with Tika. You'll get metadata and content for each
message part and can stream content if it's quite large.

-- 
Best regards,
Konstantin Gribov

чт, 4 июня 2015 г. в 8:07, Mark Kerzner <[email protected]>:

> Hi,
>
> usually I just do new Tika().parse(myfile...), and Tika does all the work.
>
> Is there anything special about *.eml files? How does Tika treat
> attachments? What would be a reference for me to read?
>
> Thank you
>
> --
> Mark Kerzner, Managing Partner, Elephant Scale <http://elephantscale.com/>
> Mobile: 713-724-2534, Skype: mark.kerzner1
> https://www.linkedin.com/in/markkerzner
> To schedule a meeting with me: http://www.meetme.so/markkerzner
>
>

Reply via email to