Hi, Mark. If you use Tika facade you will receive all text content to ContentHandler passed to parse(...), including attachments. You can use XHTMLContentHandler to receive each part of email to it's own <div class="email-entry">. Tika usually parse content recursively and emits all to ContentHandler.
If you need more fine-grained control take a look at RecursiveParserWrapper ( http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrapper.html). It returns metadata object for each parsed document and its children with content stored in that metadata object. It isn't thread safe (so create new object for each thread) and you have to reset it after each parse call. Also, this method is not suitable for large files since their content will be stored in memory. If you need even more fine-grained control -- use Apache James Mime4j (which is used in Tika itself to parse emails). If your application is email-centric and you don't need metadata normalization (provided by Tika) for email messages it can be right way. Also, each multipart message body can be parsed by Tika. I recommend to set at least content-type info to metadata object from MIME Content-Type of appropriate multipart/* headers before parsing it with Tika. You'll get metadata and content for each message part and can stream content if it's quite large. -- Best regards, Konstantin Gribov чт, 4 июня 2015 г. в 8:07, Mark Kerzner <[email protected]>: > Hi, > > usually I just do new Tika().parse(myfile...), and Tika does all the work. > > Is there anything special about *.eml files? How does Tika treat > attachments? What would be a reference for me to read? > > Thank you > > -- > Mark Kerzner, Managing Partner, Elephant Scale <http://elephantscale.com/> > Mobile: 713-724-2534, Skype: mark.kerzner1 > https://www.linkedin.com/in/markkerzner > To schedule a meeting with me: http://www.meetme.so/markkerzner > >
