Thank you, Konstantin. That is a wealth of information that will last me for both my current project and the next two :)
Mark On Thu, Jun 4, 2015 at 3:44 AM, Konstantin Gribov <[email protected]> wrote: > Hi, Mark. > > If you use Tika facade you will receive all text content to ContentHandler > passed to parse(...), including attachments. You can use > XHTMLContentHandler to receive each part of email to it's own <div > class="email-entry">. Tika usually parse content recursively and emits all > to ContentHandler. > > If you need more fine-grained control take a look at > RecursiveParserWrapper ( > http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrapper.html). > It returns metadata object for each parsed document and its children with > content stored in that metadata object. It isn't thread safe (so create new > object for each thread) and you have to reset it after each parse call. > Also, this method is not suitable for large files since their content will > be stored in memory. > > If you need even more fine-grained control -- use Apache James Mime4j > (which is used in Tika itself to parse emails). If your application is > email-centric and you don't need metadata normalization (provided by Tika) > for email messages it can be right way. Also, each multipart message body > can be parsed by Tika. I recommend to set at least content-type info to > metadata object from MIME Content-Type of appropriate multipart/* headers > before parsing it with Tika. You'll get metadata and content for each > message part and can stream content if it's quite large. > > -- > Best regards, > Konstantin Gribov > > чт, 4 июня 2015 г. в 8:07, Mark Kerzner <[email protected]>: > >> Hi, >> >> usually I just do new Tika().parse(myfile...), and Tika does all the work. >> >> Is there anything special about *.eml files? How does Tika treat >> attachments? What would be a reference for me to read? >> >> Thank you >> >> -- >> Mark Kerzner, Managing Partner, Elephant Scale >> <http://elephantscale.com/> >> Mobile: 713-724-2534, Skype: mark.kerzner1 >> https://www.linkedin.com/in/markkerzner >> To schedule a meeting with me: http://www.meetme.so/markkerzner >> >> -- Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>, To schedule a meeting with me: http://www.meetme.so/markkerzner Mobile: 713-724-2534 Skype: mark.kerzner1 Office: One Riverway Suite 1700 Houston, TX 77056 *Privileged and Confidential * <http://shmsoft.com/>
