woot! ---- Chris Mattmann [email protected]
-----Original Message----- From: Mark Kerzner <[email protected]> Reply-To: <[email protected]> Date: Thursday, June 4, 2015 at 9:42 PM To: Tika User <[email protected]> Subject: Re: Tika parsing of emails >Thank you, Konstantin. That is a wealth of information that will last me >for both my current project and the next two :) >Mark > > >On Thu, Jun 4, 2015 at 3:44 AM, Konstantin Gribov <[email protected]> >wrote: > >Hi, Mark. > >If you use Tika facade you will receive all text content to >ContentHandler passed to parse(...), including attachments. You can use >XHTMLContentHandler to receive each part of email to it's own <div >class="email-entry">. Tika usually parse content recursively and emits >all to ContentHandler. >If you need more fine-grained control take a look at >RecursiveParserWrapper >(http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrap >per.html). It returns metadata object for each parsed document and its >children with content stored in that metadata object. It isn't thread >safe (so create new object for each thread) and you have to reset it >after each parse call. Also, this method is not suitable for large files >since their content will be stored in memory. > >If you need even more fine-grained control -- use Apache James Mime4j >(which is used in Tika itself to parse emails). If your application is >email-centric and you don't need metadata normalization (provided by >Tika) for email messages it can be right way. Also, each multipart >message body can be parsed by Tika. I recommend to set at least >content-type info to metadata object from MIME Content-Type of >appropriate multipart/* headers before parsing it with Tika. You'll get >metadata and content for each message part and can stream content if it's >quite large. > >-- Best regards, >Konstantin Gribov > > > >чт, 4 июня 2015 г. в 8:07, Mark Kerzner <[email protected]>: > > >Hi, >usually I just do new Tika().parse(myfile...), and Tika does all the work. > >Is there anything special about *.eml files? How does Tika treat >attachments? What would be a reference for me to read? > >Thank you > > >-- >Mark Kerzner, Managing Partner, Elephant Scale <http://elephantscale.com/> >Mobile: 713-724-2534 <tel:713-724-2534>, Skype: mark.kerzner1 >https://www.linkedin.com/in/markkerzner > >To schedule a meeting with me: http://www.meetme.so/markkerzner > > > > > > > > > > > > > > > > > > > > > > >-- >Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>, >To schedule a meeting with me: http://www.meetme.so/markkerzner > >Mobile: 713-724-2534 >Skype: mark.kerzner1 >Office: One Riverway Suite 1700 >Houston, TX 77056 > >Privileged and Confidential > <http://shmsoft.com/> > > > >
