Thank you Ken, this is great! I've created a link to your blog post on the Tika wiki:
https://wiki.apache.org/tika/TikaResources Thank you again! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Ken Krugler <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, July 11, 2013 1:50 PM To: "[email protected]" <[email protected]> Subject: Blog post on extracting text features using Tika > > > >Hi all, > > >I just posted part 1 of a series on extracting text features for machine >learningÅ > > >http://www.scaleunlimited.com/2013/07/10/text-feature-selection-for-machin >e-learning-part-1/ > > >It uses a modified version of the Tika RFC822 parser to process mbox >files. > > >I decided it was time to try to share some of what I'd learned over the >years in processing text for classification, clustering and other related >ML tasks. > > >It undoubtedly has some things that are unclear or even incorrect, so >please comment :) > > >Thanks, > > >-- Ken > > >-------------------------- >Ken Krugler >+1 530-210-6378 >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > > > > > > >
