awesome work Mark and Ken ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Mark Kerzner <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Monday, July 20, 2015 at 4:22 PM To: Tika User <[email protected]> Subject: Re: robust Tika and Hadoop >Hi, Tim, > > >here is my Tika with Hadoop project, tested on Enron, >http://freeeed.org/, and it works quite well. > > >Mark > > >On Mon, Jul 20, 2015 at 6:20 PM, Ken Krugler ><[email protected]> wrote: > >Hi Tim, > > >When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it >with a TikaCallable >(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCa >llable.java) > > >This lets us orphan the parsing thread if it times out >(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/Simple >Parser.java#L187) > > >And provides a bit of protection against things like NoSuchMethodErrors >that can be thrown by Tika if the mime-type detection code tries to use a >parser that we exclude, in order to keep the Hadoop job jar size to >something reasonable. > > >-- Ken > > >________________________________________ >From: Allison, Timothy B. >Sent: July 15, 2015 4:38:56am PDT >To:[email protected] >Subject: robust Tika and Hadoop > > >All, > > I’d like to fill out our Wiki a bit more on using Tika robustly within >Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I >haven’t looked carefully into these packages yet. > > Does anyone have any recommendations for specific configurations/design >patterns that will defend against oom and permanent hangs within Hadoop? > > Thank you! > > Best, > > Tim > > >[0] https://github.com/DigitalPebble/behemoth >[1] >http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c >ontent-nanite/ >[2] >http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and >-integrate-etl-apps-for-apache-hadoop/ > > > > > > > >-------------------------- >Ken Krugler >+1 530-210-6378 <tel:%2B1%20530-210-6378> >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > > > > > > >-------------------------- >Ken Krugler >+1 530-210-6378 <tel:%2B1%20530-210-6378> >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > > > > > > > > > > > > > > > > >-- >Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>, >To schedule a meeting with me: http://www.meetme.so/markkerzner > >Mobile: 713-724-2534 >Skype: mark.kerzner1 >Office: One Riverway Suite 1700 >Houston, TX 77056 > >Privileged and Confidential > <http://shmsoft.com/> > > > >
