Hi, Tim, here is my Tika with Hadoop project, tested on Enron, http://freeeed.org/, and it works quite well.
Mark On Mon, Jul 20, 2015 at 6:20 PM, Ken Krugler <[email protected]> wrote: > Hi Tim, > > When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it > with a TikaCallable ( > https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java > ) > > This lets us orphan the parsing thread if it times out ( > https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187 > ) > > And provides a bit of protection against things like NoSuchMethodErrors > that can be thrown by Tika if the mime-type detection code tries to use a > parser that we exclude, in order to keep the Hadoop job jar size to > something reasonable. > > -- Ken > > ------------------------------ > > *From:* Allison, Timothy B. > > *Sent:* July 15, 2015 4:38:56am PDT > > *To:* [email protected] > > *Subject:* robust Tika and Hadoop > > All, > > I’d like to fill out our Wiki a bit more on using Tika robustly within > Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I > haven’t looked carefully into these packages yet. > > Does anyone have any recommendations for specific configurations/design > patterns that will defend against oom and permanent hangs within Hadoop? > > Thank you! > > Best, > > Tim > > > [0] https://github.com/DigitalPebble/behemoth > [1] > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > [2] > http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > -- Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>, To schedule a meeting with me: http://www.meetme.so/markkerzner Mobile: 713-724-2534 Skype: mark.kerzner1 Office: One Riverway Suite 1700 Houston, TX 77056 *Privileged and Confidential * <http://shmsoft.com/>
