I would add Nutch to the list too, Tim :-) +1 from me.
— Chris Mattmann [email protected] -----Original Message----- From: "Allison, Timothy B." <[email protected]> Reply-To: <[email protected]> Date: Wednesday, July 15, 2015 at 4:38 AM To: "[email protected]" <[email protected]> Subject: robust Tika and Hadoop >All, > > I’d like to fill out our Wiki a bit more on using Tika robustly within >Hadoop. I’m aware of Behemoth [0], Nanite [1] and Morphlines [2]. I >haven’t looked carefully into these packages yet. > > Does anyone have any recommendations for specific configurations/design >patterns that will defend against oom and permanent hangs within Hadoop? > > Thank you! > > Best, > > Tim > > >[0] https://github.com/DigitalPebble/behemoth >[1] >http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-c >ontent-nanite/ >[2] >http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and >-integrate-etl-apps-for-apache-hadoop/ ><http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-an >d-integrate-etl-apps-for-apache-hadoop/> > >
