All,
I'd like to fill out our Wiki a bit more on using Tika robustly within
Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't
looked carefully into these packages yet.
Does anyone have any recommendations for specific configurations/design
patterns that will defend against oom and permanent hangs within Hadoop?
Thank you!
Best,
Tim
[0] https://github.com/DigitalPebble/behemoth
[1]
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2]
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/