Ken,
To confirm your strategy: one new Thread for each call to Tika, add timeout
exception handling, orphan the thread.
Out of curiosity, three questions:
1) If I had more time to read your code, the answer would be
obvious...sorry....How are you organizing your ingest? Are you concatenating
files into a SequenceFile or doing something else? Are you processing each
file in a single map step, or batching files in your mapper?
2) Somewhat related to the first question, in addition to orphaning the
parsing thread, are you doing anything else, like setting maximum number of
tasks per jvm? Are you configuring max number of retries, etc?
3) Are you adding the AutoDetectParser to your ParseContext so that you'll
get content from embedded files?
Thank you, again.
Best,
Tim
From: Ken Krugler [mailto:[email protected]]
Sent: Monday, July 20, 2015 7:21 PM
To: [email protected]
Subject: RE: robust Tika and Hadoop
Hi Tim,
When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a
TikaCallable
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)
This lets us orphan the parsing thread if it times out
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)
And provides a bit of protection against things like NoSuchMethodErrors that
can be thrown by Tika if the mime-type detection code tries to use a parser
that we exclude, in order to keep the Hadoop job jar size to something
reasonable.
-- Ken
________________________________
From: Allison, Timothy B.
Sent: July 15, 2015 4:38:56am PDT
To: [email protected]<mailto:[email protected]>
Subject: robust Tika and Hadoop
All,
I'd like to fill out our Wiki a bit more on using Tika robustly within
Hadoop. I'm aware of Behemoth [0], Nanite [1] and Morphlines [2]. I haven't
looked carefully into these packages yet.
Does anyone have any recommendations for specific configurations/design
patterns that will defend against oom and permanent hangs within Hadoop?
Thank you!
Best,
Tim
[0] https://github.com/DigitalPebble/behemoth
[1]
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2]
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr