RE: robust Tika and Hadoop

Allison, Timothy B. Tue, 21 Jul 2015 12:03:12 -0700

Thank you, Ken!

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, July 21, 2015 10:23 AM
To: user@tika.apache.org
Subject: RE: robust Tika and Hadoop


Hi Tim,

Responses inline below.

-- Ken

________________________________

From: Allison, Timothy B.

Sent: July 21, 2015 5:29:37am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: RE: robust Tika and Hadoop

Ken,
  To confirm your strategy: one new Thread for each call to Tika, add timeout 
exception handling, orphan the thread.

Correct.



Out of curiosity, three questions:
1)      If I had more time to read your code, the answer would be 
obvious...sorry....How are you organizing your ingest?  Are you concatenating 
files into a SequenceFile or doing something else?  Are you processing each 
file in a single map step, or batching files in your mapper?

Files are effectively concatenated, as each record (Cascading Tuple, or Hadoop 
KV pair) has the raw bytes plus a bunch of other data (headers returned, etc)

The parse phase is a map operation, so it's batch processing of all files 
successfully downloaded during that fetch loop.


2)      Somewhat related to the first question, in addition to orphaning the 
parsing thread, are you doing anything else, like setting maximum number of 
tasks per jvm?  Are you configuring max number of retries, etc?

If by "tasks per JVM" you mean the # of times we reuse the JVM, then yes - 
otherwise the orphan threads would eventually clog things up.

For retries, typically we don't set it (so defaults to 4), but in practice I'd 
recommend using something like 2 - so you get one retry, and then it fails, 
otherwise you typically fail four times on that error that could never possible 
happen but does.


3)      Are you adding the AutoDetectParser to your ParseContext so that you'll 
get content from embedded files?

No, not typically, as we're usually ignoring archive files. But that's a good 
point, with current versions of Tika we could now more easily handle those. It 
gets a bit tricky, though, as the UID for content is the URL, but now we'd have 
multiple sub-docs that we'd want to index separately.


From: Ken Krugler [mailto:kkrugler_li...@transpac.com<http://transpac.com/>]
Sent: Monday, July 20, 2015 7:21 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: robust Tika and Hadoop

Hi Tim,

When we use Tika with Bixo (https://github.com/bixo/bixo/) we wrap it with a 
TikaCallable 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java)

This lets us orphan the parsing thread if it times out 
(https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/SimpleParser.java#L187)

And provides a bit of protection against things like NoSuchMethodErrors that 
can be thrown by Tika if the mime-type detection code tries to use a parser 
that we exclude, in order to keep the Hadoop job jar size to something 
reasonable.

-- Ken

________________________________

From: Allison, Timothy B.

Sent: July 15, 2015 4:38:56am PDT

To: user@tika.apache.org<mailto:user@tika.apache.org>

Subject: robust Tika and Hadoop

All,

  I'd like to fill out our Wiki a bit more on using Tika robustly within 
Hadoop.  I'm aware of Behemoth [0], Nanite [1] and Morphlines [2].  I haven't 
looked carefully into these packages yet.

  Does anyone have any recommendations for specific configurations/design 
patterns that will defend against oom and permanent hangs within Hadoop?

  Thank you!

        Best,

                  Tim


[0] https://github.com/DigitalPebble/behemoth
[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com<http://www.scaleunlimited.com/>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com<http://www.scaleunlimited.com/>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: robust Tika and Hadoop

Reply via email to