Nutch+Hbase on EMR CLASSPATH issue

Ketan Bhokray Mon, 16 Nov 2015 05:34:13 -0800

Hi,

I'm a hadoop newbie and trying to run Nutch 2.3, with Hbase as backend, on
EMR. Since Nutch uses hadoop-1.2.0, we chose the AMI version:2.4.2 which
comes with Hadoop 1.0.3 and HBase 0.92.0.


When I build Nutch, it is crawling without problem on local mode. But when
run in distributed mode, the job stops at injector step with the following
exception:

    Injecting seed URLs
    /home/hadoop/.../runtime/deploy/bin/nutch inject s3://myemrbucket/urls/
-crawlId 8
    15/11/16 11:23:21 INFO crawl.InjectorJob: InjectorJob: starting at
2015-11-16 11:23:21
    15/11/16 11:23:21 INFO crawl.InjectorJob: InjectorJob: Injecting
urlDir: s3://myemrbucket/urls
    15/11/16 11:23:21 INFO s3native.NativeS3FileSystem: Created AmazonS3
with InstanceProfileCredentialsProvider
    15/11/16 11:23:23 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'8_webpage'Assuming they are the same.
    15/11/16 11:23:23 INFO crawl.InjectorJob: InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    15/11/16 11:23:23 INFO mapred.JobClient: Default number of map tasks:
null
    15/11/16 11:23:23 INFO mapred.JobClient: Setting default number of map
tasks based on cluster size to : 2
    15/11/16 11:23:23 INFO mapred.JobClient: Default number of reduce
tasks: 0
    15/11/16 11:23:24 INFO security.ShellBasedUnixGroupsMapping: add hadoop
to shell userGroupsCache
    15/11/16 11:23:24 INFO mapred.JobClient: Setting group to hadoop
    15/11/16 11:23:25 INFO input.FileInputFormat: Total input paths to
process : 1
    15/11/16 11:23:25 INFO lzo.GPLNativeCodeLoader: Loaded native gpl
library
    15/11/16 11:23:25 WARN lzo.LzoCodec: Could not find build properties
file with revision hash
    15/11/16 11:23:25 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev UNKNOWN]
    15/11/16 11:23:25 WARN snappy.LoadSnappy: Snappy native library is
available
    15/11/16 11:23:25 INFO snappy.LoadSnappy: Snappy native library loaded
    15/11/16 11:23:25 INFO mapred.JobClient: Running job:
job_201511101059_0054
    15/11/16 11:23:26 INFO mapred.JobClient:  map 0% reduce 0%
    15/11/16 11:23:46 INFO mapred.JobClient: Task Id :
attempt_201511101059_0054_m_000000_0, Status : FAILED
    Error:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)Lorg/apache/hadoop/hbase/HColumnDescriptor;
    15/11/16 11:23:55 INFO mapred.JobClient: Task Id :
attempt_201511101059_0054_m_000000_1, Status : FAILED
    Error:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)Lorg/apache/hadoop/hbase/HColumnDescriptor;
    15/11/16 11:24:04 INFO mapred.JobClient: Task Id :
attempt_201511101059_0054_m_000000_2, Status : FAILED
    Error:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)Lorg/apache/hadoop/hbase/HColumnDescriptor;
    15/11/16 11:24:19 INFO mapred.JobClient: Job complete:
job_201511101059_0054
    15/11/16 11:24:19 INFO mapred.JobClient: Counters: 7
    15/11/16 11:24:19 INFO mapred.JobClient:   Job Counters
    15/11/16 11:24:19 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=27542
    15/11/16 11:24:19 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
    15/11/16 11:24:19 INFO mapred.JobClient:     Total time spent by all
maps waiting after reserving slots (ms)=0
    15/11/16 11:24:19 INFO mapred.JobClient:     Rack-local map tasks=4
    15/11/16 11:24:19 INFO mapred.JobClient:     Launched map tasks=4
    15/11/16 11:24:19 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
    15/11/16 11:24:19 INFO mapred.JobClient:     Failed map tasks=1
    15/11/16 11:24:19 ERROR crawl.InjectorJob: InjectorJob:
java.lang.RuntimeException: job failed: name=[8]inject
s3://myemrbucket/urls, jobid=job_201511101059_0054
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231)
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

    Error running:
      /home/hadoop/.../runtime/deploy/bin/nutch inject
s3://myemrbucket/urls/ -crawlId 8

This is happening because of the version mismatch between the different
hbase jars used by Nutch(hbase-0.94.14.jar) and EMR(hbase-0.92.0.jar).

*Things I have tried*
-------------------

 - Added
<http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-config_hadoop-user-env.sh.html>

        export HADOOP_USER_CLASSPATH_FIRST=true
        export
HADOOP_CLASSPATH=/home/hadoop/pathtomy/hbase-0.94.14.jar:$HADOOP_CLASSPATH

   in hadoop-user-env.sh

 - Added my hbase in the Trackers classpath using libjars
<http://grepalex.com/2013/02/25/hadoop-libjars/>:

        hadoop jar apache-nutch-2.3.job org.apache.nutch.crawl.InjectorJob
-libjars s3://myemrbucket/hbase-0.94.14.jar s3://myemrbucket/urls/ -crawlId
8

 - Added <http://stackoverflow.com/a/11697223/3999239>

        -D mapreduce.user.classpath.first=true

to my crawl script (I've also tried adding the three options specified in
the
       question)

 - Added
<http://stackoverflow.com/questions/682852/apache-ant-manifest-class-path>

        <manifest>
            <attribute name="Class-Path" value="./lib/"/>
        </manifest>
in the build.xml

 - Tried adding
<http://groups.google.com/a/cloudera.org/forum/#!topic/scm-users/rAsh09voJ8A>


        HADOOP_TASKTRACKER_OPTS="-classpath
/home/hadoop/pathtomy/hbase-0.94.14.jar ${HADOOP_TASKTRACKER_OPTS}"

in hadoop-user-env.sh (putting this causes the command `hadoop tasktracker
restart` to fail)

 - Replaced EMR's hbase-0.92.0.jar and hbase.jar with my hbase-0.94.14.jar
(This works! but I don't want to risk my cluster for my application)


None of these work in the way I want them to. How do I get out of this
frustrating issue?

Thanks,

Nutch+Hbase on EMR CLASSPATH issue

Reply via email to