Hi,
I'm a hadoop newbie and trying to run Nutch 2.3, with Hbase as backend, on
EMR. Since Nutch uses hadoop-1.2.0, we chose the AMI version:2.4.2 which
comes with Hadoop 1.0.3 and HBase 0.92.0.
When I build Nutch, it is crawling without problem on local mode. But when
run in distributed mode, the job stops at injector step with the following
exception:
Injecting seed URLs
/home/hadoop/.../runtime/deploy/bin/nutch inject s3://myemrbucket/urls/
-crawlId 8
15/11/16 11:23:21 INFO crawl.InjectorJob: InjectorJob: starting at
2015-11-16 11:23:21
15/11/16 11:23:21 INFO crawl.InjectorJob: InjectorJob: Injecting
urlDir: s3://myemrbucket/urls
15/11/16 11:23:21 INFO s3native.NativeS3FileSystem: Created AmazonS3
with InstanceProfileCredentialsProvider
15/11/16 11:23:23 WARN store.HBaseStore: Mismatching schema's names.
Mappingfile schema: 'webpage'. PersistentClass schema's name:
'8_webpage'Assuming they are the same.
15/11/16 11:23:23 INFO crawl.InjectorJob: InjectorJob: Using class
org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/16 11:23:23 INFO mapred.JobClient: Default number of map tasks:
null
15/11/16 11:23:23 INFO mapred.JobClient: Setting default number of map
tasks based on cluster size to : 2
15/11/16 11:23:23 INFO mapred.JobClient: Default number of reduce
tasks: 0
15/11/16 11:23:24 INFO security.ShellBasedUnixGroupsMapping: add hadoop
to shell userGroupsCache
15/11/16 11:23:24 INFO mapred.JobClient: Setting group to hadoop
15/11/16 11:23:25 INFO input.FileInputFormat: Total input paths to
process : 1
15/11/16 11:23:25 INFO lzo.GPLNativeCodeLoader: Loaded native gpl
library
15/11/16 11:23:25 WARN lzo.LzoCodec: Could not find build properties
file with revision hash
15/11/16 11:23:25 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev UNKNOWN]
15/11/16 11:23:25 WARN snappy.LoadSnappy: Snappy native library is
available
15/11/16 11:23:25 INFO snappy.LoadSnappy: Snappy native library loaded
15/11/16 11:23:25 INFO mapred.JobClient: Running job:
job_201511101059_0054
15/11/16 11:23:26 INFO mapred.JobClient: map 0% reduce 0%
15/11/16 11:23:46 INFO mapred.JobClient: Task Id :
attempt_201511101059_0054_m_000000_0, Status : FAILED
Error:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)Lorg/apache/hadoop/hbase/HColumnDescriptor;
15/11/16 11:23:55 INFO mapred.JobClient: Task Id :
attempt_201511101059_0054_m_000000_1, Status : FAILED
Error:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)Lorg/apache/hadoop/hbase/HColumnDescriptor;
15/11/16 11:24:04 INFO mapred.JobClient: Task Id :
attempt_201511101059_0054_m_000000_2, Status : FAILED
Error:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)Lorg/apache/hadoop/hbase/HColumnDescriptor;
15/11/16 11:24:19 INFO mapred.JobClient: Job complete:
job_201511101059_0054
15/11/16 11:24:19 INFO mapred.JobClient: Counters: 7
15/11/16 11:24:19 INFO mapred.JobClient: Job Counters
15/11/16 11:24:19 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=27542
15/11/16 11:24:19 INFO mapred.JobClient: Total time spent by all
reduces waiting after reserving slots (ms)=0
15/11/16 11:24:19 INFO mapred.JobClient: Total time spent by all
maps waiting after reserving slots (ms)=0
15/11/16 11:24:19 INFO mapred.JobClient: Rack-local map tasks=4
15/11/16 11:24:19 INFO mapred.JobClient: Launched map tasks=4
15/11/16 11:24:19 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
15/11/16 11:24:19 INFO mapred.JobClient: Failed map tasks=1
15/11/16 11:24:19 ERROR crawl.InjectorJob: InjectorJob:
java.lang.RuntimeException: job failed: name=[8]inject
s3://myemrbucket/urls, jobid=job_201511101059_0054
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
Error running:
/home/hadoop/.../runtime/deploy/bin/nutch inject
s3://myemrbucket/urls/ -crawlId 8
This is happening because of the version mismatch between the different
hbase jars used by Nutch(hbase-0.94.14.jar) and EMR(hbase-0.92.0.jar).
*Things I have tried*
-------------------
- Added
<http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-config_hadoop-user-env.sh.html>
export HADOOP_USER_CLASSPATH_FIRST=true
export
HADOOP_CLASSPATH=/home/hadoop/pathtomy/hbase-0.94.14.jar:$HADOOP_CLASSPATH
in hadoop-user-env.sh
- Added my hbase in the Trackers classpath using libjars
<http://grepalex.com/2013/02/25/hadoop-libjars/>:
hadoop jar apache-nutch-2.3.job org.apache.nutch.crawl.InjectorJob
-libjars s3://myemrbucket/hbase-0.94.14.jar s3://myemrbucket/urls/ -crawlId
8
- Added <http://stackoverflow.com/a/11697223/3999239>
-D mapreduce.user.classpath.first=true
to my crawl script (I've also tried adding the three options specified in
the
question)
- Added
<http://stackoverflow.com/questions/682852/apache-ant-manifest-class-path>
<manifest>
<attribute name="Class-Path" value="./lib/"/>
</manifest>
in the build.xml
- Tried adding
<http://groups.google.com/a/cloudera.org/forum/#!topic/scm-users/rAsh09voJ8A>
HADOOP_TASKTRACKER_OPTS="-classpath
/home/hadoop/pathtomy/hbase-0.94.14.jar ${HADOOP_TASKTRACKER_OPTS}"
in hadoop-user-env.sh (putting this causes the command `hadoop tasktracker
restart` to fail)
- Replaced EMR's hbase-0.92.0.jar and hbase.jar with my hbase-0.94.14.jar
(This works! but I don't want to risk my cluster for my application)
None of these work in the way I want them to. How do I get out of this
frustrating issue?
Thanks,