Hi Lewis,

Thanks for the reply!! I actually got to solve the problem, and replied to
the thread with the solution. Wonder where that went!. Now I am stuck
another problem: I am not able to run Nutch in distributed mode using the
REST (http://www.mail-archive.com/user%40nutch.apache.org/msg14024.html).
Can you tell me how I can solve this?



**How I solved the CLASSPATH problem**

For some reason, the hadoop in the EMR boxes are not using the "-D
mapreduce...=true" options. So, I instead set the options in the bootstrap
script while spinning up the cluster:


1. Create a new custom mapreduce xml file, in s3:

s3://myemrbucket/custom-mapred.xml:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>mapred.child.java.opts</name>

    <value>-Xmx1000m</value>

  </property>

  <property>

    <name>mapred.reduce.tasks.speculative.execution</name>

    <value>false</value>

  </property>

  <property>

    <name>mapred.map.tasks.speculative.execution</name>

    <value>false</value>

  </property>

  <property>

    <name>mapred.compress.map.output</name>

    <value>true</value>

  </property>

  <property>

    <name>mapreduce.user.classpath.first</name>

    <value>true</value>

  </property> <!-- only this should be sufficient, but included the below
options also, just in case -->

  <property>

    <name>mapreduce.task.classpath.first</name>

    <value>true</value>

  </property>

  <property>

    <name>mapreduce.job.user.classpath.first</name>

    <value>true</value>

  </property>

  <property>

    <name>mapreduce.task.classpath.user.precedence</name>

    <value>true</value>

  </property>

</configuration>
--

2. Supply the file in the bootstrap options under -M option while spinning
up the box:

--bootstrap-action
Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,Args=["-M","s3://myemrbucket/custom-mapred.xml"


This should give a cluster with the above options set in
/home/hadoop/conf/mapred-site.xml. I'd be happy to help anyone facing the
same issue.

Thanks,

On Fri, Nov 20, 2015 at 1:28 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Ketan,
>
> <
> http://www.mail-archive.com/[email protected]&q=from:%22Ketan+Bhokray%22
> >
> <
> http://www.mail-archive.com/[email protected]&q=from:%22Ketan+Bhokray%22
> >
> On Wed, Nov 18, 2015 at 2:00 AM, <[email protected]>
> wrote:
>
> >
> > Nutch+Hbase on EMR CLASSPATH issue
> >         31865 by: Ketan Bhokray
> >
> >
> > I'm a hadoop newbie and trying to run Nutch 2.3, with Hbase as backend,
> on
> > EMR. Since Nutch uses hadoop-1.2.0, we chose the AMI version:2.4.2 which
> > comes with Hadoop 1.0.3 and HBase 0.92.0.
> >
> > When I build Nutch, it is crawling without problem on local mode. But
> when
> > run in distributed mode, the job stops at injector step with the
> following
> > exception:
>
>
> Can you please try running 2.X-SNAPSHOT from source?
> http://svn.apache.org/repos/asf/nutch/branches/2.x/
> This works with the following stack
>
> Apache Avro 1.7.6 Apache Hadoop 1.2.1 and 2.5.2 Apache HBase 0.98.8-hadoop2
> (although also tested with 1.X) Apache Cassandra 2.0.2 Apache Solr 4.10.3
> MongoDB 2.6.X Apache Accumlo 1.5.1 Apache Spark 1.4.1
>
> Please let us know how you get on. If this does not work then Nutch trunk
> runs flawlesssly on EMR.
> Thanks
>


On Fri, Nov 20, 2015 at 1:28 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Ketan,
>
> <
> http://www.mail-archive.com/[email protected]&q=from:%22Ketan+Bhokray%22
> >
> <
> http://www.mail-archive.com/[email protected]&q=from:%22Ketan+Bhokray%22
> >
> On Wed, Nov 18, 2015 at 2:00 AM, <[email protected]>
> wrote:
>
> >
> > Nutch+Hbase on EMR CLASSPATH issue
> >         31865 by: Ketan Bhokray
> >
> >
> > I'm a hadoop newbie and trying to run Nutch 2.3, with Hbase as backend,
> on
> > EMR. Since Nutch uses hadoop-1.2.0, we chose the AMI version:2.4.2 which
> > comes with Hadoop 1.0.3 and HBase 0.92.0.
> >
> > When I build Nutch, it is crawling without problem on local mode. But
> when
> > run in distributed mode, the job stops at injector step with the
> following
> > exception:
>
>
> Can you please try running 2.X-SNAPSHOT from source?
> http://svn.apache.org/repos/asf/nutch/branches/2.x/
> This works with the following stack
>
> Apache Avro 1.7.6 Apache Hadoop 1.2.1 and 2.5.2 Apache HBase 0.98.8-hadoop2
> (although also tested with 1.X) Apache Cassandra 2.0.2 Apache Solr 4.10.3
> MongoDB 2.6.X Apache Accumlo 1.5.1 Apache Spark 1.4.1
>
> Please let us know how you get on. If this does not work then Nutch trunk
> runs flawlesssly on EMR.
> Thanks
>



-- 
*KETAN BHOKRAY*
_______________________________________________________
*DEPARTMENT PLACEMENT COORDINATOR*
*AEROSPACE DEPARTMENT | IIT BOMBAY*
_______________________________________________________
*Mob:* +91-9619589938 | *Alt e-mail:* [email protected]

*“If it falls to your lot to be a street sweeper, sweep streets like
Michelangelo painted pictures. Sweep streets so well that all the host of
heaven and earth will have to pause and say: Here lived a great street
sweeper who swept his job well.”   [Martin Luther King Jr.]*

Reply via email to