Re: Nutch 1.7 fetch happening in a single map task.

Simon Z Tue, 09 Sep 2014 08:39:06 -0700

Hi Meraj,

The *nutch* & deploy are at the same level, need to change the location of
the job file please?


Thanks in advance,



On Mon, Sep 8, 2014 at 10:03 PM, Meraj A. Khan <[email protected]> wrote:

> AFAIK, the script does not go by the mode you set , but the presence of the
> *nutch*.job file in the a directory a level above script it self i.
> ../*.job.
>
> Can you please check if you have the Hadoop job file at the appropriate
> location?
>
> On Mon, Sep 8, 2014 at 9:22 AM, Simon Z <[email protected]> wrote:
>
> > Thank you very Meraj for your reply, I also thought it's a typo.
> >
> > I had set the numFetchers via numSlaves, and the echo of generator showed
> > that numFetcher is 8 (numTasks=`expr $numSlaves \* 2` , that is 4 by 2),
> > but the output of generator showed that the  run mode is "local" and
> > generate exact one mapper, although I had changed mode=distributed, any
> > idea about this please?
> >
> > Many regards,
> >
> > Simon
> >
> >
> >
> >
> > On Mon, Sep 8, 2014 at 7:18 AM, Meraj A. Khan <[email protected]> wrote:
> >
> > > I think that is a typo , and it is actually CrawlDirectory. For  the
> > single
> > > map task issue although I have not tried it yet,but  we can control the
> > > number of fetchers by numFetchers parameter when doing the generate via
> > the
> > > bin/generate.
> > > On Sep 7, 2014 9:23 AM, "Simon Z" <[email protected]> wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > What do you mean by "<crawlID>" please? I am using nutch 1.8 and
> follow
> > > the
> > > > instruction in the tutorial as mentioned before, and seems have a
> > similar
> > > > situation, that is, fetch runs on only one map task. I am running on
> a
> > > > cluster of four nodes on hadoop 2.4.1.
> > > >
> > > > Notice that the map task can be assigned to any node, but only one
> map
> > > each
> > > > round.
> > > >
> > > > I have set
> > > >
> > > > numSlaves=4
> > > > mode=distributed
> > > >
> > > >
> > > > The seed url list includes five different websites from different
> host.
> > > >
> > > >
> > > > Is there any settings I missed out?
> > > >
> > > > Thanks in advance.
> > > >
> > > > Regards,
> > > >
> > > > Simon
> > > >
> > > >
> > > > On Fri, Aug 29, 2014 at 10:39 PM, Julien Nioche <
> > > > [email protected]> wrote:
> > > >
> > > > > No, just do 'bin/crawl <seedDir> <crawlID> <solrURL>
> > <numberOfRounds>'
> > > > from
> > > > > the master node. It internally calls the nutch script for the
> > > individual
> > > > > commands, which takes care of sending the job jar to your hadoop
> > > cluster,
> > > > > see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 29 August 2014 15:24, S.L <[email protected]> wrote:
> > > > >
> > > > > > Sorry Julien , I overlooked the directory names.
> > > > > >
> > > > > > My understanding is that the Hadoop Job is submitted  to a
> cluster
> > by
> > > > > using
> > > > > > the following command on the RM node bin/hadoop .job file
> <params>
> > > > > >
> > > > > > Are you suggesting I submit the script instead of the Nutch .job
> > jar
> > > > like
> > > > > > below?
> > > > > >
> > > > > > bin/hadoop  bin/crawl <seedDir> <crawlID> <solrURL>
> > <numberOfRounds>
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > As the name runtime/deploy suggest - it is used exactly for
> that
> > > > > purpose
> > > > > > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run
> > the
> > > > > > script,
> > > > > > > that's all.
> > > > > > > Look at the bottom of the nutch script for details.
> > > > > > >
> > > > > > > Julien
> > > > > > >
> > > > > > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon
> > EU
> > > (
> > > > > > > http://sched.co/1pbE15n) were we'll cover things like these
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 29 August 2014 14:30, S.L <[email protected]>
> wrote:
> > > > > > >
> > > > > > > > Thanks, can this be used on a hadoop cluster?
> > > > > > > >
> > > > > > > > Sent from my HTC
> > > > > > > >
> > > > > > > > ----- Reply message -----
> > > > > > > > From: "Julien Nioche" <[email protected]>
> > > > > > > > To: "[email protected]" <[email protected]>
> > > > > > > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > > > > > > Date: Fri, Aug 29, 2014 9:00 AM
> > > > > > > >
> > > > > > > > See
> > > > > > > >
> > > > > >
> > > >
> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > > > > > > >
> > > > > > > > just go to runtime/deploy/bin and run the script from there.
> > > > > > > >
> > > > > > > > Julien
> > > > > > > >
> > > > > > > >
> > > > > > > > On 29 August 2014 13:38, Meraj A. Khan <[email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > > Hi Julien,
> > > > > > > > >
> > > > > > > > > I have 15 domains and they are all being fetched in a
> single
> > > map
> > > > > task
> > > > > > > > which
> > > > > > > > > does not fetch all the urls no matter what depth or topN i
> > > give.
> > > > > > > > >
> > > > > > > > > I am submitting the Nutch job jar which seems to be using
> the
> > > > > > > Crawl.java
> > > > > > > > > class, how do I use the Crawl script on a Hadoop cluster,
> are
> > > > there
> > > > > > any
> > > > > > > > > pointers you can share?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> > > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Meraj,
> > > > > > > > > >
> > > > > > > > > > The generator will place all the URLs in a single segment
> > if
> > > > all
> > > > > > they
> > > > > > > > > > belong to the same host for politeness reason. Otherwise
> it
> > > > will
> > > > > > use
> > > > > > > > > > whichever value is passed with the -numFetchers parameter
> > in
> > > > the
> > > > > > > > > generation
> > > > > > > > > > step.
> > > > > > > > > >
> > > > > > > > > > Why don't you use the crawl script in /bin instead of
> > > tinkering
> > > > > > with
> > > > > > > > the
> > > > > > > > > > (now deprecated) Crawl class? It comes with a good
> default
> > > > > > > > configuration
> > > > > > > > > > and should make your life easier.
> > > > > > > > > >
> > > > > > > > > > Julien
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 28 August 2014 06:47, Meraj A. Khan <
> [email protected]>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and
> I
> > > > > noticed
> > > > > > > that
> > > > > > > > > > there
> > > > > > > > > > > is only a single reducer in the generate partition
> job. I
> > > am
> > > > > > > running
> > > > > > > > > in
> > > > > > > > > > a
> > > > > > > > > > > situation where the subsequent fetch is only running
> in a
> > > > > single
> > > > > > > map
> > > > > > > > > task
> > > > > > > > > > > (I believe as a consequence of a single reducer in the
> > > > earlier
> > > > > > > > phase).
> > > > > > > > > > How
> > > > > > > > > > > can I force Nutch to do fetch in multiple map tasks ,
> is
> > > > there
> > > > > a
> > > > > > > > > setting
> > > > > > > > > > to
> > > > > > > > > > > force more than one reducers in the generate-partition
> > job
> > > to
> > > > > > have
> > > > > > > > more
> > > > > > > > > > map
> > > > > > > > > > > tasks ?.
> > > > > > > > > > >
> > > > > > > > > > > Please also note that I have commented out the code in
> > > > > Crawl.java
> > > > > > > to
> > > > > > > > > not
> > > > > > > > > > do
> > > > > > > > > > > the LInkInversion phase as , I dont need the scoring of
> > the
> > > > > URLS
> > > > > > > that
> > > > > > > > > > Nutch
> > > > > > > > > > > crawls, every URL is equally important to me.
> > > > > > > > > > >
> > > > > > > > > > > Thanks.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > Open Source Solutions for Text Engineering
> > > > > > > > > >
> > > > > > > > > > http://digitalpebble.blogspot.com/
> > > > > > > > > > http://www.digitalpebble.com
> > > > > > > > > > http://twitter.com/digitalpebble
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Open Source Solutions for Text Engineering
> > > > > > > >
> > > > > > > > http://digitalpebble.blogspot.com/
> > > > > > > > http://www.digitalpebble.com
> > > > > > > > http://twitter.com/digitalpebble
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Open Source Solutions for Text Engineering
> > > > > > >
> > > > > > > http://digitalpebble.blogspot.com/
> > > > > > > http://www.digitalpebble.com
> > > > > > > http://twitter.com/digitalpebble
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > > http://twitter.com/digitalpebble
> > > > >
> > > >
> > >
> >
>

Re: Nutch 1.7 fetch happening in a single map task.

Reply via email to