Julien,

Thank you for the decisive advice,using the crawl script seems to have
solved the problem of abrupt termination of the crawl , the bin/crawl
script respects the depth and topN parameters and iterates accordingly.

However , I have an issue with the number of maps that are being used for
fetch phase , its always 1 , I see that the script sets the numFetchers
parameters at the time of generate phase euqal to the number of slaves
which is 3 in my case , however  only a single map task is being used,
under-utilizing my Hadoop cluster and slowing down the crawl .

I see that in the Crawldb update phase there millions on 'db_unfetched'
urls still the generate phase only creates a single segment with about
20-30k urls and as a result only a single map tasks is being used for the
fetch phase, I guess I need to make the generate phase  generate more
segments than one , how do I do that using the bin/crawl script.

Please note that this is for Nutch 1.7 on Hadoop 2.3.0.

Thanks.


On Fri, Aug 29, 2014 at 10:39 AM, Julien Nioche <
[email protected]> wrote:

> No, just do 'bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>' from
> the master node. It internally calls the nutch script for the individual
> commands, which takes care of sending the job jar to your hadoop cluster,
> see https://github.com/apache/nutch/blob/trunk/src/bin/nutch#L271
>
>
>
>
> On 29 August 2014 15:24, S.L <[email protected]> wrote:
>
> > Sorry Julien , I overlooked the directory names.
> >
> > My understanding is that the Hadoop Job is submitted  to a cluster by
> using
> > the following command on the RM node bin/hadoop .job file <params>
> >
> > Are you suggesting I submit the script instead of the Nutch .job jar like
> > below?
> >
> > bin/hadoop  bin/crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
> >
> >
> > On Fri, Aug 29, 2014 at 10:01 AM, Julien Nioche <
> > [email protected]> wrote:
> >
> > > As the name runtime/deploy suggest - it is used exactly for that
> purpose
> > > ;-) Just make sure HADOOP_HOME/bin is added to the path and run the
> > script,
> > > that's all.
> > > Look at the bottom of the nutch script for details.
> > >
> > > Julien
> > >
> > > PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
> > > http://sched.co/1pbE15n) were we'll cover things like these
> > >
> > >
> > >
> > > On 29 August 2014 14:30, S.L <[email protected]> wrote:
> > >
> > > > Thanks, can this be used on a hadoop cluster?
> > > >
> > > > Sent from my HTC
> > > >
> > > > ----- Reply message -----
> > > > From: "Julien Nioche" <[email protected]>
> > > > To: "[email protected]" <[email protected]>
> > > > Subject: Nutch 1.7 fetch happening in a single map task.
> > > > Date: Fri, Aug 29, 2014 9:00 AM
> > > >
> > > > See
> > > >
> > http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
> > > >
> > > > just go to runtime/deploy/bin and run the script from there.
> > > >
> > > > Julien
> > > >
> > > >
> > > > On 29 August 2014 13:38, Meraj A. Khan <[email protected]> wrote:
> > > >
> > > > > Hi Julien,
> > > > >
> > > > > I have 15 domains and they are all being fetched in a single map
> task
> > > > which
> > > > > does not fetch all the urls no matter what depth or topN i give.
> > > > >
> > > > > I am submitting the Nutch job jar which seems to be using the
> > > Crawl.java
> > > > > class, how do I use the Crawl script on a Hadoop cluster, are there
> > any
> > > > > pointers you can share?
> > > > >
> > > > > Thanks.
> > > > > On Aug 29, 2014 4:40 AM, "Julien Nioche" <
> > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi Meraj,
> > > > > >
> > > > > > The generator will place all the URLs in a single segment if all
> > they
> > > > > > belong to the same host for politeness reason. Otherwise it will
> > use
> > > > > > whichever value is passed with the -numFetchers parameter in the
> > > > > generation
> > > > > > step.
> > > > > >
> > > > > > Why don't you use the crawl script in /bin instead of tinkering
> > with
> > > > the
> > > > > > (now deprecated) Crawl class? It comes with a good default
> > > > configuration
> > > > > > and should make your life easier.
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > >
> > > > > > On 28 August 2014 06:47, Meraj A. Khan <[email protected]>
> wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I
> noticed
> > > that
> > > > > > there
> > > > > > > is only a single reducer in the generate partition job. I am
> > > running
> > > > > in
> > > > > > a
> > > > > > > situation where the subsequent fetch is only running in a
> single
> > > map
> > > > > task
> > > > > > > (I believe as a consequence of a single reducer in the earlier
> > > > phase).
> > > > > > How
> > > > > > > can I force Nutch to do fetch in multiple map tasks , is there
> a
> > > > > setting
> > > > > > to
> > > > > > > force more than one reducers in the generate-partition job to
> > have
> > > > more
> > > > > > map
> > > > > > > tasks ?.
> > > > > > >
> > > > > > > Please also note that I have commented out the code in
> Crawl.java
> > > to
> > > > > not
> > > > > > do
> > > > > > > the LInkInversion phase as , I dont need the scoring of the
> URLS
> > > that
> > > > > > Nutch
> > > > > > > crawls, every URL is equally important to me.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Open Source Solutions for Text Engineering
> > > > > >
> > > > > > http://digitalpebble.blogspot.com/
> > > > > > http://www.digitalpebble.com
> > > > > > http://twitter.com/digitalpebble
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > > http://twitter.com/digitalpebble
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to