Re: Nutch 1.7 fetch happening in a single map task.

Julien Nioche Fri, 29 Aug 2014 07:02:19 -0700

As the name runtime/deploy suggest - it is used exactly for that purpose
;-) Just make sure HADOOP_HOME/bin is added to the path and run the script,
that's all.
Look at the bottom of the nutch script for details.


Julien

PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
http://sched.co/1pbE15n) were we'll cover things like these



On 29 August 2014 14:30, S.L <[email protected]> wrote:

> Thanks, can this be used on a hadoop cluster?
>
> Sent from my HTC
>
> ----- Reply message -----
> From: "Julien Nioche" <[email protected]>
> To: "[email protected]" <[email protected]>
> Subject: Nutch 1.7 fetch happening in a single map task.
> Date: Fri, Aug 29, 2014 9:00 AM
>
> See
> http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
>
> just go to runtime/deploy/bin and run the script from there.
>
> Julien
>
>
> On 29 August 2014 13:38, Meraj A. Khan <[email protected]> wrote:
>
> > Hi Julien,
> >
> > I have 15 domains and they are all being fetched in a single map task
> which
> > does not fetch all the urls no matter what depth or topN i give.
> >
> > I am submitting the Nutch job jar which seems to be using the Crawl.java
> > class, how do I use the Crawl script on a Hadoop cluster, are there any
> > pointers you can share?
> >
> > Thanks.
> > On Aug 29, 2014 4:40 AM, "Julien Nioche" <[email protected]>
> > wrote:
> >
> > > Hi Meraj,
> > >
> > > The generator will place all the URLs in a single segment if all they
> > > belong to the same host for politeness reason. Otherwise it will use
> > > whichever value is passed with the -numFetchers parameter in the
> > generation
> > > step.
> > >
> > > Why don't you use the crawl script in /bin instead of tinkering with
> the
> > > (now deprecated) Crawl class? It comes with a good default
> configuration
> > > and should make your life easier.
> > >
> > > Julien
> > >
> > >
> > > On 28 August 2014 06:47, Meraj A. Khan <[email protected]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> > > there
> > > > is only a single reducer in the generate partition job. I am  running
> > in
> > > a
> > > > situation where the subsequent fetch is only running in a single map
> > task
> > > > (I believe as a consequence of a single reducer in the earlier
> phase).
> > > How
> > > > can I force Nutch to do fetch in multiple map tasks , is there a
> > setting
> > > to
> > > > force more than one reducers in the generate-partition job to have
> more
> > > map
> > > > tasks ?.
> > > >
> > > > Please also note that I have commented out the code in Crawl.java to
> > not
> > > do
> > > > the LInkInversion phase as , I dont need the scoring of the URLS that
> > > Nutch
> > > > crawls, every URL is equally important to me.
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.7 fetch happening in a single map task.

Reply via email to