Hi Julien,

I have 15 domains and they are all being fetched in a single map task which
does not fetch all the urls no matter what depth or topN i give.

I am submitting the Nutch job jar which seems to be using the Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are there any
pointers you can share?

Thanks.
On Aug 29, 2014 4:40 AM, "Julien Nioche" <lists.digitalpeb...@gmail.com>
wrote:

> Hi Meraj,
>
> The generator will place all the URLs in a single segment if all they
> belong to the same host for politeness reason. Otherwise it will use
> whichever value is passed with the -numFetchers parameter in the generation
> step.
>
> Why don't you use the crawl script in /bin instead of tinkering with the
> (now deprecated) Crawl class? It comes with a good default configuration
> and should make your life easier.
>
> Julien
>
>
> On 28 August 2014 06:47, Meraj A. Khan <mera...@gmail.com> wrote:
>
> > Hi All,
> >
> > I am running Nutch 1.7 on Hadoop 2.3.0 cluster and and I noticed that
> there
> > is only a single reducer in the generate partition job. I am  running in
> a
> > situation where the subsequent fetch is only running in a single map task
> > (I believe as a consequence of a single reducer in the earlier phase).
> How
> > can I force Nutch to do fetch in multiple map tasks , is there a setting
> to
> > force more than one reducers in the generate-partition job to have more
> map
> > tasks ?.
> >
> > Please also note that I have commented out the code in Crawl.java to not
> do
> > the LInkInversion phase as , I dont need the scoring of the URLS that
> Nutch
> > crawls, every URL is equally important to me.
> >
> > Thanks.
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to