Hey Peter,

Give a bigger value for topN parameter. Also, use:

<property>
  <name>generate.max.count</name>
  <value>-1</value>
</property>

<property>
  <name>generate.count.mode</name>
  <value>domain</value>
</property>

Not sure why you see queue mode as byhost and not by domain. Did it print
that in the logs ?
I should have asked you this before : Are you using nutch 1.X or 2.x ?

thanks,
Tejas Patil


On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto
<[email protected]>wrote:

> Hi Tejas,
>
> I changed the generate.count.mode to domain and generate.max.count to 100
> but still it shows queue mode as byhost and not by domain.
>
>
>
> peterbarretto wrote
> > Hi Tejas
> >
> > The fetcher.threads.per.host property has been depreciated and replaced
> > with fetcher.threads.per.queue
> > I am not sue if fetcher.threads.per.queue will hepl the fetching as the
> > generator only generates the fetchlist from 2 or 3 domain. How can i tell
> > the generator to create fetchlist with equal number of urls from all
> > domain?
> >
> > I am sure there are urls from the other domains but i guess since the url
> > score is less it fetches from only 2 domains.
> >
> > I will try increasing fetcher.threads.per.queue to 5 and see if the fetch
> > speed is increased and let you know
> > Tejas Patil wrote
> >> Hey Peter,
> >>
> >> I am guessing that you have just increased the global thread count. Have
> >> you even increased "fetcher.threads.per.host" ? This will improve the
> >> crawl
> >> rate as multiple threads can attack the same site. Dont make it too high
> >> or
> >> else the system will get overloaded. The nutch wiki has an article [0]
> >> about the potential reasons for slow crawls and some good suggestions.
> >>
> >> [0] : https://wiki.apache.org/nutch/OptimizingCrawls
> >>
> >> Thanks,
> >> Tejas Patil
> >>
> >>
> >> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto &lt;
>
> >> peterbarretto08@
>
> >> &gt;wrote:
> >>
> >>> I tried increasing the numbers of threads to 50 but the speed is not
> >>> affected
> >>>
> >>>
> >>> I tried changing the partition.url.mode value to byDomain and
> >>> fetcher.queue.mode to byDomain but still it does not help the speed.
> >>> It seems to get urls from 2 domains now and the other domains are not
> >>> getting crawled. Is this due to the url score? if so how do i crawl
> urls
> >>> from all the domains?
> >>>
> >>>
> >>> lewis john mcgibbney wrote
> >>> > Increase number of threads when fetching
> >>> > Also please see nutch-deault.xml for paritioning of urls, if you know
> >>> your
> >>> > target domains you may wish to adapt the policy.
> >>> > Lewis
> >>> >
> >>> > On Sunday, January 27, 2013, peterbarretto &lt;
> >>>
> >>> > peterbarretto08@
> >>>
> >>> > &gt;
> >>> > wrote:
> >>> >> I want to increase the number of urls fetched at a time in nutch. I
> >>> have
> >>> >> around 10 websites to crawl. so how can i crawl all the sites at a
> >>> time
> >>> ?
> >>> >> right now i am fetching 1 site with a fetch delay of 2 second but it
> >>> is
> >>> > too
> >>> >> slow. How to concurrently fetch from different domain?
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> View this message in context:
> >>> >
> >>>
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
> >>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>> >>
> >>> >
> >>> > --
> >>> > *Lewis*
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
> >>> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to