Hey Peter, Give a bigger value for topN parameter. Also, use:
<property> <name>generate.max.count</name> <value>-1</value> </property> <property> <name>generate.count.mode</name> <value>domain</value> </property> Not sure why you see queue mode as byhost and not by domain. Did it print that in the logs ? I should have asked you this before : Are you using nutch 1.X or 2.x ? thanks, Tejas Patil On Tue, Jan 29, 2013 at 12:08 AM, peterbarretto <[email protected]>wrote: > Hi Tejas, > > I changed the generate.count.mode to domain and generate.max.count to 100 > but still it shows queue mode as byhost and not by domain. > > > > peterbarretto wrote > > Hi Tejas > > > > The fetcher.threads.per.host property has been depreciated and replaced > > with fetcher.threads.per.queue > > I am not sue if fetcher.threads.per.queue will hepl the fetching as the > > generator only generates the fetchlist from 2 or 3 domain. How can i tell > > the generator to create fetchlist with equal number of urls from all > > domain? > > > > I am sure there are urls from the other domains but i guess since the url > > score is less it fetches from only 2 domains. > > > > I will try increasing fetcher.threads.per.queue to 5 and see if the fetch > > speed is increased and let you know > > Tejas Patil wrote > >> Hey Peter, > >> > >> I am guessing that you have just increased the global thread count. Have > >> you even increased "fetcher.threads.per.host" ? This will improve the > >> crawl > >> rate as multiple threads can attack the same site. Dont make it too high > >> or > >> else the system will get overloaded. The nutch wiki has an article [0] > >> about the potential reasons for slow crawls and some good suggestions. > >> > >> [0] : https://wiki.apache.org/nutch/OptimizingCrawls > >> > >> Thanks, > >> Tejas Patil > >> > >> > >> On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto < > > >> peterbarretto08@ > > >> >wrote: > >> > >>> I tried increasing the numbers of threads to 50 but the speed is not > >>> affected > >>> > >>> > >>> I tried changing the partition.url.mode value to byDomain and > >>> fetcher.queue.mode to byDomain but still it does not help the speed. > >>> It seems to get urls from 2 domains now and the other domains are not > >>> getting crawled. Is this due to the url score? if so how do i crawl > urls > >>> from all the domains? > >>> > >>> > >>> lewis john mcgibbney wrote > >>> > Increase number of threads when fetching > >>> > Also please see nutch-deault.xml for paritioning of urls, if you know > >>> your > >>> > target domains you may wish to adapt the policy. > >>> > Lewis > >>> > > >>> > On Sunday, January 27, 2013, peterbarretto < > >>> > >>> > peterbarretto08@ > >>> > >>> > > > >>> > wrote: > >>> >> I want to increase the number of urls fetched at a time in nutch. I > >>> have > >>> >> around 10 websites to crawl. so how can i crawl all the sites at a > >>> time > >>> ? > >>> >> right now i am fetching 1 site with a fetch delay of 2 second but it > >>> is > >>> > too > >>> >> slow. How to concurrently fetch from different domain? > >>> >> > >>> >> > >>> >> > >>> >> -- > >>> >> View this message in context: > >>> > > >>> > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html > >>> >> Sent from the Nutch - User mailing list archive at Nabble.com. > >>> >> > >>> > > >>> > -- > >>> > *Lewis* > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html > >>> Sent from the Nutch - User mailing list archive at Nabble.com. > >>> > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036976.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

