Try decreasing the number of fetcher threads instead... On Wed, Feb 22, 2012 at 2:33 PM, Bharat Goyal <[email protected]>wrote:
> Went through the checklist and made some changes as in increased the no > of fetcher threads from default 10 to 30, but I still see nutch eating > up all the resources, the CPU usage is as high as 100% > > -Bharat > > On Tuesday 21 February 2012 04:45 PM, Julien Nioche wrote: > >> See >> http://*wiki*.apache.org/***nutch*/OptimizingCrawls<http://apache.org/*nutch*/OptimizingCrawls>for >> a checklist >> >> On 21 February 2012 10:47, Bharat Goyal<[email protected]**> >> wrote: >> >> No of fetcher threads is equal to default value(10), What is the optimum >>> value for no of threads? Also, the fetching and parsing are not seperate. >>> >>> -Bharat >>> >>> >>> On Tuesday 21 February 2012 04:11 PM, Lewis John Mcgibbney wrote: >>> >>> How many fetcher threads do you have at play? >>>> Also Are you separating fetching and parsing? >>>> >>>> These are (generally speaking) places to get started. >>>> >>>> On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal<[email protected]* >>>> *** >>>> >>>>> wrote: >>>>> >>>> Hi, >>>> >>>>> I have a list of around 1000 seed URLS, which I crawl till depth=2 or >>>>> 3. >>>>> This is done on a local machine having a configuration(having no other >>>>> large resource consuming processes running) : >>>>> Dual Core (2.4 GHz), >>>>> 4GB Ram >>>>> >>>>> It takes around 14-15 hours to crawl this seedlist, which generates >>>>> around 21k web page content. Is there any way this can be optimized and >>>>> takes less time, Nutch(1.2) settings are all default. >>>>> >>>>> Thanks for the help. >>>>> >>>>> Regards, >>>>> >>>>> Bharat Goyal >>>>> >>>>> DISCLAIMER >>>>> This email is intended only for the person or the entity to whom it is >>>>> addressed and may contain information which is confidential and >>>>> privileged. >>>>> Any review, retransmission, dissemination or any other use of the said >>>>> information by person or entities other than intended recipient is >>>>> unauthorized and prohibited. If you are not the intended recipient, >>>>> please >>>>> delete this email and contact the sender. >>>>> >>>>> >>>>> >>>> DISCLAIMER >>> This email is intended only for the person or the entity to whom it is >>> addressed and may contain information which is confidential and >>> privileged. >>> Any review, retransmission, dissemination or any other use of the said >>> information by person or entities other than intended recipient is >>> unauthorized and prohibited. If you are not the intended recipient, >>> please >>> delete this email and contact the sender. >>> >>> >> >> > > DISCLAIMER > This email is intended only for the person or the entity to whom it is > addressed and may contain information which is confidential and privileged. > Any review, retransmission, dissemination or any other use of the said > information by person or entities other than intended recipient is > unauthorized and prohibited. If you are not the intended recipient, please > delete this email and contact the sender. >

