See http://*wiki*.apache.org/*nutch*/OptimizingCrawls for a checklist
On 21 February 2012 10:47, Bharat Goyal <[email protected]> wrote: > No of fetcher threads is equal to default value(10), What is the optimum > value for no of threads? Also, the fetching and parsing are not seperate. > > -Bharat > > > On Tuesday 21 February 2012 04:11 PM, Lewis John Mcgibbney wrote: > >> How many fetcher threads do you have at play? >> Also Are you separating fetching and parsing? >> >> These are (generally speaking) places to get started. >> >> On Tue, Feb 21, 2012 at 8:19 AM, Bharat Goyal<[email protected]** >> >wrote: >> >> Hi, >>> >>> I have a list of around 1000 seed URLS, which I crawl till depth=2 or 3. >>> This is done on a local machine having a configuration(having no other >>> large resource consuming processes running) : >>> Dual Core (2.4 GHz), >>> 4GB Ram >>> >>> It takes around 14-15 hours to crawl this seedlist, which generates >>> around 21k web page content. Is there any way this can be optimized and >>> takes less time, Nutch(1.2) settings are all default. >>> >>> Thanks for the help. >>> >>> Regards, >>> >>> Bharat Goyal >>> >>> DISCLAIMER >>> This email is intended only for the person or the entity to whom it is >>> addressed and may contain information which is confidential and >>> privileged. >>> Any review, retransmission, dissemination or any other use of the said >>> information by person or entities other than intended recipient is >>> unauthorized and prohibited. If you are not the intended recipient, >>> please >>> delete this email and contact the sender. >>> >>> >> >> > > DISCLAIMER > This email is intended only for the person or the entity to whom it is > addressed and may contain information which is confidential and privileged. > Any review, retransmission, dissemination or any other use of the said > information by person or entities other than intended recipient is > unauthorized and prohibited. If you are not the intended recipient, please > delete this email and contact the sender. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

