Hi, Based on our experience i would recommend running Nutch on a Hadoop pseudo- cluster with a bit more memory and at least 4 CPU cores. Fetch and parse of those url's wont' be a problem but updating the crawldb and generating fetch lists is going to be a problem.
Are you also indexing? Then that will also be a very costly process. Cheers On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote: > HI, > > I am looking for advice on how to configure Nutch (and Solr) to crawl a > private Wikipedia mirror. > > - It is my mirror on an intranet so I do not need to be polite to > myself. - I need to complete this 11 million page crawl as fast as I > reasonably can. > - Both crawler and mirror are 1.7GB machines dedicated to this task. > - I only need to crawl internal links (not external). > - Eventually I will need to update the crawl but a monthly update will > be sufficient. > > Any advice (and sample config files) would be much appreciated! > > Fred --

