To add to this, you might wish to have a look at the rest of the wiki, in particular
http://wiki.apache.org/nutch/NutchHadoopTutorial This is a significant step up from running a crawl command, but it will greatly reduce the complexity and disadvantages of undertaking the type of process you wish on a single workstation. Lewis ________________________________________ From: Hannes Carl Meyer [[email protected]] Sent: 26 February 2011 18:02 To: [email protected] Subject: Re: Can I use the Nutch crawl command for large crawls? I would not recommend using the Crawl command for large crawls, because: 1. Tuning Hadoop ist not possible at all 2. Incremental Crawling is also pretty difficult because you can't control the different processes/steps Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

