Two things to note. db.ignore.external.links doesn't quite work the way you think it should. If you have a url inside the domain that resolves to a url outside the domain, nutch will end up indexing that domain as well. The way to get around this is to use the whitelist instead of db.ignore.external.links.
Also, you should set your fetcher.threads.per.host to at least 2. I've seen nutch take forever in a fetch because one host is left. Don't set this too high, however, as you can cause a DOS. Also, set limits on your generates. You want to have urls from as many different servers as possible in order to spread the load around. On Wed, Sep 26, 2012 at 8:42 AM, Matt MacDonald <[email protected]> wrote: > Hi, > > I have just performed my first full crawl of a collection of sites > within a vertical domain using Nutch 2.1. This is a restricted crawl > where I am limiting the results to just the collection of urls in the > seed.txt file and setting db.ignore.external.links to true. This crawl > is being performed on a development server and deployment in > production is likely to be EC2. I'm posting my crawl results here > hoping that others might share how they have configured their > environments, moving from smaller crawls on development machines to > larger, distributed crawls. > > I'm specifically interested in suggestions for speeding up both the > initial crawl and subsequent re-crawls in local mode first. Also > suggestions and approaches to determine an optimal cost/speed EC2 > configuration. Initially I'd like to avoid a multi-server setup if > possible as I'm likely to only have funds for a single server for the > time being. > > I've only been poking around with Nutch off and on for a few weeks, > apologies if I'm butchering terminology or concepts. > > I've read in other posts that using the native libraries is likely to > improve performance but I have been unable to find helpful information > about how to load them on on OS X while using Nutch 2. > * util.NativeCodeLoader - Unable to load native-hadoop library for > your platform... using builtin-java classes where applicable is likely > to help > > Single Server > ----------------------- > * OS X 10.7.4 > * 2.7 GHz Intel Core i5 Quad Core > * 16GB memory > * 25Mbps download speed over consumer broadband (RCN) > * 1.95Mbps upload speed > * 1TB SATA hard dive 7200 rpm > > Nutch 2.X HEAD > ----------------------- > * Local mode bin/nutch crawl urls -depth 8 -topN 10000 > * Using Hbase 0.90.6 (was encountering hung threads with 0.90.4) > * Portions of my configuration... (are there other more relevant bits?) > > <property> > <name>fetcher.server.delay</name> > <value>4.0</value> > <description>The number of seconds the fetcher will delay between > successive requests to the same server.</description> > </property> > > <property> > <name>fetcher.threads.fetch</name> > <value>50</value> > <description>The number of FetcherThreads the fetcher should use. > This is also determines the maximum number of requests that are > made at once (each FetcherThread handles one connection). The total > number of threads running in distributed mode will be the number of > fetcher threads * number of nodes as fetcher has one map task per node. > </description> > </property> > > <property> > <name>fetcher.threads.per.queue</name> > <value>20</value> > <description>This number is the maximum number of threads that > should be allowed to access a queue at one time.</description> > </property> > > <property> > <name>fetcher.threads.per.host</name> > <value>1</value> > <description>This number is the maximum number of threads that > should be allowed to access a host at one time.</description> > </property> > > Data points > ----------------------- > * Currently 177 unique websites will grow to ~30,000 websites > * PDF/Word/Excel heavy sites ~50% of the documents are non-HTML > * 64,000 unique webpage documents reported in 'webpage' Hbase table > * Hbase 'webpage' table is 14GB > > Crawl performance > ----------------------- > * bin/nutch crawl urls -depth 8 -topN 10000 > * Initial crawl takes ~4 hours > * jconsole reports the Heap usage usually hovers around 1GB occasional > spikes to about 2GB > * max heap size is set to default > * CPU use is typically below 10% > * Network peak data received: 30Mbps > * Subsequent crawl takes ~2 hours > > Indexing > ----------------------- > * Using ElasticSearch for the index > * ElasticSearch index is 752MB with 50,000 unique documents > * Custom index plugin adds specific vertical domain information to the > index > * Indexing 64,000 documents via bin/nutch elasticindex takes ~5minutes > > Questions > ----------------------- > * Given my current setup, is the crawl that I'm performing taking > roughly the same time that others might expect? > * If this crawl is taking much longer than you might expect what would > you suggest trying to decrease the crawl time? > * Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000' > and calling each step individually? Why? > * Are there more recent/improved versions of > http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch > 2.x? > * Why would Hbase show 64,000 documents but ElasticSearch only 50,000? > * What other questions should I be asking? > > Thanks, > Matt >

