Hi,
I have just performed my first full crawl of a collection of sites
within a vertical domain using Nutch 2.1. This is a restricted crawl
where I am limiting the results to just the collection of urls in the
seed.txt file and setting db.ignore.external.links to true. This crawl
is being performed on a development server and deployment in
production is likely to be EC2. I'm posting my crawl results here
hoping that others might share how they have configured their
environments, moving from smaller crawls on development machines to
larger, distributed crawls.
I'm specifically interested in suggestions for speeding up both the
initial crawl and subsequent re-crawls in local mode first. Also
suggestions and approaches to determine an optimal cost/speed EC2
configuration. Initially I'd like to avoid a multi-server setup if
possible as I'm likely to only have funds for a single server for the
time being.
I've only been poking around with Nutch off and on for a few weeks,
apologies if I'm butchering terminology or concepts.
I've read in other posts that using the native libraries is likely to
improve performance but I have been unable to find helpful information
about how to load them on on OS X while using Nutch 2.
* util.NativeCodeLoader - Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable is likely
to help
Single Server
-----------------------
* OS X 10.7.4
* 2.7 GHz Intel Core i5 Quad Core
* 16GB memory
* 25Mbps download speed over consumer broadband (RCN)
* 1.95Mbps upload speed
* 1TB SATA hard dive 7200 rpm
Nutch 2.X HEAD
-----------------------
* Local mode bin/nutch crawl urls -depth 8 -topN 10000
* Using Hbase 0.90.6 (was encountering hung threads with 0.90.4)
* Portions of my configuration... (are there other more relevant bits?)
<property>
<name>fetcher.server.delay</name>
<value>4.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>50</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>20</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time.</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
Data points
-----------------------
* Currently 177 unique websites will grow to ~30,000 websites
* PDF/Word/Excel heavy sites ~50% of the documents are non-HTML
* 64,000 unique webpage documents reported in 'webpage' Hbase table
* Hbase 'webpage' table is 14GB
Crawl performance
-----------------------
* bin/nutch crawl urls -depth 8 -topN 10000
* Initial crawl takes ~4 hours
* jconsole reports the Heap usage usually hovers around 1GB occasional
spikes to about 2GB
* max heap size is set to default
* CPU use is typically below 10%
* Network peak data received: 30Mbps
* Subsequent crawl takes ~2 hours
Indexing
-----------------------
* Using ElasticSearch for the index
* ElasticSearch index is 752MB with 50,000 unique documents
* Custom index plugin adds specific vertical domain information to the index
* Indexing 64,000 documents via bin/nutch elasticindex takes ~5minutes
Questions
-----------------------
* Given my current setup, is the crawl that I'm performing taking
roughly the same time that others might expect?
* If this crawl is taking much longer than you might expect what would
you suggest trying to decrease the crawl time?
* Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
and calling each step individually? Why?
* Are there more recent/improved versions of
http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
2.x?
* Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
* What other questions should I be asking?
Thanks,
Matt