Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Matt MacDonald Wed, 26 Sep 2012 05:42:48 -0700

Hi,

I have just performed my first full crawl of a collection of sites
within a vertical domain using Nutch 2.1. This is a restricted crawl
where I am limiting the results to just the collection of urls in the
seed.txt file and setting db.ignore.external.links to true. This crawl
is being performed on a development server and deployment in
production is likely to be EC2. I'm posting my crawl results here
hoping that others might share how they have configured their
environments, moving from smaller crawls on development machines to
larger, distributed crawls.


I'm specifically interested in suggestions for speeding up both the
initial crawl and subsequent re-crawls in local mode first. Also
suggestions and approaches to determine an optimal cost/speed EC2
configuration. Initially I'd like to avoid a multi-server setup if
possible as I'm likely to only have funds for a single server for the
time being.

I've only been poking around with Nutch off and on for a few weeks,
apologies if I'm butchering terminology or concepts.

I've read in other posts that using the native libraries is likely to
improve performance but I have been unable to find helpful information
about how to load them on on OS X while using Nutch 2.
* util.NativeCodeLoader - Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable is likely
to help

Single Server
-----------------------
* OS X 10.7.4
* 2.7 GHz Intel Core i5 Quad Core
* 16GB memory
* 25Mbps download speed over consumer broadband (RCN)
* 1.95Mbps upload speed
* 1TB SATA hard dive 7200 rpm

Nutch 2.X HEAD
-----------------------
* Local mode bin/nutch crawl urls -depth 8 -topN 10000
* Using Hbase 0.90.6 (was encountering hung threads with 0.90.4)
* Portions of my configuration... (are there other more relevant bits?)

  <property>
    <name>fetcher.server.delay</name>
    <value>4.0</value>
    <description>The number of seconds the fetcher will delay between
     successive requests to the same server.</description>
  </property>

  <property>
    <name>fetcher.threads.fetch</name>
    <value>50</value>
    <description>The number of FetcherThreads the fetcher should use.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection). The total
    number of threads running in distributed mode will be the number of
    fetcher threads * number of nodes as fetcher has one map task per node.
    </description>
  </property>

  <property>
    <name>fetcher.threads.per.queue</name>
    <value>20</value>
    <description>This number is the maximum number of threads that
      should be allowed to access a queue at one time.</description>
  </property>

  <property>
    <name>fetcher.threads.per.host</name>
    <value>1</value>
    <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
  </property>

Data points
-----------------------
* Currently 177 unique websites will grow to ~30,000 websites
* PDF/Word/Excel heavy sites ~50% of the documents are non-HTML
* 64,000 unique webpage documents reported in 'webpage' Hbase table
* Hbase 'webpage' table is 14GB

Crawl performance
-----------------------
* bin/nutch crawl urls -depth 8 -topN 10000
* Initial crawl takes ~4 hours
* jconsole reports the Heap usage usually hovers around 1GB occasional
spikes to about 2GB
* max heap size is set to default
* CPU use is typically below 10%
* Network peak data received: 30Mbps
* Subsequent crawl takes ~2 hours

Indexing
-----------------------
* Using ElasticSearch for the index
* ElasticSearch index is 752MB with 50,000 unique documents
* Custom index plugin adds specific vertical domain information to the index
* Indexing 64,000 documents via bin/nutch elasticindex takes ~5minutes

Questions
-----------------------
* Given my current setup, is the crawl that I'm performing taking
roughly the same time that others might expect?
* If this crawl is taking much longer than you might expect what would
you suggest trying to decrease the crawl time?
* Should I be moving away from 'bin/nutch crawl -depth 8 -topN 10000'
and calling each step individually? Why?
* Are there more recent/improved versions of
http://wiki.apache.org/nutch/Crawl scripts that are written for Nutch
2.x?
* Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
* What other questions should I be asking?

Thanks,
Matt

Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Reply via email to