Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Matt MacDonald Tue, 02 Oct 2012 15:17:42 -0700

Hi,

Great feedback, suggestions and activity on this list.


Based on guidance from the list I stopped using the bin/nutch crawl
command and am now calling each step individually. Julien, you
suggested that I start with
https://issues.apache.org/jira/secure/attachment/12535851/NUTCH-1087-2.1.patch.
I'm more comfortable working with Ruby than shell scripts so I ported
the script to Ruby and added some additional logging to help me better
understand the timing and output of each step.

There are a few parameters that are used in the shell script that I'm
unclear of what impact they have or if they are being used and I'd
love feedback on what they mean and how I might tweak them.

Called during Generate & Fetch:
-------------------------
mapred.reduce.tasks.speculative.execution=false
mapred.map.tasks.speculative.execution=false
mapred.compress.map.output=true

Called during Parse:
-------------------------
mapred.skip.attempts.to.start.skipping=2
mapred.skip.map.max.skip.records=1


I've run 12 crawl iterations over the 177 websites that I'm crawling
and I'm wondering if the results are what others might expect.

These are my crawling commands:
-------------------------
0) nutch inject #{options[:seed_dir]}

Loop
1) nutch generate -D mapred.map.tasks=2 -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -topN 50000 -numFetchers 1 -noFilter

2) nutch fetch -D mapred.map.tasks=2 -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true <BATCH_ID>

3) nutch parse -D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 <BATCH_ID>

4) nutch updatedb


Iterations #2-#5 resulted in:
-------------------------
Average iteration time: 30-35 minutes

Iterations #6-#12 resulted in (realized I should be timing each step):
-------------------------
Average generate time: 250 seconds
Average fetch time: 400 seconds
Average parse time: 450 seconds
Average update time: 300 seconds
Average total iteration time: 20-25 minutes

HBase size after 12 interations: 11.02GB

After the 12th iteration readdb -stats resulted in the following output
-------------------------
WebTable statistics start
Statistics for WebTable:
status 2 (status_fetched):  39611
min score:  0.0
retry 0:  43146
jobs: {db_stats-job_local_0001={jobID=job_local_0001,
jobName=db_stats, counters={File Input Format Counters
={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=7829,
MAP_INPUT_RECORDS=45859, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=1339,
MAP_OUTPUT_BYTES=2430527, COMMITTED_HEAP_BYTES=27249123328,
COMBINE_INPUT_RECORDS=183436, SPLIT_RAW_BYTES=78062,
REDUCE_INPUT_RECORDS=463, REDUCE_INPUT_GROUPS=118,
COMBINE_OUTPUT_RECORDS=463, REDUCE_OUTPUT_RECORDS=118,
MAP_OUTPUT_RECORDS=183436},
FileSystemCounters={FILE_BYTES_READ=32439253,
FILE_BYTES_WRITTEN=32913783}, File Output Format Counters
={BYTES_WRITTEN=2520}}}}
retry 1:  2713
status 5 (status_redir_perm): 1373
max score:  19.345
TOTAL urls: 45859
status 4 (status_redir_temp): 346
status 1 (status_unfetched):  4529
avg score:  0.04870584
WebTable statistics: done
status 2 (status_fetched):  39611
min score:  0.0
retry 0:  43146
jobs: {db_stats-job_local_0001={jobID=job_local_0001,
jobName=db_stats, counters={File Input Format Counters
={BYTES_READ=0}, Map-Reduce
Framework={MAP_OUTPUT_MATERIALIZED_BYTES=7829,
MAP_INPUT_RECORDS=45859, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=1339,
MAP_OUTPUT_BYTES=2430527, COMMITTED_HEAP_BYTES=27249123328,
COMBINE_INPUT_RECORDS=183436, SPLIT_RAW_BYTES=78062,
REDUCE_INPUT_RECORDS=463, REDUCE_INPUT_GROUPS=118,
COMBINE_OUTPUT_RECORDS=463, REDUCE_OUTPUT_RECORDS=118,
MAP_OUTPUT_RECORDS=183436},
FileSystemCounters={FILE_BYTES_READ=32439253,
FILE_BYTES_WRITTEN=32913783}, File Output Format Counters
={BYTES_WRITTEN=2520}}}}
retry 1:  2713
status 5 (status_redir_perm): 1373
max score:  19.345
TOTAL urls: 45859
status 4 (status_redir_temp): 346
status 1 (status_unfetched):  4529
avg score:  0.04870584


Some questions:
-------------------------
1) After 12 iterations I'm still seeing more than 4,500 documents out
of 45,000 that are unfetched. How might I go about determining why the
unfeteched urls are not being fetched?

2) Any suggestions for modifying the interation steps and/or
parameters for each step in successive iterations to decrease crawl
times and/or increase the number of fetched urls? topN? threads?

3) Any additional information on what the mapred related parameters do?
mapred.reduce.tasks.speculative.execution=false
mapred.map.tasks.speculative.execution=false
mapred.compress.map.output=true
mapred.skip.attempts.to.start.skipping=2
mapred.skip.map.max.skip.records=1

4) During my local, single node crawl I've seen a few sites throw 500
errors and become unresponsive. How can I ensure that I'm not DOSing
and crashing the sites I'm crawling?
* fetcher.server.delay=5.0
* fetcher.threads.fetch=100
* fetcher.threads.per.queue=100
* fetcher.threads.per.host=100
* db.fetch.schedule.class=org.apache.nutch.crawl.AdaptiveFetchSchedule
* http.timeout=30000
* db.ignore.external.links=true

5) What value should I set for gora.buffer.read.limit? Currently it's
set to the default of 10000. During fetch steps #6-#12 nearly 50% of
the time was spent reading from HBase. I was seeing
gora.buffer.read.limit=10000 show up for several minutes in the logs.

Thanks,
Matt

On Fri, Sep 28, 2012 at 8:21 AM, Julien Nioche
<[email protected]> wrote:
> Hi Matt
>
>
>> > the fetch step is likely to take most of the time and the time it takes
>> it
>> > mostly a matter of the distribution of hosts/IP/domains in your
>> fetchlist.
>> > Search the WIKI for details on performance tips
>>
>> Thanks. Most of the urls that I'm fetching are each on their own
>> IP/hosts and unique servers.
>>
>
> Ok, you might want to use a large number of threads then
> (fetcher.threads.fetch)
>
> [...]
>
>
>>
>> >
>> >
>> >> * Why would Hbase show 64,000 documents but ElasticSearch only 50,000?
>> >>
>> >
>> > redirections? sounds quite a lot though
>>
>> Thoughts for how I would identify which are redirects?
>>
>
> try using 'nutch readdb' to dump the content of the webtable and inspect
> the URLs
>
> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Re: Nutch 2.1 Advice, thoughts and comments on crawl performance, indexing and deployment?

Reply via email to