Nutch 1.1 performance degrading

Jeroen van Vianen Sat, 03 Jul 2010 13:01:56 -0700

Hi,

Am trying to find out ways to improve performance of an incrementalcrawl of approximately 200 URLs to start with, depth = 3, topN = 500. Icollected timing information according to the patch I provided in<https://issues.apache.org/jira/browse/NUTCH-838>:


inject crawldb urls
for ((i=1; i <= 3; i++))
do
    generate crawldb segments -topN 500
    export SEGMENT=segments/`ls -tr segments|tail -1`
    fetch $SEGMENT -noParsing
    parse $SEGMENT
    updatedb crawldb $SEGMENT -filter -normalize
done
invertlinks linkdb -dir segments
solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
solrdedup http://127.0.0.1:8983/solr

I ran the above script 46 times until now and total runtime grew from00:01:28 to max 01:48:20 in iteration 17. I then adjustedfetcher.threads.per.host=3and runtimes then fell back to 00:16:27 in iteration 19 and arecurrently at 00:50:32 in iteration 46.

The steps that seem to take most of the time were fetch (but thatproblem was solved by adjusting the configuration), but now are generate(from 00:00:06 up to 00:04:56 per iteration), updatedb (from 00:00:02 to00:05:33 per iteration) and solrindex (from 00:00:07 up to 00:14:20 periteration).


What kind of suggestions do you have to improve performance given that:

1. I'm planning to crawl and recrawl all these URLs regularly

2. The total number of injected sites is growing from 200 today toapproximately 2500 and hence the entire crawldb will contain 2.5M URLs(currently at slightly more than 200K).


Thanks for your support and insights.

Best regards,


Jeroen

PS: If necessary I can send an Excel overview with all the timings Icollected.

Nutch 1.1 performance degrading

Reply via email to