Hi,

Am trying to find out ways to improve performance of an incremental crawl of approximately 200 URLs to start with, depth = 3, topN = 500. I collected timing information according to the patch I provided in <https://issues.apache.org/jira/browse/NUTCH-838>:

inject crawldb urls
for ((i=1; i <= 3; i++))
do
    generate crawldb segments -topN 500
    export SEGMENT=segments/`ls -tr segments|tail -1`
    fetch $SEGMENT -noParsing
    parse $SEGMENT
    updatedb crawldb $SEGMENT -filter -normalize
done
invertlinks linkdb -dir segments
solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
solrdedup http://127.0.0.1:8983/solr

I ran the above script 46 times until now and total runtime grew from 00:01:28 to max 01:48:20 in iteration 17. I then adjusted fetcher.threads.per.host=3 and runtimes then fell back to 00:16:27 in iteration 19 and are currently at 00:50:32 in iteration 46.

The steps that seem to take most of the time were fetch (but that problem was solved by adjusting the configuration), but now are generate (from 00:00:06 up to 00:04:56 per iteration), updatedb (from 00:00:02 to 00:05:33 per iteration) and solrindex (from 00:00:07 up to 00:14:20 per iteration).

What kind of suggestions do you have to improve performance given that:

1. I'm planning to crawl and recrawl all these URLs regularly
2. The total number of injected sites is growing from 200 today to approximately 2500 and hence the entire crawldb will contain 2.5M URLs (currently at slightly more than 200K).

Thanks for your support and insights.

Best regards,


Jeroen

PS: If necessary I can send an Excel overview with all the timings I collected.

Reply via email to