Hi,
Am trying to find out ways to improve performance of an incremental
crawl of approximately 200 URLs to start with, depth = 3, topN = 500. I
collected timing information according to the patch I provided in
<https://issues.apache.org/jira/browse/NUTCH-838>:
inject crawldb urls
for ((i=1; i <= 3; i++))
do
generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
fetch $SEGMENT -noParsing
parse $SEGMENT
updatedb crawldb $SEGMENT -filter -normalize
done
invertlinks linkdb -dir segments
solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
solrdedup http://127.0.0.1:8983/solr
I ran the above script 46 times until now and total runtime grew from
00:01:28 to max 01:48:20 in iteration 17. I then adjusted
fetcher.threads.per.host=3
and runtimes then fell back to 00:16:27 in iteration 19 and are
currently at 00:50:32 in iteration 46.
The steps that seem to take most of the time were fetch (but that
problem was solved by adjusting the configuration), but now are generate
(from 00:00:06 up to 00:04:56 per iteration), updatedb (from 00:00:02 to
00:05:33 per iteration) and solrindex (from 00:00:07 up to 00:14:20 per
iteration).
What kind of suggestions do you have to improve performance given that:
1. I'm planning to crawl and recrawl all these URLs regularly
2. The total number of injected sites is growing from 200 today to
approximately 2500 and hence the entire crawldb will contain 2.5M URLs
(currently at slightly more than 200K).
Thanks for your support and insights.
Best regards,
Jeroen
PS: If necessary I can send an Excel overview with all the timings I
collected.