Hi,

I found a recrawl script to incrementally index a list of URLs. It basically contains the following steps (minor details left out):

nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
        nutch generate crawldb segments -topN 500
        export SEGMENT=segments/`ls -tr segments|tail -1`
        nutch fetch $SEGMENT -noParsing
        nutch parse $SEGMENT
        nutch updatedb crawldb $SEGMENT -filter -normalize
done
nutch invertlinks linkdb -dir segments
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
nutch solrdedup http://127.0.0.1:8983/solr

Why are the invertlinks and solrindex steps done on the entire segments dir rather than only the last $SEGMENT? I'd like to know because the number of segment directories equals depth * (numbers the recrawl script has been run) which is ofcourse becoming larger and larger over time and might become a performance problem.

Doesn't the below version work just as well while doing the invertlinks and solrindex steps only on the last segment? What is the difference if I would do it this way?

nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
        nutch generate crawldb segments -topN 500
        export SEGMENT=segments/`ls -tr segments|tail -1`
        nutch fetch $SEGMENT -noParsing
        nutch parse $SEGMENT
        nutch updatedb crawldb $SEGMENT -filter -normalize
        nutch invertlinks linkdb $SEGMENT
        nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb $SEGMENT
done
nutch solrdedup http://127.0.0.1:8983/solr

Any insights are appreciated.

Regards,


Jeroen

Reply via email to