Recrawl script question

Jeroen van Vianen Thu, 01 Jul 2010 14:03:10 -0700

Hi,

I found a recrawl script to incrementally index a list of URLs. Itbasically contains the following steps (minor details left out):


nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
        nutch generate crawldb segments -topN 500
        export SEGMENT=segments/`ls -tr segments|tail -1`
        nutch fetch $SEGMENT -noParsing
        nutch parse $SEGMENT
        nutch updatedb crawldb $SEGMENT -filter -normalize
done
nutch invertlinks linkdb -dir segments
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
nutch solrdedup http://127.0.0.1:8983/solr

Why are the invertlinks and solrindex steps done on the entire segmentsdir rather than only the last $SEGMENT? I'd like to know because thenumber of segment directories equals depth * (numbers the recrawl scripthas been run) which is ofcourse becoming larger and larger over time andmight become a performance problem.

Doesn't the below version work just as well while doing the invertlinksand solrindex steps only on the last segment? What is the difference ifI would do it this way?


nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
        nutch generate crawldb segments -topN 500
        export SEGMENT=segments/`ls -tr segments|tail -1`
        nutch fetch $SEGMENT -noParsing
        nutch parse $SEGMENT
        nutch updatedb crawldb $SEGMENT -filter -normalize
        nutch invertlinks linkdb $SEGMENT
        nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb $SEGMENT
done
nutch solrdedup http://127.0.0.1:8983/solr

Any insights are appreciated.

Regards,


Jeroen

Recrawl script question

Reply via email to