Hi,
I found a recrawl script to incrementally index a list of URLs. It
basically contains the following steps (minor details left out):
nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
nutch generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
nutch fetch $SEGMENT -noParsing
nutch parse $SEGMENT
nutch updatedb crawldb $SEGMENT -filter -normalize
done
nutch invertlinks linkdb -dir segments
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
nutch solrdedup http://127.0.0.1:8983/solr
Why are the invertlinks and solrindex steps done on the entire segments
dir rather than only the last $SEGMENT? I'd like to know because the
number of segment directories equals depth * (numbers the recrawl script
has been run) which is ofcourse becoming larger and larger over time and
might become a performance problem.
Doesn't the below version work just as well while doing the invertlinks
and solrindex steps only on the last segment? What is the difference if
I would do it this way?
nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
nutch generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
nutch fetch $SEGMENT -noParsing
nutch parse $SEGMENT
nutch updatedb crawldb $SEGMENT -filter -normalize
nutch invertlinks linkdb $SEGMENT
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb $SEGMENT
done
nutch solrdedup http://127.0.0.1:8983/solr
Any insights are appreciated.
Regards,
Jeroen