On 2010-07-01 23:01, Jeroen van Vianen wrote:
Hi,
I found a recrawl script to incrementally index a list of URLs. It
basically contains the following steps (minor details left out):
nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
nutch generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
nutch fetch $SEGMENT -noParsing
nutch parse $SEGMENT
nutch updatedb crawldb $SEGMENT -filter -normalize
done
nutch invertlinks linkdb -dir segments
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
nutch solrdedup http://127.0.0.1:8983/solr
Why are the invertlinks and solrindex steps done on the entire segments
dir rather than only the last $SEGMENT? I'd like to know because the
number of segment directories equals depth * (numbers the recrawl script
has been run) which is ofcourse becoming larger and larger over time and
might become a performance problem.
Doesn't the below version work just as well while doing the invertlinks
and solrindex steps only on the last segment? What is the difference if
I would do it this way?
nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
nutch generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
nutch fetch $SEGMENT -noParsing
nutch parse $SEGMENT
nutch updatedb crawldb $SEGMENT -filter -normalize
nutch invertlinks linkdb $SEGMENT
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb $SEGMENT
done
nutch solrdedup http://127.0.0.1:8983/solr
Any insights are appreciated.
In the second version of your script, the linkdb is updated
incrementally, which means the inlink (and anchor text) information is
also collected incrementally and for the same target page it changes as
you collect more inlinks. Eventually the linkdb-s will be the same.
However, in the second script, for segment10 the set of inlinks will be
different for the same page than in the segment1 - in fact, the universe
of inlinks (and associated anchor text) for segment1 will be extremely
limited, because it will come only from the data in segment1.
This means in turn that the accumulated anchor text for the same page
will be different when you use the first and the second script - the
first script will submit for indexing a much richer set of anchor texts,
because it will work from a complete linkdb.
In a situation where your crawling frontier is already relatively
stable, i.e. you already collected most of the link graph and you work
with segment10000 and segment10001 ;) the linkdb will already be mostly
complete and changing only slightly. If you can dismiss these changes as
mostly irrelevant, then the second script will become roughly equivalent
to the first script.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com