Re: Recrawl script question

Andrzej Bialecki Thu, 01 Jul 2010 14:41:25 -0700

On 2010-07-01 23:01, Jeroen van Vianen wrote:

Hi,


I found a recrawl script to incrementally index a list of URLs. It
basically contains the following steps (minor details left out):

nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
nutch generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
nutch fetch $SEGMENT -noParsing
nutch parse $SEGMENT
nutch updatedb crawldb $SEGMENT -filter -normalize
done
nutch invertlinks linkdb -dir segments
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb segments/*
nutch solrdedup http://127.0.0.1:8983/solr

Why are the invertlinks and solrindex steps done on the entire segments
dir rather than only the last $SEGMENT? I'd like to know because the
number of segment directories equals depth * (numbers the recrawl script
has been run) which is ofcourse becoming larger and larger over time and
might become a performance problem.

Doesn't the below version work just as well while doing the invertlinks
and solrindex steps only on the last segment? What is the difference if
I would do it this way?

nutch inject crawldb urls
for ((i=1; i <= depth ; i++))
do
nutch generate crawldb segments -topN 500
export SEGMENT=segments/`ls -tr segments|tail -1`
nutch fetch $SEGMENT -noParsing
nutch parse $SEGMENT
nutch updatedb crawldb $SEGMENT -filter -normalize
nutch invertlinks linkdb $SEGMENT
nutch solrindex http://127.0.0.1:8983/solr crawldb linkdb $SEGMENT
done
nutch solrdedup http://127.0.0.1:8983/solr

Any insights are appreciated.

In the second version of your script, the linkdb is updatedincrementally, which means the inlink (and anchor text) information isalso collected incrementally and for the same target page it changes asyou collect more inlinks. Eventually the linkdb-s will be the same.

However, in the second script, for segment10 the set of inlinks will bedifferent for the same page than in the segment1 - in fact, the universeof inlinks (and associated anchor text) for segment1 will be extremelylimited, because it will come only from the data in segment1.

This means in turn that the accumulated anchor text for the same pagewill be different when you use the first and the second script - thefirst script will submit for indexing a much richer set of anchor texts,because it will work from a complete linkdb.

In a situation where your crawling frontier is already relativelystable, i.e. you already collected most of the link graph and you workwith segment10000 and segment10001 ;) the linkdb will already be mostlycomplete and changing only slightly. If you can dismiss these changes asmostly irrelevant, then the second script will become roughly equivalentto the first script.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Recrawl script question

Reply via email to