Re-running indexing to follow links

Chris Alexander Wed, 13 Jul 2011 06:46:39 -0700

Hi all,

I am implementing a script for doing some incremental and staggered indexing
(I believe my colleague Matthew Painter has already asked some questions to
this effect) and I am seeing a few issues with what I've written that I
could use an explanation / clarification for.


The script below runs through the commands needed to do a whole-web crawl
(on limited URL list). The seeds/urls file contains only one line,
http://www.bbc.co.uk/news. The first run through the while loop crawls only
this URL. I would then expect the second loop to crawl the pages linked to
by the original URL. However instead it crawls only
http://www.bbc.co.uk/news in the second iteration too. It is only on the
third iteration through the loop that pages that were linked from the
original URL are actually picked up. Is this expected behaviour? If not, is
there something I've done wrong in the scripts?

Many thanks for any help

Chris


The script:
--------------------------------

#
# Facilitate incremental crawling with Nutch and Solr
#

# Set the location of the Nutch runtime/local directory
NUTCH_HOME=/solr/nutch/runtime/local

# Specify the Solr location
SOLR_HOST=10.0.2.251
SOLR_PORT=7080

# Specify options for the crawl
depth=3

# The Nutch executable
nutch=$NUTCH_HOME/bin/nutch

# Directories relating to Nutch functionality
sourceUrlDir=$NUTCH_HOME/seeds
crawlDir=$NUTCH_HOME/crawl

echo "Inject the URLs to crawl into Nutch"
$nutch inject $crawlDir/crawldb $sourceUrlDir

i=0
while [[ $i -lt $depth ]]
do

        echo "Generate the list of URLs to crawl"
        $nutch generate $crawlDir/crawldb $crawlDir/segments

        echo "Retrieve a segment"
        segment=`ls -d $crawlDir/segments/2* | tail -1`

        echo "Fetch that segment"
        $nutch fetch $segment

        echo "Parse the retrieved segment for URLs"
        $nutch parse $segment

        echo "Update the crawl database with the results of the crawl and
parse"
        $nutch updatedb $crawlDir/crawldb $segment

        # Invert the links of the crawl results
        #$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments

        # Push the whole lot off to Solr
        #$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/
$crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/*

        ((i++))
done

echo "Invert links and push it off to solr"
$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments
$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb
$crawlDir/linkdb $crawlDir/segments/*

echo "Deleting used crawl directory"
rm -r $crawlDir

Re-running indexing to follow links

Reply via email to