Hi all, I am implementing a script for doing some incremental and staggered indexing (I believe my colleague Matthew Painter has already asked some questions to this effect) and I am seeing a few issues with what I've written that I could use an explanation / clarification for.
The script below runs through the commands needed to do a whole-web crawl (on limited URL list). The seeds/urls file contains only one line, http://www.bbc.co.uk/news. The first run through the while loop crawls only this URL. I would then expect the second loop to crawl the pages linked to by the original URL. However instead it crawls only http://www.bbc.co.uk/news in the second iteration too. It is only on the third iteration through the loop that pages that were linked from the original URL are actually picked up. Is this expected behaviour? If not, is there something I've done wrong in the scripts? Many thanks for any help Chris The script: -------------------------------- # # Facilitate incremental crawling with Nutch and Solr # # Set the location of the Nutch runtime/local directory NUTCH_HOME=/solr/nutch/runtime/local # Specify the Solr location SOLR_HOST=10.0.2.251 SOLR_PORT=7080 # Specify options for the crawl depth=3 # The Nutch executable nutch=$NUTCH_HOME/bin/nutch # Directories relating to Nutch functionality sourceUrlDir=$NUTCH_HOME/seeds crawlDir=$NUTCH_HOME/crawl echo "Inject the URLs to crawl into Nutch" $nutch inject $crawlDir/crawldb $sourceUrlDir i=0 while [[ $i -lt $depth ]] do echo "Generate the list of URLs to crawl" $nutch generate $crawlDir/crawldb $crawlDir/segments echo "Retrieve a segment" segment=`ls -d $crawlDir/segments/2* | tail -1` echo "Fetch that segment" $nutch fetch $segment echo "Parse the retrieved segment for URLs" $nutch parse $segment echo "Update the crawl database with the results of the crawl and parse" $nutch updatedb $crawlDir/crawldb $segment # Invert the links of the crawl results #$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments # Push the whole lot off to Solr #$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/* ((i++)) done echo "Invert links and push it off to solr" $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments $nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/* echo "Deleting used crawl directory" rm -r $crawlDir

