Hi Chris Your script looks OK, could it be a redirection e.g. http://bbc.co.uk/news<http://www.bbc.co.uk/news>=> http://www.bbc.co.uk/news?
Note : segment=`ls -d $crawlDir/segments/2* | tail -1` won't work in distributed mode $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments => will take more and more time as you get more segments; use '$nutch invertlinks $crawlDir/linkdb $segment' instead Jul On 13 July 2011 14:46, Chris Alexander <[email protected]> wrote: > Hi all, > > I am implementing a script for doing some incremental and staggered > indexing > (I believe my colleague Matthew Painter has already asked some questions to > this effect) and I am seeing a few issues with what I've written that I > could use an explanation / clarification for. > > The script below runs through the commands needed to do a whole-web crawl > (on limited URL list). The seeds/urls file contains only one line, > http://www.bbc.co.uk/news. The first run through the while loop crawls > only > this URL. I would then expect the second loop to crawl the pages linked to > by the original URL. However instead it crawls only > http://www.bbc.co.uk/news in the second iteration too. It is only on the > third iteration through the loop that pages that were linked from the > original URL are actually picked up. Is this expected behaviour? If not, is > there something I've done wrong in the scripts? > > Many thanks for any help > > Chris > > > The script: > -------------------------------- > > # > # Facilitate incremental crawling with Nutch and Solr > # > > # Set the location of the Nutch runtime/local directory > NUTCH_HOME=/solr/nutch/runtime/local > > # Specify the Solr location > SOLR_HOST=10.0.2.251 > SOLR_PORT=7080 > > # Specify options for the crawl > depth=3 > > # The Nutch executable > nutch=$NUTCH_HOME/bin/nutch > > # Directories relating to Nutch functionality > sourceUrlDir=$NUTCH_HOME/seeds > crawlDir=$NUTCH_HOME/crawl > > echo "Inject the URLs to crawl into Nutch" > $nutch inject $crawlDir/crawldb $sourceUrlDir > > i=0 > while [[ $i -lt $depth ]] > do > > echo "Generate the list of URLs to crawl" > $nutch generate $crawlDir/crawldb $crawlDir/segments > > echo "Retrieve a segment" > segment=`ls -d $crawlDir/segments/2* | tail -1` > > echo "Fetch that segment" > $nutch fetch $segment > > echo "Parse the retrieved segment for URLs" > $nutch parse $segment > > echo "Update the crawl database with the results of the crawl and > parse" > $nutch updatedb $crawlDir/crawldb $segment > > # Invert the links of the crawl results > #$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments > > # Push the whole lot off to Solr > #$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ > $crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/* > > ((i++)) > done > > echo "Invert links and push it off to solr" > $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments > $nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb > $crawlDir/linkdb $crawlDir/segments/* > > echo "Deleting used crawl directory" > rm -r $crawlDir > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

