Hi Jul, Yes that looks like this is the case - an unfortunate URL to pick! Many thanks for the tip too, greatly appreciated.
Chris On 13 July 2011 15:05, Julien Nioche <[email protected]> wrote: > Hi Chris > > Your script looks OK, could it be a redirection e.g. > http://bbc.co.uk/news<http://www.bbc.co.uk/news>=> > http://www.bbc.co.uk/news? > > Note : > segment=`ls -d $crawlDir/segments/2* | tail -1` won't work in distributed > mode > > $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments => will take > more and more time as you get more segments; use '$nutch invertlinks > $crawlDir/linkdb $segment' instead > > Jul > > On 13 July 2011 14:46, Chris Alexander <[email protected]> wrote: > > > Hi all, > > > > I am implementing a script for doing some incremental and staggered > > indexing > > (I believe my colleague Matthew Painter has already asked some questions > to > > this effect) and I am seeing a few issues with what I've written that I > > could use an explanation / clarification for. > > > > The script below runs through the commands needed to do a whole-web crawl > > (on limited URL list). The seeds/urls file contains only one line, > > http://www.bbc.co.uk/news. The first run through the while loop crawls > > only > > this URL. I would then expect the second loop to crawl the pages linked > to > > by the original URL. However instead it crawls only > > http://www.bbc.co.uk/news in the second iteration too. It is only on the > > third iteration through the loop that pages that were linked from the > > original URL are actually picked up. Is this expected behaviour? If not, > is > > there something I've done wrong in the scripts? > > > > Many thanks for any help > > > > Chris > > > > > > The script: > > -------------------------------- > > > > # > > # Facilitate incremental crawling with Nutch and Solr > > # > > > > # Set the location of the Nutch runtime/local directory > > NUTCH_HOME=/solr/nutch/runtime/local > > > > # Specify the Solr location > > SOLR_HOST=10.0.2.251 > > SOLR_PORT=7080 > > > > # Specify options for the crawl > > depth=3 > > > > # The Nutch executable > > nutch=$NUTCH_HOME/bin/nutch > > > > # Directories relating to Nutch functionality > > sourceUrlDir=$NUTCH_HOME/seeds > > crawlDir=$NUTCH_HOME/crawl > > > > echo "Inject the URLs to crawl into Nutch" > > $nutch inject $crawlDir/crawldb $sourceUrlDir > > > > i=0 > > while [[ $i -lt $depth ]] > > do > > > > echo "Generate the list of URLs to crawl" > > $nutch generate $crawlDir/crawldb $crawlDir/segments > > > > echo "Retrieve a segment" > > segment=`ls -d $crawlDir/segments/2* | tail -1` > > > > echo "Fetch that segment" > > $nutch fetch $segment > > > > echo "Parse the retrieved segment for URLs" > > $nutch parse $segment > > > > echo "Update the crawl database with the results of the crawl and > > parse" > > $nutch updatedb $crawlDir/crawldb $segment > > > > # Invert the links of the crawl results > > #$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments > > > > # Push the whole lot off to Solr > > #$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ > > $crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/* > > > > ((i++)) > > done > > > > echo "Invert links and push it off to solr" > > $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments > > $nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb > > $crawlDir/linkdb $crawlDir/segments/* > > > > echo "Deleting used crawl directory" > > rm -r $crawlDir > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

