Re: Re-running indexing to follow links

Chris Alexander Wed, 13 Jul 2011 07:26:16 -0700

Hi Jul,

Yes that looks like this is the case - an unfortunate URL to pick! Many
thanks for the tip too, greatly appreciated.


Chris

On 13 July 2011 15:05, Julien Nioche <[email protected]> wrote:

> Hi Chris
>
> Your script looks OK, could it be a redirection e.g.
> http://bbc.co.uk/news<http://www.bbc.co.uk/news>=>
> http://www.bbc.co.uk/news?
>
> Note :
> segment=`ls -d $crawlDir/segments/2* | tail -1` won't work in distributed
> mode
>
> $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments => will take
> more and more time as you get more segments; use '$nutch invertlinks
> $crawlDir/linkdb $segment' instead
>
> Jul
>
> On 13 July 2011 14:46, Chris Alexander <[email protected]> wrote:
>
> > Hi all,
> >
> > I am implementing a script for doing some incremental and staggered
> > indexing
> > (I believe my colleague Matthew Painter has already asked some questions
> to
> > this effect) and I am seeing a few issues with what I've written that I
> > could use an explanation / clarification for.
> >
> > The script below runs through the commands needed to do a whole-web crawl
> > (on limited URL list). The seeds/urls file contains only one line,
> > http://www.bbc.co.uk/news. The first run through the while loop crawls
> > only
> > this URL. I would then expect the second loop to crawl the pages linked
> to
> > by the original URL. However instead it crawls only
> > http://www.bbc.co.uk/news in the second iteration too. It is only on the
> > third iteration through the loop that pages that were linked from the
> > original URL are actually picked up. Is this expected behaviour? If not,
> is
> > there something I've done wrong in the scripts?
> >
> > Many thanks for any help
> >
> > Chris
> >
> >
> > The script:
> > --------------------------------
> >
> > #
> > # Facilitate incremental crawling with Nutch and Solr
> > #
> >
> > # Set the location of the Nutch runtime/local directory
> > NUTCH_HOME=/solr/nutch/runtime/local
> >
> > # Specify the Solr location
> > SOLR_HOST=10.0.2.251
> > SOLR_PORT=7080
> >
> > # Specify options for the crawl
> > depth=3
> >
> > # The Nutch executable
> > nutch=$NUTCH_HOME/bin/nutch
> >
> > # Directories relating to Nutch functionality
> > sourceUrlDir=$NUTCH_HOME/seeds
> > crawlDir=$NUTCH_HOME/crawl
> >
> > echo "Inject the URLs to crawl into Nutch"
> > $nutch inject $crawlDir/crawldb $sourceUrlDir
> >
> > i=0
> > while [[ $i -lt $depth ]]
> > do
> >
> >        echo "Generate the list of URLs to crawl"
> >        $nutch generate $crawlDir/crawldb $crawlDir/segments
> >
> >        echo "Retrieve a segment"
> >        segment=`ls -d $crawlDir/segments/2* | tail -1`
> >
> >        echo "Fetch that segment"
> >        $nutch fetch $segment
> >
> >        echo "Parse the retrieved segment for URLs"
> >        $nutch parse $segment
> >
> >        echo "Update the crawl database with the results of the crawl and
> > parse"
> >        $nutch updatedb $crawlDir/crawldb $segment
> >
> >        # Invert the links of the crawl results
> >        #$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments
> >
> >        # Push the whole lot off to Solr
> >        #$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/
> > $crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/*
> >
> >        ((i++))
> > done
> >
> > echo "Invert links and push it off to solr"
> > $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments
> > $nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb
> > $crawlDir/linkdb $crawlDir/segments/*
> >
> > echo "Deleting used crawl directory"
> > rm -r $crawlDir
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Re-running indexing to follow links

Reply via email to