Re: Re-running indexing to follow links

Julien Nioche Wed, 13 Jul 2011 07:06:00 -0700

Hi Chris

Your script looks OK, could it be a redirection e.g.
http://bbc.co.uk/news<http://www.bbc.co.uk/news>=>
http://www.bbc.co.uk/news?


Note :
segment=`ls -d $crawlDir/segments/2* | tail -1` won't work in distributed
mode

$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments => will take
more and more time as you get more segments; use '$nutch invertlinks
$crawlDir/linkdb $segment' instead

Jul

On 13 July 2011 14:46, Chris Alexander <[email protected]> wrote:

> Hi all,
>
> I am implementing a script for doing some incremental and staggered
> indexing
> (I believe my colleague Matthew Painter has already asked some questions to
> this effect) and I am seeing a few issues with what I've written that I
> could use an explanation / clarification for.
>
> The script below runs through the commands needed to do a whole-web crawl
> (on limited URL list). The seeds/urls file contains only one line,
> http://www.bbc.co.uk/news. The first run through the while loop crawls
> only
> this URL. I would then expect the second loop to crawl the pages linked to
> by the original URL. However instead it crawls only
> http://www.bbc.co.uk/news in the second iteration too. It is only on the
> third iteration through the loop that pages that were linked from the
> original URL are actually picked up. Is this expected behaviour? If not, is
> there something I've done wrong in the scripts?
>
> Many thanks for any help
>
> Chris
>
>
> The script:
> --------------------------------
>
> #
> # Facilitate incremental crawling with Nutch and Solr
> #
>
> # Set the location of the Nutch runtime/local directory
> NUTCH_HOME=/solr/nutch/runtime/local
>
> # Specify the Solr location
> SOLR_HOST=10.0.2.251
> SOLR_PORT=7080
>
> # Specify options for the crawl
> depth=3
>
> # The Nutch executable
> nutch=$NUTCH_HOME/bin/nutch
>
> # Directories relating to Nutch functionality
> sourceUrlDir=$NUTCH_HOME/seeds
> crawlDir=$NUTCH_HOME/crawl
>
> echo "Inject the URLs to crawl into Nutch"
> $nutch inject $crawlDir/crawldb $sourceUrlDir
>
> i=0
> while [[ $i -lt $depth ]]
> do
>
>        echo "Generate the list of URLs to crawl"
>        $nutch generate $crawlDir/crawldb $crawlDir/segments
>
>        echo "Retrieve a segment"
>        segment=`ls -d $crawlDir/segments/2* | tail -1`
>
>        echo "Fetch that segment"
>        $nutch fetch $segment
>
>        echo "Parse the retrieved segment for URLs"
>        $nutch parse $segment
>
>        echo "Update the crawl database with the results of the crawl and
> parse"
>        $nutch updatedb $crawlDir/crawldb $segment
>
>        # Invert the links of the crawl results
>        #$nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments
>
>        # Push the whole lot off to Solr
>        #$nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/
> $crawlDir/crawldb $crawlDir/linkdb $crawlDir/segments/*
>
>        ((i++))
> done
>
> echo "Invert links and push it off to solr"
> $nutch invertlinks $crawlDir/linkdb -dir $crawlDir/segments
> $nutch solrindex http://$SOLR_HOST:$SOLR_PORT/solr/ $crawlDir/crawldb
> $crawlDir/linkdb $crawlDir/segments/*
>
> echo "Deleting used crawl directory"
> rm -r $crawlDir
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Re-running indexing to follow links

Reply via email to