Re: Continuous Crawling

Markus Jelsma Mon, 02 Jan 2012 04:27:54 -0800

use the maxNumSegments on the generator the make more segments. Then loop 
through them with bash and fetch and parse.


On Thursday 29 December 2011 21:29:02 Bai Shen wrote:
> Currently, I'm using a shell script to run my nutch crawl.  It seems to
> work okay, but it only generates one segment at a time.  Does anybody have
> any suggestions for how to improve it, make it work with multiple segments,
> etc?
> 
> Thanks.
> 
> 
> while true
> do
>   bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter
> -noParm
>   export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print
> $8}'` bin/nutch fetch $SEGMENT
>   bin/nutch parse $SEGMENT
>   bin/nutch updatedb crawl/crawldb $SEGMENT
>   bin/nutch invertlinks crawl/linkdb $SEGMENT
>   bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb
> crawl/linkdb $SEGMENT
>   bin/nutch solrdedup http://solr:8080/solr
>   hadoop fs -mv $SEGMENT crawl/old
> done

-- 
Markus Jelsma - CTO - Openindex

Re: Continuous Crawling

Reply via email to