use the maxNumSegments on the generator the make more segments. Then loop
through them with bash and fetch and parse.
On Thursday 29 December 2011 21:29:02 Bai Shen wrote:
> Currently, I'm using a shell script to run my nutch crawl. It seems to
> work okay, but it only generates one segment at a time. Does anybody have
> any suggestions for how to improve it, make it work with multiple segments,
> etc?
>
> Thanks.
>
>
> while true
> do
> bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter
> -noParm
> export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print
> $8}'` bin/nutch fetch $SEGMENT
> bin/nutch parse $SEGMENT
> bin/nutch updatedb crawl/crawldb $SEGMENT
> bin/nutch invertlinks crawl/linkdb $SEGMENT
> bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb
> crawl/linkdb $SEGMENT
> bin/nutch solrdedup http://solr:8080/solr
> hadoop fs -mv $SEGMENT crawl/old
> done
--
Markus Jelsma - CTO - Openindex