Continuous Crawling

Bai Shen Thu, 29 Dec 2011 12:29:32 -0800

Currently, I'm using a shell script to run my nutch crawl.  It seems to
work okay, but it only generates one segment at a time.  Does anybody have
any suggestions for how to improve it, make it work with multiple segments,
etc?


Thanks.


while true
do
  bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter
-noParm
  export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print $8}'`
  bin/nutch fetch $SEGMENT
  bin/nutch parse $SEGMENT
  bin/nutch updatedb crawl/crawldb $SEGMENT
  bin/nutch invertlinks crawl/linkdb $SEGMENT
  bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb
crawl/linkdb $SEGMENT
  bin/nutch solrdedup http://solr:8080/solr
  hadoop fs -mv $SEGMENT crawl/old
done

Continuous Crawling

Reply via email to