Currently, I'm using a shell script to run my nutch crawl. It seems to
work okay, but it only generates one segment at a time. Does anybody have
any suggestions for how to improve it, make it work with multiple segments,
etc?
Thanks.
while true
do
bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter
-noParm
export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print $8}'`
bin/nutch fetch $SEGMENT
bin/nutch parse $SEGMENT
bin/nutch updatedb crawl/crawldb $SEGMENT
bin/nutch invertlinks crawl/linkdb $SEGMENT
bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb
crawl/linkdb $SEGMENT
bin/nutch solrdedup http://solr:8080/solr
hadoop fs -mv $SEGMENT crawl/old
done