I don't understand, you can fetch or parse segments simultaneously and update the crawldb with a list of segments (or segmentDir). What's the prbolem?
On Tuesday 03 January 2012 19:34:05 Bai Shen wrote: > Isn't the whole point of segments to be able to work on them > simultaneously? That's the problem I've been having with them. > > On Mon, Jan 2, 2012 at 7:27 AM, Markus Jelsma <[email protected]>wrote: > > use the maxNumSegments on the generator the make more segments. Then loop > > through them with bash and fetch and parse. > > > > On Thursday 29 December 2011 21:29:02 Bai Shen wrote: > > > Currently, I'm using a shell script to run my nutch crawl. It seems to > > > work okay, but it only generates one segment at a time. Does anybody > > > > have > > > > > any suggestions for how to improve it, make it work with multiple > > > > segments, > > > > > etc? > > > > > > Thanks. > > > > > > > > > while true > > > do > > > > > > bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter > > > > > > -noParm > > > > > > export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print > > > > > > $8}'` bin/nutch fetch $SEGMENT > > > > > > bin/nutch parse $SEGMENT > > > bin/nutch updatedb crawl/crawldb $SEGMENT > > > bin/nutch invertlinks crawl/linkdb $SEGMENT > > > bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb > > > > > > crawl/linkdb $SEGMENT > > > > > > bin/nutch solrdedup http://solr:8080/solr > > > hadoop fs -mv $SEGMENT crawl/old > > > > > > done > > > > -- > > Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex

