Re: Continuous Crawling

Markus Jelsma Wed, 04 Jan 2012 07:53:08 -0800

I don't understand, you can fetch or parse segments simultaneously and update 
the crawldb with a list of segments (or segmentDir). What's the prbolem?


On Tuesday 03 January 2012 19:34:05 Bai Shen wrote:
> Isn't the whole point of segments to be able to work on them
> simultaneously?  That's the problem I've been having with them.
> 
> On Mon, Jan 2, 2012 at 7:27 AM, Markus Jelsma 
<[email protected]>wrote:
> > use the maxNumSegments on the generator the make more segments. Then loop
> > through them with bash and fetch and parse.
> > 
> > On Thursday 29 December 2011 21:29:02 Bai Shen wrote:
> > > Currently, I'm using a shell script to run my nutch crawl.  It seems to
> > > work okay, but it only generates one segment at a time.  Does anybody
> > 
> > have
> > 
> > > any suggestions for how to improve it, make it work with multiple
> > 
> > segments,
> > 
> > > etc?
> > > 
> > > Thanks.
> > > 
> > > 
> > > while true
> > > do
> > > 
> > >   bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter
> > > 
> > > -noParm
> > > 
> > >   export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print
> > > 
> > > $8}'` bin/nutch fetch $SEGMENT
> > > 
> > >   bin/nutch parse $SEGMENT
> > >   bin/nutch updatedb crawl/crawldb $SEGMENT
> > >   bin/nutch invertlinks crawl/linkdb $SEGMENT
> > >   bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb
> > > 
> > > crawl/linkdb $SEGMENT
> > > 
> > >   bin/nutch solrdedup http://solr:8080/solr
> > >   hadoop fs -mv $SEGMENT crawl/old
> > > 
> > > done
> > 
> > --
> > Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex

Re: Continuous Crawling

Reply via email to