Re: Continuous Crawling

Bai Shen Tue, 03 Jan 2012 10:34:34 -0800

Isn't the whole point of segments to be able to work on them
simultaneously?  That's the problem I've been having with them.


On Mon, Jan 2, 2012 at 7:27 AM, Markus Jelsma <[email protected]>wrote:

> use the maxNumSegments on the generator the make more segments. Then loop
> through them with bash and fetch and parse.
>
> On Thursday 29 December 2011 21:29:02 Bai Shen wrote:
> > Currently, I'm using a shell script to run my nutch crawl.  It seems to
> > work okay, but it only generates one segment at a time.  Does anybody
> have
> > any suggestions for how to improve it, make it work with multiple
> segments,
> > etc?
> >
> > Thanks.
> >
> >
> > while true
> > do
> >   bin/nutch generate crawl/crawldb crawl/segments -topN 10000 -noFilter
> > -noParm
> >   export SEGMENT=`hadoop fs -ls crawl/segments | tail -1 | awk '{print
> > $8}'` bin/nutch fetch $SEGMENT
> >   bin/nutch parse $SEGMENT
> >   bin/nutch updatedb crawl/crawldb $SEGMENT
> >   bin/nutch invertlinks crawl/linkdb $SEGMENT
> >   bin/nutch solrindex http://solr:8080/solr crawl/crawldb -linkdb
> > crawl/linkdb $SEGMENT
> >   bin/nutch solrdedup http://solr:8080/solr
> >   hadoop fs -mv $SEGMENT crawl/old
> > done
>
> --
> Markus Jelsma - CTO - Openindex
>

Re: Continuous Crawling

Reply via email to