You mean the script mentioned on the wiki page? I've never used it but it's propably going to stop.
Maybe you are better off trying the steps manually at first as it might give you a better understanding of what's going on. bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` bin/nutch fetch $SEGMENT bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize bin/nutch invertlinks crawl/linkdb -dir crawl/segments Please read on the wiki what these commands actually do. On Tuesday 28 September 2010 15:51:11 Dennis wrote: > Thanks, Markus, > Another question, the script will stop, right? I mean, I am not going to > crawl for 100 days, I need it finish it's job. Dennis > > --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote: > > From: Markus Jelsma <[email protected]> > Subject: Re: crawl www > To: "Dennis" <[email protected]> > Cc: [email protected] > Date: Tuesday, September 28, 2010, > 9:16 PM > > Oh, you don't need to crawl-urlfilter.txt. It's being used by the crawl > command only and if you're about to crawl the internet (!), you will need > the steps i explained in > the other e-mail. You can forget about the crawl command > in this case. > > On Tuesday 28 September 2010 14:58:32 Dennis wrote: > > Sorry for interrupting, Markus, > > > > But I'm not quite understand. How do I "update your DB's"?, What should I > > do about "crawl-urlfilter.txt"? Thanks > > > > > > Dennis > > > > --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote: > > > > From: Markus Jelsma <[email protected]> > > Subject: Re: crawl www > > To: [email protected] > > Date: Tuesday, September 28, 2010, 8:19 PM > > > > Dennis, you shouldn't hyjack my > > thread ;) > > > Anyway. it's all about crawl, update your DB's and recrawl and keep > > repeating the same loop over and over. > > > > Cheers, > > > > On Tuesday 28 September 2010 10:08:00 Dennis wrote: > > > Hi, all, > > > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It > > > used to be:# accept hosts in > > > MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis > > > > Markus Jelsma - Technisch Architect - Buyways BV > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > > Markus Jelsma - Technisch Architect - Buyways BV > http://www.linkedin.com/in/markus17 > 050-8536620 / > 06-50258350 > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

