Re: crawl www

Markus Jelsma Tue, 28 Sep 2010 07:02:32 -0700

You mean the script mentioned on the wiki page? I've never used it but it's 
propably going to stop.


Maybe you are better off trying the steps manually at first as it might give 
you a better understanding of what's going on.


bin/nutch inject crawl/crawldb urls

bin/nutch generate crawl/crawldb crawl/segments

export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

bin/nutch fetch $SEGMENT

bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

bin/nutch invertlinks crawl/linkdb -dir crawl/segments


Please read on the wiki what these commands actually do.



On Tuesday 28 September 2010 15:51:11 Dennis wrote:
> Thanks, Markus,
> Another question, the script will stop, right? I mean, I am not going to
>  crawl for 100 days, I need it finish it's job. Dennis
> 
> --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote:
> 
> From: Markus Jelsma <[email protected]>
> Subject: Re: crawl www
> To: "Dennis" <[email protected]>
> Cc: [email protected]
> Date: Tuesday, September 28, 2010,
>  9:16 PM
> 
> Oh, you don't need to crawl-urlfilter.txt. It's being used by the crawl
> command only and if you're about to crawl the internet (!), you will need
>  the steps i explained in
>  the other e-mail. You can forget about the crawl command
> in this case.
> 
> On Tuesday 28 September 2010 14:58:32 Dennis wrote:
> > Sorry for interrupting, Markus,
> >
> > But I'm not quite understand. How do I "update your DB's"?, What should I
> >  do about "crawl-urlfilter.txt"? Thanks
> >
> >
> > Dennis
> >
> > --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote:
> >
> > From: Markus Jelsma <[email protected]>
> > Subject: Re: crawl www
> > To: [email protected]
> > Date: Tuesday, September 28, 2010, 8:19 PM
> >
> > Dennis, you shouldn't hyjack my
> 
>  thread ;)
> 
> > Anyway. it's all about crawl, update your DB's and recrawl and keep
> >  repeating the same loop over and over.
> >
> > Cheers,
> >
> > On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> > > Hi, all,
> > > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
> > >  used to be:# accept hosts in
> > >  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 /
>  06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: crawl www

Reply via email to