Thanks, Markus
Dennis

--- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote:

From: Markus Jelsma <[email protected]>
Subject: Re: crawl www
To: "Dennis" <[email protected]>
Cc: [email protected]
Date: Tuesday, September 28, 2010, 9:08 PM

You should read a bit, maybe this 'll help.

http://wiki.apache.org/nutch/NutchTutorial
http://wiki.apache.org/nutch/Crawl

In short, in Nutch you need to have a CrawlDB, a DB listing your URL's. To 
start fetching URL's you need to generate a fetch list from your CrawlDB. 
These are the URL's you're going to fetch in the first and subsequent cycles. 
When done fetching, you can parse the fetched pages and get proper content. 
Now you've got a fully parsed segment.

Later you need to update your CrawlDB and add the newly found URL's in your 
parsed segment. This way your CrawlDB grows and new URL's can be used to 
generate your subsequent fetch list.

Finally you need to update your LinkDB (holding anchors to URL's) and index 
the parsed content in Nutch 1.x or a Solr instance.



On Tuesday 28 September 2010 14:58:32 Dennis wrote:
> Sorry for interrupting, Markus,
> 
> But I'm not quite understand. How do I "update your DB's"?, What should I
>  do about "crawl-urlfilter.txt"? Thanks
> 
> 
> Dennis
> 
> --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote:
> 
> From: Markus Jelsma <[email protected]>
> Subject: Re: crawl www
> To: [email protected]
> Date: Tuesday, September 28, 2010, 8:19 PM
> 
> Dennis, you shouldn't hyjack my thread ;)
> 
> Anyway. it's all about crawl, update your DB's and recrawl and keep
>  repeating the same loop over and over.
> 
> Cheers,
> 
> On Tuesday 28 September 2010 10:08:00 Dennis wrote:
> > Hi, all,
> > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It
> >  used to be:# accept hosts in
> >  MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350




      

Reply via email to