Thanks, Markus Dennis --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote:
From: Markus Jelsma <[email protected]> Subject: Re: crawl www To: "Dennis" <[email protected]> Cc: [email protected] Date: Tuesday, September 28, 2010, 9:08 PM You should read a bit, maybe this 'll help. http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/Crawl In short, in Nutch you need to have a CrawlDB, a DB listing your URL's. To start fetching URL's you need to generate a fetch list from your CrawlDB. These are the URL's you're going to fetch in the first and subsequent cycles. When done fetching, you can parse the fetched pages and get proper content. Now you've got a fully parsed segment. Later you need to update your CrawlDB and add the newly found URL's in your parsed segment. This way your CrawlDB grows and new URL's can be used to generate your subsequent fetch list. Finally you need to update your LinkDB (holding anchors to URL's) and index the parsed content in Nutch 1.x or a Solr instance. On Tuesday 28 September 2010 14:58:32 Dennis wrote: > Sorry for interrupting, Markus, > > But I'm not quite understand. How do I "update your DB's"?, What should I > do about "crawl-urlfilter.txt"? Thanks > > > Dennis > > --- On Tue, 9/28/10, Markus Jelsma <[email protected]> wrote: > > From: Markus Jelsma <[email protected]> > Subject: Re: crawl www > To: [email protected] > Date: Tuesday, September 28, 2010, 8:19 PM > > Dennis, you shouldn't hyjack my thread ;) > > Anyway. it's all about crawl, update your DB's and recrawl and keep > repeating the same loop over and over. > > Cheers, > > On Tuesday 28 September 2010 10:08:00 Dennis wrote: > > Hi, all, > > I want to crawl the whole www, how do I config "crawl-urlfilter.txt"?It > > used to be:# accept hosts in > > MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ ThanksDennis > > Markus Jelsma - Technisch Architect - Buyways BV > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

