> However, how is topN determined? It's just the top N unfetched pages sorted by decreasing score. Pages will be re-fetched only after some larger amount of time, 30 days per default, see property db.fetch.interval.default.
> If I am crawling inside a domain, there will be links from almost every > inner pages to the menu items. Wouldn't that increase the score of the > menu/navigation items ? Yes. And that's what you expect. These pages are hubs containing many outlinks. So you want to re-fetch them first to detect links to new pages. >> How do I limit nutch to crawl only certain domains ? You did it right. But you need time to get all pages fetched. Sebastian On 08/12/2012 06:29 PM, Sourajit Basak wrote: > I proceeded like this .. > > 1. inject the urls > 2. run generate > 3. run fetch > 4. run parse > 5. run generate with topN 1000 > .. repeat 3 & 4 > ... > 6. run generate with topN 1000 > > This seems to be fetching the inner pages. However, how is topN determined > ? If I am crawling inside a domain, there will be links from almost every > inner pages to the menu items. Wouldn't that increase the score of the > menu/navigation items ? > > On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak > <[email protected]>wrote: > >> How do I limit nutch to crawl only certain domains ? >> >> For e.g. lets say, I have 2 domains. I put the following in a text file >> and inject the crawldb >> >> http://www.domain1.com >> http://name.domain2.com >> >> Now, I wish to crawl all pages only in the above 2 domains. >> >> To do that, I added these to the regex filter (config file) >> >> +^http://www\.domain1\.com >> +^http://name\.domain2\.com >> >> However, it seems to crawl only the (home) top most page of the above >> domains only. How do I visit all inner pages ? >> >> >> >> >> >

