Do I need to carry this iteration several times to crawl all the domains satisfactorily ?
These domains may not have links among themselves. This is just to group related websites together. So, if I assume, on average each domain has (max) 100 links per page, and I have 5 domains; I need to set topN = 5 * 100 during each 'generate' phase ? On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel < [email protected]> wrote: > > However, how is topN determined? > It's just the top N unfetched pages sorted by decreasing score. > Pages will be re-fetched only after some larger amount of time, > 30 days per default, see property db.fetch.interval.default. > > > If I am crawling inside a domain, there will be links from almost every > > inner pages to the menu items. Wouldn't that increase the score of the > > menu/navigation items ? > Yes. And that's what you expect. These pages are hubs containing many > outlinks. So you want to re-fetch them first to detect links to new pages. > > >> How do I limit nutch to crawl only certain domains ? > You did it right. But you need time to get all pages fetched. > > Sebastian > > On 08/12/2012 06:29 PM, Sourajit Basak wrote: > > I proceeded like this .. > > > > 1. inject the urls > > 2. run generate > > 3. run fetch > > 4. run parse > > 5. run generate with topN 1000 > > .. repeat 3 & 4 > > ... > > 6. run generate with topN 1000 > > > > This seems to be fetching the inner pages. However, how is topN > determined > > ? If I am crawling inside a domain, there will be links from almost every > > inner pages to the menu items. Wouldn't that increase the score of the > > menu/navigation items ? > > > > On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak < > [email protected]>wrote: > > > >> How do I limit nutch to crawl only certain domains ? > >> > >> For e.g. lets say, I have 2 domains. I put the following in a text file > >> and inject the crawldb > >> > >> http://www.domain1.com > >> http://name.domain2.com > >> > >> Now, I wish to crawl all pages only in the above 2 domains. > >> > >> To do that, I added these to the regex filter (config file) > >> > >> +^http://www\.domain1\.com > >> +^http://name\.domain2\.com > >> > >> However, it seems to crawl only the (home) top most page of the above > >> domains only. How do I visit all inner pages ? > >> > >> > >> > >> > >> > > > >

