I proceeded like this .. 1. inject the urls 2. run generate 3. run fetch 4. run parse 5. run generate with topN 1000 .. repeat 3 & 4 ... 6. run generate with topN 1000
This seems to be fetching the inner pages. However, how is topN determined ? If I am crawling inside a domain, there will be links from almost every inner pages to the menu items. Wouldn't that increase the score of the menu/navigation items ? On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <[email protected]>wrote: > How do I limit nutch to crawl only certain domains ? > > For e.g. lets say, I have 2 domains. I put the following in a text file > and inject the crawldb > > http://www.domain1.com > http://name.domain2.com > > Now, I wish to crawl all pages only in the above 2 domains. > > To do that, I added these to the regex filter (config file) > > +^http://www\.domain1\.com > +^http://name\.domain2\.com > > However, it seems to crawl only the (home) top most page of the above > domains only. How do I visit all inner pages ? > > > > >

