Re: limit nutch to all pages within a certain domain

Sourajit Basak Sun, 12 Aug 2012 09:29:32 -0700

I proceeded like this ..

1. inject the urls
2. run generate
3. run fetch
4. run parse
5. run generate with topN 1000
.. repeat 3 & 4
...
6. run generate with topN 1000

This seems to be fetching the inner pages. However, how is topN determined
? If I am crawling inside a domain, there will be links from almost every
inner pages to the menu items. Wouldn't that increase the score of the
menu/navigation items ?

On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <[email protected]>wrote:

> How do I limit nutch to crawl only certain domains ?
>
> For e.g. lets say, I have 2 domains. I put the following in a text file
> and inject the crawldb
>
> http://www.domain1.com
> http://name.domain2.com
>
> Now, I wish to crawl all pages only in the above 2 domains.
>
> To do that, I added these to the regex filter (config file)
>
> +^http://www\.domain1\.com
> +^http://name\.domain2\.com
>
> However, it seems to crawl only the (home) top most page of the above
> domains only. How do I visit all inner pages ?
>
>
>
>
>

Re: limit nutch to all pages within a certain domain

Reply via email to