Re: limit nutch to all pages within a certain domain

Sebastian Nagel Sun, 12 Aug 2012 09:57:35 -0700

> However, how is topN determined?
It's just the top N  unfetched pages sorted by decreasing score.
Pages will be re-fetched only after some larger amount of time,
30 days per default, see property db.fetch.interval.default.


> If I am crawling inside a domain, there will be links from almost every
> inner pages to the menu items. Wouldn't that increase the score of the
> menu/navigation items ?
Yes. And that's what you expect. These pages are hubs containing many
outlinks. So you want to re-fetch them first to detect links to new pages.

>> How do I limit nutch to crawl only certain domains ?
You did it right. But you need time to get all pages fetched.

Sebastian

On 08/12/2012 06:29 PM, Sourajit Basak wrote:
> I proceeded like this ..
> 
> 1. inject the urls
> 2. run generate
> 3. run fetch
> 4. run parse
> 5. run generate with topN 1000
> .. repeat 3 & 4
> ...
> 6. run generate with topN 1000
> 
> This seems to be fetching the inner pages. However, how is topN determined
> ? If I am crawling inside a domain, there will be links from almost every
> inner pages to the menu items. Wouldn't that increase the score of the
> menu/navigation items ?
> 
> On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak 
> <[email protected]>wrote:
> 
>> How do I limit nutch to crawl only certain domains ?
>>
>> For e.g. lets say, I have 2 domains. I put the following in a text file
>> and inject the crawldb
>>
>> http://www.domain1.com
>> http://name.domain2.com
>>
>> Now, I wish to crawl all pages only in the above 2 domains.
>>
>> To do that, I added these to the regex filter (config file)
>>
>> +^http://www\.domain1\.com
>> +^http://name\.domain2\.com
>>
>> However, it seems to crawl only the (home) top most page of the above
>> domains only. How do I visit all inner pages ?
>>
>>
>>
>>
>>
>

Re: limit nutch to all pages within a certain domain

Reply via email to