Re: limit nutch to all pages within a certain domain

Sebastian Nagel Sun, 12 Aug 2012 11:35:31 -0700

On 08/12/2012 07:14 PM, Sourajit Basak wrote:
> Do I need to carry this iteration several times to crawl all the domains
> satisfactorily ?
Yes, you have to loop over generate-fetch-update cycles. In trunk there is
a script src/bin/crawl which does this.


> These domains may not have links among themselves. This is just to group
> related websites together. So, if I assume, on average each domain has
> (max) 100 links per page, and I have 5 domains; I need to set topN = 5 *
> 100 during each 'generate' phase ?
For large sites you can take more because the growth is exponential.
The 100 pages of the second cycle have theoretically 10000 outlinks.
Practically, many targets are shared, so you'll get much less outlinks.

> 
> On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel <
> [email protected]> wrote:
> 
>>> However, how is topN determined?
>> It's just the top N  unfetched pages sorted by decreasing score.
>> Pages will be re-fetched only after some larger amount of time,
>> 30 days per default, see property db.fetch.interval.default.
>>
>>> If I am crawling inside a domain, there will be links from almost every
>>> inner pages to the menu items. Wouldn't that increase the score of the
>>> menu/navigation items ?
>> Yes. And that's what you expect. These pages are hubs containing many
>> outlinks. So you want to re-fetch them first to detect links to new pages.
>>
>>>> How do I limit nutch to crawl only certain domains ?
>> You did it right. But you need time to get all pages fetched.
>>
>> Sebastian
>>
>> On 08/12/2012 06:29 PM, Sourajit Basak wrote:
>>> I proceeded like this ..
>>>
>>> 1. inject the urls
>>> 2. run generate
>>> 3. run fetch
>>> 4. run parse
>>> 5. run generate with topN 1000
>>> .. repeat 3 & 4
>>> ...
>>> 6. run generate with topN 1000
>>>
>>> This seems to be fetching the inner pages. However, how is topN
>> determined
>>> ? If I am crawling inside a domain, there will be links from almost every
>>> inner pages to the menu items. Wouldn't that increase the score of the
>>> menu/navigation items ?
>>>
>>> On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <
>> [email protected]>wrote:
>>>
>>>> How do I limit nutch to crawl only certain domains ?
>>>>
>>>> For e.g. lets say, I have 2 domains. I put the following in a text file
>>>> and inject the crawldb
>>>>
>>>> http://www.domain1.com
>>>> http://name.domain2.com
>>>>
>>>> Now, I wish to crawl all pages only in the above 2 domains.
>>>>
>>>> To do that, I added these to the regex filter (config file)
>>>>
>>>> +^http://www\.domain1\.com
>>>> +^http://name\.domain2\.com
>>>>
>>>> However, it seems to crawl only the (home) top most page of the above
>>>> domains only. How do I visit all inner pages ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: limit nutch to all pages within a certain domain

Reply via email to