Re: limit nutch to all pages within a certain domain

Sourajit Basak Sun, 12 Aug 2012 10:14:29 -0700

Do I need to carry this iteration several times to crawl all the domains
satisfactorily ?


These domains may not have links among themselves. This is just to group
related websites together. So, if I assume, on average each domain has
(max) 100 links per page, and I have 5 domains; I need to set topN = 5 *
100 during each 'generate' phase ?

On Sun, Aug 12, 2012 at 10:27 PM, Sebastian Nagel <
[email protected]> wrote:

> > However, how is topN determined?
> It's just the top N  unfetched pages sorted by decreasing score.
> Pages will be re-fetched only after some larger amount of time,
> 30 days per default, see property db.fetch.interval.default.
>
> > If I am crawling inside a domain, there will be links from almost every
> > inner pages to the menu items. Wouldn't that increase the score of the
> > menu/navigation items ?
> Yes. And that's what you expect. These pages are hubs containing many
> outlinks. So you want to re-fetch them first to detect links to new pages.
>
> >> How do I limit nutch to crawl only certain domains ?
> You did it right. But you need time to get all pages fetched.
>
> Sebastian
>
> On 08/12/2012 06:29 PM, Sourajit Basak wrote:
> > I proceeded like this ..
> >
> > 1. inject the urls
> > 2. run generate
> > 3. run fetch
> > 4. run parse
> > 5. run generate with topN 1000
> > .. repeat 3 & 4
> > ...
> > 6. run generate with topN 1000
> >
> > This seems to be fetching the inner pages. However, how is topN
> determined
> > ? If I am crawling inside a domain, there will be links from almost every
> > inner pages to the menu items. Wouldn't that increase the score of the
> > menu/navigation items ?
> >
> > On Sun, Aug 12, 2012 at 9:25 PM, Sourajit Basak <
> [email protected]>wrote:
> >
> >> How do I limit nutch to crawl only certain domains ?
> >>
> >> For e.g. lets say, I have 2 domains. I put the following in a text file
> >> and inject the crawldb
> >>
> >> http://www.domain1.com
> >> http://name.domain2.com
> >>
> >> Now, I wish to crawl all pages only in the above 2 domains.
> >>
> >> To do that, I added these to the regex filter (config file)
> >>
> >> +^http://www\.domain1\.com
> >> +^http://name\.domain2\.com
> >>
> >> However, it seems to crawl only the (home) top most page of the above
> >> domains only. How do I visit all inner pages ?
> >>
> >>
> >>
> >>
> >>
> >
>
>

Re: limit nutch to all pages within a certain domain

Reply via email to