RE: Crawling a specific site only

Markus Jelsma Wed, 18 Dec 2013 01:39:22 -0800
Increase it to a reasonable high value or don't set it at all, it will then 
attempt to crawl as much as it can. Also check generate.count.mode and 
generate.max.count.
 
 
-----Original message-----
> From:Vangelis karv <[email protected]>
> Sent: Wednesday 18th December 2013 9:56
> To: [email protected]
> Subject: RE: Crawling a specific site only
> 
> Can you be a little more specific about that, Tejas?
> 
> > Date: Tue, 17 Dec 2013 23:32:46 -0800
> > Subject: Re: Crawling a specific site only
> > From: [email protected]
> > To: [email protected]
> > 
> > You should bump the value of topN instead of setting to 2000. That would
> > make lot of the urls eligible for fetching.
> > 
> > Thanks,
> > Tejas
> > 
> > 
> > On Tue, Dec 17, 2013 at 3:02 AM, Vangelis karv 
> > <[email protected]>wrote:
> > 
> > > Markus and Wang thank you very much for your fast responses. I forgot to
> > > mention that i use nutch 2.2.1 and mysql. Both DomainFilter and
> > > ignore.external.links ideas are awesome! What really bothers me is that
> > > dreaded "-topN". I really want to live without it! :) I hate it when I 
> > > open
> > > my database and I see that i have for example 2000 links unfetched, which
> > > means they are not parsed->useless, and only 2000 fetched.
> > >
> > > > Subject: Re: Crawling a specific site only
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Date: Tue, 17 Dec 2013 18:53:55 +0800
> > > >
> > > > HI
> > > > Just set
> > > >         <name>db.ignore.external.links</name>
> > > >         <value>true</value>
> > > > and run crawl script for several times, the default number of pages to
> > > > be added is 50,000.
> > > >
> > > > Is it right?
> > > > Wang
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Vangelis karv <[email protected]>
> > > > Reply-to: [email protected]
> > > > To: [email protected] <[email protected]>
> > > > Subject: Crawling a specific site only
> > > > Date: Tue, 17 Dec 2013 12:15:00 +0200
> > > >
> > > > Hi again! My goal is to crawl a specific site. I want to crawl all the
> > > links that exist under that site. For example, if i decide to crawl
> > > http://www.uefa.com/, I want to parse all its inlinks(photos, videos,
> > > htmls etc) and not only the best scoring urls for this site= topN. So, my
> > > question here is: how can we tell Nutch to crawl everything in a site and
> > > not only the sites that have the best score?
> > > >
> > > >
> > > >
> > >
> > >
>
RE: Crawling a specific site only

Reply via email to