Markus and Wang thank you very much for your fast responses. I forgot to mention that i use nutch 2.2.1 and mysql. Both DomainFilter and ignore.external.links ideas are awesome! What really bothers me is that dreaded "-topN". I really want to live without it! :) I hate it when I open my database and I see that i have for example 2000 links unfetched, which means they are not parsed->useless, and only 2000 fetched.
> Subject: Re: Crawling a specific site only > From: [email protected] > To: [email protected] > Date: Tue, 17 Dec 2013 18:53:55 +0800 > > HI > Just set > <name>db.ignore.external.links</name> > <value>true</value> > and run crawl script for several times, the default number of pages to > be added is 50,000. > > Is it right? > Wang > > > -----Original Message----- > From: Vangelis karv <[email protected]> > Reply-to: [email protected] > To: [email protected] <[email protected]> > Subject: Crawling a specific site only > Date: Tue, 17 Dec 2013 12:15:00 +0200 > > Hi again! My goal is to crawl a specific site. I want to crawl all the links > that exist under that site. For example, if i decide to crawl > http://www.uefa.com/, I want to parse all its inlinks(photos, videos, htmls > etc) and not only the best scoring urls for this site= topN. So, my question > here is: how can we tell Nutch to crawl everything in a site and not only the > sites that have the best score? > > >

