db.ignore.external.links will not necessarily keep you within one domain when it comes to redirections. Try the domain filter instead or url filter.
On Wednesday 06 July 2011 17:32:53 lewis john mcgibbney wrote: > Hi C.B., > > To fetch all the pages in your two domains, I would start with a breadth of > search equal to one i.e. -depth 1. This way after every crawl you can > evaluate how your crawldb and linkdb are looking with readdb and readlinkdb > respectively and operate in an ancremental manner. This will also allow you > to use Luke, and you can see the quality of your index. > > Please note that we can use URLfilters for filtering out domains we do not > want to search, also have a look at redirect properties, as well as > db.ignore.external.links and db.ignore.internal.links in nutch-site.xml. > Once you read the description of these properties you will get an idea for > the type of conf setting which will provide best results. > > On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <[email protected]> wrote: > > Hello, > > > > I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3 > > > > and in my urls/sites file, I have two sites like: > > > > http://www.mysite.com > > http://www.mysite2.com > > > > I would like to crawl those two sites to infinite depth, and just > > index all the pages in these sites. But I dont want it to go to remote > > sites, like facebook if there is a link from those sites. > > > > How do I do it? I know this is a primitive question, but I have looked > > all the documentation but could not figure it out. > > > > Best Regards, > > C.B. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

