Hi C.B., To fetch all the pages in your two domains, I would start with a breadth of search equal to one i.e. -depth 1. This way after every crawl you can evaluate how your crawldb and linkdb are looking with readdb and readlinkdb respectively and operate in an ancremental manner. This will also allow you to use Luke, and you can see the quality of your index.
Please note that we can use URLfilters for filtering out domains we do not want to search, also have a look at redirect properties, as well as db.ignore.external.links and db.ignore.internal.links in nutch-site.xml. Once you read the description of these properties you will get an idea for the type of conf setting which will provide best results. On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <[email protected]> wrote: > Hello, > > I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3 > > and in my urls/sites file, I have two sites like: > > http://www.mysite.com > http://www.mysite2.com > > I would like to crawl those two sites to infinite depth, and just > index all the pages in these sites. But I dont want it to go to remote > sites, like facebook if there is a link from those sites. > > How do I do it? I know this is a primitive question, but I have looked > all the documentation but could not figure it out. > > Best Regards, > C.B. > -- *Lewis*

