Hello Lewis, Thank you very much. I have indeed figured it out, and now nutch is indexing like I want.
Best, C.B. On Wed, Jul 6, 2011 at 6:32 PM, lewis john mcgibbney <[email protected]> wrote: > Hi C.B., > > To fetch all the pages in your two domains, I would start with a breadth of > search equal to one i.e. -depth 1. This way after every crawl you can > evaluate how your crawldb and linkdb are looking with readdb and readlinkdb > respectively and operate in an ancremental manner. This will also allow you > to use Luke, and you can see the quality of your index. > > Please note that we can use URLfilters for filtering out domains we do not > want to search, also have a look at redirect properties, as well as > db.ignore.external.links and db.ignore.internal.links in nutch-site.xml. > Once you read the description of these properties you will get an idea for > the type of conf setting which will provide best results. > > On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <[email protected]> wrote: > >> Hello, >> >> I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3 >> >> and in my urls/sites file, I have two sites like: >> >> http://www.mysite.com >> http://www.mysite2.com >> >> I would like to crawl those two sites to infinite depth, and just >> index all the pages in these sites. But I dont want it to go to remote >> sites, like facebook if there is a link from those sites. >> >> How do I do it? I know this is a primitive question, but I have looked >> all the documentation but could not figure it out. >> >> Best Regards, >> C.B. >> > > > > -- > *Lewis* >

