Hi James, As Markus said, you can set db.ignore.external.links to true so that you only process outlinks within the same domain as the page they're found on.
This has one (usually minor) side-effect; you toss links that go between domains that are in your seed list. If that's an issue, then you could take your 5K URLs, extract the domains, dedup, and then use that with domain based URL filtering. -- Ken On May 9, 2012, at 8:09am, James Ford wrote: > Hello, > > I am wondering how to only crawl the domains of a injected seed without > adding external URLs to the database? > > Lets say I have 5k urls in my seed, and I want nutch to crawl everything(Or > some million urls) for each domain in the fastest way possible. > > What settings should I use? > > I will have topN at about 20k, and I want the db_unfetched to be around 20k > for each iteration? > > What should I set "db.max.outlinks.per.page" to? I was wondering about > setting it to 4, to get 4*5k=20k for the first iteration? > > Can anyone help me? > > Thanks, > James Ford > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html > Sent from the Nutch - User mailing list archive at Nabble.com. -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

