Re: Make Nutch to crawl internal urls only

Ken Krugler Wed, 09 May 2012 14:29:24 -0700

Hi James,

As Markus said, you can set db.ignore.external.links to true so that you only 
process outlinks within the same domain as the page they're found on.


This has one (usually minor) side-effect; you toss links that go between 
domains that are in your seed list.

If that's an issue, then you could take your 5K URLs, extract the domains, 
dedup, and then use that with domain based URL filtering.

-- Ken

On May 9, 2012, at 8:09am, James Ford wrote:

> Hello,
> 
> I am wondering how to only crawl the domains of a injected seed without
> adding external URLs to the database?
> 
> Lets say I have 5k urls in my seed, and I want nutch to crawl everything(Or
> some million urls) for each domain in the fastest way possible.
> 
> What settings should I use?
> 
> I will have topN at about 20k, and I want the db_unfetched to be around 20k
> for each iteration?
> 
> What should I set "db.max.outlinks.per.page" to? I was wondering about
> setting it to 4, to get 4*5k=20k for the first iteration?
> 
> Can anyone help me? 
> 
> Thanks,
> James Ford
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Make-Nutch-to-crawl-internal-urls-only-tp3974397.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Make Nutch to crawl internal urls only

Reply via email to