Re: Per-page crawling policy

2006-01-17 Thread Ken Krugler
Hi Andrzej, 1. Even with a pretty broad area of interest, you wind up focusing on a subset of all domains. Which then means that the max threads per host limit (for polite crawling) starts killing your efficiency. The policies approach that I described is able to follow and distribute the

Re: Per-page crawling policy

2006-01-16 Thread Ken Krugler
Hi Andrzej, I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a crawl frontier. Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the

Re: Per-page crawling policy

2006-01-16 Thread Andrzej Bialecki
Hi Ken, First of all, thanks for sharing your insights, that's a very interesting read. Ken Krugler wrote: This sounds like the TrustRank algorithm. See http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust attenuation via trust dampening (reducing the trust level as you get

Re: Per-page crawling policy

2006-01-06 Thread Jack Tang
Hi Andrzej The idea brings vertical search into nutch and definitely it is great:) I think nutch should add information retrieving layer into the who architecture, and export some abstract interface, say UrlBasedInformationRetrieve(you can implement your url grouping idea here?),

Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Hi, I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a crawl frontier. Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the crawling

Re: Per-page crawling policy

2006-01-05 Thread Byron Miller
Excellent Ideas and that is what i'm hoping to use some of the social bookmarking type ideas to build the starter sites from and linkmaps from. I hope to work with Simpy or other bookmarking projects to build somewhat of a popularity map(human edited authorit) to merge and calculate against a

Re: Per-page crawling policy

2006-01-05 Thread Stefan Groschupf
I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we have a means to store and

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Stefan Groschupf wrote: I like the idea and it is another step in the direction of vertical search, where I personal see the biggest chance for nutch. How to implement it? Surprisingly, I think that it's very simple - just adding a CrawlDatum.policyId field would suffice, assuming we

Re: Per-page crawling policy

2006-01-05 Thread Doug Cutting
Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as something that might prove very

Re: Per-page crawling policy

2006-01-05 Thread Andrzej Bialecki
Doug Cutting wrote: Stefan Groschupf wrote: Before we start adding meta data and more meta data, why not once in general adding meta data to the crawlDatum, than we can have any kinds of plugins that add and process metadata that belongs to a url. +1 This feature strikes me as

Re: Per-page crawling policy

2006-01-05 Thread Neal Whitley
Andrzej, This sounds like another great way to create more of a vertical search application as well. By defining trusted seed sources we can limit the scope of the crawl to a more suitable set of links. Also, being able to apply additional rules by domain/host or by trusted source would be