Hi Andrzej,
1. Even with a pretty broad area of interest, you wind up focusing
on a subset of all domains. Which then means that the max threads
per host limit (for polite crawling) starts killing your efficiency.
The policies approach that I described is able to follow and
distribute the
Hi Andrzej,
I've been toying with the following idea, which is an extension of
the existing URLFilter mechanism and the concept of a crawl
frontier.
Let's suppose we have several initial seed urls, each with a
different subjective quality. We would like to crawl these, and
expand the
Hi Ken,
First of all, thanks for sharing your insights, that's a very
interesting read.
Ken Krugler wrote:
This sounds like the TrustRank algorithm. See
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
attenuation via trust dampening (reducing the trust level as you get
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?),
Hi,
I've been toying with the following idea, which is an extension of the
existing URLFilter mechanism and the concept of a crawl frontier.
Let's suppose we have several initial seed urls, each with a different
subjective quality. We would like to crawl these, and expand the
crawling
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.
I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
have a means to store and
Stefan Groschupf wrote:
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
Stefan Groschupf wrote:
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any kinds
of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as something that might prove very
Doug Cutting wrote:
Stefan Groschupf wrote:
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any
kinds of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as
Andrzej,
This sounds like another great way to create more of a vertical
search application as well. By defining trusted seed sources we can
limit the scope of the crawl to a more suitable set of links.
Also, being able to apply additional rules by domain/host or by
trusted source would be
11 matches
Mail list logo