The WebGraph and LinkRank classes work together. The WebGraph is were
links from either the same domains or same hosts can be ignored (or
allowed). The configuration parameters:
link.ignore.internal.host = true|false
link.ignore.internal.domain = true|false
can be used to change that behavior. By default it ignores links from
the same domain and hosts. So a link from news.google.com wouldn't be
counted and wouldn't raise the score for www.google.com. The webgraph
just building the lists of inlinks, outlinks, and nodes. Then the
LinkRank class processes that to create the score. LinkRank does follow
very closely to the original pagerank formula which is something like:
(1 - dampingFactor) + (dampingFactor * totalInlinkScore)
Where totalInlinkScore is the calculated from all the inlinks pointing
to a page, taking into account that this is iterative and pages all
start off with rankOne score which is (1 / numLinksInWebGraph).
The differences are:
1. The Loops class can be used to identify and remove spam/problem
links. This class was supposed to identify reciprocal links and
link cycles and then allow those links to be removed. Problem is
the class is very expensive computationally. You can set the
depth you want it to run but it is worse than exponential so I
wouldn't do more than 1-3 depth if at all. That will get you
reciprocal links and small link cycles (a->b->c->a). Really this
doesn't add much to score in the end, I would just leave it off
and not run this job.
2. You can limit duplicate links from pages and domains. Say page A
points to B twice, you can limit it and only count it once.
3. There is a damping factor which is by default set to 0.85. This
is the same as the original pagerank paper. This is configurable
with the link.analyze.damping.factor parameter.
4. LinkRank runs a given number of iterations. Ideally the job would
iterate until the scores converge to a point, currently it is a
set number of iterations.
I don't remember if the pagerank scores ignored internal links or not.
But beyond that yes the LinkRank scores should be equivalent (close
enough) to pagerank scores. Some things to consider:
1. Pagerank is just one of over 200 signals that google uses (if they
still use it) to determine relevancy. Even if Google still uses
it it most likely has changed. Link analysis scores are good
global relevancy scores, but a link score does not a search engine
make today. Oh how I wish it was that simple. LinkRank is a good
starting point, that's it.
2. This is only as good as the amount of pages you have crawled. The
larger your set of crawled segments the better the scores get.
3. A link is a link, it is content agnostic. If you crawl 100m pages
and do a LinkRank on that you will see all the usual suspects
(Google, YouTube, Facebook) but you will also see things like the
flash download. To LinkRank a link is a link, it isn't particular
in it being a viewable piece of content.
Hope that answered some of your questions.
Dennis
Although it is configurable, it does ignore reciprocal links and links
within the same domain
On 07/13/2011 03:25 AM, Nutch User - 1 wrote:
Does anyone know how the LinkRank scores are calculated exactly? The
only sources of information I have are this wiki page:
(http://wiki.apache.org/nutch/NewScoring) and the source code of the tool.
Is this the only difference from PageRank:
"
It is different from PageRank in that nepotistic links such as links
internal to a website and reciprocal links between websites can be
ignored. The number of iterations can also be configured; by default 10
iterations are performed.
"
?
I.e. if internal links are not ignored, would the LinkRank scores be
equivalent to PageRank scores?