The WebGraph and LinkRank classes work together. The WebGraph is were links from either the same domains or same hosts can be ignored (or allowed). The configuration parameters:

link.ignore.internal.host = true|false
link.ignore.internal.domain = true|false

can be used to change that behavior. By default it ignores links from the same domain and hosts. So a link from news.google.com wouldn't be counted and wouldn't raise the score for www.google.com. The webgraph just building the lists of inlinks, outlinks, and nodes. Then the LinkRank class processes that to create the score. LinkRank does follow very closely to the original pagerank formula which is something like:

(1 - dampingFactor) + (dampingFactor * totalInlinkScore)

Where totalInlinkScore is the calculated from all the inlinks pointing to a page, taking into account that this is iterative and pages all start off with rankOne score which is (1 / numLinksInWebGraph).

The differences are:

  1. The Loops class can be used to identify and remove spam/problem
     links.  This class was supposed to identify reciprocal links and
     link cycles and then allow those links to be removed.  Problem is
     the class is very expensive computationally.  You can set the
     depth you want it to run but it is worse than exponential so I
     wouldn't do more than 1-3 depth if at all.  That will get you
     reciprocal links and small link cycles (a->b->c->a).  Really this
     doesn't add much to score in the end, I would just leave it off
     and not run this job.
  2. You can limit duplicate links from pages and domains.  Say page A
     points to B twice, you can limit it and only count it once.
  3. There is a damping factor which is by default set to 0.85.  This
     is the same as the original pagerank paper.  This is configurable
     with the link.analyze.damping.factor parameter.
  4. LinkRank runs a given number of iterations.  Ideally the job would
     iterate until the scores converge to a point, currently it is a
     set number of iterations.

I don't remember if the pagerank scores ignored internal links or not. But beyond that yes the LinkRank scores should be equivalent (close enough) to pagerank scores. Some things to consider:

  1. Pagerank is just one of over 200 signals that google uses (if they
     still use it) to determine relevancy.  Even if Google still uses
     it it most likely has changed.  Link analysis scores are good
     global relevancy scores, but a link score does not a search engine
     make today.  Oh how I wish it was that simple.  LinkRank is a good
     starting point, that's it.
  2. This is only as good as the amount of pages you have crawled.  The
     larger your set of crawled segments the better the scores get.
  3. A link is a link, it is content agnostic.  If you crawl 100m pages
     and do a LinkRank on that you will see all the usual suspects
     (Google, YouTube, Facebook) but you will also see things like the
     flash download.  To LinkRank a link is a link, it isn't particular
     in it being a viewable piece of content.

Hope that answered some of your questions.

Dennis

Although it is configurable, it does ignore reciprocal links and links within the same domain

On 07/13/2011 03:25 AM, Nutch User - 1 wrote:
Does anyone know how the LinkRank scores are calculated exactly? The
only sources of information I have are this wiki page:
(http://wiki.apache.org/nutch/NewScoring) and the source code of the tool.

Is this the only difference from PageRank:
"
It is different from PageRank in that nepotistic links such as links
internal to a website and reciprocal links between websites can be
ignored. The number of iterations can also be configured; by default 10
iterations are performed.
"
?

I.e. if internal links are not ignored, would the LinkRank scores be
equivalent to PageRank scores?

Reply via email to