[Nutch Wiki] Update of "NewScoring" by DennisKubes

Apache Wiki Mon, 12 Jan 2009 09:34:00 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NewScoring

New page:
This page describes the new scoring (i.e. WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.

== General Information ==
The new scoring functionality can be found in 
org.apache.nutch.scoring.webgraph.  This package contains multiple programs 
that build web graphs, perform a stable convergent link-analysis, and update 
the crawldb with those scores.  These programs assume that fetching cycles have 
already been completed and now the users want to build a global webgraph from 
those segments and from that webgraph perform link-analysis to get a single 
global relevancy score for each url.  Building a webgraph assumes that all 
links are stored in the current segments to be processed.  Links are not held 
over from one processing cycle to another.  Global link-analysis scores are 
based on the current links available and scores will change as the link 
structure of the webgraph changes.

Currently the scoring jobs are not integrated into the Nutch script as commands 
and must be run in the form bin/nutch org.apache.nutch.scoring.webgraph.XXXX.

=== WebGraph ===
The WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  WebGraph is found at 
org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs 
usage.

{{{
usage: WebGraph
 -help                      show this help message
 -segment <segment>         the segment(s) to use
 -webgraphdb <webgraphdb>   the web graph database to use
}}}

The WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
WebGraph creates three different components, and inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.

=== Loops ===
Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because their may be 
a better way to perform this function with a different algorithm.  But on 
current production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.

{{{
usage: Loops
 -help                      show this help message
 -webgraphdb <webgraphdb>   the web graph database to use
}}}

=== LinkRank ===
With the web graph built we can now run LinkRank to perform an iterative link 
analysis.  LinkRank is a PageRank like link analysis program that converges to 
stable global scores for each url.  Similar to PageRank, the LinkRank program 
starts with a common score for all urls.  It then creates a global score for 
each url based on the number of incoming links and the scores for those link 
and the number of outgoing links from the page.  The process is iterative and 
scores tend to converge after a given number of iterations.  It is different 
from PageRank in that nepotistic links such as links internal to a website and 
reciprocal links between websites can be ignored.  The number of iterations can 
also be configured, by default 10 iterations are performed.  Unlike the 
previous OPIC scoring, the LinkRank program does not keep scores from one 
processing time to another.  The web graph and the link scores are recreated at 
each processing run and so we don't have the problems of ever
  increasing scores.  LinkRank requires the WebGraph program to have completed 
successfully and it stores its output scores for each url in the node database 
of the webgraph. LinkRank is found at 
org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs 
usage.  

{{{
usage: LinkRank
 -help                      show this help message
 -webgraphdb <webgraphdb>   the web graph db to use
}}}

=== ScoreUpdater ===
Once the LinkRank program has been run and link analysis is completed, the 
scores must be updated into the crawl database to work with the current Nutch 
functionality.  The ScoreUpdater program takes the scores stored in the node 
database of the webgraph and updates them into the crawldb.  If a exists in the 
crawldb that doesn't exist in the webgraph then its score is cleared in the 
crawldb.  The ScoreUpdater requires that the WebGraph and LinkRank programs 
have both been run and requires a crawl database to update.  ScoreUpdater is 
found at org.apache.nutch.scoring.webgraph.ScoreUpdater. Below is a printout of 
the programs usage.

{{{
usage: ScoreUpdater
 -crawldb <crawldb>         the crawldb to use
 -help                      show this help message
 -webgraphdb <webgraphdb>   the webgraphdb to use
}}}

[Nutch Wiki] Update of "NewScoring" by DennisKubes

Reply via email to