[Nutch Wiki] Update of "NewScoring" by OtisGospodnetic

Apache Wiki Tue, 13 Jan 2009 10:22:47 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/NewScoring

------------------------------------------------------------------------------
- This page describes the new scoring (i.e. WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.
+ This page describes the new scoring (i.e. !WebGraph and Link Analysis) 
functionality in Nutch as of revision 723441.
  
  == General Information ==
  The new scoring functionality can be found in 
org.apache.nutch.scoring.webgraph.  This package contains multiple programs 
that build web graphs, perform a stable convergent link-analysis, and update 
the crawldb with those scores.  These programs assume that fetching cycles have 
already been completed and now the users want to build a global webgraph from 
those segments and from that webgraph perform link-analysis to get a single 
global relevancy score for each url.  Building a webgraph assumes that all 
links are stored in the current segments to be processed.  Links are not held 
over from one processing cycle to another.  Global link-analysis scores are 
based on the current links available and scores will change as the link 
structure of the webgraph changes.
@@ -8, +8 @@

  Currently the scoring jobs are not integrated into the Nutch script as 
commands and must be run in the form bin/nutch 
org.apache.nutch.scoring.webgraph.XXXX.
  
  === WebGraph ===
- The WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  WebGraph is found at 
org.apache.nutch.scoring.webgraph.WebGraph. Below is a printout of the programs 
usage.
+ The !WebGraph program is the first job that must be run once all segments are 
fetched and ready to be processed.  !WebGraph is found at 
org.apache.nutch.scoring.webgraph.!WebGraph. Below is a printout of the 
programs usage.
  
  {{{
  usage: WebGraph
@@ -17, +17 @@

   -webgraphdb <webgraphdb>   the web graph database to use
  }}}
  
- The WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
WebGraph creates three different components, and inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.
+ The !WebGraph program can take multiple segments to process and requires an 
output directory in which to place the completed web graph components.  The 
!WebGraph creates three different components: an inlink database, an outlink 
database, and a node database.  The inlink database is a listing of url and all 
of its inlinks.  The outlink database is a listing of url and all of its 
outlinks.  The node database is a listing of url with node meta information 
including the number of inlinks and outlinks, and eventually the score for that 
node.
  
  === Loops ===
- Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because their may be 
a better way to perform this function with a different algorithm.  But on 
current production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.
+ Once the web graph is built we can begin the process of link analysis.  Loops 
is an optional program that attempts to help weed out spam sites by determining 
link cycles in a web graph.  An example of a link cycle would be sites A, B, C, 
and D, where A links to B which links to C which links to D which links back to 
A.  This program is computationally expensive and usually, due to time and 
space requirement, can't be run on more than a three or four level depth.  
While it does identify sites which appear to be spam and those links are then 
discounted in the later !LinkRank program, its benefit to cost ratio is very 
low.  It is included in this package for completeness and because there may be 
a better way to perform this function with a different algorithm.  But on 
current large production webgraphs, its use is discouraged.  Loops is found at 
org.apache.nutch.scoring.webgraph.Loops. Below is a printout of the programs 
usage.
  
  {{{
  usage: Loops
@@ -29, +29 @@

  }}}
  
  === LinkRank ===
- With the web graph built we can now run LinkRank to perform an iterative link 
analysis.  LinkRank is a PageRank like link analysis program that converges to 
stable global scores for each url.  Similar to PageRank, the LinkRank program 
starts with a common score for all urls.  It then creates a global score for 
each url based on the number of incoming links and the scores for those link 
and the number of outgoing links from the page.  The process is iterative and 
scores tend to converge after a given number of iterations.  It is different 
from PageRank in that nepotistic links such as links internal to a website and 
reciprocal links between websites can be ignored.  The number of iterations can 
also be configured, by default 10 iterations are performed.  Unlike the 
previous OPIC scoring, the LinkRank program does not keep scores from one 
processing time to another.  The web graph and the link scores are recreated at 
each processing run and so we don't have the problems of ev
 er increasing scores.  LinkRank requires the WebGraph program to have 
completed successfully and it stores its output scores for each url in the node 
database of the webgraph. LinkRank is found at 
org.apache.nutch.scoring.webgraph.LinkRank. Below is a printout of the programs 
usage.  
+ With the web graph built we can now run !LinkRank to perform an iterative 
link analysis.  !LinkRank is a !PageRank-like link analysis program that 
converges to stable global scores for each url.  Similar to !PageRank, the 
!LinkRank program starts with a common score for all urls.  It then creates a 
global score for each url based on the number of incoming links and the scores 
for those links and the number of outgoing links from the page.  The process is 
iterative and scores tend to converge after a given number of iterations.  It 
is different from !PageRank in that nepotistic links such as links internal to 
a website and reciprocal links between websites can be ignored.  The number of 
iterations can also be configured; by default 10 iterations are performed.  
Unlike the previous OPIC scoring, the !LinkRank program does not keep scores 
from one processing time to another.  The web graph and the link scores are 
recreated at each processing run and so we don't have the proble
 ms of ever increasing scores.  !LinkRank requires the !WebGraph program to 
have completed successfully and it stores its output scores for each url in the 
node database of the webgraph. !LinkRank is found at 
org.apache.nutch.scoring.webgraph.!LinkRank. Below is a printout of the 
programs usage.  
  
  {{{
  usage: LinkRank
@@ -38, +38 @@

  }}}
  
  === ScoreUpdater ===
- Once the LinkRank program has been run and link analysis is completed, the 
scores must be updated into the crawl database to work with the current Nutch 
functionality.  The ScoreUpdater program takes the scores stored in the node 
database of the webgraph and updates them into the crawldb.  If a exists in the 
crawldb that doesn't exist in the webgraph then its score is cleared in the 
crawldb.  The ScoreUpdater requires that the WebGraph and LinkRank programs 
have both been run and requires a crawl database to update.  ScoreUpdater is 
found at org.apache.nutch.scoring.webgraph.ScoreUpdater. Below is a printout of 
the programs usage.
+ Once the !LinkRank program has been run and link analysis is completed, the 
scores must be updated into the crawl database to work with the current Nutch 
functionality.  The !ScoreUpdater program takes the scores stored in the node 
database of the webgraph and updates them into the crawldb.  If a url exists in 
the crawldb that doesn't exist in the webgraph then its score is cleared in the 
crawldb.  The !ScoreUpdater requires that the !WebGraph and !LinkRank programs 
have both been run and requires a crawl database to update.  !ScoreUpdater is 
found at org.apache.nutch.scoring.webgraph.!ScoreUpdater. Below is a printout 
of the programs usage.
  
  {{{
  usage: ScoreUpdater

[Nutch Wiki] Update of "NewScoring" by OtisGospodnetic

Reply via email to