On Feb 2, 2011, at 5:18 AM, David Saile wrote:

> Hi all,
> 
> I have a question concerning updating a site's score in Nutch 1.2.
> 
> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call to 
>       scfilters.updateDbScore((Text)key, oldSet ? old : null, result, 
> linkList);
> 
> During debugging, I discovered that this method is executed in the 
> org.apache.nutch.scoring.opic.OPICScoringFilter class.  The code for this 
> method is the following:
>       /** Increase the score by a sum of inlinked scores. */
>  public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List 
> inlinked) throws ScoringFilterException {
>    float adjust = 0.0f;
>    for (int i = 0; i < inlinked.size(); i++) {
>      CrawlDatum linked = (CrawlDatum)inlinked.get(i);
>      adjust += linked.getScore();
>    }
>    if (old == null) old = datum;
>    datum.setScore(old.getScore() + adjust);
>  }
> 
> To my understanding, this code would increase a sites score based on it's 
> inlinks, every time a site is crawled. So even if neither the site has been 
> modified, nor any new inlink was discovered, the sites score will increase.
> 
> Is my understanding of this mechanism correct? 
> If so, could anyone explain to me why a sites score is increased in any case? 
> I would expect it to only change if either its content has changed, or a new 
> inlink has been discovered.
> 

Your observations are correct. We recently ran into this exact same issue and 
have determined that the OPICScoringFilter is not suitable for crawls where 
pages will be re-fetched / re-parsed. The page score will continually be 
increased each time it is fetched eventually resulting in a score of Inifinity.

The "Online Page Importance Computation" (OPIC) score algorithm is described in 
this paper => http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html

The purpose of the algorithm is that you do not have to maintain the entire 
link graph in memory to computer score imparted to inlinks and outlinks. The 
downside is that you cannot determine if a page's score has already been 
included in the outlinks to another page. Hence the infinite score growth you 
have observed.

This behavior only appears if you are re-fetching / re-parsing pages.

Blessings,
TwP

Reply via email to