On Feb 2, 2011, at 5:18 AM, David Saile wrote:
> Hi all,
>
> I have a question concerning updating a site's score in Nutch 1.2.
>
> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call to
> scfilters.updateDbScore((Text)key, oldSet ? old : null, result,
> linkList);
>
> During debugging, I discovered that this method is executed in the
> org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this
> method is the following:
> /** Increase the score by a sum of inlinked scores. */
> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List
> inlinked) throws ScoringFilterException {
> float adjust = 0.0f;
> for (int i = 0; i < inlinked.size(); i++) {
> CrawlDatum linked = (CrawlDatum)inlinked.get(i);
> adjust += linked.getScore();
> }
> if (old == null) old = datum;
> datum.setScore(old.getScore() + adjust);
> }
>
> To my understanding, this code would increase a sites score based on it's
> inlinks, every time a site is crawled. So even if neither the site has been
> modified, nor any new inlink was discovered, the sites score will increase.
>
> Is my understanding of this mechanism correct?
> If so, could anyone explain to me why a sites score is increased in any case?
> I would expect it to only change if either its content has changed, or a new
> inlink has been discovered.
>
Your observations are correct. We recently ran into this exact same issue and
have determined that the OPICScoringFilter is not suitable for crawls where
pages will be re-fetched / re-parsed. The page score will continually be
increased each time it is fetched eventually resulting in a score of Inifinity.
The "Online Page Importance Computation" (OPIC) score algorithm is described in
this paper => http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html
The purpose of the algorithm is that you do not have to maintain the entire
link graph in memory to computer score imparted to inlinks and outlinks. The
downside is that you cannot determine if a page's score has already been
included in the outlinks to another page. Hence the infinite score growth you
have observed.
This behavior only appears if you are re-fetching / re-parsing pages.
Blessings,
TwP