Thanks for pointing me to that information. However, the OPIC-algorithm seems more suitable for my needs, as it creates scores w/o the need to compute an entire WebGraph.
I think I still don't understand the nature of the problem with the OPIC-algorithm. It seems to me the problem Tim described, of scores converging to an infimum is avoided in the OPIC-algorithm for dynamic graphs, where the score is reset after a certain time-window. Inspecting the nutch-code, I could not find mechanisms to start a new time-window. Was Nutch using the algorithm for static graphs, prior to Dennis' new scoring tools? Thanks for all your help! David Am 03.02.2011 um 14:10 schrieb Julien Nioche: > Dennis' new scoring tools have been designed to replace the OPIC > implementation. See http://wiki.apache.org/nutch/NewScoring and > http://wiki.apache.org/nutch/NewScoringIndexingExample > > HTH > > Julien > > > On 3 February 2011 12:40, David Saile <[email protected]> wrote: > >> >> Am 02.02.2011 um 17:04 schrieb Tim Pease: >> >>> >>> On Feb 2, 2011, at 5:18 AM, David Saile wrote: >>> >>>> Hi all, >>>> >>>> I have a question concerning updating a site's score in Nutch 1.2. >>>> >>>> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call >> to >>>> scfilters.updateDbScore((Text)key, oldSet ? old : null, result, >> linkList); >>>> >>>> During debugging, I discovered that this method is executed in the >> org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this >> method is the following: >>>> /** Increase the score by a sum of inlinked scores. */ >>>> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, >> List inlinked) throws ScoringFilterException { >>>> float adjust = 0.0f; >>>> for (int i = 0; i < inlinked.size(); i++) { >>>> CrawlDatum linked = (CrawlDatum)inlinked.get(i); >>>> adjust += linked.getScore(); >>>> } >>>> if (old == null) old = datum; >>>> datum.setScore(old.getScore() + adjust); >>>> } >>>> >>>> To my understanding, this code would increase a sites score based on >> it's inlinks, every time a site is crawled. So even if neither the site has >> been modified, nor any new inlink was discovered, the sites score will >> increase. >>>> >>>> Is my understanding of this mechanism correct? >>>> If so, could anyone explain to me why a sites score is increased in any >> case? I would expect it to only change if either its content has changed, or >> a new inlink has been discovered. >>>> >>> >>> Your observations are correct. We recently ran into this exact same issue >> and have determined that the OPICScoringFilter is not suitable for crawls >> where pages will be re-fetched / re-parsed. The page score will continually >> be increased each time it is fetched eventually resulting in a score of >> Inifinity. >>> >>> The "Online Page Importance Computation" (OPIC) score algorithm is >> described in this paper => >> http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html >>> >>> The purpose of the algorithm is that you do not have to maintain the >> entire link graph in memory to computer score imparted to inlinks and >> outlinks. The downside is that you cannot determine if a page's score has >> already been included in the outlinks to another page. Hence the infinite >> score growth you have observed. >>> >>> This behavior only appears if you are re-fetching / re-parsing pages. >>> >>> Blessings, >>> TwP >> >> Thank you very much for you reply Tim! >> >> Is it correct to assume, that you could make the OPIC score algorithm more >> precise by only updating the score in two cases: >> >> 1) If a site has a modified outlink (i.e. the outlink was added or >> deleted since the last fetch), update the score of the target-site of this >> outlink. >> >> 2) If a sites score has changed since the last fetch, you have to >> update the score of all targets of outlinks on this site. >> >> (given the case you actually had the required information at hand)? >> >> Cheers >> David > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com

