Dennis' new scoring tools have been designed to replace the OPIC implementation. See http://wiki.apache.org/nutch/NewScoring and http://wiki.apache.org/nutch/NewScoringIndexingExample
HTH Julien On 3 February 2011 12:40, David Saile <[email protected]> wrote: > > Am 02.02.2011 um 17:04 schrieb Tim Pease: > > > > > On Feb 2, 2011, at 5:18 AM, David Saile wrote: > > > >> Hi all, > >> > >> I have a question concerning updating a site's score in Nutch 1.2. > >> > >> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call > to > >> scfilters.updateDbScore((Text)key, oldSet ? old : null, result, > linkList); > >> > >> During debugging, I discovered that this method is executed in the > org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this > method is the following: > >> /** Increase the score by a sum of inlinked scores. */ > >> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, > List inlinked) throws ScoringFilterException { > >> float adjust = 0.0f; > >> for (int i = 0; i < inlinked.size(); i++) { > >> CrawlDatum linked = (CrawlDatum)inlinked.get(i); > >> adjust += linked.getScore(); > >> } > >> if (old == null) old = datum; > >> datum.setScore(old.getScore() + adjust); > >> } > >> > >> To my understanding, this code would increase a sites score based on > it's inlinks, every time a site is crawled. So even if neither the site has > been modified, nor any new inlink was discovered, the sites score will > increase. > >> > >> Is my understanding of this mechanism correct? > >> If so, could anyone explain to me why a sites score is increased in any > case? I would expect it to only change if either its content has changed, or > a new inlink has been discovered. > >> > > > > Your observations are correct. We recently ran into this exact same issue > and have determined that the OPICScoringFilter is not suitable for crawls > where pages will be re-fetched / re-parsed. The page score will continually > be increased each time it is fetched eventually resulting in a score of > Inifinity. > > > > The "Online Page Importance Computation" (OPIC) score algorithm is > described in this paper => > http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html > > > > The purpose of the algorithm is that you do not have to maintain the > entire link graph in memory to computer score imparted to inlinks and > outlinks. The downside is that you cannot determine if a page's score has > already been included in the outlinks to another page. Hence the infinite > score growth you have observed. > > > > This behavior only appears if you are re-fetching / re-parsing pages. > > > > Blessings, > > TwP > > Thank you very much for you reply Tim! > > Is it correct to assume, that you could make the OPIC score algorithm more > precise by only updating the score in two cases: > > 1) If a site has a modified outlink (i.e. the outlink was added or > deleted since the last fetch), update the score of the target-site of this > outlink. > > 2) If a sites score has changed since the last fetch, you have to > update the score of all targets of outlinks on this site. > > (given the case you actually had the required information at hand)? > > Cheers > David -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

