Dennis' new scoring tools have been designed to replace the OPIC
implementation. See http://wiki.apache.org/nutch/NewScoring and
http://wiki.apache.org/nutch/NewScoringIndexingExample

HTH

Julien


On 3 February 2011 12:40, David Saile <[email protected]> wrote:

>
> Am 02.02.2011 um 17:04 schrieb Tim Pease:
>
> >
> > On Feb 2, 2011, at 5:18 AM, David Saile wrote:
> >
> >> Hi all,
> >>
> >> I have a question concerning updating a site's score in Nutch 1.2.
> >>
> >> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call
> to
> >>      scfilters.updateDbScore((Text)key, oldSet ? old : null, result,
> linkList);
> >>
> >> During debugging, I discovered that this method is executed in the
> org.apache.nutch.scoring.opic.OPICScoringFilter class.  The code for this
> method is the following:
> >>      /** Increase the score by a sum of inlinked scores. */
> >> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
> List inlinked) throws ScoringFilterException {
> >>  float adjust = 0.0f;
> >>  for (int i = 0; i < inlinked.size(); i++) {
> >>    CrawlDatum linked = (CrawlDatum)inlinked.get(i);
> >>    adjust += linked.getScore();
> >>  }
> >>  if (old == null) old = datum;
> >>  datum.setScore(old.getScore() + adjust);
> >> }
> >>
> >> To my understanding, this code would increase a sites score based on
> it's inlinks, every time a site is crawled. So even if neither the site has
> been modified, nor any new inlink was discovered, the sites score will
> increase.
> >>
> >> Is my understanding of this mechanism correct?
> >> If so, could anyone explain to me why a sites score is increased in any
> case? I would expect it to only change if either its content has changed, or
> a new inlink has been discovered.
> >>
> >
> > Your observations are correct. We recently ran into this exact same issue
> and have determined that the OPICScoringFilter is not suitable for crawls
> where pages will be re-fetched / re-parsed. The page score will continually
> be increased each time it is fetched eventually resulting in a score of
> Inifinity.
> >
> > The "Online Page Importance Computation" (OPIC) score algorithm is
> described in this paper =>
> http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html
> >
> > The purpose of the algorithm is that you do not have to maintain the
> entire link graph in memory to computer score imparted to inlinks and
> outlinks. The downside is that you cannot determine if a page's score has
> already been included in the outlinks to another page. Hence the infinite
> score growth you have observed.
> >
> > This behavior only appears if you are re-fetching / re-parsing pages.
> >
> > Blessings,
> > TwP
>
> Thank you very much for you reply Tim!
>
> Is it correct to assume, that you could make the OPIC score algorithm more
> precise by only updating the score in two cases:
>
>        1) If a site has a modified outlink (i.e. the outlink was added or
> deleted since the last fetch), update the score of the target-site of this
> outlink.
>
>        2) If a sites score has changed since the last fetch, you have to
> update the score of all targets of outlinks on this site.
>
> (given the case you actually had the required information at hand)?
>
> Cheers
> David




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to