Thanks for pointing me to that information. 

However, the OPIC-algorithm seems more suitable for my needs, as it creates 
scores w/o the need to compute an entire WebGraph.

I think I still don't understand the nature of the problem with the 
OPIC-algorithm. It seems to me the problem Tim described, of scores converging 
to an infimum is avoided in the OPIC-algorithm for dynamic graphs, where the 
score is reset after a certain time-window. 

Inspecting the nutch-code, I could not find mechanisms to start a new 
time-window. Was Nutch using the algorithm for static graphs, prior to Dennis' 
new scoring tools?  

Thanks for all your help!
David



Am 03.02.2011 um 14:10 schrieb Julien Nioche:

> Dennis' new scoring tools have been designed to replace the OPIC
> implementation. See http://wiki.apache.org/nutch/NewScoring and
> http://wiki.apache.org/nutch/NewScoringIndexingExample
> 
> HTH
> 
> Julien
> 
> 
> On 3 February 2011 12:40, David Saile <[email protected]> wrote:
> 
>> 
>> Am 02.02.2011 um 17:04 schrieb Tim Pease:
>> 
>>> 
>>> On Feb 2, 2011, at 5:18 AM, David Saile wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I have a question concerning updating a site's score in Nutch 1.2.
>>>> 
>>>> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call
>> to
>>>>    scfilters.updateDbScore((Text)key, oldSet ? old : null, result,
>> linkList);
>>>> 
>>>> During debugging, I discovered that this method is executed in the
>> org.apache.nutch.scoring.opic.OPICScoringFilter class.  The code for this
>> method is the following:
>>>>    /** Increase the score by a sum of inlinked scores. */
>>>> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
>> List inlinked) throws ScoringFilterException {
>>>> float adjust = 0.0f;
>>>> for (int i = 0; i < inlinked.size(); i++) {
>>>>  CrawlDatum linked = (CrawlDatum)inlinked.get(i);
>>>>  adjust += linked.getScore();
>>>> }
>>>> if (old == null) old = datum;
>>>> datum.setScore(old.getScore() + adjust);
>>>> }
>>>> 
>>>> To my understanding, this code would increase a sites score based on
>> it's inlinks, every time a site is crawled. So even if neither the site has
>> been modified, nor any new inlink was discovered, the sites score will
>> increase.
>>>> 
>>>> Is my understanding of this mechanism correct?
>>>> If so, could anyone explain to me why a sites score is increased in any
>> case? I would expect it to only change if either its content has changed, or
>> a new inlink has been discovered.
>>>> 
>>> 
>>> Your observations are correct. We recently ran into this exact same issue
>> and have determined that the OPICScoringFilter is not suitable for crawls
>> where pages will be re-fetched / re-parsed. The page score will continually
>> be increased each time it is fetched eventually resulting in a score of
>> Inifinity.
>>> 
>>> The "Online Page Importance Computation" (OPIC) score algorithm is
>> described in this paper =>
>> http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html
>>> 
>>> The purpose of the algorithm is that you do not have to maintain the
>> entire link graph in memory to computer score imparted to inlinks and
>> outlinks. The downside is that you cannot determine if a page's score has
>> already been included in the outlinks to another page. Hence the infinite
>> score growth you have observed.
>>> 
>>> This behavior only appears if you are re-fetching / re-parsing pages.
>>> 
>>> Blessings,
>>> TwP
>> 
>> Thank you very much for you reply Tim!
>> 
>> Is it correct to assume, that you could make the OPIC score algorithm more
>> precise by only updating the score in two cases:
>> 
>>      1) If a site has a modified outlink (i.e. the outlink was added or
>> deleted since the last fetch), update the score of the target-site of this
>> outlink.
>> 
>>      2) If a sites score has changed since the last fetch, you have to
>> update the score of all targets of outlinks on this site.
>> 
>> (given the case you actually had the required information at hand)?
>> 
>> Cheers
>> David
> 
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com

Reply via email to