Help on this would be greatly appreciated!

I am trying to modify Nutch in a way, that recrawling becomes more incremental. 
This requires the use of a more iterative algorithm like OPIC, instead of 
creating an entire WebGraph..

Thanks
David

Anfang der weitergeleiteten E-Mail:

> Von: David Saile <[email protected]>
> Datum: 4. Februar 2011 16:03:41 MEZ
> An: [email protected]
> Betreff: Re: ScoringFilter always increasing a fetched site's score
> 
> Thanks for pointing me to that information. 
> 
> However, the OPIC-algorithm seems more suitable for my needs, as it creates 
> scores w/o the need to compute an entire WebGraph.
> 
> I think I still don't understand the nature of the problem with the 
> OPIC-algorithm. It seems to me the problem Tim described, of scores 
> converging to an infimum is avoided in the OPIC-algorithm for dynamic graphs, 
> where the score is reset after a certain time-window. 
> 
> Inspecting the nutch-code, I could not find mechanisms to start a new 
> time-window. Was Nutch using the algorithm for static graphs, prior to 
> Dennis' new scoring tools?  
> 
> Thanks for all your help!
> David
> 
> 
> 
> Am 03.02.2011 um 14:10 schrieb Julien Nioche:
> 
>> Dennis' new scoring tools have been designed to replace the OPIC
>> implementation. See http://wiki.apache.org/nutch/NewScoring and
>> http://wiki.apache.org/nutch/NewScoringIndexingExample
>> 
>> HTH
>> 
>> Julien
>> 
>> 
>> On 3 February 2011 12:40, David Saile <[email protected]> wrote:
>> 
>>> 
>>> Am 02.02.2011 um 17:04 schrieb Tim Pease:
>>> 
>>>> 
>>>> On Feb 2, 2011, at 5:18 AM, David Saile wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I have a question concerning updating a site's score in Nutch 1.2.
>>>>> 
>>>>> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call
>>> to
>>>>>   scfilters.updateDbScore((Text)key, oldSet ? old : null, result,
>>> linkList);
>>>>> 
>>>>> During debugging, I discovered that this method is executed in the
>>> org.apache.nutch.scoring.opic.OPICScoringFilter class.  The code for this
>>> method is the following:
>>>>>   /** Increase the score by a sum of inlinked scores. */
>>>>> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
>>> List inlinked) throws ScoringFilterException {
>>>>> float adjust = 0.0f;
>>>>> for (int i = 0; i < inlinked.size(); i++) {
>>>>> CrawlDatum linked = (CrawlDatum)inlinked.get(i);
>>>>> adjust += linked.getScore();
>>>>> }
>>>>> if (old == null) old = datum;
>>>>> datum.setScore(old.getScore() + adjust);
>>>>> }
>>>>> 
>>>>> To my understanding, this code would increase a sites score based on
>>> it's inlinks, every time a site is crawled. So even if neither the site has
>>> been modified, nor any new inlink was discovered, the sites score will
>>> increase.
>>>>> 
>>>>> Is my understanding of this mechanism correct?
>>>>> If so, could anyone explain to me why a sites score is increased in any
>>> case? I would expect it to only change if either its content has changed, or
>>> a new inlink has been discovered.
>>>>> 
>>>> 
>>>> Your observations are correct. We recently ran into this exact same issue
>>> and have determined that the OPICScoringFilter is not suitable for crawls
>>> where pages will be re-fetched / re-parsed. The page score will continually
>>> be increased each time it is fetched eventually resulting in a score of
>>> Inifinity.
>>>> 
>>>> The "Online Page Importance Computation" (OPIC) score algorithm is
>>> described in this paper =>
>>> http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html
>>>> 
>>>> The purpose of the algorithm is that you do not have to maintain the
>>> entire link graph in memory to computer score imparted to inlinks and
>>> outlinks. The downside is that you cannot determine if a page's score has
>>> already been included in the outlinks to another page. Hence the infinite
>>> score growth you have observed.
>>>> 
>>>> This behavior only appears if you are re-fetching / re-parsing pages.
>>>> 
>>>> Blessings,
>>>> TwP
>>> 
>>> Thank you very much for you reply Tim!
>>> 
>>> Is it correct to assume, that you could make the OPIC score algorithm more
>>> precise by only updating the score in two cases:
>>> 
>>>     1) If a site has a modified outlink (i.e. the outlink was added or
>>> deleted since the last fetch), update the score of the target-site of this
>>> outlink.
>>> 
>>>     2) If a sites score has changed since the last fetch, you have to
>>> update the score of all targets of outlinks on this site.
>>> 
>>> (given the case you actually had the required information at hand)?
>>> 
>>> Cheers
>>> David
>> 
>> 
>> 
>> 
>> -- 
>> *
>> *Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
> 

Reply via email to