Help on this would be greatly appreciated! I am trying to modify Nutch in a way, that recrawling becomes more incremental. This requires the use of a more iterative algorithm like OPIC, instead of creating an entire WebGraph..
Thanks David Anfang der weitergeleiteten E-Mail: > Von: David Saile <[email protected]> > Datum: 4. Februar 2011 16:03:41 MEZ > An: [email protected] > Betreff: Re: ScoringFilter always increasing a fetched site's score > > Thanks for pointing me to that information. > > However, the OPIC-algorithm seems more suitable for my needs, as it creates > scores w/o the need to compute an entire WebGraph. > > I think I still don't understand the nature of the problem with the > OPIC-algorithm. It seems to me the problem Tim described, of scores > converging to an infimum is avoided in the OPIC-algorithm for dynamic graphs, > where the score is reset after a certain time-window. > > Inspecting the nutch-code, I could not find mechanisms to start a new > time-window. Was Nutch using the algorithm for static graphs, prior to > Dennis' new scoring tools? > > Thanks for all your help! > David > > > > Am 03.02.2011 um 14:10 schrieb Julien Nioche: > >> Dennis' new scoring tools have been designed to replace the OPIC >> implementation. See http://wiki.apache.org/nutch/NewScoring and >> http://wiki.apache.org/nutch/NewScoringIndexingExample >> >> HTH >> >> Julien >> >> >> On 3 February 2011 12:40, David Saile <[email protected]> wrote: >> >>> >>> Am 02.02.2011 um 17:04 schrieb Tim Pease: >>> >>>> >>>> On Feb 2, 2011, at 5:18 AM, David Saile wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have a question concerning updating a site's score in Nutch 1.2. >>>>> >>>>> In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call >>> to >>>>> scfilters.updateDbScore((Text)key, oldSet ? old : null, result, >>> linkList); >>>>> >>>>> During debugging, I discovered that this method is executed in the >>> org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this >>> method is the following: >>>>> /** Increase the score by a sum of inlinked scores. */ >>>>> public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, >>> List inlinked) throws ScoringFilterException { >>>>> float adjust = 0.0f; >>>>> for (int i = 0; i < inlinked.size(); i++) { >>>>> CrawlDatum linked = (CrawlDatum)inlinked.get(i); >>>>> adjust += linked.getScore(); >>>>> } >>>>> if (old == null) old = datum; >>>>> datum.setScore(old.getScore() + adjust); >>>>> } >>>>> >>>>> To my understanding, this code would increase a sites score based on >>> it's inlinks, every time a site is crawled. So even if neither the site has >>> been modified, nor any new inlink was discovered, the sites score will >>> increase. >>>>> >>>>> Is my understanding of this mechanism correct? >>>>> If so, could anyone explain to me why a sites score is increased in any >>> case? I would expect it to only change if either its content has changed, or >>> a new inlink has been discovered. >>>>> >>>> >>>> Your observations are correct. We recently ran into this exact same issue >>> and have determined that the OPICScoringFilter is not suitable for crawls >>> where pages will be re-fetched / re-parsed. The page score will continually >>> be increased each time it is fetched eventually resulting in a score of >>> Inifinity. >>>> >>>> The "Online Page Importance Computation" (OPIC) score algorithm is >>> described in this paper => >>> http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html >>>> >>>> The purpose of the algorithm is that you do not have to maintain the >>> entire link graph in memory to computer score imparted to inlinks and >>> outlinks. The downside is that you cannot determine if a page's score has >>> already been included in the outlinks to another page. Hence the infinite >>> score growth you have observed. >>>> >>>> This behavior only appears if you are re-fetching / re-parsing pages. >>>> >>>> Blessings, >>>> TwP >>> >>> Thank you very much for you reply Tim! >>> >>> Is it correct to assume, that you could make the OPIC score algorithm more >>> precise by only updating the score in two cases: >>> >>> 1) If a site has a modified outlink (i.e. the outlink was added or >>> deleted since the last fetch), update the score of the target-site of this >>> outlink. >>> >>> 2) If a sites score has changed since the last fetch, you have to >>> update the score of all targets of outlinks on this site. >>> >>> (given the case you actually had the required information at hand)? >>> >>> Cheers >>> David >> >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >

