Hi, Here is an old question but an answer from Markus too :) http://lucene.472066.n3.nabble.com/PageRank-or-Opic-td4118842.html
Kind Regards, Furkan KAMACI On Wed, Nov 16, 2016 at 11:32 AM, Markus Jelsma <[email protected]> wrote: > WebGraph is superior to opic. It eats resources but if you can spare them, > use it. Also, if you recrawl already fetched URL's, scores will go wrong > with opic. > Markus > > > > -----Original message----- > > From:Michael Coffey <[email protected]> > > Sent: Wednesday 16th November 2016 7:15 > > To: [email protected] > > Subject: Re: How can I Score? > > > > Aha! I was wrong when I said I was using all default settings. I forgot > I had followed a tutorial that told mem to put |scoring-depth| instead of > |scoring-opic| into the plugin.includes property. Now I get a variety of > scores. > > Anyway, what is the general advice on which scoring method to use? Is > there any recommended reading? I am planning to crawl broadly across the > www for data mining (not necessarily search) covering millions of sites. > > > > > > From: lewis john mcgibbney <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Tuesday, November 15, 2016 12:09 AM > > Subject: Re: How can I Score? > > > > Hi Michael, > > Replies inline > > > > On Sat, Nov 12, 2016 at 7:10 PM, <[email protected]> > wrote: > > > > > From: Michael Coffey <[email protected]> > > > To: "[email protected]" <[email protected]> > > > Cc: > > > Date: Sun, 13 Nov 2016 03:07:16 +0000 (UTC) > > > Subject: How can I Score? > > > When the generator is used with -topN, it is supposed to choose the > > > highest-scoring urls. > > > > > > Yes this is the threshold of how many top scoring URLs you wish to > generate > > into a new Fetch list and subsequently fetch. When you use the crawl > > script, the -topN is calculated as follows > > > > $numSlaves * 50000 > > > > By default, we assume that you are running on one machine (local mode) > > therefore the numSlaves variable is set to 1. > > > > > > > In my case, all the urls in my db have a score of zero, except the ones > > > injected. > > > > > > > This is a bit strange. I would not expect them to have absolutely zero... > > are you sure that it is not marginally above zero? Which scoring > > plugin/mechanism are you currently using? > > > > > > > How can I cause scores to be computed and stored? > > > > > > Scores for each and every CrawlDatum are computed automatically > > out-of-the-box. > > > > > > > I am using the standard crawl script. > > > > > > OK > > > > > > > Do I need to enable the various webgraph lines in the script? > > > > > > > > Not unless you wish to use the WebGraph scoring implementation... > > Lewis > > > > > > -- > > http://home.apache.org/~lewismc/ > > @hectorMcSpector > > http://www.linkedin.com/in/lmcgibbney > > > > > > >

