WebGraph is superior to opic. It eats resources but if you can spare them, use it. Also, if you recrawl already fetched URL's, scores will go wrong with opic. Markus
-----Original message----- > From:Michael Coffey <[email protected]> > Sent: Wednesday 16th November 2016 7:15 > To: [email protected] > Subject: Re: How can I Score? > > Aha! I was wrong when I said I was using all default settings. I forgot I had > followed a tutorial that told mem to put |scoring-depth| instead of > |scoring-opic| into the plugin.includes property. Now I get a variety of > scores. > Anyway, what is the general advice on which scoring method to use? Is there > any recommended reading? I am planning to crawl broadly across the www for > data mining (not necessarily search) covering millions of sites. > > > From: lewis john mcgibbney <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Tuesday, November 15, 2016 12:09 AM > Subject: Re: How can I Score? > > Hi Michael, > Replies inline > > On Sat, Nov 12, 2016 at 7:10 PM, <[email protected]> wrote: > > > From: Michael Coffey <[email protected]> > > To: "[email protected]" <[email protected]> > > Cc: > > Date: Sun, 13 Nov 2016 03:07:16 +0000 (UTC) > > Subject: How can I Score? > > When the generator is used with -topN, it is supposed to choose the > > highest-scoring urls. > > > Yes this is the threshold of how many top scoring URLs you wish to generate > into a new Fetch list and subsequently fetch. When you use the crawl > script, the -topN is calculated as follows > > $numSlaves * 50000 > > By default, we assume that you are running on one machine (local mode) > therefore the numSlaves variable is set to 1. > > > > In my case, all the urls in my db have a score of zero, except the ones > > injected. > > > > This is a bit strange. I would not expect them to have absolutely zero... > are you sure that it is not marginally above zero? Which scoring > plugin/mechanism are you currently using? > > > > How can I cause scores to be computed and stored? > > > Scores for each and every CrawlDatum are computed automatically > out-of-the-box. > > > > I am using the standard crawl script. > > > OK > > > > Do I need to enable the various webgraph lines in the script? > > > > > Not unless you wish to use the WebGraph scoring implementation... > Lewis > > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney > > >

