Aha! I was wrong when I said I was using all default settings. I forgot I had
followed a tutorial that told mem to put |scoring-depth| instead of
|scoring-opic| into the plugin.includes property. Now I get a variety of scores.
Anyway, what is the general advice on which scoring method to use? Is there any
recommended reading? I am planning to crawl broadly across the www for data
mining (not necessarily search) covering millions of sites.
From: lewis john mcgibbney <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Tuesday, November 15, 2016 12:09 AM
Subject: Re: How can I Score?
Hi Michael,
Replies inline
On Sat, Nov 12, 2016 at 7:10 PM, <[email protected]> wrote:
> From: Michael Coffey <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 +0000 (UTC)
> Subject: How can I Score?
> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls.
Yes this is the threshold of how many top scoring URLs you wish to generate
into a new Fetch list and subsequently fetch. When you use the crawl
script, the -topN is calculated as follows
$numSlaves * 50000
By default, we assume that you are running on one machine (local mode)
therefore the numSlaves variable is set to 1.
> In my case, all the urls in my db have a score of zero, except the ones
> injected.
>
This is a bit strange. I would not expect them to have absolutely zero...
are you sure that it is not marginally above zero? Which scoring
plugin/mechanism are you currently using?
> How can I cause scores to be computed and stored?
Scores for each and every CrawlDatum are computed automatically
out-of-the-box.
> I am using the standard crawl script.
OK
> Do I need to enable the various webgraph lines in the script?
>
>
Not unless you wish to use the WebGraph scoring implementation...
Lewis
--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney