Hi,

Here is an old question but an answer from Markus too :)
http://lucene.472066.n3.nabble.com/PageRank-or-Opic-td4118842.html

Kind Regards,
Furkan KAMACI

On Wed, Nov 16, 2016 at 11:32 AM, Markus Jelsma <[email protected]>
wrote:

> WebGraph is superior to opic. It eats resources but if you can spare them,
> use it. Also, if you recrawl already fetched URL's, scores will go wrong
> with opic.
> Markus
>
>
>
> -----Original message-----
> > From:Michael Coffey <[email protected]>
> > Sent: Wednesday 16th November 2016 7:15
> > To: [email protected]
> > Subject: Re: How can I Score?
> >
> > Aha! I was wrong when I said I was using all default settings. I forgot
> I had followed a tutorial that told mem to put |scoring-depth| instead of
> |scoring-opic| into the plugin.includes property. Now I get a variety of
> scores.
> > Anyway, what is the general advice on which scoring method to use? Is
> there any recommended reading? I am planning to crawl broadly across the
> www for data mining (not necessarily search) covering millions of sites.
> >
> >
> >       From: lewis john mcgibbney <[email protected]>
> >  To: "[email protected]" <[email protected]>
> >  Sent: Tuesday, November 15, 2016 12:09 AM
> >  Subject: Re: How can I Score?
> >
> > Hi Michael,
> > Replies inline
> >
> > On Sat, Nov 12, 2016 at 7:10 PM, <[email protected]>
> wrote:
> >
> > > From: Michael Coffey <[email protected]>
> > > To: "[email protected]" <[email protected]>
> > > Cc:
> > > Date: Sun, 13 Nov 2016 03:07:16 +0000 (UTC)
> > > Subject: How can I Score?
> > > When the generator is used with -topN, it is supposed to choose the
> > > highest-scoring urls.
> >
> >
> > Yes this is the threshold of how many top scoring URLs you wish to
> generate
> > into a new Fetch list and subsequently fetch. When you use the crawl
> > script, the -topN is calculated as follows
> >
> > $numSlaves * 50000
> >
> > By default, we assume that you are running on one machine (local mode)
> > therefore the numSlaves variable is set to 1.
> >
> >
> > > In my case, all the urls in my db have a score of zero, except the ones
> > > injected.
> > >
> >
> > This is a bit strange. I would not expect them to have absolutely zero...
> > are you sure that it is not marginally above zero? Which scoring
> > plugin/mechanism are you currently using?
> >
> >
> > > How can I cause scores to be computed and stored?
> >
> >
> > Scores for each and every CrawlDatum are computed automatically
> > out-of-the-box.
> >
> >
> > > I am using the standard crawl script.
> >
> >
> > OK
> >
> >
> > > Do I need to enable the various webgraph lines in the script?
> > >
> > >
> > Not unless you wish to use the WebGraph scoring implementation...
> > Lewis
> >
> >
> > --
> > http://home.apache.org/~lewismc/
> > @hectorMcSpector
> > http://www.linkedin.com/in/lmcgibbney
> >
> >
> >
>

Reply via email to