RE: How can I Score?

2016-11-17 Thread Vladimir Loubenski
Hi,
Does scoring implemented for Nutch 2.3.1?
I always receive 0 for the scoring field. 
Do  I need to run a special step to receive it?

Regards,
Vladimir.


-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: November-17-16 7:52 AM
To: user@nutch.apache.org
Subject: Re: How can I Score?

Hi,

one solution is to include the plugin
  scoring-opic
in the property
  plugin.includes

That should work given that injected URLs have a non-zero score and are not 
redirects. OPIC is a good choice for frontier selection (via -topN).

For Nutch 1.x the webgraph steps in combination with scoring-link are an 
alternative, esp. if you want to use the score mainly in the index to rank 
search results.

Best,
Sebastian

On 11/13/2016 04:17 AM, Yongyao Jiang wrote:
> I was thinking about the same question too. My guess is scoring 
> happens when you run fetch command.
> 
> This page may help, 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_n
> utch_NutchScoring=DgICaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6I
> KI5viJn9Qf3N2dP8AA11tevsqfk=1EHxXRkotDzHoYoqwwvqkZmyjjqdhmAcZG65pT4F
> W1c=a6DajlQGWDVJOyQXgXusR9NEsZsgU7MuwSqqycvPFtQ=
> 
> On Sat, Nov 12, 2016 at 10:07 PM, Michael Coffey 
> <mcof...@yahoo.com.invalid>
> wrote:
> 
>> When the generator is used with -topN, it is supposed to choose the 
>> highest-scoring urls. In my case, all the urls in my db have a score 
>> of zero, except the ones injected.
>> How can I cause scores to be computed and stored? I am using the 
>> standard crawl script. Do I need to enable the various webgraph lines in the 
>> script?
>>
> 
> 
> 



Re: How can I Score?

2016-11-17 Thread Sebastian Nagel
Hi,

one solution is to include the plugin
  scoring-opic
in the property
  plugin.includes

That should work given that injected URLs have a non-zero score and are
not redirects. OPIC is a good choice for frontier selection (via -topN).

For Nutch 1.x the webgraph steps in combination with scoring-link are
an alternative, esp. if you want to use the score mainly in the index
to rank search results.

Best,
Sebastian

On 11/13/2016 04:17 AM, Yongyao Jiang wrote:
> I was thinking about the same question too. My guess is scoring happens
> when you run fetch command.
> 
> This page may help, https://wiki.apache.org/nutch/NutchScoring
> 
> On Sat, Nov 12, 2016 at 10:07 PM, Michael Coffey 
> wrote:
> 
>> When the generator is used with -topN, it is supposed to choose the
>> highest-scoring urls. In my case, all the urls in my db have a score of
>> zero, except the ones injected.
>> How can I cause scores to be computed and stored? I am using the standard
>> crawl script. Do I need to enable the various webgraph lines in the script?
>>
> 
> 
> 



Re: How can I Score?

2016-11-16 Thread Furkan KAMACI
Hi,

Here is an old question but an answer from Markus too :)
http://lucene.472066.n3.nabble.com/PageRank-or-Opic-td4118842.html

Kind Regards,
Furkan KAMACI

On Wed, Nov 16, 2016 at 11:32 AM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> WebGraph is superior to opic. It eats resources but if you can spare them,
> use it. Also, if you recrawl already fetched URL's, scores will go wrong
> with opic.
> Markus
>
>
>
> -Original message-
> > From:Michael Coffey <mcof...@yahoo.com.INVALID>
> > Sent: Wednesday 16th November 2016 7:15
> > To: user@nutch.apache.org
> > Subject: Re: How can I Score?
> >
> > Aha! I was wrong when I said I was using all default settings. I forgot
> I had followed a tutorial that told mem to put |scoring-depth| instead of
> |scoring-opic| into the plugin.includes property. Now I get a variety of
> scores.
> > Anyway, what is the general advice on which scoring method to use? Is
> there any recommended reading? I am planning to crawl broadly across the
> www for data mining (not necessarily search) covering millions of sites.
> >
> >
> >   From: lewis john mcgibbney <lewi...@apache.org>
> >  To: "user@nutch.apache.org" <user@nutch.apache.org>
> >  Sent: Tuesday, November 15, 2016 12:09 AM
> >  Subject: Re: How can I Score?
> >
> > Hi Michael,
> > Replies inline
> >
> > On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org>
> wrote:
> >
> > > From: Michael Coffey <mcof...@yahoo.com.invalid>
> > > To: "user@nutch.apache.org" <user@nutch.apache.org>
> > > Cc:
> > > Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> > > Subject: How can I Score?
> > > When the generator is used with -topN, it is supposed to choose the
> > > highest-scoring urls.
> >
> >
> > Yes this is the threshold of how many top scoring URLs you wish to
> generate
> > into a new Fetch list and subsequently fetch. When you use the crawl
> > script, the -topN is calculated as follows
> >
> > $numSlaves * 5
> >
> > By default, we assume that you are running on one machine (local mode)
> > therefore the numSlaves variable is set to 1.
> >
> >
> > > In my case, all the urls in my db have a score of zero, except the ones
> > > injected.
> > >
> >
> > This is a bit strange. I would not expect them to have absolutely zero...
> > are you sure that it is not marginally above zero? Which scoring
> > plugin/mechanism are you currently using?
> >
> >
> > > How can I cause scores to be computed and stored?
> >
> >
> > Scores for each and every CrawlDatum are computed automatically
> > out-of-the-box.
> >
> >
> > > I am using the standard crawl script.
> >
> >
> > OK
> >
> >
> > > Do I need to enable the various webgraph lines in the script?
> > >
> > >
> > Not unless you wish to use the WebGraph scoring implementation...
> > Lewis
> >
> >
> > --
> > http://home.apache.org/~lewismc/
> > @hectorMcSpector
> > http://www.linkedin.com/in/lmcgibbney
> >
> >
> >
>


RE: How can I Score?

2016-11-16 Thread Markus Jelsma
WebGraph is superior to opic. It eats resources but if you can spare them, use 
it. Also, if you recrawl already fetched URL's, scores will go wrong with opic.
Markus

 
 
-Original message-
> From:Michael Coffey <mcof...@yahoo.com.INVALID>
> Sent: Wednesday 16th November 2016 7:15
> To: user@nutch.apache.org
> Subject: Re: How can I Score?
> 
> Aha! I was wrong when I said I was using all default settings. I forgot I had 
> followed a tutorial that told mem to put |scoring-depth| instead of 
> |scoring-opic| into the plugin.includes property. Now I get a variety of 
> scores.
> Anyway, what is the general advice on which scoring method to use? Is there 
> any recommended reading? I am planning to crawl broadly across the www for 
> data mining (not necessarily search) covering millions of sites.
> 
> 
>   From: lewis john mcgibbney <lewi...@apache.org>
>  To: "user@nutch.apache.org" <user@nutch.apache.org> 
>  Sent: Tuesday, November 15, 2016 12:09 AM
>  Subject: Re: How can I Score?
>
> Hi Michael,
> Replies inline
> 
> On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote:
> 
> > From: Michael Coffey <mcof...@yahoo.com.invalid>
> > To: "user@nutch.apache.org" <user@nutch.apache.org>
> > Cc:
> > Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> > Subject: How can I Score?
> > When the generator is used with -topN, it is supposed to choose the
> > highest-scoring urls.
> 
> 
> Yes this is the threshold of how many top scoring URLs you wish to generate
> into a new Fetch list and subsequently fetch. When you use the crawl
> script, the -topN is calculated as follows
> 
> $numSlaves * 5
> 
> By default, we assume that you are running on one machine (local mode)
> therefore the numSlaves variable is set to 1.
> 
> 
> > In my case, all the urls in my db have a score of zero, except the ones
> > injected.
> >
> 
> This is a bit strange. I would not expect them to have absolutely zero...
> are you sure that it is not marginally above zero? Which scoring
> plugin/mechanism are you currently using?
> 
> 
> > How can I cause scores to be computed and stored?
> 
> 
> Scores for each and every CrawlDatum are computed automatically
> out-of-the-box.
> 
> 
> > I am using the standard crawl script.
> 
> 
> OK
> 
> 
> > Do I need to enable the various webgraph lines in the script?
> >
> >
> Not unless you wish to use the WebGraph scoring implementation...
> Lewis
> 
> 
> -- 
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
> 
> 
>


Re: How can I Score?

2016-11-15 Thread Michael Coffey
Aha! I was wrong when I said I was using all default settings. I forgot I had 
followed a tutorial that told mem to put |scoring-depth| instead of 
|scoring-opic| into the plugin.includes property. Now I get a variety of scores.
Anyway, what is the general advice on which scoring method to use? Is there any 
recommended reading? I am planning to crawl broadly across the www for data 
mining (not necessarily search) covering millions of sites.


  From: lewis john mcgibbney <lewi...@apache.org>
 To: "user@nutch.apache.org" <user@nutch.apache.org> 
 Sent: Tuesday, November 15, 2016 12:09 AM
 Subject: Re: How can I Score?
   
Hi Michael,
Replies inline

On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote:

> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> Subject: How can I Score?
> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls.


Yes this is the threshold of how many top scoring URLs you wish to generate
into a new Fetch list and subsequently fetch. When you use the crawl
script, the -topN is calculated as follows

$numSlaves * 5

By default, we assume that you are running on one machine (local mode)
therefore the numSlaves variable is set to 1.


> In my case, all the urls in my db have a score of zero, except the ones
> injected.
>

This is a bit strange. I would not expect them to have absolutely zero...
are you sure that it is not marginally above zero? Which scoring
plugin/mechanism are you currently using?


> How can I cause scores to be computed and stored?


Scores for each and every CrawlDatum are computed automatically
out-of-the-box.


> I am using the standard crawl script.


OK


> Do I need to enable the various webgraph lines in the script?
>
>
Not unless you wish to use the WebGraph scoring implementation...
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


   

Re: How can I Score?

2016-11-15 Thread lewis john mcgibbney
Hi Michael,
Replies inline

On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote:

> From: Michael Coffey <mcof...@yahoo.com.invalid>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> Subject: How can I Score?
> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls.


Yes this is the threshold of how many top scoring URLs you wish to generate
into a new Fetch list and subsequently fetch. When you use the crawl
script, the -topN is calculated as follows

$numSlaves * 5

By default, we assume that you are running on one machine (local mode)
therefore the numSlaves variable is set to 1.


> In my case, all the urls in my db have a score of zero, except the ones
> injected.
>

This is a bit strange. I would not expect them to have absolutely zero...
are you sure that it is not marginally above zero? Which scoring
plugin/mechanism are you currently using?


> How can I cause scores to be computed and stored?


Scores for each and every CrawlDatum are computed automatically
out-of-the-box.


> I am using the standard crawl script.


OK


> Do I need to enable the various webgraph lines in the script?
>
>
Not unless you wish to use the WebGraph scoring implementation...
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: How can I Score?

2016-11-12 Thread Yongyao Jiang
I was thinking about the same question too. My guess is scoring happens
when you run fetch command.

This page may help, https://wiki.apache.org/nutch/NutchScoring

On Sat, Nov 12, 2016 at 10:07 PM, Michael Coffey 
wrote:

> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls. In my case, all the urls in my db have a score of
> zero, except the ones injected.
> How can I cause scores to be computed and stored? I am using the standard
> crawl script. Do I need to enable the various webgraph lines in the script?
>



-- 
Yongyao Jiang
https://www.linkedin.com/in/yongyao-jiang-42516164
Ph.D. Student in Earth Systems and GeoInformation Sciences
NSF Spatiotemporal Innovation Center
George Mason University


How can I Score?

2016-11-12 Thread Michael Coffey
When the generator is used with -topN, it is supposed to choose the 
highest-scoring urls. In my case, all the urls in my db have a score of zero, 
except the ones injected.
How can I cause scores to be computed and stored? I am using the standard crawl 
script. Do I need to enable the various webgraph lines in the script?