RE: How can I Score?
Hi, Does scoring implemented for Nutch 2.3.1? I always receive 0 for the scoring field. Do I need to run a special step to receive it? Regards, Vladimir. -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: November-17-16 7:52 AM To: user@nutch.apache.org Subject: Re: How can I Score? Hi, one solution is to include the plugin scoring-opic in the property plugin.includes That should work given that injected URLs have a non-zero score and are not redirects. OPIC is a good choice for frontier selection (via -topN). For Nutch 1.x the webgraph steps in combination with scoring-link are an alternative, esp. if you want to use the score mainly in the index to rank search results. Best, Sebastian On 11/13/2016 04:17 AM, Yongyao Jiang wrote: > I was thinking about the same question too. My guess is scoring > happens when you run fetch command. > > This page may help, > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_n > utch_NutchScoring=DgICaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6I > KI5viJn9Qf3N2dP8AA11tevsqfk=1EHxXRkotDzHoYoqwwvqkZmyjjqdhmAcZG65pT4F > W1c=a6DajlQGWDVJOyQXgXusR9NEsZsgU7MuwSqqycvPFtQ= > > On Sat, Nov 12, 2016 at 10:07 PM, Michael Coffey > <mcof...@yahoo.com.invalid> > wrote: > >> When the generator is used with -topN, it is supposed to choose the >> highest-scoring urls. In my case, all the urls in my db have a score >> of zero, except the ones injected. >> How can I cause scores to be computed and stored? I am using the >> standard crawl script. Do I need to enable the various webgraph lines in the >> script? >> > > >
Re: How can I Score?
Hi, one solution is to include the plugin scoring-opic in the property plugin.includes That should work given that injected URLs have a non-zero score and are not redirects. OPIC is a good choice for frontier selection (via -topN). For Nutch 1.x the webgraph steps in combination with scoring-link are an alternative, esp. if you want to use the score mainly in the index to rank search results. Best, Sebastian On 11/13/2016 04:17 AM, Yongyao Jiang wrote: > I was thinking about the same question too. My guess is scoring happens > when you run fetch command. > > This page may help, https://wiki.apache.org/nutch/NutchScoring > > On Sat, Nov 12, 2016 at 10:07 PM, Michael Coffey> wrote: > >> When the generator is used with -topN, it is supposed to choose the >> highest-scoring urls. In my case, all the urls in my db have a score of >> zero, except the ones injected. >> How can I cause scores to be computed and stored? I am using the standard >> crawl script. Do I need to enable the various webgraph lines in the script? >> > > >
Re: How can I Score?
Hi, Here is an old question but an answer from Markus too :) http://lucene.472066.n3.nabble.com/PageRank-or-Opic-td4118842.html Kind Regards, Furkan KAMACI On Wed, Nov 16, 2016 at 11:32 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > WebGraph is superior to opic. It eats resources but if you can spare them, > use it. Also, if you recrawl already fetched URL's, scores will go wrong > with opic. > Markus > > > > -Original message- > > From:Michael Coffey <mcof...@yahoo.com.INVALID> > > Sent: Wednesday 16th November 2016 7:15 > > To: user@nutch.apache.org > > Subject: Re: How can I Score? > > > > Aha! I was wrong when I said I was using all default settings. I forgot > I had followed a tutorial that told mem to put |scoring-depth| instead of > |scoring-opic| into the plugin.includes property. Now I get a variety of > scores. > > Anyway, what is the general advice on which scoring method to use? Is > there any recommended reading? I am planning to crawl broadly across the > www for data mining (not necessarily search) covering millions of sites. > > > > > > From: lewis john mcgibbney <lewi...@apache.org> > > To: "user@nutch.apache.org" <user@nutch.apache.org> > > Sent: Tuesday, November 15, 2016 12:09 AM > > Subject: Re: How can I Score? > > > > Hi Michael, > > Replies inline > > > > On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> > wrote: > > > > > From: Michael Coffey <mcof...@yahoo.com.invalid> > > > To: "user@nutch.apache.org" <user@nutch.apache.org> > > > Cc: > > > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > > > Subject: How can I Score? > > > When the generator is used with -topN, it is supposed to choose the > > > highest-scoring urls. > > > > > > Yes this is the threshold of how many top scoring URLs you wish to > generate > > into a new Fetch list and subsequently fetch. When you use the crawl > > script, the -topN is calculated as follows > > > > $numSlaves * 5 > > > > By default, we assume that you are running on one machine (local mode) > > therefore the numSlaves variable is set to 1. > > > > > > > In my case, all the urls in my db have a score of zero, except the ones > > > injected. > > > > > > > This is a bit strange. I would not expect them to have absolutely zero... > > are you sure that it is not marginally above zero? Which scoring > > plugin/mechanism are you currently using? > > > > > > > How can I cause scores to be computed and stored? > > > > > > Scores for each and every CrawlDatum are computed automatically > > out-of-the-box. > > > > > > > I am using the standard crawl script. > > > > > > OK > > > > > > > Do I need to enable the various webgraph lines in the script? > > > > > > > > Not unless you wish to use the WebGraph scoring implementation... > > Lewis > > > > > > -- > > http://home.apache.org/~lewismc/ > > @hectorMcSpector > > http://www.linkedin.com/in/lmcgibbney > > > > > > >
RE: How can I Score?
WebGraph is superior to opic. It eats resources but if you can spare them, use it. Also, if you recrawl already fetched URL's, scores will go wrong with opic. Markus -Original message- > From:Michael Coffey <mcof...@yahoo.com.INVALID> > Sent: Wednesday 16th November 2016 7:15 > To: user@nutch.apache.org > Subject: Re: How can I Score? > > Aha! I was wrong when I said I was using all default settings. I forgot I had > followed a tutorial that told mem to put |scoring-depth| instead of > |scoring-opic| into the plugin.includes property. Now I get a variety of > scores. > Anyway, what is the general advice on which scoring method to use? Is there > any recommended reading? I am planning to crawl broadly across the www for > data mining (not necessarily search) covering millions of sites. > > > From: lewis john mcgibbney <lewi...@apache.org> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Sent: Tuesday, November 15, 2016 12:09 AM > Subject: Re: How can I Score? > > Hi Michael, > Replies inline > > On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote: > > > From: Michael Coffey <mcof...@yahoo.com.invalid> > > To: "user@nutch.apache.org" <user@nutch.apache.org> > > Cc: > > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > > Subject: How can I Score? > > When the generator is used with -topN, it is supposed to choose the > > highest-scoring urls. > > > Yes this is the threshold of how many top scoring URLs you wish to generate > into a new Fetch list and subsequently fetch. When you use the crawl > script, the -topN is calculated as follows > > $numSlaves * 5 > > By default, we assume that you are running on one machine (local mode) > therefore the numSlaves variable is set to 1. > > > > In my case, all the urls in my db have a score of zero, except the ones > > injected. > > > > This is a bit strange. I would not expect them to have absolutely zero... > are you sure that it is not marginally above zero? Which scoring > plugin/mechanism are you currently using? > > > > How can I cause scores to be computed and stored? > > > Scores for each and every CrawlDatum are computed automatically > out-of-the-box. > > > > I am using the standard crawl script. > > > OK > > > > Do I need to enable the various webgraph lines in the script? > > > > > Not unless you wish to use the WebGraph scoring implementation... > Lewis > > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney > > >
Re: How can I Score?
Aha! I was wrong when I said I was using all default settings. I forgot I had followed a tutorial that told mem to put |scoring-depth| instead of |scoring-opic| into the plugin.includes property. Now I get a variety of scores. Anyway, what is the general advice on which scoring method to use? Is there any recommended reading? I am planning to crawl broadly across the www for data mining (not necessarily search) covering millions of sites. From: lewis john mcgibbney <lewi...@apache.org> To: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Tuesday, November 15, 2016 12:09 AM Subject: Re: How can I Score? Hi Michael, Replies inline On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > Subject: How can I Score? > When the generator is used with -topN, it is supposed to choose the > highest-scoring urls. Yes this is the threshold of how many top scoring URLs you wish to generate into a new Fetch list and subsequently fetch. When you use the crawl script, the -topN is calculated as follows $numSlaves * 5 By default, we assume that you are running on one machine (local mode) therefore the numSlaves variable is set to 1. > In my case, all the urls in my db have a score of zero, except the ones > injected. > This is a bit strange. I would not expect them to have absolutely zero... are you sure that it is not marginally above zero? Which scoring plugin/mechanism are you currently using? > How can I cause scores to be computed and stored? Scores for each and every CrawlDatum are computed automatically out-of-the-box. > I am using the standard crawl script. OK > Do I need to enable the various webgraph lines in the script? > > Not unless you wish to use the WebGraph scoring implementation... Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Re: How can I Score?
Hi Michael, Replies inline On Sat, Nov 12, 2016 at 7:10 PM, <user-digest-h...@nutch.apache.org> wrote: > From: Michael Coffey <mcof...@yahoo.com.invalid> > To: "user@nutch.apache.org" <user@nutch.apache.org> > Cc: > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > Subject: How can I Score? > When the generator is used with -topN, it is supposed to choose the > highest-scoring urls. Yes this is the threshold of how many top scoring URLs you wish to generate into a new Fetch list and subsequently fetch. When you use the crawl script, the -topN is calculated as follows $numSlaves * 5 By default, we assume that you are running on one machine (local mode) therefore the numSlaves variable is set to 1. > In my case, all the urls in my db have a score of zero, except the ones > injected. > This is a bit strange. I would not expect them to have absolutely zero... are you sure that it is not marginally above zero? Which scoring plugin/mechanism are you currently using? > How can I cause scores to be computed and stored? Scores for each and every CrawlDatum are computed automatically out-of-the-box. > I am using the standard crawl script. OK > Do I need to enable the various webgraph lines in the script? > > Not unless you wish to use the WebGraph scoring implementation... Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Re: How can I Score?
I was thinking about the same question too. My guess is scoring happens when you run fetch command. This page may help, https://wiki.apache.org/nutch/NutchScoring On Sat, Nov 12, 2016 at 10:07 PM, Michael Coffeywrote: > When the generator is used with -topN, it is supposed to choose the > highest-scoring urls. In my case, all the urls in my db have a score of > zero, except the ones injected. > How can I cause scores to be computed and stored? I am using the standard > crawl script. Do I need to enable the various webgraph lines in the script? > -- Yongyao Jiang https://www.linkedin.com/in/yongyao-jiang-42516164 Ph.D. Student in Earth Systems and GeoInformation Sciences NSF Spatiotemporal Innovation Center George Mason University
How can I Score?
When the generator is used with -topN, it is supposed to choose the highest-scoring urls. In my case, all the urls in my db have a score of zero, except the ones injected. How can I cause scores to be computed and stored? I am using the standard crawl script. Do I need to enable the various webgraph lines in the script?