Re: How can I Score?

2016-11-15 Thread Michael Coffey
Aha! I was wrong when I said I was using all default settings. I forgot I had 
followed a tutorial that told mem to put |scoring-depth| instead of 
|scoring-opic| into the plugin.includes property. Now I get a variety of scores.
Anyway, what is the general advice on which scoring method to use? Is there any 
recommended reading? I am planning to crawl broadly across the www for data 
mining (not necessarily search) covering millions of sites.


  From: lewis john mcgibbney 
 To: "user@nutch.apache.org"  
 Sent: Tuesday, November 15, 2016 12:09 AM
 Subject: Re: How can I Score?
   
Hi Michael,
Replies inline

On Sat, Nov 12, 2016 at 7:10 PM,  wrote:

> From: Michael Coffey 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> Subject: How can I Score?
> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls.


Yes this is the threshold of how many top scoring URLs you wish to generate
into a new Fetch list and subsequently fetch. When you use the crawl
script, the -topN is calculated as follows

$numSlaves * 5

By default, we assume that you are running on one machine (local mode)
therefore the numSlaves variable is set to 1.


> In my case, all the urls in my db have a score of zero, except the ones
> injected.
>

This is a bit strange. I would not expect them to have absolutely zero...
are you sure that it is not marginally above zero? Which scoring
plugin/mechanism are you currently using?


> How can I cause scores to be computed and stored?


Scores for each and every CrawlDatum are computed automatically
out-of-the-box.


> I am using the standard crawl script.


OK


> Do I need to enable the various webgraph lines in the script?
>
>
Not unless you wish to use the WebGraph scoring implementation...
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


   

Re: [MASSMAIL]Re: how to insert nutch into ambari ecosystem ?

2016-11-15 Thread Eyeris Rodriguez Rueda
Thanks lewis.
Nutch crawl script has an automatic option to detect if it is distributed or 
local mode.
as you said i have copied nutch into a cluster and also compile as a job with 
its configuration, and is done.
That is a complex task because ambari has a lot of component that are 
intersting.
I am learning about accumulo,yarn because it is new for me.
Thanks for your answer.
Eyeris.


- Mensaje original -
De: "lewis john mcgibbney" 
Para: user@nutch.apache.org
Enviados: Martes, 15 de Noviembre 2016 13:55:57
Asunto: [MASSMAIL]Re: how to insert nutch into ambari ecosystem ?

Hi Eyeris,
Replies inline

On Fri, Oct 28, 2016 at 8:51 PM,  wrote:

> From: Eyeris Rodriguez Rueda 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT)
> Subject: how to insert nutch into ambari ecosystem ?
> Hi all.
> I have installed ambari ecosystem and it services is running
> ok(accumulo,yarn,zookeeper and others).
>

Good.


> My environment is a short cluster with 8 servers using ubuntu Server 14.04
> because ambari is not yet compatible with ubuntu server 16.04.
>

OK


> But i don't know how to insert nutch into ambari ecosystem to make crawl
> and also index with solr.
> Please any help or advice will be appreciated.
>
>
Well there are two parts to this.

One is us working over on the Ambari/BigTop platforms to ensure that the
relevant compatible packaging is created such that the option to build
Nutch with the Hadoop stack is shipped and available within Ambari. This is
probably a fair amount of work... but something which would be useful there
is no doubt about that.

The other is that when launching Hadoop clusters with Ambari and wishing to
run Nutch on there, you can do so as you would do so normally. Just log
into the head node and launch your Nutch crawler in deploy mode... simple
as that.
Any issues, let us know.
lewis
The University of Informatics Sciences invites you to participate in the 
Scientific Conference UCIENCIA 2016,
november 24-26.
Conferencia Científica UCIENCIA 2016,del 24 al 26 de noviembre.
http://uciencia.eventos.uci.cu/


Re: how to insert nutch into ambari ecosystem ?

2016-11-15 Thread lewis john mcgibbney
Hi Eyeris,
Replies inline

On Fri, Oct 28, 2016 at 8:51 PM,  wrote:

> From: Eyeris Rodriguez Rueda 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT)
> Subject: how to insert nutch into ambari ecosystem ?
> Hi all.
> I have installed ambari ecosystem and it services is running
> ok(accumulo,yarn,zookeeper and others).
>

Good.


> My environment is a short cluster with 8 servers using ubuntu Server 14.04
> because ambari is not yet compatible with ubuntu server 16.04.
>

OK


> But i don't know how to insert nutch into ambari ecosystem to make crawl
> and also index with solr.
> Please any help or advice will be appreciated.
>
>
Well there are two parts to this.

One is us working over on the Ambari/BigTop platforms to ensure that the
relevant compatible packaging is created such that the option to build
Nutch with the Hadoop stack is shipped and available within Ambari. This is
probably a fair amount of work... but something which would be useful there
is no doubt about that.

The other is that when launching Hadoop clusters with Ambari and wishing to
run Nutch on there, you can do so as you would do so normally. Just log
into the head node and launch your Nutch crawler in deploy mode... simple
as that.
Any issues, let us know.
lewis


RE: Nutch 2.3.1 REST calls to DB

2016-11-15 Thread Vladimir Loubenski
Thank you Lewis!
About second question 
db
   {
   "batchId": "batch-id"
}
I replaced  batch-id with value from batchId from database.
It doesn't work.
Regards,
Vladimir.

-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: November-15-16 11:53 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.3.1 REST calls to DB

Hi Vladimir,
Responses inline

On Thu, Nov 10, 2016 at 1:05 AM,  wrote:

> From: Vladimir Loubenski 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned  following syntax for DB 
> calls  
> :https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_
> nutch_NutchRESTAPI=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6
> IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIk
> w5q8=1ISpV-kF4K4uFOgWvbrhzK_gkRhK13HECdHSlV7eB9Q=
>
> 1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
> POST /db
>{
>   "startKey":"com.google",
>   "endKey":"com.yahoo",
>   "isKeysReversed":"true"
>}
>

Well essentially you are running a DB query here this is because we are 
attempting to obtain data from one of the Gora supported databases. If you wish 
to read the code then please see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_resources_DbResource.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=FUPZpGNJHrdxDxRqoU6QBE8Utbmgzsoku0ihKlDtAKk=
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_model_request_DbFilter.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=GXRDR3MKOpzJFD8vtAUnIaNkrjtOUt_ChAX974hFoDQ=
In this case you are setting a start key, and an end key from which to scan and 
for which to return a results Iterator. Please note, that right now we do not 
have consistency in the way that start keys or end keys are inclusive or not 
within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one 
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to the 
improvements this offers us in terms of query and scan performance.
Lets take an example

'org.apache.nutch...'

This means that we can scan initially for 'org' then 'apache' meaning that we 
are scanning a significantly reduced subset of the data contained within the 
WebGraph DB. On the other hand lets consider the following

'https://urldefense.proofpoint.com/v2/url?u=http-3A__nutch.apache.org=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=AZDsV2XWUToctvsYiXTgWfQZmm4W3Ehpb1EbO7LtFbc=
 ...'

This would mean that we query by 'http://', then 'nutch'

The issue with querying for 'http://' is that more or less EVERY key within the 
DB would contain 'http://' meaning that our path to query is significantly 
increased and our query is not going to be very efficient at all.



>
> Call bellow doesn't work for me. It always return empty result POST 
> /db
>{
>   "batchId": "batch-id"
>}
>
>
Please ensure that you have replaced the right hand side "batch-id" value with 
the value of one of your BatchID identifiers. These are created at the generate 
phase of a crawl cycle. In order to obtain a list of all BatchID's you've 
created, you would need to query your Database separately outside of Nutch and 
create a list of BatchID results.
hth
Lewis


Re: user Digest 7 Nov 2016 19:53:09 -0000 Issue 2672

2016-11-15 Thread lewis john mcgibbney
Hi Eyeris,
I've just tried Nutch master branch to parse outlinks from a number of RSS
Feeds, an example being 'http://www.jpl.nasa.gov/blog/feed/'. This works
perfectly with both the feed and parse-tika plugins. Outlinks are extracted
accordingly.
Can you provide an example of the RSS Feeds you are looking to parse
outlinks from? Are they valid?
An excellent resource to use for this kind of trouble shooting is the
ParseChecker tool
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
hth


On Mon, Nov 7, 2016 at 11:53 AM,  wrote:

> From: Eyeris Rodriguez Rueda 
> To: user@nutch.apache.org
> Cc:
> Date: Sun, 6 Nov 2016 12:14:29 -0500 (CST)
> Subject: how to insert outlinks from rss in crawldb ?
>
> Hi.
> I am using nutch 1.12 and solr 4.10.3.
>
> Rss is a significant way to discover new url to fetch.
>
> All links detected in a rss are not inserted in crawldb as new urls.
> any body can tell me why?
> Please any body can help me or point me in the right direction for insert
> outlinks from feed to crawldb, and visit its in the next iteration.
>
> I have activated only tika parser because using both (tika and feed) the
> field content and outlinks are empty in solr.
>
>
>
>
>
>
>
>
>
>
>  E don´t are inserted in crawldb and also don´t visited in next iterations
> of c
>
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Nutch 2.3.1 REST calls to DB

2016-11-15 Thread lewis john mcgibbney
Hi Vladimir,
Responses inline

On Thu, Nov 10, 2016 at 1:05 AM,  wrote:

> From: Vladimir Loubenski 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned  following syntax for DB
> calls  :https://wiki.apache.org/nutch/NutchRESTAPI
>
> 1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
> POST /db
>{
>   "startKey":"com.google",
>   "endKey":"com.yahoo",
>   "isKeysReversed":"true"
>}
>

Well essentially you are running a DB query here this is because we are
attempting to obtain data from one of the Gora supported databases. If you
wish to read the code then please see
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/resources/DbResource.java
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/model/request/DbFilter.java
In this case you are setting a start key, and an end key from which to scan
and for which to return a results Iterator. Please note, that right now we
do not have consistency in the way that start keys or end keys are
inclusive or not within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to
the improvements this offers us in terms of query and scan performance.
Lets take an example

'org.apache.nutch...'

This means that we can scan initially for 'org' then 'apache' meaning that
we are scanning a significantly reduced subset of the data contained within
the WebGraph DB. On the other hand lets consider the following

'http://nutch.apache.org...'

This would mean that we query by 'http://', then 'nutch'

The issue with querying for 'http://' is that more or less EVERY key within
the DB would contain 'http://' meaning that our path to query is
significantly increased and our query is not going to be very efficient at
all.



>
> Call bellow doesn't work for me. It always return empty result
> POST /db
>{
>   "batchId": "batch-id"
>}
>
>
Please ensure that you have replaced the right hand side "batch-id" value
with the value of one of your BatchID identifiers. These are created at the
generate phase of a crawl cycle. In order to obtain a list of all BatchID's
you've created, you would need to query your Database separately outside of
Nutch and create a list of BatchID results.
hth
Lewis


Re: How can I Score?

2016-11-15 Thread lewis john mcgibbney
Hi Michael,
Replies inline

On Sat, Nov 12, 2016 at 7:10 PM,  wrote:

> From: Michael Coffey 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> Subject: How can I Score?
> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls.


Yes this is the threshold of how many top scoring URLs you wish to generate
into a new Fetch list and subsequently fetch. When you use the crawl
script, the -topN is calculated as follows

$numSlaves * 5

By default, we assume that you are running on one machine (local mode)
therefore the numSlaves variable is set to 1.


> In my case, all the urls in my db have a score of zero, except the ones
> injected.
>

This is a bit strange. I would not expect them to have absolutely zero...
are you sure that it is not marginally above zero? Which scoring
plugin/mechanism are you currently using?


> How can I cause scores to be computed and stored?


Scores for each and every CrawlDatum are computed automatically
out-of-the-box.


> I am using the standard crawl script.


OK


> Do I need to enable the various webgraph lines in the script?
>
>
Not unless you wish to use the WebGraph scoring implementation...
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney