Re: How can I Score?
Aha! I was wrong when I said I was using all default settings. I forgot I had followed a tutorial that told mem to put |scoring-depth| instead of |scoring-opic| into the plugin.includes property. Now I get a variety of scores. Anyway, what is the general advice on which scoring method to use? Is there any recommended reading? I am planning to crawl broadly across the www for data mining (not necessarily search) covering millions of sites. From: lewis john mcgibbneyTo: "user@nutch.apache.org" Sent: Tuesday, November 15, 2016 12:09 AM Subject: Re: How can I Score? Hi Michael, Replies inline On Sat, Nov 12, 2016 at 7:10 PM, wrote: > From: Michael Coffey > To: "user@nutch.apache.org" > Cc: > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > Subject: How can I Score? > When the generator is used with -topN, it is supposed to choose the > highest-scoring urls. Yes this is the threshold of how many top scoring URLs you wish to generate into a new Fetch list and subsequently fetch. When you use the crawl script, the -topN is calculated as follows $numSlaves * 5 By default, we assume that you are running on one machine (local mode) therefore the numSlaves variable is set to 1. > In my case, all the urls in my db have a score of zero, except the ones > injected. > This is a bit strange. I would not expect them to have absolutely zero... are you sure that it is not marginally above zero? Which scoring plugin/mechanism are you currently using? > How can I cause scores to be computed and stored? Scores for each and every CrawlDatum are computed automatically out-of-the-box. > I am using the standard crawl script. OK > Do I need to enable the various webgraph lines in the script? > > Not unless you wish to use the WebGraph scoring implementation... Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Re: [MASSMAIL]Re: how to insert nutch into ambari ecosystem ?
Thanks lewis. Nutch crawl script has an automatic option to detect if it is distributed or local mode. as you said i have copied nutch into a cluster and also compile as a job with its configuration, and is done. That is a complex task because ambari has a lot of component that are intersting. I am learning about accumulo,yarn because it is new for me. Thanks for your answer. Eyeris. - Mensaje original - De: "lewis john mcgibbney"Para: user@nutch.apache.org Enviados: Martes, 15 de Noviembre 2016 13:55:57 Asunto: [MASSMAIL]Re: how to insert nutch into ambari ecosystem ? Hi Eyeris, Replies inline On Fri, Oct 28, 2016 at 8:51 PM, wrote: > From: Eyeris Rodriguez Rueda > To: user@nutch.apache.org > Cc: > Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT) > Subject: how to insert nutch into ambari ecosystem ? > Hi all. > I have installed ambari ecosystem and it services is running > ok(accumulo,yarn,zookeeper and others). > Good. > My environment is a short cluster with 8 servers using ubuntu Server 14.04 > because ambari is not yet compatible with ubuntu server 16.04. > OK > But i don't know how to insert nutch into ambari ecosystem to make crawl > and also index with solr. > Please any help or advice will be appreciated. > > Well there are two parts to this. One is us working over on the Ambari/BigTop platforms to ensure that the relevant compatible packaging is created such that the option to build Nutch with the Hadoop stack is shipped and available within Ambari. This is probably a fair amount of work... but something which would be useful there is no doubt about that. The other is that when launching Hadoop clusters with Ambari and wishing to run Nutch on there, you can do so as you would do so normally. Just log into the head node and launch your Nutch crawler in deploy mode... simple as that. Any issues, let us know. lewis The University of Informatics Sciences invites you to participate in the Scientific Conference UCIENCIA 2016, november 24-26. Conferencia CientÃfica UCIENCIA 2016,del 24 al 26 de noviembre. http://uciencia.eventos.uci.cu/
Re: how to insert nutch into ambari ecosystem ?
Hi Eyeris, Replies inline On Fri, Oct 28, 2016 at 8:51 PM,wrote: > From: Eyeris Rodriguez Rueda > To: user@nutch.apache.org > Cc: > Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT) > Subject: how to insert nutch into ambari ecosystem ? > Hi all. > I have installed ambari ecosystem and it services is running > ok(accumulo,yarn,zookeeper and others). > Good. > My environment is a short cluster with 8 servers using ubuntu Server 14.04 > because ambari is not yet compatible with ubuntu server 16.04. > OK > But i don't know how to insert nutch into ambari ecosystem to make crawl > and also index with solr. > Please any help or advice will be appreciated. > > Well there are two parts to this. One is us working over on the Ambari/BigTop platforms to ensure that the relevant compatible packaging is created such that the option to build Nutch with the Hadoop stack is shipped and available within Ambari. This is probably a fair amount of work... but something which would be useful there is no doubt about that. The other is that when launching Hadoop clusters with Ambari and wishing to run Nutch on there, you can do so as you would do so normally. Just log into the head node and launch your Nutch crawler in deploy mode... simple as that. Any issues, let us know. lewis
RE: Nutch 2.3.1 REST calls to DB
Thank you Lewis! About second question db { "batchId": "batch-id" } I replaced batch-id with value from batchId from database. It doesn't work. Regards, Vladimir. -Original Message- From: lewis john mcgibbney [mailto:lewi...@apache.org] Sent: November-15-16 11:53 AM To: user@nutch.apache.org Subject: Re: Nutch 2.3.1 REST calls to DB Hi Vladimir, Responses inline On Thu, Nov 10, 2016 at 1:05 AM,wrote: > From: Vladimir Loubenski > To: "user@nutch.apache.org" > Cc: > Date: Tue, 8 Nov 2016 17:53:59 + > Subject: Nutch 2.3.1 REST calls to DB > Hi, > Nutch 2.x REST API documentation is mentioned following syntax for DB > calls > :https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_ > nutch_NutchRESTAPI=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6 > IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIk > w5q8=1ISpV-kF4K4uFOgWvbrhzK_gkRhK13HECdHSlV7eB9Q= > > 1. What does mean "startKey", "endKey" and "isKeysReversed" ? > POST /db >{ > "startKey":"com.google", > "endKey":"com.yahoo", > "isKeysReversed":"true" >} > Well essentially you are running a DB query here this is because we are attempting to obtain data from one of the Gora supported databases. If you wish to read the code then please see https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_resources_DbResource.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=FUPZpGNJHrdxDxRqoU6QBE8Utbmgzsoku0ihKlDtAKk= In this case the startKey and endKey are what make up the 'DbFilter' object. More on the particular object semantics can be seen at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_model_request_DbFilter.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=GXRDR3MKOpzJFD8vtAUnIaNkrjtOUt_ChAX974hFoDQ= In this case you are setting a start key, and an end key from which to scan and for which to return a results Iterator. Please note, that right now we do not have consistency in the way that start keys or end keys are inclusive or not within the Gora Query API. Now on to the 'isKeysReversed' aspect of the JSON configuration. This one relates to whether or not your key's represent a URL in it's reversed form. In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to the improvements this offers us in terms of query and scan performance. Lets take an example 'org.apache.nutch...' This means that we can scan initially for 'org' then 'apache' meaning that we are scanning a significantly reduced subset of the data contained within the WebGraph DB. On the other hand lets consider the following 'https://urldefense.proofpoint.com/v2/url?u=http-3A__nutch.apache.org=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=AZDsV2XWUToctvsYiXTgWfQZmm4W3Ehpb1EbO7LtFbc= ...' This would mean that we query by 'http://', then 'nutch' The issue with querying for 'http://' is that more or less EVERY key within the DB would contain 'http://' meaning that our path to query is significantly increased and our query is not going to be very efficient at all. > > Call bellow doesn't work for me. It always return empty result POST > /db >{ > "batchId": "batch-id" >} > > Please ensure that you have replaced the right hand side "batch-id" value with the value of one of your BatchID identifiers. These are created at the generate phase of a crawl cycle. In order to obtain a list of all BatchID's you've created, you would need to query your Database separately outside of Nutch and create a list of BatchID results. hth Lewis
Re: user Digest 7 Nov 2016 19:53:09 -0000 Issue 2672
Hi Eyeris, I've just tried Nutch master branch to parse outlinks from a number of RSS Feeds, an example being 'http://www.jpl.nasa.gov/blog/feed/'. This works perfectly with both the feed and parse-tika plugins. Outlinks are extracted accordingly. Can you provide an example of the RSS Feeds you are looking to parse outlinks from? Are they valid? An excellent resource to use for this kind of trouble shooting is the ParseChecker tool https://wiki.apache.org/nutch/bin/nutch%20parsechecker hth On Mon, Nov 7, 2016 at 11:53 AM,wrote: > From: Eyeris Rodriguez Rueda > To: user@nutch.apache.org > Cc: > Date: Sun, 6 Nov 2016 12:14:29 -0500 (CST) > Subject: how to insert outlinks from rss in crawldb ? > > Hi. > I am using nutch 1.12 and solr 4.10.3. > > Rss is a significant way to discover new url to fetch. > > All links detected in a rss are not inserted in crawldb as new urls. > any body can tell me why? > Please any body can help me or point me in the right direction for insert > outlinks from feed to crawldb, and visit its in the next iteration. > > I have activated only tika parser because using both (tika and feed) the > field content and outlinks are empty in solr. > > > > > > > > > > > E don´t are inserted in crawldb and also don´t visited in next iterations > of c > > > -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney
Re: Nutch 2.3.1 REST calls to DB
Hi Vladimir, Responses inline On Thu, Nov 10, 2016 at 1:05 AM,wrote: > From: Vladimir Loubenski > To: "user@nutch.apache.org" > Cc: > Date: Tue, 8 Nov 2016 17:53:59 + > Subject: Nutch 2.3.1 REST calls to DB > Hi, > Nutch 2.x REST API documentation is mentioned following syntax for DB > calls :https://wiki.apache.org/nutch/NutchRESTAPI > > 1. What does mean "startKey", "endKey" and "isKeysReversed" ? > POST /db >{ > "startKey":"com.google", > "endKey":"com.yahoo", > "isKeysReversed":"true" >} > Well essentially you are running a DB query here this is because we are attempting to obtain data from one of the Gora supported databases. If you wish to read the code then please see https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/resources/DbResource.java In this case the startKey and endKey are what make up the 'DbFilter' object. More on the particular object semantics can be seen at https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/model/request/DbFilter.java In this case you are setting a start key, and an end key from which to scan and for which to return a results Iterator. Please note, that right now we do not have consistency in the way that start keys or end keys are inclusive or not within the Gora Query API. Now on to the 'isKeysReversed' aspect of the JSON configuration. This one relates to whether or not your key's represent a URL in it's reversed form. In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to the improvements this offers us in terms of query and scan performance. Lets take an example 'org.apache.nutch...' This means that we can scan initially for 'org' then 'apache' meaning that we are scanning a significantly reduced subset of the data contained within the WebGraph DB. On the other hand lets consider the following 'http://nutch.apache.org...' This would mean that we query by 'http://', then 'nutch' The issue with querying for 'http://' is that more or less EVERY key within the DB would contain 'http://' meaning that our path to query is significantly increased and our query is not going to be very efficient at all. > > Call bellow doesn't work for me. It always return empty result > POST /db >{ > "batchId": "batch-id" >} > > Please ensure that you have replaced the right hand side "batch-id" value with the value of one of your BatchID identifiers. These are created at the generate phase of a crawl cycle. In order to obtain a list of all BatchID's you've created, you would need to query your Database separately outside of Nutch and create a list of BatchID results. hth Lewis
Re: How can I Score?
Hi Michael, Replies inline On Sat, Nov 12, 2016 at 7:10 PM,wrote: > From: Michael Coffey > To: "user@nutch.apache.org" > Cc: > Date: Sun, 13 Nov 2016 03:07:16 + (UTC) > Subject: How can I Score? > When the generator is used with -topN, it is supposed to choose the > highest-scoring urls. Yes this is the threshold of how many top scoring URLs you wish to generate into a new Fetch list and subsequently fetch. When you use the crawl script, the -topN is calculated as follows $numSlaves * 5 By default, we assume that you are running on one machine (local mode) therefore the numSlaves variable is set to 1. > In my case, all the urls in my db have a score of zero, except the ones > injected. > This is a bit strange. I would not expect them to have absolutely zero... are you sure that it is not marginally above zero? Which scoring plugin/mechanism are you currently using? > How can I cause scores to be computed and stored? Scores for each and every CrawlDatum are computed automatically out-of-the-box. > I am using the standard crawl script. OK > Do I need to enable the various webgraph lines in the script? > > Not unless you wish to use the WebGraph scoring implementation... Lewis -- http://home.apache.org/~lewismc/ @hectorMcSpector http://www.linkedin.com/in/lmcgibbney