RE: Crawled data not inserting in the tables
Hi, Talat I am afraid that I do not understand. We have set the “ttl” value to 0, which is the default value. We don’t have any need portions of data that needs to be deleted. For now, I am using a single node cluster, for us the gc_grace_seconds=”0” default value would be a valid value. Have I missed out anything? My settings are as follows. Any suggestions would be greatly appreciated. gora-orm keyspace name=projectKeyspace cluster=MultiTest host=192.161.23.161:9160 placement_strategy=org.apache.cassandra.locator.NetworkTopologyStrategy family name=p / family name=f/ family name=sc type=super/ family name=mtdt type=super/ family name=il type=super/ family name=ol type=super/ /keyspace class keyClass=java.lang.String name=org.apache.nutch.storage.WebPage keyspace=projectKeyspace !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=modifiedTime family=f qualifier=mod/ field name=prevModifiedTime family=f qualifier=pmod/ field name=batchId family=f qualifier=bid/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=f qualifier=s/ !-- super columns -- field name=headers family=sc qualifier=h/ field name=inlinks family=sc qualifier=il/ field name=outlinks family=sc qualifier=ol/ field name=metadata family=sc qualifier=mtdt/ field name=markers family=sc qualifier=mk/ field name=parseStatus family=sc qualifier=pas/ field name=protocolStatus family=sc qualifier=prs/ /class class keyClass=java.lang.String name=org.apache.nutch.storage.Host keyspace=projectKeyspace field name=metadata family=mtdt qualifier=mtdt/ field name=inlinks family=il qualifier=il/ field name=outlinks family=ol qualifier=ol/ /class /gora-orm Thanks, Kartik From: Talat Uyarer [mailto:ta...@uyarer.com] Sent: Thursday, September 25, 2014 5:04 PM To: user@gora.apache.org Cc: u...@nutch.apache.org Subject: Re: Crawled data not inserting in the tables Hi Kartik, The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. Please see the documentation [0], specifically relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix the problem Talat [0] http://gora.apache.org/current/gora-cassandra.html Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in “column family” f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message. -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.
Re: Crawled data not inserting in the tables
Hi Kartik, If TTL hasn't been set or if it has been set to 0, then Gora is not using any TTL[1] and all your data should be persisted without any problems. Maybe this has to do something with the url generating/fetching process? Could you determine during which process the data is changing? (generate/fetch/parse) Thanks! Renato M. [1] https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik kartik.krishnan...@bankofamerica.com: Hi, Talat I am afraid that I do not understand. We have set the “ttl” value to 0, which is the default value. We don’t have any need portions of data that needs to be deleted. For now, I am using a single node cluster, for us the gc_grace_seconds=”0” default value would be a valid value. Have I missed out anything? My settings are as follows. Any suggestions would be greatly appreciated. gora-orm keyspace name=*projectKeyspace* cluster=*MultiTest* host=*192.161.23.161:9160 http://192.161.23.161:9160* placement_strategy= *org.apache.cassandra.locator.NetworkTopologyStrategy* family name=*p* / family name=*f*/ family name=*sc* type=*super*/ family name=*mtdt* type=*super*/ family name=*il* type=*super*/ family name=*ol* type=*super*/ /keyspace class keyClass=*java.lang.String* name= *org.apache.nutch.storage.WebPage* keyspace=*projectKeyspace * !-- fetch fields -- field name=*baseUrl* family=*f* qualifier=*bas*/ field name=*status* family=*f* qualifier=*st*/ field name=*prevFetchTime* family=*f* qualifier=*pts*/ field name=*fetchTime* family=*f* qualifier=*ts*/ field name=*fetchInterval* family=*f* qualifier=*fi*/ field name=*retriesSinceFetch* family=*f* qualifier=*rsf*/ field name=*reprUrl* family=*f* qualifier=*rpr*/ field name=*content* family=*f* qualifier=*cnt*/ field name=*contentType* family=*f* qualifier=*typ*/ field name=*modifiedTime* family=*f* qualifier=*mod*/ field name=*prevModifiedTime* family=*f* qualifier=*pmod*/ field name=*batchId* family=*f* qualifier=*bid*/ !-- parse fields -- field name=*title* family=*p* qualifier=*t*/ field name=*text* family=*p* qualifier=*c*/ field name=*signature* family=*p* qualifier=*sig*/ field name=*prevSignature* family=*p* qualifier=*psig*/ !-- score fields -- field name=*score* family=*f* qualifier=*s*/ !-- super columns -- field name=*headers* family=*sc* qualifier=*h*/ field name=*inlinks* family=*sc* qualifier=*il*/ field name=*outlinks* family=*sc* qualifier=*ol*/ field name=*metadata* family=*sc* qualifier=*mtdt*/ field name=*markers* family=*sc* qualifier=*mk*/ field name=*parseStatus* family=*sc* qualifier=*pas*/ field name=*protocolStatus* family=*sc* qualifier=*prs*/ /class class keyClass=*java.lang.String* name= *org.apache.nutch.storage.Host* keyspace=*projectKeyspace * field name=*metadata* family=*mtdt* qualifier=*mtdt*/ field name=*inlinks* family=*il* qualifier=*il*/ field name=*outlinks* family=*ol* qualifier=*ol*/ /class /gora-orm Thanks, Kartik *From:* Talat Uyarer [mailto:ta...@uyarer.com] *Sent:* Thursday, September 25, 2014 5:04 PM *To:* user@gora.apache.org *Cc:* u...@nutch.apache.org *Subject:* Re: Crawled data not inserting in the tables Hi Kartik, The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. Please see the documentation [0], specifically relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix the problem Talat [0] http://gora.apache.org/current/gora-cassandra.html Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in “column family” f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message. -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged
Re: Crawled data not inserting in the tables
Can you also make sure that the cluster name and fully qualified address and port agree between mapping and Gora.properties Thanks On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi Kartik, If TTL hasn't been set or if it has been set to 0, then Gora is not using any TTL[1] and all your data should be persisted without any problems. Maybe this has to do something with the url generating/fetching process? Could you determine during which process the data is changing? (generate/fetch/parse) Thanks! Renato M. [1] https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik kartik.krishnan...@bankofamerica.com javascript:_e(%7B%7D,'cvml','kartik.krishnan...@bankofamerica.com');: Hi, Talat I am afraid that I do not understand. We have set the “ttl” value to 0, which is the default value. We don’t have any need portions of data that needs to be deleted. For now, I am using a single node cluster, for us the gc_grace_seconds=”0” default value would be a valid value. Have I missed out anything? My settings are as follows. Any suggestions would be greatly appreciated. gora-orm keyspace name=*projectKeyspace* cluster=*MultiTest* host=*192.161.23.161:9160 http://192.161.23.161:9160* placement_strategy= *org.apache.cassandra.locator.NetworkTopologyStrategy* family name=*p* / family name=*f*/ family name=*sc* type=*super*/ family name=*mtdt* type=*super*/ family name=*il* type=*super*/ family name=*ol* type=*super*/ /keyspace class keyClass=*java.lang.String* name= *org.apache.nutch.storage.WebPage* keyspace=*projectKeyspace * !-- fetch fields -- field name=*baseUrl* family=*f* qualifier=*bas*/ field name=*status* family=*f* qualifier=*st*/ field name=*prevFetchTime* family=*f* qualifier=*pts*/ field name=*fetchTime* family=*f* qualifier=*ts*/ field name=*fetchInterval* family=*f* qualifier=*fi*/ field name=*retriesSinceFetch* family=*f* qualifier=*rsf* / field name=*reprUrl* family=*f* qualifier=*rpr*/ field name=*content* family=*f* qualifier=*cnt*/ field name=*contentType* family=*f* qualifier=*typ*/ field name=*modifiedTime* family=*f* qualifier=*mod*/ field name=*prevModifiedTime* family=*f* qualifier=*pmod* / field name=*batchId* family=*f* qualifier=*bid*/ !-- parse fields -- field name=*title* family=*p* qualifier=*t*/ field name=*text* family=*p* qualifier=*c*/ field name=*signature* family=*p* qualifier=*sig*/ field name=*prevSignature* family=*p* qualifier=*psig*/ !-- score fields -- field name=*score* family=*f* qualifier=*s*/ !-- super columns -- field name=*headers* family=*sc* qualifier=*h*/ field name=*inlinks* family=*sc* qualifier=*il*/ field name=*outlinks* family=*sc* qualifier=*ol*/ field name=*metadata* family=*sc* qualifier=*mtdt*/ field name=*markers* family=*sc* qualifier=*mk*/ field name=*parseStatus* family=*sc* qualifier=*pas*/ field name=*protocolStatus* family=*sc* qualifier=*prs*/ /class class keyClass=*java.lang.String* name= *org.apache.nutch.storage.Host* keyspace=*projectKeyspace * field name=*metadata* family=*mtdt* qualifier=*mtdt*/ field name=*inlinks* family=*il* qualifier=*il*/ field name=*outlinks* family=*ol* qualifier=*ol*/ /class /gora-orm Thanks, Kartik *From:* Talat Uyarer [mailto:ta...@uyarer.com javascript:_e(%7B%7D,'cvml','ta...@uyarer.com');] *Sent:* Thursday, September 25, 2014 5:04 PM *To:* user@gora.apache.org javascript:_e(%7B%7D,'cvml','user@gora.apache.org'); *Cc:* u...@nutch.apache.org javascript:_e(%7B%7D,'cvml','u...@nutch.apache.org'); *Subject:* Re: Crawled data not inserting in the tables Hi Kartik, The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. Please see the documentation [0], specifically relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix the problem Talat [0] http://gora.apache.org/current/gora-cassandra.html Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in “column family” f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik
Re: Crawled data not inserting in the tables
Hi, So did you get this sorted out? I am unsure if you achieved persistence of data. Thanks Lewis On Tuesday, September 30, 2014, Krishnanand, Kartik kartik.krishnan...@bankofamerica.com wrote: Hi, Lewis Thank you for replying. I apologize in advance for asking what might well be a stupid question. We are using the Crawler/InjectorJob/GeneratorJob/FetcherJob/ParserJob source code from the Nutch codebase without any modifications and calling the binary directly. @Lewis: I used the datastax library directly to query the keyspace for that host and port combination. I was able to execute CQL queries programmatically and return the result sets. Pinging the hosts returns valid packets. My gora.properties gora.datastore.autocreateschema=true gora.CassandraStore.autocreateschema=true gora.cassandrastore.servers=*192.161.23.161:9160 http://192.161.23.161:9160* gora.cassandrastore.username=*username* gora.cassandrastore.password=*password* They match with gora-cassandra-mapping.xml data. We are using Nutch 2.2.x for our purpose. *From:* Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com');] *Sent:* Tuesday, September 30, 2014 8:19 AM *To:* user@gora.apache.org javascript:_e(%7B%7D,'cvml','user@gora.apache.org'); *Cc:* Nutch Users; Kothuvatiparambil, Viju; Krishnanand, Kartik *Subject:* Re: Crawled data not inserting in the tables Can you also make sure that the cluster name and fully qualified address and port agree between mapping and Gora.properties Thanks On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com javascript:_e(%7B%7D,'cvml','renatoj.marroq...@gmail.com'); wrote: Hi Kartik, If TTL hasn't been set or if it has been set to 0, then Gora is not using any TTL[1] and all your data should be persisted without any problems. Maybe this has to do something with the url generating/fetching process? Could you determine during which process the data is changing? (generate/fetch/parse) Thanks! Renato M. [1] https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik kartik.krishnan...@bankofamerica.com: Hi, Talat I am afraid that I do not understand. We have set the “ttl” value to 0, which is the default value. We don’t have any need portions of data that needs to be deleted. For now, I am using a single node cluster, for us the gc_grace_seconds=”0” default value would be a valid value. Have I missed out anything? My settings are as follows. Any suggestions would be greatly appreciated. gora-orm keyspace name=*projectKeyspace* cluster=*MultiTest* host=*192.161.23.161:9160 http://192.161.23.161:9160* placement_strategy= *org.apache.cassandra.locator.NetworkTopologyStrategy* family name=*p* / family name=*f*/ family name=*sc* type=*super*/ family name=*mtdt* type=*super*/ family name=*il* type=*super*/ family name=*ol* type=*super*/ /keyspace class keyClass=*java.lang.String* name= *org.apache.nutch.storage.WebPage* keyspace=*projectKeyspace * !-- fetch fields -- field name=*baseUrl* family=*f* qualifier=*bas*/ field name=*status* family=*f* qualifier=*st*/ field name=*prevFetchTime* family=*f* qualifier=*pts*/ field name=*fetchTime* family=*f* qualifier=*ts*/ field name=*fetchInterval* family=*f* qualifier=*fi*/ field name=*retriesSinceFetch* family=*f* qualifier=*rsf*/ field name=*reprUrl* family=*f* qualifier=*rpr*/ field name=*content* family=*f* qualifier=*cnt*/ field name=*contentType* family=*f* qualifier=*typ*/ field name=*modifiedTime* family=*f* qualifier=*mod*/ field name=*prevModifiedTime* family=*f* qualifier=*pmod*/ field name=*batchId* family=*f* qualifier=*bid*/ !-- parse fields -- field name=*title* family=*p* qualifier=*t*/ field name=*text* family=*p* qualifier=*c*/ field name=*signature* family=*p* qualifier=*sig*/ field name=*prevSignature* family=*p* qualifier=*psig*/ !-- score fields -- field name=*score* family=*f* qualifier=*s*/ !-- super columns -- field name=*headers* family=*sc* qualifier=*h*/ field name=*inlinks* family=*sc* qualifier=*il*/ field name=*outlinks* family=*sc* qualifier=*ol*/ field name=*metadata* family=*sc* qualifier=*mtdt*/ field name=*markers* family=*sc* qualifier=*mk*/ field name=*parseStatus* family=*sc* qualifier=*pas*/ field name=*protocolStatus* family=*sc* qualifier=*prs*/ /class class keyClass=*java.lang.String* name= *org.apache.nutch.storage.Host* keyspace
Re: Crawled data not inserting in the tables
Hi, Did you get this sorted out? Thanks KLewis On Thu, Sep 25, 2014 at 4:56 PM, Krishnanand, Kartik kartik.krishnan...@bankofamerica.com wrote: Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in “column family” f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message. -- *Lewis*
Crawled data not inserting in the tables
Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in column family f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.
Re: Crawled data not inserting in the tables
Hi Kartik, The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. Please see the documentation [0], specifically relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix the problem Talat [0] http://gora.apache.org/current/gora-cassandra.html Hi, Gora gurus, I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA Cassandra mapping to store the crawled data. I can confirm that all 12 URLs are not being filtered and are injected, but after running the generate, fetch and parse jobs . There are only 3 entries in “column family” f. I am not sure what I am doing wrong. The logs have not yielded anything relevant. What should I be looking at? Any advice would be gratefully appreciated. Thanks, Kartik -- This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.