Hi Kartik, If TTL hasn't been set or if it has been set to 0, then Gora is not using any TTL[1] and all your data should be persisted without any problems. Maybe this has to do something with the url generating/fetching process? Could you determine during which process the data is changing? (generate/fetch/parse) Thanks!
Renato M. [1] https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik < [email protected]>: > Hi, Talat > > > > I am afraid that I do not understand. We have set the “ttl” value to 0, > which is the default value. We don’t have any need portions of data that > needs to be deleted. For now, I am using a single node cluster, for us the > gc_grace_seconds=”0” default value would be a valid value. > > > > Have I missed out anything? My settings are as follows. Any suggestions > would be greatly appreciated. > > > > <gora-orm> > > > > <keyspace name=*"projectKeyspace"* cluster=*"MultiTest"* > host=*"192.161.23.161:9160 > <http://192.161.23.161:9160>"* placement_strategy= > *"org.apache.cassandra.locator.NetworkTopologyStrategy"*> > > <family name=*"p"* /> > > <family name=*"f"*/> > > <family name=*"sc"* type=*"super"*/> > > > > <family name=*"mtdt"* type=*"super"*/> > > <family name=*"il"* type=*"super"*/> > > <family name=*"ol"* type=*"super"*/> > > </keyspace> > > > > <class keyClass=*"java.lang.String"* name= > *"org.apache.nutch.storage.WebPage"* keyspace=*"projectKeyspace "*> > > > > <!-- fetch fields --> > > <field name=*"baseUrl"* family=*"f"* qualifier=*"bas"*/> > > <field name=*"status"* family=*"f"* qualifier=*"st"*/> > > <field name=*"prevFetchTime"* family=*"f"* qualifier=*"pts"*/> > > <field name=*"fetchTime"* family=*"f"* qualifier=*"ts"*/> > > <field name=*"fetchInterval"* family=*"f"* qualifier=*"fi"*/> > > <field name=*"retriesSinceFetch"* family=*"f"* qualifier=*"rsf"*/> > > <field name=*"reprUrl"* family=*"f"* qualifier=*"rpr"*/> > > <field name=*"content"* family=*"f"* qualifier=*"cnt"*/> > > <field name=*"contentType"* family=*"f"* qualifier=*"typ"*/> > > <field name=*"modifiedTime"* family=*"f"* qualifier=*"mod"*/> > > <field name=*"prevModifiedTime"* family=*"f"* qualifier=*"pmod"*/> > > <field name=*"batchId"* family=*"f"* qualifier=*"bid"*/> > > > > <!-- parse fields --> > > <field name=*"title"* family=*"p"* qualifier=*"t"*/> > > <field name=*"text"* family=*"p"* qualifier=*"c"*/> > > <field name=*"signature"* family=*"p"* qualifier=*"sig"*/> > > <field name=*"prevSignature"* family=*"p"* qualifier=*"psig"*/> > > > > <!-- score fields --> > > <field name=*"score"* family=*"f"* qualifier=*"s"*/> > > > > <!-- super columns --> > > <field name=*"headers"* family=*"sc"* qualifier=*"h"*/> > > <field name=*"inlinks"* family=*"sc"* qualifier=*"il"*/> > > <field name=*"outlinks"* family=*"sc"* qualifier=*"ol"*/> > > <field name=*"metadata"* family=*"sc"* qualifier=*"mtdt"*/> > > <field name=*"markers"* family=*"sc"* qualifier=*"mk"*/> > > <field name=*"parseStatus"* family=*"sc"* qualifier=*"pas"*/> > > <field name=*"protocolStatus"* family=*"sc"* qualifier=*"prs"*/> > > </class> > > > > > > <class keyClass=*"java.lang.String"* name= > *"org.apache.nutch.storage.Host"* keyspace=*"projectKeyspace "*> > > <field name=*"metadata"* family=*"mtdt"* qualifier=*"mtdt"*/> > > <field name=*"inlinks"* family=*"il"* qualifier=*"il"*/> > > <field name=*"outlinks"* family=*"ol"* qualifier=*"ol"*/> > > </class> > > > > </gora-orm> > > > > Thanks, > > > > Kartik > > > > *From:* Talat Uyarer [mailto:[email protected]] > *Sent:* Thursday, September 25, 2014 5:04 PM > *To:* [email protected] > *Cc:* [email protected] > *Subject:* Re: Crawled data not inserting in the tables > > > > Hi Kartik, > > The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. > Please see the documentation [0], specifically relating to the values for > 'gc_grace_seconds' and also 'ttl'. This will fix the problem > > Talat > > [0] http://gora.apache.org/current/gora-cassandra.html > > Hi, Gora gurus, > > > > I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA > Cassandra mapping to store the crawled data. > > > > I can confirm that all 12 URLs are not being filtered and are injected, > but after running the generate, fetch and parse jobs . There are only 3 > entries in “column family” f. > > > > I am not sure what I am doing wrong. The logs have not yielded anything > relevant. What should I be looking at? > > > > Any advice would be gratefully appreciated. > > > > Thanks, > > > > Kartik > ------------------------------ > > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the intended > recipient, please delete this message. > ------------------------------ > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the intended > recipient, please delete this message. >

