Hi, So did you get this sorted out? I am unsure if you achieved persistence of data. Thanks Lewis
On Tuesday, September 30, 2014, Krishnanand, Kartik < [email protected]> wrote: > Hi, Lewis > > > > Thank you for replying. I apologize in advance for asking what might well > be a stupid question. We are using the > Crawler/InjectorJob/GeneratorJob/FetcherJob/ParserJob source code from the > Nutch codebase without any modifications and calling the binary directly. > > > > @Lewis: I used the datastax library directly to query the keyspace for > that host and port combination. I was able to execute CQL queries > programmatically and return the result sets. Pinging the hosts returns > valid packets. My gora.properties > > > > gora.datastore.autocreateschema=true > > gora.CassandraStore.autocreateschema=true > > gora.cassandrastore.servers=*192.161.23.161:9160 > <http://192.161.23.161:9160>* > > gora.cassandrastore.username=*<username>* > > gora.cassandrastore.password=*<password>* > > > > They match with gora-cassandra-mapping.xml data. > > > > We are using Nutch 2.2.x for our purpose. > > > > > > > > *From:* Lewis John Mcgibbney [mailto:[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>] > *Sent:* Tuesday, September 30, 2014 8:19 AM > *To:* [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');> > *Cc:* Nutch Users; Kothuvatiparambil, Viju; Krishnanand, Kartik > *Subject:* Re: Crawled data not inserting in the tables > > > > Can you also make sure that the cluster name and fully qualified address > and port agree between mapping and Gora.properties > > Thanks > > On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > > Hi Kartik, > > > > If TTL hasn't been set or if it has been set to 0, then Gora is not using > any TTL[1] and all your data should be persisted without any problems. > > Maybe this has to do something with the url generating/fetching process? > Could you determine during which process the data is changing? > (generate/fetch/parse) > > Thanks! > > > > > > Renato M. > > > > [1] > https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72 > > > > 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik < > [email protected]>: > > Hi, Talat > > > > I am afraid that I do not understand. We have set the “ttl” value to 0, > which is the default value. We don’t have any need portions of data that > needs to be deleted. For now, I am using a single node cluster, for us the > gc_grace_seconds=”0” default value would be a valid value. > > > > Have I missed out anything? My settings are as follows. Any suggestions > would be greatly appreciated. > > > > <gora-orm> > > > > <keyspace name=*"projectKeyspace"* cluster=*"MultiTest"* > host=*"192.161.23.161:9160 > <http://192.161.23.161:9160>"* placement_strategy= > *"org.apache.cassandra.locator.NetworkTopologyStrategy"*> > > <family name=*"p"* /> > > <family name=*"f"*/> > > <family name=*"sc"* type=*"super"*/> > > > > <family name=*"mtdt"* type=*"super"*/> > > <family name=*"il"* type=*"super"*/> > > <family name=*"ol"* type=*"super"*/> > > </keyspace> > > > > <class keyClass=*"java.lang.String"* name= > *"org.apache.nutch.storage.WebPage"* keyspace=*"projectKeyspace "*> > > > > <!-- fetch fields --> > > <field name=*"baseUrl"* family=*"f"* qualifier=*"bas"*/> > > <field name=*"status"* family=*"f"* qualifier=*"st"*/> > > <field name=*"prevFetchTime"* family=*"f"* qualifier=*"pts"*/> > > <field name=*"fetchTime"* family=*"f"* qualifier=*"ts"*/> > > <field name=*"fetchInterval"* family=*"f"* qualifier=*"fi"*/> > > <field name=*"retriesSinceFetch"* family=*"f"* qualifier=*"rsf"*/> > > <field name=*"reprUrl"* family=*"f"* qualifier=*"rpr"*/> > > <field name=*"content"* family=*"f"* qualifier=*"cnt"*/> > > <field name=*"contentType"* family=*"f"* qualifier=*"typ"*/> > > <field name=*"modifiedTime"* family=*"f"* qualifier=*"mod"*/> > > <field name=*"prevModifiedTime"* family=*"f"* qualifier=*"pmod"*/> > > <field name=*"batchId"* family=*"f"* qualifier=*"bid"*/> > > > > <!-- parse fields --> > > <field name=*"title"* family=*"p"* qualifier=*"t"*/> > > <field name=*"text"* family=*"p"* qualifier=*"c"*/> > > <field name=*"signature"* family=*"p"* qualifier=*"sig"*/> > > <field name=*"prevSignature"* family=*"p"* qualifier=*"psig"*/> > > > > <!-- score fields --> > > <field name=*"score"* family=*"f"* qualifier=*"s"*/> > > > > <!-- super columns --> > > <field name=*"headers"* family=*"sc"* qualifier=*"h"*/> > > <field name=*"inlinks"* family=*"sc"* qualifier=*"il"*/> > > <field name=*"outlinks"* family=*"sc"* qualifier=*"ol"*/> > > <field name=*"metadata"* family=*"sc"* qualifier=*"mtdt"*/> > > <field name=*"markers"* family=*"sc"* qualifier=*"mk"*/> > > <field name=*"parseStatus"* family=*"sc"* qualifier=*"pas"*/> > > <field name=*"protocolStatus"* family=*"sc"* qualifier=*"prs"*/> > > </class> > > > > > > <class keyClass=*"java.lang.String"* name= > *"org.apache.nutch.storage.Host"* keyspace=*"projectKeyspace "*> > > <field name=*"metadata"* family=*"mtdt"* qualifier=*"mtdt"*/> > > <field name=*"inlinks"* family=*"il"* qualifier=*"il"*/> > > <field name=*"outlinks"* family=*"ol"* qualifier=*"ol"*/> > > </class> > > > > </gora-orm> > > > > Thanks, > > > > Kartik > > > > *From:* Talat Uyarer [mailto:[email protected]] > *Sent:* Thursday, September 25, 2014 5:04 PM > *To:* [email protected] > *Cc:* [email protected] > *Subject:* Re: Crawled data not inserting in the tables > > > > Hi Kartik, > > The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. > Please see the documentation [0], specifically relating to the values for > 'gc_grace_seconds' and also 'ttl'. This will fix the problem > > Talat > > [0] http://gora.apache.org/current/gora-cassandra.html > > Hi, Gora gurus, > > > > I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA > Cassandra mapping to store the crawled data. > > > > I can confirm that all 12 URLs are not being filtered and are injected, > but after running the generate, fetch and parse jobs . There are only 3 > entries in “column family” f. > > > > I am not sure what I am doing wrong. The logs have not yielded anything > relevant. What should I be looking at? > > > > Any advice would be gratefully appreciated. > > > > Thanks, > > > > Kartik > ------------------------------ > > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the intended > recipient, please delete this message. > ------------------------------ > > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the intended > recipient, please delete this message. > > > > > > -- > *Lewis* > ------------------------------ > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the intended > recipient, please delete this message. > -- *Lewis*

