Can you also make sure that the cluster name and fully qualified address
and port agree between mapping and Gora.properties
Thanks

On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo <
[email protected]> wrote:

> Hi Kartik,
>
> If TTL hasn't been set or if it has been set to 0, then Gora is not using
> any TTL[1] and all your data should be persisted without any problems.
> Maybe this has to do something with the url generating/fetching process?
> Could you determine during which process the data is changing?
> (generate/fetch/parse)
> Thanks!
>
>
> Renato M.
>
> [1]
> https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72
>
> 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>
>>  Hi, Talat
>>
>>
>>
>> I am afraid that I do not understand.  We have set the “ttl” value to 0,
>> which is the default value. We don’t have any need portions of data that
>> needs to be deleted.  For now, I am using a single node cluster, for us the
>> gc_grace_seconds=”0” default value would be a valid value.
>>
>>
>>
>> Have I missed out anything? My settings are as follows. Any suggestions
>> would be greatly appreciated.
>>
>>
>>
>> <gora-orm>
>>
>>
>>
>>     <keyspace name=*"projectKeyspace"* cluster=*"MultiTest"* 
>> host=*"192.161.23.161:9160
>> <http://192.161.23.161:9160>"* placement_strategy=
>> *"org.apache.cassandra.locator.NetworkTopologyStrategy"*>
>>
>>         <family name=*"p"* />
>>
>>         <family name=*"f"*/>
>>
>>         <family name=*"sc"* type=*"super"*/>
>>
>>
>>
>>         <family name=*"mtdt"* type=*"super"*/>
>>
>>         <family name=*"il"* type=*"super"*/>
>>
>>         <family name=*"ol"* type=*"super"*/>
>>
>>     </keyspace>
>>
>>
>>
>>     <class keyClass=*"java.lang.String"* name=
>> *"org.apache.nutch.storage.WebPage"* keyspace=*"projectKeyspace "*>
>>
>>
>>
>>         <!-- fetch fields -->
>>
>>         <field name=*"baseUrl"* family=*"f"* qualifier=*"bas"*/>
>>
>>         <field name=*"status"* family=*"f"* qualifier=*"st"*/>
>>
>>         <field name=*"prevFetchTime"* family=*"f"* qualifier=*"pts"*/>
>>
>>         <field name=*"fetchTime"* family=*"f"* qualifier=*"ts"*/>
>>
>>         <field name=*"fetchInterval"* family=*"f"* qualifier=*"fi"*/>
>>
>>         <field name=*"retriesSinceFetch"* family=*"f"* qualifier=*"rsf"*
>> />
>>
>>         <field name=*"reprUrl"* family=*"f"* qualifier=*"rpr"*/>
>>
>>         <field name=*"content"* family=*"f"* qualifier=*"cnt"*/>
>>
>>         <field name=*"contentType"* family=*"f"* qualifier=*"typ"*/>
>>
>>         <field name=*"modifiedTime"* family=*"f"* qualifier=*"mod"*/>
>>
>>         <field name=*"prevModifiedTime"* family=*"f"* qualifier=*"pmod"*
>> />
>>
>>         <field name=*"batchId"* family=*"f"* qualifier=*"bid"*/>
>>
>>
>>
>>         <!-- parse fields -->
>>
>>         <field name=*"title"* family=*"p"* qualifier=*"t"*/>
>>
>>         <field name=*"text"* family=*"p"* qualifier=*"c"*/>
>>
>>         <field name=*"signature"* family=*"p"* qualifier=*"sig"*/>
>>
>>         <field name=*"prevSignature"* family=*"p"* qualifier=*"psig"*/>
>>
>>
>>
>>         <!-- score fields -->
>>
>>         <field name=*"score"* family=*"f"* qualifier=*"s"*/>
>>
>>
>>
>>         <!-- super columns -->
>>
>>         <field name=*"headers"* family=*"sc"* qualifier=*"h"*/>
>>
>>         <field name=*"inlinks"* family=*"sc"* qualifier=*"il"*/>
>>
>>         <field name=*"outlinks"* family=*"sc"* qualifier=*"ol"*/>
>>
>>         <field name=*"metadata"* family=*"sc"* qualifier=*"mtdt"*/>
>>
>>         <field name=*"markers"* family=*"sc"* qualifier=*"mk"*/>
>>
>>         <field name=*"parseStatus"* family=*"sc"* qualifier=*"pas"*/>
>>
>>         <field name=*"protocolStatus"* family=*"sc"* qualifier=*"prs"*/>
>>
>>     </class>
>>
>>
>>
>>
>>
>>     <class keyClass=*"java.lang.String"* name=
>> *"org.apache.nutch.storage.Host"* keyspace=*"projectKeyspace "*>
>>
>>         <field name=*"metadata"* family=*"mtdt"* qualifier=*"mtdt"*/>
>>
>>         <field name=*"inlinks"* family=*"il"* qualifier=*"il"*/>
>>
>>         <field name=*"outlinks"* family=*"ol"* qualifier=*"ol"*/>
>>
>>     </class>
>>
>>
>>
>> </gora-orm>
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Kartik
>>
>>
>>
>> *From:* Talat Uyarer [mailto:[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>]
>> *Sent:* Thursday, September 25, 2014 5:04 PM
>> *To:* [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>> *Cc:* [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>
>> *Subject:* Re: Crawled data not inserting in the tables
>>
>>
>>
>> Hi Kartik,
>>
>> The 'problem' is with your mapping settings in
>> gora-cassandra-mapping.xml. Please see the documentation [0], specifically
>> relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix
>> the problem
>>
>> Talat
>>
>> [0] http://gora.apache.org/current/gora-cassandra.html
>>
>> Hi, Gora gurus,
>>
>>
>>
>> I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA
>> Cassandra mapping to store the crawled data.
>>
>>
>>
>> I can confirm that all 12 URLs are not being filtered and are injected,
>> but after running the generate, fetch and parse jobs . There are only 3
>> entries in “column family” f.
>>
>>
>>
>> I am not sure what I am doing wrong. The logs have not yielded anything
>> relevant. What should I be looking at?
>>
>>
>>
>> Any advice would be gratefully appreciated.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Kartik
>>  ------------------------------
>>
>> This message, and any attachments, is for the intended recipient(s) only,
>> may contain information that is privileged, confidential and/or proprietary
>> and subject to important terms and conditions available at
>> http://www.bankofamerica.com/emaildisclaimer. If you are not the
>> intended recipient, please delete this message.
>>   ------------------------------
>> This message, and any attachments, is for the intended recipient(s) only,
>> may contain information that is privileged, confidential and/or proprietary
>> and subject to important terms and conditions available at
>> http://www.bankofamerica.com/emaildisclaimer. If you are not the
>> intended recipient, please delete this message.
>>
>
>

-- 
*Lewis*

Reply via email to