Hi, Talat
I am afraid that I do not understand. We have set the “ttl” value to 0, which
is the default value. We don’t have any need portions of data that needs to be
deleted. For now, I am using a single node cluster, for us the
gc_grace_seconds=”0” default value would be a valid value.
Have I missed out anything? My settings are as follows. Any suggestions would
be greatly appreciated.
<gora-orm>
<keyspace name="projectKeyspace" cluster="MultiTest"
host="192.161.23.161:9160"
placement_strategy="org.apache.cassandra.locator.NetworkTopologyStrategy">
<family name="p" />
<family name="f"/>
<family name="sc" type="super"/>
<family name="mtdt" type="super"/>
<family name="il" type="super"/>
<family name="ol" type="super"/>
</keyspace>
<class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"
keyspace="projectKeyspace ">
<!-- fetch fields -->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>
<field name="reprUrl" family="f" qualifier="rpr"/>
<field name="content" family="f" qualifier="cnt"/>
<field name="contentType" family="f" qualifier="typ"/>
<field name="modifiedTime" family="f" qualifier="mod"/>
<field name="prevModifiedTime" family="f" qualifier="pmod"/>
<field name="batchId" family="f" qualifier="bid"/>
<!-- parse fields -->
<field name="title" family="p" qualifier="t"/>
<field name="text" family="p" qualifier="c"/>
<field name="signature" family="p" qualifier="sig"/>
<field name="prevSignature" family="p" qualifier="psig"/>
<!-- score fields -->
<field name="score" family="f" qualifier="s"/>
<!-- super columns -->
<field name="headers" family="sc" qualifier="h"/>
<field name="inlinks" family="sc" qualifier="il"/>
<field name="outlinks" family="sc" qualifier="ol"/>
<field name="metadata" family="sc" qualifier="mtdt"/>
<field name="markers" family="sc" qualifier="mk"/>
<field name="parseStatus" family="sc" qualifier="pas"/>
<field name="protocolStatus" family="sc" qualifier="prs"/>
</class>
<class keyClass="java.lang.String" name="org.apache.nutch.storage.Host"
keyspace="projectKeyspace ">
<field name="metadata" family="mtdt" qualifier="mtdt"/>
<field name="inlinks" family="il" qualifier="il"/>
<field name="outlinks" family="ol" qualifier="ol"/>
</class>
</gora-orm>
Thanks,
Kartik
From: Talat Uyarer [mailto:[email protected]]
Sent: Thursday, September 25, 2014 5:04 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Crawled data not inserting in the tables
Hi Kartik,
The 'problem' is with your mapping settings in gora-cassandra-mapping.xml.
Please see the documentation [0], specifically relating to the values for
'gc_grace_seconds' and also 'ttl'. This will fix the problem
Talat
[0] http://gora.apache.org/current/gora-cassandra.html
Hi, Gora gurus,
I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA
Cassandra mapping to store the crawled data.
I can confirm that all 12 URLs are not being filtered and are injected, but
after running the generate, fetch and parse jobs . There are only 3 entries in
“column family” f.
I am not sure what I am doing wrong. The logs have not yielded anything
relevant. What should I be looking at?
Any advice would be gratefully appreciated.
Thanks,
Kartik
________________________________
This message, and any attachments, is for the intended recipient(s) only, may
contain information that is privileged, confidential and/or proprietary and
subject to important terms and conditions available at
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
recipient, please delete this message.
----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may
contain information that is privileged, confidential and/or proprietary and
subject to important terms and conditions available at
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
recipient, please delete this message.