Hi, Talat

I am afraid that I do not understand.  We have set the “ttl” value to 0, which 
is the default value. We don’t have any need portions of data that needs to be 
deleted.  For now, I am using a single node cluster, for us the 
gc_grace_seconds=”0” default value would be a valid value.

Have I missed out anything? My settings are as follows. Any suggestions would 
be greatly appreciated.

<gora-orm>

    <keyspace name="projectKeyspace" cluster="MultiTest" 
host="192.161.23.161:9160" 
placement_strategy="org.apache.cassandra.locator.NetworkTopologyStrategy">
        <family name="p" />
        <family name="f"/>
        <family name="sc" type="super"/>

        <family name="mtdt" type="super"/>
        <family name="il" type="super"/>
        <family name="ol" type="super"/>
    </keyspace>

    <class keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage" 
keyspace="projectKeyspace ">

        <!-- fetch fields -->
        <field name="baseUrl" family="f" qualifier="bas"/>
        <field name="status" family="f" qualifier="st"/>
        <field name="prevFetchTime" family="f" qualifier="pts"/>
        <field name="fetchTime" family="f" qualifier="ts"/>
        <field name="fetchInterval" family="f" qualifier="fi"/>
        <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
        <field name="reprUrl" family="f" qualifier="rpr"/>
        <field name="content" family="f" qualifier="cnt"/>
        <field name="contentType" family="f" qualifier="typ"/>
        <field name="modifiedTime" family="f" qualifier="mod"/>
        <field name="prevModifiedTime" family="f" qualifier="pmod"/>
        <field name="batchId" family="f" qualifier="bid"/>

        <!-- parse fields -->
        <field name="title" family="p" qualifier="t"/>
        <field name="text" family="p" qualifier="c"/>
        <field name="signature" family="p" qualifier="sig"/>
        <field name="prevSignature" family="p" qualifier="psig"/>

        <!-- score fields -->
        <field name="score" family="f" qualifier="s"/>

        <!-- super columns -->
        <field name="headers" family="sc" qualifier="h"/>
        <field name="inlinks" family="sc" qualifier="il"/>
        <field name="outlinks" family="sc" qualifier="ol"/>
        <field name="metadata" family="sc" qualifier="mtdt"/>
        <field name="markers" family="sc" qualifier="mk"/>
        <field name="parseStatus" family="sc" qualifier="pas"/>
        <field name="protocolStatus" family="sc" qualifier="prs"/>
    </class>


    <class keyClass="java.lang.String" name="org.apache.nutch.storage.Host" 
keyspace="projectKeyspace ">
        <field name="metadata" family="mtdt" qualifier="mtdt"/>
        <field name="inlinks" family="il" qualifier="il"/>
        <field name="outlinks" family="ol" qualifier="ol"/>
    </class>

</gora-orm>

Thanks,

Kartik

From: Talat Uyarer [mailto:[email protected]]
Sent: Thursday, September 25, 2014 5:04 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Crawled data not inserting in the tables


Hi Kartik,

The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. 
Please see the documentation [0], specifically relating to the values for 
'gc_grace_seconds' and also 'ttl'. This will fix the problem

Talat

[0] http://gora.apache.org/current/gora-cassandra.html
Hi, Gora gurus,

I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA 
Cassandra mapping to store the crawled data.

I can confirm that all 12 URLs are not being filtered and are injected, but 
after running the generate, fetch and parse jobs . There are only 3 entries in 
“column family” f.

I am not sure what I am doing wrong. The logs have not yielded anything 
relevant. What should I be looking at?

Any advice would be gratefully appreciated.

Thanks,

Kartik
________________________________
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended 
recipient, please delete this message.

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.

Reply via email to