RE: Crawled data not inserting in the tables

2014-09-30 Thread Krishnanand, Kartik
Hi, Talat

I am afraid that I do not understand.  We have set the “ttl” value to 0, which 
is the default value. We don’t have any need portions of data that needs to be 
deleted.  For now, I am using a single node cluster, for us the 
gc_grace_seconds=”0” default value would be a valid value.

Have I missed out anything? My settings are as follows. Any suggestions would 
be greatly appreciated.

gora-orm

keyspace name=projectKeyspace cluster=MultiTest 
host=192.161.23.161:9160 
placement_strategy=org.apache.cassandra.locator.NetworkTopologyStrategy
family name=p /
family name=f/
family name=sc type=super/

family name=mtdt type=super/
family name=il type=super/
family name=ol type=super/
/keyspace

class keyClass=java.lang.String name=org.apache.nutch.storage.WebPage 
keyspace=projectKeyspace 

!-- fetch fields --
field name=baseUrl family=f qualifier=bas/
field name=status family=f qualifier=st/
field name=prevFetchTime family=f qualifier=pts/
field name=fetchTime family=f qualifier=ts/
field name=fetchInterval family=f qualifier=fi/
field name=retriesSinceFetch family=f qualifier=rsf/
field name=reprUrl family=f qualifier=rpr/
field name=content family=f qualifier=cnt/
field name=contentType family=f qualifier=typ/
field name=modifiedTime family=f qualifier=mod/
field name=prevModifiedTime family=f qualifier=pmod/
field name=batchId family=f qualifier=bid/

!-- parse fields --
field name=title family=p qualifier=t/
field name=text family=p qualifier=c/
field name=signature family=p qualifier=sig/
field name=prevSignature family=p qualifier=psig/

!-- score fields --
field name=score family=f qualifier=s/

!-- super columns --
field name=headers family=sc qualifier=h/
field name=inlinks family=sc qualifier=il/
field name=outlinks family=sc qualifier=ol/
field name=metadata family=sc qualifier=mtdt/
field name=markers family=sc qualifier=mk/
field name=parseStatus family=sc qualifier=pas/
field name=protocolStatus family=sc qualifier=prs/
/class


class keyClass=java.lang.String name=org.apache.nutch.storage.Host 
keyspace=projectKeyspace 
field name=metadata family=mtdt qualifier=mtdt/
field name=inlinks family=il qualifier=il/
field name=outlinks family=ol qualifier=ol/
/class

/gora-orm

Thanks,

Kartik

From: Talat Uyarer [mailto:ta...@uyarer.com]
Sent: Thursday, September 25, 2014 5:04 PM
To: user@gora.apache.org
Cc: u...@nutch.apache.org
Subject: Re: Crawled data not inserting in the tables


Hi Kartik,

The 'problem' is with your mapping settings in gora-cassandra-mapping.xml. 
Please see the documentation [0], specifically relating to the values for 
'gc_grace_seconds' and also 'ttl'. This will fix the problem

Talat

[0] http://gora.apache.org/current/gora-cassandra.html
Hi, Gora gurus,

I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA 
Cassandra mapping to store the crawled data.

I can confirm that all 12 URLs are not being filtered and are injected, but 
after running the generate, fetch and parse jobs . There are only 3 entries in 
“column family” f.

I am not sure what I am doing wrong. The logs have not yielded anything 
relevant. What should I be looking at?

Any advice would be gratefully appreciated.

Thanks,

Kartik

This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended 
recipient, please delete this message.

--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.


Re: Crawled data not inserting in the tables

2014-09-30 Thread Renato Marroquín Mogrovejo
Hi Kartik,

If TTL hasn't been set or if it has been set to 0, then Gora is not using
any TTL[1] and all your data should be persisted without any problems.
Maybe this has to do something with the url generating/fetching process?
Could you determine during which process the data is changing?
(generate/fetch/parse)
Thanks!


Renato M.

[1]
https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72

2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik 
kartik.krishnan...@bankofamerica.com:

  Hi, Talat



 I am afraid that I do not understand.  We have set the “ttl” value to 0,
 which is the default value. We don’t have any need portions of data that
 needs to be deleted.  For now, I am using a single node cluster, for us the
 gc_grace_seconds=”0” default value would be a valid value.



 Have I missed out anything? My settings are as follows. Any suggestions
 would be greatly appreciated.



 gora-orm



 keyspace name=*projectKeyspace* cluster=*MultiTest* 
 host=*192.161.23.161:9160
 http://192.161.23.161:9160* placement_strategy=
 *org.apache.cassandra.locator.NetworkTopologyStrategy*

 family name=*p* /

 family name=*f*/

 family name=*sc* type=*super*/



 family name=*mtdt* type=*super*/

 family name=*il* type=*super*/

 family name=*ol* type=*super*/

 /keyspace



 class keyClass=*java.lang.String* name=
 *org.apache.nutch.storage.WebPage* keyspace=*projectKeyspace *



 !-- fetch fields --

 field name=*baseUrl* family=*f* qualifier=*bas*/

 field name=*status* family=*f* qualifier=*st*/

 field name=*prevFetchTime* family=*f* qualifier=*pts*/

 field name=*fetchTime* family=*f* qualifier=*ts*/

 field name=*fetchInterval* family=*f* qualifier=*fi*/

 field name=*retriesSinceFetch* family=*f* qualifier=*rsf*/

 field name=*reprUrl* family=*f* qualifier=*rpr*/

 field name=*content* family=*f* qualifier=*cnt*/

 field name=*contentType* family=*f* qualifier=*typ*/

 field name=*modifiedTime* family=*f* qualifier=*mod*/

 field name=*prevModifiedTime* family=*f* qualifier=*pmod*/

 field name=*batchId* family=*f* qualifier=*bid*/



 !-- parse fields --

 field name=*title* family=*p* qualifier=*t*/

 field name=*text* family=*p* qualifier=*c*/

 field name=*signature* family=*p* qualifier=*sig*/

 field name=*prevSignature* family=*p* qualifier=*psig*/



 !-- score fields --

 field name=*score* family=*f* qualifier=*s*/



 !-- super columns --

 field name=*headers* family=*sc* qualifier=*h*/

 field name=*inlinks* family=*sc* qualifier=*il*/

 field name=*outlinks* family=*sc* qualifier=*ol*/

 field name=*metadata* family=*sc* qualifier=*mtdt*/

 field name=*markers* family=*sc* qualifier=*mk*/

 field name=*parseStatus* family=*sc* qualifier=*pas*/

 field name=*protocolStatus* family=*sc* qualifier=*prs*/

 /class





 class keyClass=*java.lang.String* name=
 *org.apache.nutch.storage.Host* keyspace=*projectKeyspace *

 field name=*metadata* family=*mtdt* qualifier=*mtdt*/

 field name=*inlinks* family=*il* qualifier=*il*/

 field name=*outlinks* family=*ol* qualifier=*ol*/

 /class



 /gora-orm



 Thanks,



 Kartik



 *From:* Talat Uyarer [mailto:ta...@uyarer.com]
 *Sent:* Thursday, September 25, 2014 5:04 PM
 *To:* user@gora.apache.org
 *Cc:* u...@nutch.apache.org
 *Subject:* Re: Crawled data not inserting in the tables



 Hi Kartik,

 The 'problem' is with your mapping settings in gora-cassandra-mapping.xml.
 Please see the documentation [0], specifically relating to the values for
 'gc_grace_seconds' and also 'ttl'. This will fix the problem

 Talat

 [0] http://gora.apache.org/current/gora-cassandra.html

 Hi, Gora gurus,



 I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA
 Cassandra mapping to store the crawled data.



 I can confirm that all 12 URLs are not being filtered and are injected,
 but after running the generate, fetch and parse jobs . There are only 3
 entries in “column family” f.



 I am not sure what I am doing wrong. The logs have not yielded anything
 relevant. What should I be looking at?



 Any advice would be gratefully appreciated.



 Thanks,



 Kartik
  --

 This message, and any attachments, is for the intended recipient(s) only,
 may contain information that is privileged, confidential and/or proprietary
 and subject to important terms and conditions available at
 http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
 recipient, please delete this message.
   --
 This message, and any attachments, is for the intended recipient(s) only,
 may contain information that is privileged

Re: Crawled data not inserting in the tables

2014-09-30 Thread Lewis John Mcgibbney
Can you also make sure that the cluster name and fully qualified address
and port agree between mapping and Gora.properties
Thanks

On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Hi Kartik,

 If TTL hasn't been set or if it has been set to 0, then Gora is not using
 any TTL[1] and all your data should be persisted without any problems.
 Maybe this has to do something with the url generating/fetching process?
 Could you determine during which process the data is changing?
 (generate/fetch/parse)
 Thanks!


 Renato M.

 [1]
 https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72

 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik 
 kartik.krishnan...@bankofamerica.com
 javascript:_e(%7B%7D,'cvml','kartik.krishnan...@bankofamerica.com');:

  Hi, Talat



 I am afraid that I do not understand.  We have set the “ttl” value to 0,
 which is the default value. We don’t have any need portions of data that
 needs to be deleted.  For now, I am using a single node cluster, for us the
 gc_grace_seconds=”0” default value would be a valid value.



 Have I missed out anything? My settings are as follows. Any suggestions
 would be greatly appreciated.



 gora-orm



 keyspace name=*projectKeyspace* cluster=*MultiTest* 
 host=*192.161.23.161:9160
 http://192.161.23.161:9160* placement_strategy=
 *org.apache.cassandra.locator.NetworkTopologyStrategy*

 family name=*p* /

 family name=*f*/

 family name=*sc* type=*super*/



 family name=*mtdt* type=*super*/

 family name=*il* type=*super*/

 family name=*ol* type=*super*/

 /keyspace



 class keyClass=*java.lang.String* name=
 *org.apache.nutch.storage.WebPage* keyspace=*projectKeyspace *



 !-- fetch fields --

 field name=*baseUrl* family=*f* qualifier=*bas*/

 field name=*status* family=*f* qualifier=*st*/

 field name=*prevFetchTime* family=*f* qualifier=*pts*/

 field name=*fetchTime* family=*f* qualifier=*ts*/

 field name=*fetchInterval* family=*f* qualifier=*fi*/

 field name=*retriesSinceFetch* family=*f* qualifier=*rsf*
 /

 field name=*reprUrl* family=*f* qualifier=*rpr*/

 field name=*content* family=*f* qualifier=*cnt*/

 field name=*contentType* family=*f* qualifier=*typ*/

 field name=*modifiedTime* family=*f* qualifier=*mod*/

 field name=*prevModifiedTime* family=*f* qualifier=*pmod*
 /

 field name=*batchId* family=*f* qualifier=*bid*/



 !-- parse fields --

 field name=*title* family=*p* qualifier=*t*/

 field name=*text* family=*p* qualifier=*c*/

 field name=*signature* family=*p* qualifier=*sig*/

 field name=*prevSignature* family=*p* qualifier=*psig*/



 !-- score fields --

 field name=*score* family=*f* qualifier=*s*/



 !-- super columns --

 field name=*headers* family=*sc* qualifier=*h*/

 field name=*inlinks* family=*sc* qualifier=*il*/

 field name=*outlinks* family=*sc* qualifier=*ol*/

 field name=*metadata* family=*sc* qualifier=*mtdt*/

 field name=*markers* family=*sc* qualifier=*mk*/

 field name=*parseStatus* family=*sc* qualifier=*pas*/

 field name=*protocolStatus* family=*sc* qualifier=*prs*/

 /class





 class keyClass=*java.lang.String* name=
 *org.apache.nutch.storage.Host* keyspace=*projectKeyspace *

 field name=*metadata* family=*mtdt* qualifier=*mtdt*/

 field name=*inlinks* family=*il* qualifier=*il*/

 field name=*outlinks* family=*ol* qualifier=*ol*/

 /class



 /gora-orm



 Thanks,



 Kartik



 *From:* Talat Uyarer [mailto:ta...@uyarer.com
 javascript:_e(%7B%7D,'cvml','ta...@uyarer.com');]
 *Sent:* Thursday, September 25, 2014 5:04 PM
 *To:* user@gora.apache.org
 javascript:_e(%7B%7D,'cvml','user@gora.apache.org');
 *Cc:* u...@nutch.apache.org
 javascript:_e(%7B%7D,'cvml','u...@nutch.apache.org');
 *Subject:* Re: Crawled data not inserting in the tables



 Hi Kartik,

 The 'problem' is with your mapping settings in
 gora-cassandra-mapping.xml. Please see the documentation [0], specifically
 relating to the values for 'gc_grace_seconds' and also 'ttl'. This will fix
 the problem

 Talat

 [0] http://gora.apache.org/current/gora-cassandra.html

 Hi, Gora gurus,



 I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA
 Cassandra mapping to store the crawled data.



 I can confirm that all 12 URLs are not being filtered and are injected,
 but after running the generate, fetch and parse jobs . There are only 3
 entries in “column family” f.



 I am not sure what I am doing wrong. The logs have not yielded anything
 relevant. What should I be looking at?



 Any advice would be gratefully appreciated.



 Thanks,



 Kartik

Re: Crawled data not inserting in the tables

2014-09-30 Thread Lewis John Mcgibbney
Hi,
So did you get this sorted out?
I am unsure if you achieved persistence of data.
Thanks
Lewis

On Tuesday, September 30, 2014, Krishnanand, Kartik 
kartik.krishnan...@bankofamerica.com wrote:

  Hi, Lewis



 Thank you for replying.  I apologize in advance for asking what might well
 be a stupid question.  We are using the
 Crawler/InjectorJob/GeneratorJob/FetcherJob/ParserJob source code from the
 Nutch codebase without any modifications and calling the binary directly.



 @Lewis: I used the datastax library directly to query the keyspace for
 that host and port combination. I was able to execute CQL queries
 programmatically and return the result sets. Pinging the hosts returns
 valid packets.  My gora.properties



 gora.datastore.autocreateschema=true

 gora.CassandraStore.autocreateschema=true

 gora.cassandrastore.servers=*192.161.23.161:9160
 http://192.161.23.161:9160*

 gora.cassandrastore.username=*username*

 gora.cassandrastore.password=*password*



 They match with gora-cassandra-mapping.xml data.



 We are using Nutch 2.2.x for our purpose.







 *From:* Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com
 javascript:_e(%7B%7D,'cvml','lewis.mcgibb...@gmail.com');]
 *Sent:* Tuesday, September 30, 2014 8:19 AM
 *To:* user@gora.apache.org
 javascript:_e(%7B%7D,'cvml','user@gora.apache.org');
 *Cc:* Nutch Users; Kothuvatiparambil, Viju; Krishnanand, Kartik
 *Subject:* Re: Crawled data not inserting in the tables



 Can you also make sure that the cluster name and fully qualified address
 and port agree between mapping and Gora.properties

 Thanks

 On Tuesday, September 30, 2014, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com
 javascript:_e(%7B%7D,'cvml','renatoj.marroq...@gmail.com'); wrote:

 Hi Kartik,



 If TTL hasn't been set or if it has been set to 0, then Gora is not using
 any TTL[1] and all your data should be persisted without any problems.

 Maybe this has to do something with the url generating/fetching process?
 Could you determine during which process the data is changing?
 (generate/fetch/parse)

 Thanks!





 Renato M.



 [1]
 https://github.com/apache/gora/blob/master/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/HectorUtils.java#L72



 2014-09-30 10:00 GMT+02:00 Krishnanand, Kartik 
 kartik.krishnan...@bankofamerica.com:

 Hi, Talat



 I am afraid that I do not understand.  We have set the “ttl” value to 0,
 which is the default value. We don’t have any need portions of data that
 needs to be deleted.  For now, I am using a single node cluster, for us the
 gc_grace_seconds=”0” default value would be a valid value.



 Have I missed out anything? My settings are as follows. Any suggestions
 would be greatly appreciated.



 gora-orm



 keyspace name=*projectKeyspace* cluster=*MultiTest* 
 host=*192.161.23.161:9160
 http://192.161.23.161:9160* placement_strategy=
 *org.apache.cassandra.locator.NetworkTopologyStrategy*

 family name=*p* /

 family name=*f*/

 family name=*sc* type=*super*/



 family name=*mtdt* type=*super*/

 family name=*il* type=*super*/

 family name=*ol* type=*super*/

 /keyspace



 class keyClass=*java.lang.String* name=
 *org.apache.nutch.storage.WebPage* keyspace=*projectKeyspace *



 !-- fetch fields --

 field name=*baseUrl* family=*f* qualifier=*bas*/

 field name=*status* family=*f* qualifier=*st*/

 field name=*prevFetchTime* family=*f* qualifier=*pts*/

 field name=*fetchTime* family=*f* qualifier=*ts*/

 field name=*fetchInterval* family=*f* qualifier=*fi*/

 field name=*retriesSinceFetch* family=*f* qualifier=*rsf*/

 field name=*reprUrl* family=*f* qualifier=*rpr*/

 field name=*content* family=*f* qualifier=*cnt*/

 field name=*contentType* family=*f* qualifier=*typ*/

 field name=*modifiedTime* family=*f* qualifier=*mod*/

 field name=*prevModifiedTime* family=*f* qualifier=*pmod*/

 field name=*batchId* family=*f* qualifier=*bid*/



 !-- parse fields --

 field name=*title* family=*p* qualifier=*t*/

 field name=*text* family=*p* qualifier=*c*/

 field name=*signature* family=*p* qualifier=*sig*/

 field name=*prevSignature* family=*p* qualifier=*psig*/



 !-- score fields --

 field name=*score* family=*f* qualifier=*s*/



 !-- super columns --

 field name=*headers* family=*sc* qualifier=*h*/

 field name=*inlinks* family=*sc* qualifier=*il*/

 field name=*outlinks* family=*sc* qualifier=*ol*/

 field name=*metadata* family=*sc* qualifier=*mtdt*/

 field name=*markers* family=*sc* qualifier=*mk*/

 field name=*parseStatus* family=*sc* qualifier=*pas*/

 field name=*protocolStatus* family=*sc* qualifier=*prs*/

 /class





 class keyClass=*java.lang.String* name=
 *org.apache.nutch.storage.Host* keyspace

Re: Crawled data not inserting in the tables

2014-09-27 Thread Lewis John Mcgibbney
Hi,
Did you get this sorted out?
Thanks
KLewis

On Thu, Sep 25, 2014 at 4:56 PM, Krishnanand, Kartik 
kartik.krishnan...@bankofamerica.com wrote:

  Hi, Gora gurus,



 I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA
 Cassandra mapping to store the crawled data.



 I can confirm that all 12 URLs are not being filtered and are injected,
 but after running the generate, fetch and parse jobs . There are only 3
 entries in “column family” f.



 I am not sure what I am doing wrong. The logs have not yielded anything
 relevant. What should I be looking at?



 Any advice would be gratefully appreciated.



 Thanks,



 Kartik
  --
 This message, and any attachments, is for the intended recipient(s) only,
 may contain information that is privileged, confidential and/or proprietary
 and subject to important terms and conditions available at
 http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
 recipient, please delete this message.




-- 
*Lewis*


Crawled data not inserting in the tables

2014-09-25 Thread Krishnanand, Kartik
Hi, Gora gurus,

I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA 
Cassandra mapping to store the crawled data.

I can confirm that all 12 URLs are not being filtered and are injected, but 
after running the generate, fetch and parse jobs . There are only 3 entries in 
column family f.

I am not sure what I am doing wrong. The logs have not yielded anything 
relevant. What should I be looking at?

Any advice would be gratefully appreciated.

Thanks,

Kartik

--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.


Re: Crawled data not inserting in the tables

2014-09-25 Thread Talat Uyarer
Hi Kartik,

The 'problem' is with your mapping settings in gora-cassandra-mapping.xml.
Please see the documentation [0], specifically relating to the values for
'gc_grace_seconds' and also 'ttl'. This will fix the problem

Talat

[0] http://gora.apache.org/current/gora-cassandra.html

Hi, Gora gurus,



I am trying to crawl URLS starting with 12 seed URLs. I am using the GORA
Cassandra mapping to store the crawled data.



I can confirm that all 12 URLs are not being filtered and are injected, but
after running the generate, fetch and parse jobs . There are only 3 entries
in “column family” f.



I am not sure what I am doing wrong. The logs have not yielded anything
relevant. What should I be looking at?



Any advice would be gratefully appreciated.



Thanks,



Kartik
 --
This message, and any attachments, is for the intended recipient(s) only,
may contain information that is privileged, confidential and/or proprietary
and subject to important terms and conditions available at
http://www.bankofamerica.com/emaildisclaimer. If you are not the intended
recipient, please delete this message.