RE: Read latency issue

2012-10-03 Thread Roshni Rajagopal

Hi Arindam,
There were some changes for CQL3 for composite keys storage , and you may be 
using CQL2 by default.You could try for a non composite key or supply all the 
components of the key in the search...and see if you get different results...
Regards,roshni

 

 From: aba...@247-inc.com
 To: user@cassandra.apache.org
 Subject: RE: Read latency issue
 Date: Wed, 3 Oct 2012 17:53:46 +
 
 
 Thanks for your responses.
 
 Just to be clear our table declaration looks something like this:
 CREATE TABLE sessionevents (
   atag text,
   col2 uuid,
   col3 text,
   col4 uuid,
   col5 text,
   col6 text,
   col7 blob,
   col8 text,
   col9 timestamp,
   col10 uuid,
   col11 int,
   col12 uuid,
   PRIMARY KEY (atag, col2, col3, col4)
 )
 
 My understanding was that the (full) row key in this case would be the 'atag' 
 values. The column names would then be composites like 
 (col2_value:col3_value:col4_value:col5), (col2_value: col3_value: 
 col4_value:col6), (col2_value:col3_value:col4_value:col7) ... 
 (col2_value: col3_value: col4_value:col12). The columns would be sorted 
 first by col2_values, then by col3 values, etc.
 
 Hence a query like select * from sessionevents where atag=foo, we are 
 specifying the entire row key, and Cassandra would return all the columns for 
 that row.
 
  Using read consistency of ONE reduces the read latency by ~20ms, compared 
  to using QUORUM.
 It would only have read from the local node. (I think, may be confusing 
 secondary index reads here).
 For read consistency ONE, reading only from one node is my expectation as 
 well, and hence I'm seeing the reduced read latency compared to read 
 consistency QUORUM. Does that not sound right?
 Btw, with read consistency ONE, we found the reading only happens from one 
 node, but not necessarily the local node, even if the data is present in the 
 local node. To check this, we turned on DEBUG logs on all the Cassandra hosts 
 in the ring. We are using replication factor=3 on a 4 node ring, hence mostly 
 the data is present locally. However, we noticed that the coordinator host on 
 receiving the same request multiple times (i.e with the same row key) , would 
 sometimes return the data locally, but sometimes would contact another host 
 in the ring to fetch the data.
 
 Thanks,
 Arindam
 
 -Original Message-
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Wednesday, October 03, 2012 12:32 AM
 To: user@cassandra.apache.org
 Subject: Re: Read latency issue
 
  Running a query to like select * from table_name where atag=foo, 
  where 'atag' is the first column of the composite key, from either JDBC or 
  Hector (equivalent code), results in read times of 200-300ms from a remote 
  host on the same network. 
 
 If you send a query to select columns from a row and do not fully specify the 
 row key cassandra has to do a row scan. 
 
 If you want fast performance specify the full row key. 
 
  Using read consistency of ONE reduces the read latency by ~20ms, compared 
  to using QUORUM.
 It would only have read from the local node. (I think, may be confusing 
 secondary index reads here).
  
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 3/10/2012, at 2:17 AM, Roshni Rajagopal roshni_rajago...@hotmail.com 
 wrote:
 
  Arindam,
  
Did you also try the cassandra stress tool  compare results?
  
  I havent done a performance test as yet, the only ones published on 
  the internet are of YCSB on an older version of apache cassandra, and it 
  doesn't seem to be actively supported or updated 
  http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf.
  
  The numbers you have sound very low, for a read of a row by key which 
  should have been the fastest.  I hope someone can help investigate or share 
  numbers from their tests.
  
   
  
  Regards,
  Roshni
   
  
   From: dean.hil...@nrel.gov
   To: user@cassandra.apache.org
   Date: Tue, 2 Oct 2012 06:41:09 -0600
   Subject: Re: Read latency issue
   
   Interesting results. With PlayOrm, we did a 6 node test of reading 100 
   rows from 1,000,000 using PlayOrm Scalable SQL. It only took 60ms. Maybe 
   we have better hardware though??? We are using 7200 RPM drives so nothing 
   fancy on the disk side of things. More nodes puts at a higher throughput 
   though as reading from more disks will be faster. Anyways, you may want 
   to play with more nodes and re-run. If you run a test with PlayOrm, I 
   would love to know the results there as well.
   
   Later,
   Dean
   
   From: Arindam Barua aba...@247-inc.commailto:aba...@247-inc.com
   Reply-To: 
   user@cassandra.apache.orgmailto:user@cassandra.apache.org 
   user@cassandra.apache.orgmailto:user@cassandra.apache.org
   Date: Monday, October 1, 2012 4:57 PM
   To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
   user@cassandra.apache.orgmailto:user@cassandra.apache.org
   Subject: Read latency issue
   
   unning a query to like

RE: Data Modeling: Comments with Voting

2012-10-01 Thread Roshni Rajagopal

Hi , 
To explain my suggestions - my thoughts were 
a) you need to store entity type information about a comment like date created, 
comment text, commented by etc. I cant think of any other master information 
for a comment, but in general one starts with entities in a standard static 
column family.  If you store an entity in a dynamic denormailized form, if any 
master data changes you would need to iterate across all rows and update it 
which is expensive in cassandra. Here comment text is editable.
b) So when a comment is created it goes to the static column family. Also an 
entry is made in the dynamic sort_by_time_list column family with column as 
time created. I didn't suggest a and c be clubbed so that master information 
remains in one place. The other approach would be to have a comment stored as a 
JSON in the column value. However if you need to update comment text, it 
would be hard to identify the comment column and update it. c) when a comment 
gets a vote, the counter column family is incremented to know the number of 
votes for a comment. Also to sort by number of votes  , after incrementing the 
counter you need to write the current number of votes, and the comment id in 
the column family d. But I see now that you also need to delete the old number 
of votes  comment id column and add a new  column with current number of votes 
and comment id. It would be sorted by number of votes.
If there are many ways to sort, its better to do it in the application to avoid 
having a new column family for each type of sort...however Im not certain over 
time and volume which approach would perform better.Sorting can be complex - 
aaron's blog post http://thelastpickle.com/2012/08/18/Sorting-Lists-For-Humans/ 
 
Welcome any feedback on my suggestions.


From: aa...@thelastpickle.com
Subject: Re: Data Modeling: Comments with Voting
Date: Tue, 2 Oct 2012 10:39:42 +1300
To: user@cassandra.apache.org

You cannot (and probably do not want to) sort continually when the voting is 
going on. 
You can store the votes using CounterColumnTypes in column values. When someone 
votes you then (somehow) queue a job that will read the vote counts for the 
post / comment, pivot and sort on the vote count, and then write the updated 
leader board to cassandra. 
Alternatively if you have a small number of comments for a post just read all 
the votes and sort them as part of the read. 
Cheers  

-Aaron MortonFreelance 
Developer@aaronmortonhttp://www.thelastpickle.com



On 30/09/2012, at 8:25 AM, Drew Kutcharian d...@venarc.com wrote:Thanks 
Roshni,
I'm not sue how #d will work when users are actually voting on a comment. What 
happens when two users vote on the same comment simultaneously? How do you 
update the entries in #d column family to prevent duplicates?
 Also #a and #c can be combined together using TimeUUID as comment ids.
- Drew



On Sep 27, 2012, at 2:13 AM, Roshni Rajagopal roshni_rajago...@hotmail.com 
wrote:





Hi Drew,
I think you have 4 requirements. Here are my suggestions.
a) store comments : have a static column family for comments with master data 
like created date, created by , length etcb) when a person votes for a comment, 
increment a vote counter : have a counter column family for incrementing the 
votes for each commentc) display comments sorted by date created: have a column 
family with a dummy row id  'sort_by_time_list',  column names can be date 
created(timeUUID), and column value can be comment id d) display comments 
sorted by number of votes: have a column family with a dummy row id 
'sort_by_votes_list' and column names can be a composite of number of votes , 
and comment id ( as more than 1 comment can have the same votes)

Regards,Roshni

 Date: Wed, 26 Sep 2012 17:36:13 -0700
 From: k...@mustardgrain.com
 To: user@cassandra.apache.org
 CC: d...@venarc.com
 Subject: Re: Data Modeling: Comments with Voting
 
 Depending on your needs, you could simply duplicate the comments in two 
 separate CFs with the column names including time in one and the vote in 
 the other. If you allow for updates to the comments, that would pose 
 some issues you'd need to solve at the app level.
 
 On 9/26/12 4:28 PM, Drew Kutcharian wrote:
  Hi Guys,
 
  Wondering what would be the best way to model a flat (no sub comments, i.e. 
  twitter) comments list with support for voting (where I can sort by create 
  time or votes) in Cassandra?
 
  To demonstrate:
 
  Sorted by create time:
  - comment 1 (5 votes)
  - comment 2 (1 votes)
  - comment 3 (no votes)
  - comment 4 (10 votes)
 
  Sorted by votes:
  - comment 4 (10 votes)
  - comment 1 (5 votes)
  - comment 2 (1 votes)
  - comment 3 (no votes)
 
  It's the sorted-by-votes that I'm having a bit of a trouble with. I'm 
  looking for a roll-your-own approach and prefer not to use secondary 
  indexes and CQL sorting.
 
  Thanks,
 
  Drew
 
 
  

  

RE: Data Modeling: Comments with Voting

2012-09-27 Thread Roshni Rajagopal

Hi Drew,
I think you have 4 requirements. Here are my suggestions.
a) store comments : have a static column family for comments with master data 
like created date, created by , length etcb) when a person votes for a comment, 
increment a vote counter : have a counter column family for incrementing the 
votes for each commentc) display comments sorted by date created: have a column 
family with a dummy row id  'sort_by_time_list',  column names can be date 
created(timeUUID), and column value can be comment id d) display comments 
sorted by number of votes: have a column family with a dummy row id 
'sort_by_votes_list' and column names can be a composite of number of votes , 
and comment id ( as more than 1 comment can have the same votes)

Regards,Roshni

 Date: Wed, 26 Sep 2012 17:36:13 -0700
 From: k...@mustardgrain.com
 To: user@cassandra.apache.org
 CC: d...@venarc.com
 Subject: Re: Data Modeling: Comments with Voting
 
 Depending on your needs, you could simply duplicate the comments in two 
 separate CFs with the column names including time in one and the vote in 
 the other. If you allow for updates to the comments, that would pose 
 some issues you'd need to solve at the app level.
 
 On 9/26/12 4:28 PM, Drew Kutcharian wrote:
  Hi Guys,
 
  Wondering what would be the best way to model a flat (no sub comments, i.e. 
  twitter) comments list with support for voting (where I can sort by create 
  time or votes) in Cassandra?
 
  To demonstrate:
 
  Sorted by create time:
  - comment 1 (5 votes)
  - comment 2 (1 votes)
  - comment 3 (no votes)
  - comment 4 (10 votes)
 
  Sorted by votes:
  - comment 4 (10 votes)
  - comment 1 (5 votes)
  - comment 2 (1 votes)
  - comment 3 (no votes)
 
  It's the sorted-by-votes that I'm having a bit of a trouble with. I'm 
  looking for a roll-your-own approach and prefer not to use secondary 
  indexes and CQL sorting.
 
  Thanks,
 
  Drew
 
 
  

RE: 1.1.5 Missing Insert! Strange Problem

2012-09-26 Thread Roshni Rajagopal

By any chance is a TTL (time to live ) set on the columns...

Date: Tue, 25 Sep 2012 19:56:19 -0700
Subject: 1.1.5 Missing Insert! Strange Problem
From: gouda...@gmail.com
To: user@cassandra.apache.org

Hi All,
I have a 4 node cluster setup in 2 zones with NetworkTopology strategy and 
strategy options for writing a copy to each zone, so the effective load on each 
machine is 50%.

Symptom:I have a column family that has gc grace seconds of 10 days (the 
default). On 17th there was an insert done to this column family and from our 
application logs I can see that the client got a successful response back with 
write consistency of ONE. I can verify the existence of the key that was 
inserted in Commitlogs of both replicas however it seams that this record was 
never inserted. I used list to get all the column family rows which were about 
800ish, and examine them to see if it could possibly be deleted by our 
application. List should have shown them to me since I have not gone beyond gc 
grace seconds if this record was deleted during past days. I could not find it. 

Things happened:During the same time as this insert was happening, I was 
performing a rolling upgrade of Cassandra from 1.1.3 to 1.1.5 by taking one 
node down at a time, performing the package upgrade and restarting the service 
and going to the next node. I could see from system.log that some mutations 
were replayed during those restarts, so I suppose the memtables were not 
flushed before restart. 


Could this procedure cause the row inser to disappear? How could I troubleshoot 
as I am running out of ideas.
Your help is greatly appreciated.


Cheers,=Arya  

RE: Cassandra Counters

2012-09-25 Thread Roshni Rajagopal

Thanks for the reply and sorry for being bull - headed.
Once  you're past the stage where you've decided its distributed, and NoSQL and 
cassandra out of all the NoSQL options,Now to count something, you can do it in 
different ways in cassandra. In all the ways you want to use cassandra's best 
features of availability, tunable consistency , partition tolerance etc.
Given this, what are the performance tradeoffs of using counters vs a standard 
column family for counting. Because as I see if the counter number in a counter 
column family becomes wrong, it will not be 'eventually consistent' - you will 
need intervention to correct it. So the key aspect is how much faster would be 
a counter column family, and at what numbers do we start seing a difference.




Date: Tue, 25 Sep 2012 07:57:08 +0200
Subject: Re: Cassandra Counters
From: oleksandr.pet...@gmail.com
To: user@cassandra.apache.org

Maybe I'm missing the point, but counting in a standard column family would be 
a little overkill. 
I assume that distributed counting here was more of a map/reduce approach, 
where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a lot. We're 
doing some more complex counting (e.q. based on sets of rules) like that. Of 
course, that would perform _way_ slower than counting beforehand. On the other 
side, you will always have a consistent result for a consistent dataset.

On the other hand, if you use things like AMQP or Storm (sorry to put up my 
sentence together like that, as tools are mostly either orthogonal or 
complementary, but I hope you get my point), you could build a topology that 
makes fault-tolerant writes independently of your original write. Of course, it 
would still have a consistency tradeoff, mostly because of race conditions and 
different network latencies etc.  

So I would say that building a data model in a distributed system often depends 
more on your problem than on the common patterns, because everything has a 
tradeoff. 
Want to have an immediate result? Modify your counter while writing the row.
Can sacrifice speed, but have more counting opportunities? Go with offline 
distributed counting.Want to have kind of both, dispatch a message and react 
upon it, having the processing logic and writes decoupled from main 
application, allowing you to care less about speed.

However, I may have missed the point somewhere (early morning, you know), so I 
may be wrong in any given statement.Cheers

On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal 
roshni_rajago...@hotmail.com wrote:





Thanks Milind,
Has anyone implemented counting in a standard col family in cassandra, when you 
can have increments and decrements to the count. Any comparisons in performance 
to using counter column families? 

Regards,Roshni

Date: Mon, 24 Sep 2012 11:02:51 -0700
Subject: RE: Cassandra Counters
From: milindpar...@gmail.com

To: user@cassandra.apache.org

IMO

You would use Cassandra Counters (or other variation of distributed counting) 
in case of having determined that a centralized version of counting is not 
going to work.

You'd determine the non_feasibility of centralized counting by figuring the 
speed at which you need to sustain writes and reads and reconcile that with 
your hard disk seek times (essentially).

Once you have proved that you can't do centralized counting, the second layer 
of arsenal comes into play; which is distributed counting.

In distributed counting , the CAP theorem comes into life.  in Cassandra, 
Availability and Network Partitioning trumps over Consistency. 

 

So yes, you sacrifice strong consistency for availability and partion 
tolerance; for eventual consistency.

On Sep 24, 2012 10:28 AM, Roshni Rajagopal roshni_rajago...@hotmail.com 
wrote:






Hi folks,
   I looked at my mail below, and Im rambling a bit, so Ill try to re-state my 
queries pointwise. 
a) what are the performance tradeoffs on reads  writes between creating a 
standard column family and manually doing the counts by a lookup on a key, 
versus using counters. 


b) whats the current state of counters limitations in the latest version of 
apache cassandra?
c) with there being a possibilty of counter values getting out of sync, would 
counters not be recommended where strong consistency is desired. The normal 
benefits of cassandra's tunable consistency would not be applicable, as 
re-tries may cause overstating. So the normal use case is high performance, and 
where consistency is not paramount.


Regards,roshni


From: roshni_rajago...@hotmail.com
To: user@cassandra.apache.org


Subject: Cassandra Counters
Date: Mon, 24 Sep 2012 16:21:55 +0530





Hi ,
I'm trying to understand if counters are a good fit for my use case.Ive watched 
http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now...

and still need help!
Suppose I have a list of items- to which I can add or delete a set of items at 
a time,  and I want a count of the items, without considering changing the 
database

RE: Cassandra Counters

2012-09-24 Thread Roshni Rajagopal

Hi folks,
   I looked at my mail below, and Im rambling a bit, so Ill try to re-state my 
queries pointwise. 
a) what are the performance tradeoffs on reads  writes between creating a 
standard column family and manually doing the counts by a lookup on a key, 
versus using counters. 
b) whats the current state of counters limitations in the latest version of 
apache cassandra?
c) with there being a possibilty of counter values getting out of sync, would 
counters not be recommended where strong consistency is desired. The normal 
benefits of cassandra's tunable consistency would not be applicable, as 
re-tries may cause overstating. So the normal use case is high performance, and 
where consistency is not paramount.
Regards,roshni


From: roshni_rajago...@hotmail.com
To: user@cassandra.apache.org
Subject: Cassandra Counters
Date: Mon, 24 Sep 2012 16:21:55 +0530





Hi ,
I'm trying to understand if counters are a good fit for my use case.Ive watched 
http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now...and 
still need help!
Suppose I have a list of items- to which I can add or delete a set of items at 
a time,  and I want a count of the items, without considering changing the 
database  or additional components like zookeeper,I have 2 options_ the first 
is a counter col family, and the second is a standard one











 
 
  1. List_Counter_CF
  
  
  
 
 
  
  TotalItems
  
  
  
  
 
 
  ListId
  50
  
  
  
  
 
 
  
  
  
  
  
  
 
 
  2.List_Std_CF


  
  
  
  
  
 
 
  
  TimeUUID1
  TimeUUID2
  TimeUUID3
  TimeUUID4
  TimeUUID5
 
 
  ListId
  3
  70
  -20
  3
  -6
 


And in the second I can add a new col with every set of items added or deleted. 
Over time this row may grow wide.To display the final count, Id need to read 
the row, slice through all columns and add them.
In both cases the writes should be fast, in fact standard col family should be 
faster as there's no read, before write. And for CL ONE write the latency 
should be same. For reads, the first option is very good, just read one column 
for a key
For the second, the read involves reading the row, and adding each column value 
via application code. I dont think there's a way to do math via CQL yet.There 
should be not hot spotting, if the key is sharded well. I could even maintain 
the count derived from the List_Std_CF in a separate column family which is a 
standard col family with the final number, but I could do that as a separate 
process  immediately after the write to List_Std_CF completes, so that its not 
blocking.  I understand cassandra is faster for writes than reads, but how slow 
would Reading by row key be...? Is there any number around after how many 
columns the performance starts deteriorating, or how much worse in performance 
it would be? 
The advantage I see is that I can use the same consistency rules as for the 
rest of column families. If quorum for reads  writes, then you get strongly 
consistent values. In case of counters I see that in case of timeout exceptions 
because the first replica is down or not responding, there's a chance of the 
values getting messed up, and re-trying can mess it up further. Its not 
idempotent like a standard col family design can be.
If it gets messed up, it would need administrator's help (is there a a document 
on how we could resolve counter values going wrong?)
I believe the rest of the limitations still hold good- has anything changed in 
recent versions? In my opinion, they are not as major as the consistency 
question.-removing a counter  then modifying value - behaviour is 
undetermined-special process for counter col family sstable loss( need to 
remove all files)-no TTL support-no secondary indexes

In short, I can recommend counters can be used for analytics or while dealing 
with data where the exact numbers are not important, orwhen its ok to take some 
time to fix the mismatch, and the performance requirements are most 
important.However where the numbers should match , its better to use a std 
column family and a manual implementation.
Please share your thoughts on this.
Regards,roshni  
  

RE: Cassandra Counters

2012-09-24 Thread Roshni Rajagopal

Thanks Milind,
Has anyone implemented counting in a standard col family in cassandra, when you 
can have increments and decrements to the count. Any comparisons in performance 
to using counter column families? 
Regards,Roshni

Date: Mon, 24 Sep 2012 11:02:51 -0700
Subject: RE: Cassandra Counters
From: milindpar...@gmail.com
To: user@cassandra.apache.org

IMO
You would use Cassandra Counters (or other variation of distributed counting) 
in case of having determined that a centralized version of counting is not 
going to work.
You'd determine the non_feasibility of centralized counting by figuring the 
speed at which you need to sustain writes and reads and reconcile that with 
your hard disk seek times (essentially).
Once you have proved that you can't do centralized counting, the second layer 
of arsenal comes into play; which is distributed counting.
In distributed counting , the CAP theorem comes into life.  in Cassandra, 
Availability and Network Partitioning trumps over Consistency. 

 

So yes, you sacrifice strong consistency for availability and partion 
tolerance; for eventual consistency.
On Sep 24, 2012 10:28 AM, Roshni Rajagopal roshni_rajago...@hotmail.com 
wrote:





Hi folks,
   I looked at my mail below, and Im rambling a bit, so Ill try to re-state my 
queries pointwise. 
a) what are the performance tradeoffs on reads  writes between creating a 
standard column family and manually doing the counts by a lookup on a key, 
versus using counters. 

b) whats the current state of counters limitations in the latest version of 
apache cassandra?
c) with there being a possibilty of counter values getting out of sync, would 
counters not be recommended where strong consistency is desired. The normal 
benefits of cassandra's tunable consistency would not be applicable, as 
re-tries may cause overstating. So the normal use case is high performance, and 
where consistency is not paramount.

Regards,roshni


From: roshni_rajago...@hotmail.com
To: user@cassandra.apache.org

Subject: Cassandra Counters
Date: Mon, 24 Sep 2012 16:21:55 +0530





Hi ,
I'm trying to understand if counters are a good fit for my use case.Ive watched 
http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now...
and still need help!
Suppose I have a list of items- to which I can add or delete a set of items at 
a time,  and I want a count of the items, without considering changing the 
database  or additional components like zookeeper,
I have 2 options_ the first is a counter col family, and the second is a 
standard one











 
 
  1. List_Counter_CF
  
  
  
 
 
  
  TotalItems
  
  
  
  
 
 
  ListId
  50
  
  
  
  
 
 
  
  
  
  
  
  
 
 
  2.List_Std_CF


  
  
  
  
  
 
 
  
  TimeUUID1
  TimeUUID2
  TimeUUID3
  TimeUUID4
  TimeUUID5
 
 
  ListId
  3
  70
  -20
  3
  -6
 


And in the second I can add a new col with every set of items added or deleted. 
Over time this row may grow wide.To display the final count, Id need to read 
the row, slice through all columns and add them.

In both cases the writes should be fast, in fact standard col family should be 
faster as there's no read, before write. And for CL ONE write the latency 
should be same. For reads, the first option is very good, just read one column 
for a key

For the second, the read involves reading the row, and adding each column value 
via application code. I dont think there's a way to do math via CQL yet.There 
should be not hot spotting, if the key is sharded well. I could even maintain 
the count derived from the List_Std_CF in a separate column family which is a 
standard col family with the final number, but I could do that as a separate 
process  immediately after the write to List_Std_CF completes, so that its not 
blocking.  I understand cassandra is faster for writes than reads, but how slow 
would Reading by row key be...? Is there any number around after how many 
columns the performance starts deteriorating, or how much worse in performance 
it would be? 

The advantage I see is that I can use the same consistency rules as for the 
rest of column families. If quorum for reads  writes, then you get strongly 
consistent values. In case of counters I see that in case of timeout exceptions 
because the first replica is down or not responding, there's a chance of the 
values getting messed up, and re-trying can mess it up further. Its not 
idempotent like a standard col family design can be.

If it gets messed up, it would need administrator's help (is there a a document 
on how we could resolve counter values going wrong?)
I believe the rest of the limitations still hold good- has anything changed in 
recent versions? In my opinion, they are not as major as the consistency 
question.
-removing a counter  then modifying value - behaviour is undetermined-special 
process for counter col family sstable loss( need to remove all files)-no TTL 
support-no secondary indexes


In short, I can recommend counters can be used

Data Model - Consistency question

2012-09-19 Thread Roshni Rajagopal

Hi Folks,
In the relational world, if I needed to model students, courses relationship, I 
may have donea students -master tablea course - master tablea bridge table 
students-course which gives me the ids to students and the courses they are 
taking. This can answer both 'which students take course A', as well as 'which 
courses are taken by student B'
In the cassandra world, I may design it like thisa static student column 
familya static course column familya student-course column family with student 
id as key and dynamic list of course - ids to answer 'which courses are taken 
by student B'a course-student column family with course id as key and dynamic 
list of student ids  'which students take course A'
A screen which displays some student entity details as well as all the courses 
she is taking will need to refer to 2 column families
Suppose an application inserts a new row in student column family, and a new 
row in student-course column family, as transactions or consistency across 
column families is not guaranteed, there is a chance that the client receives 
information that a student is attending a course from student-course column 
family, but does not exist in the student column family. 
If we use Strong consistency from the  reads + writes combination - will this 
scenario not occur ?And if we dont, can this scenario occur?  Regards,Roshni





Regards,Roshni

Solr Use Cases

2012-09-19 Thread Roshni Rajagopal

Hi,
Im new to Solr, and I hear that Solr is a great tool for improving search 
performanceIm unsure whether Solr or DSE Search is a must for all cassandra 
deployments
1. For performance - I thought cassandra had great read  write performance. 
When should solr be used ?Taking the following use cases for cassandra from the 
datastax FAQ page, in which cases would Solr be useful, and whether for 
all?Time series data managementHigh-velocity device data ingestion and 
analysisMedia streaming (e.g., music, movies)Social media input and 
analysisOnline web retail (e.g., shopping carts, user transactions)Web log 
management / analysisWeb click-stream analysisReal-time data analyticsOnline 
gaming (e.g., real-time messaging)Write-intensive transaction systemsBuyer 
event analyticsRisk analysis and management
2. what changes to cassandra data modeling does Solr bring? We have some 
guidelines  best practices around cassandra data modeling.Is Solr so powerful, 
that it does not matter how data is modelled in cassandra? Are there different 
best practices for cassandra data modeling when Solr is in the picture?Is this 
something we should keep in mind while modeling for cassandra today- that it 
should be  good to be used via Solr in future?
3. Does Solr come with any drawbacks like its not real time ? 
I can  should read the manual, but it will be great if someone can explain at 
a high level. 
Thank you!

Regards,Roshni

Data Model

2012-09-13 Thread Roshni Rajagopal

I want to learn how we can model a mix of static and dynamic columns in a 
family.
Consider a course_students col family which gives a list of students for a 
coursewith row key- Course IdColumns - Name, Teach_Nm, StudID1, StudID2, 
StudID3Values - Maths, Prof. Abc, 20,21,25 where 20,21,25 are IDs of students.
We have
fixed columns like Course Name, Teacher Name, and a dynamic number of
columns like 'StudID1', 'StudID2' etc, and my thoughts were that we could
look for 'StudID' and get all the columns with the student Ids in Hector. But 
the
question was how would we determine the number for the column, like to add
StudID3 we need to read the row and identify that 2 students are there,
and this is the third one.


So we can remove the number in the column name, altogether and keep
columns like Course Name, Teacher Name, Student:20,Student:21, Student:25,
where the second part is the actual student id. However here we run into
the second issue that we cannot have some columns of a composite format
and some of another format, when we use static column families- all
columns would need to be in the format UTF8:integer We may want to treat
it as a composite column key and not use a delimiter- to get sorting,
validate the types of the parts of the key, not have to search for the
delimiter and separate the 2 components  manually etc. 


A third option is to put only data in the column name for students like
Course Name, Teacher Name, 20,21,25 - it would be difficult to identify
that columns with name 20, 21, 25 actually stand for student names - a bit
unreadable.


I hope this is not confusing, and would like to hear your thoughts on this.The 
question isaround when you de-normalize  want to have some static info like 
name ,and a dynamic list - whats the best way to model this.
Regards,Roshni

Re: Data Modelling Suggestions

2012-08-24 Thread Roshni Rajagopal
Thank you Aaron  Guillermo,

I find composite columns very confusing :(
To reconfirm ,

 1.  we can only search for columns  range with the first component on the 
composite column.
 2.  After specifying a range for the first component, we cannot further filter 
for the second component.  I found this link 
http://doanduyhai.wordpress.com/2012/07/05/apache-cassandra-tricks-and-traps/  
which seems to suggest filtering is possible by second component in addition to 
first, and I tried the same example but I couldn't get it to work. Does anyone 
have an example where suppose I have data like this in my column names

Timestamp1: 123, Timestamp2: 456, Timestamp3: 777,Timestamp4: 654  ---get range 
of columns for (start)component1 = timestamp1, component2=123 , to 
(end)component1=timestamp3,component2=123  -- should give me only one column
Im finding that only the first component is used ….is this understanding 
correct?


We see a lot of examples about Timeseries modelling with TimeUUID as column 
names. But how is the updating or deletion of columns happening here, how are 
the columns found to know which ones to delete or modify. Does one always need 
a separate column family to handle updating/deletion for time series, or is 
usually handled by setting TTL for data outside the archival period, or does 
time series modelling usually not involve any manipulation of past records?

Regards,
Roshni



From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Data Modelling Suggestions

I was trying to find hector examples where we search for second column in a 
composite column, but I couldn't find any good one. Im not sure if its 
possible.…if you have any do have any example please share.
It's not. When slicing columns you can only return one contiguous range.

Anyway I would prefer storing the item-ids as column names in the main column 
family and having a second CF for the order-by-date query only with the pair 
timestamp_itemid. That way you can add later other query strategies without 
messing with how you store the item
+1
Have the orders somewhere, and build a time ordered custom index to show them 
in order.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/08/2012, at 6:28 AM, Guillermo Winkler 
gwink...@inconcertcc.commailto:gwink...@inconcertcc.com wrote:

I think you need another CF as index.

user_itemid - timestamped column_name

Otherwise you can't guess what's the timestamp to use in the column name.

Anyway I would prefer storing the item-ids as column names in the main column 
family and having a second CF for the order-by-date query only with the pair 
timestamp_itemid. That way you can add later other query strategies without 
messing with how you store the item information.

Maybe you can solve it with a secondary index by timestamp too.

Guille


On Thu, Aug 23, 2012 at 7:26 AM, Roshni Rajagopal 
roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote:
Hi,

Need some help on a data modelling question. We're using Hector  Datastax 
Enterprise 2.1.


I want to associate a list of items for a user. It should be sorted on the time 
added. And items can be updated (quantity of the item can be changed), and 
items can be deleted.
I can model it like this so that its denormalized and I get all my information 
in one go from one row, sorted by time added. I can use composite columns.

Row key: User Id
Column Name: TimeUUID:item ID: Item Name: Item Description: Item Price: Item Qty
Column Value : Null

Now, how do I handle manipulations

 1.  Add new item :Easy , just a new column
 2.  Add exiting item or modify qty: I want to get to the correct column to 
update . Can I search by second column in the composite column (equals 
condition)  update the column name itself to reflect new TimeUUID and qty?  Or 
would it be better to just add it as a new column and always use the latest 
column for an item in the application code and delete duplicates in the 
background.
 3.  Delete item: Can I search by second column in the composite column to find 
the correct column to delete?

I was trying to find hector examples where we search for second column in a 
composite column, but I couldn't find any good one. Im not sure if its 
possible.…if you have any do have any example please share.

Regards,
Roshni


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom

Data Modeling- another question

2012-08-24 Thread Roshni Rajagopal
Hi,

Suppose I have a column family to associate a user to a dynamic list of items. 
I want to store 5-10 key  information about the item,  no specific sorting 
requirements are there.
I have two options

A) use composite columns
UserId1 : {
 itemid1:Name = Betty Crocker,
 itemid1:Descr = Cake
itemid1:Qty = 5
 itemid2:Name = Nutella,
 itemid2:Descr = Choc spread
itemid2:Qty = 15
}

B) use a json with the data
UserId1 : {
 itemid1 = {name: Betty Crocker,descr: Cake, Qty: 5},
 itemid2 ={name: Nutella,descr: Choc spread, Qty: 15}
}

Which do you suggest would be better?


Regards,
Roshni

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Secondary index partially created

2012-08-24 Thread Roshni Rajagopal
What does List my_column_family in CLI show on all the nodes?
Perhaps the syntax u're using isn't correct?  You should be getting the
same data on all the nodes irrespective of which node's CLI you use.
The replication factor is for redundancy to have copies of the data on
different nodes to help if nodes go down. Even if you had a replication
factor of 1 you should still get the same data on all nodes.



On 24/08/12 11:05 PM, Richard Crowley r...@rcrowley.org wrote:

On Thu, Aug 23, 2012 at 6:54 PM, Richard Crowley r...@rcrowley.org wrote:
 I have a three-node cluster running Cassandra 1.0.10.  In this cluster
 is a keyspace with RF=3.  I *updated* a column family via Astyanax to
 add a column definition with an index on that column.  Then I ran a
 backfill to populate the column in every row.  Then I tried to query
 the index from Java and it failed but so did cassandra-cli:

 get my_column_family where my_column = 'my_value';

 Two out of the three nodes are unable to query the new index and throw
 this error:

 InvalidRequestException(why:No indexed columns present in index
 clause with operator EQ)

 The third is able to query the new index happily but doesn't find any
 results, even when I expect it to.

This morning the one node that's able to query the index is also able
to produce the expected results.  I'm a dummy and didn't use science
so I don't know if the `nodetool compact` I ran across the cluster had
anything to do with it.  Regardless, it did not change the situation
in any other way.


 `describe cluster;` in cassandra-cli confirms that all three nodes
 have the same schema and `show schema;` confirms that schema includes
 the new column definition and its index.

 The my_column_family.my_index-hd-* files only exist on that one node
 that can query the index.

 I ran `nodetool repair` on each node and waited for `nodetool
 compactionstats` to report zero pending tasks.  Ditto for `nodetool
 compact`.  The nodes that failed still fail.  The node that succeeded
 still succeed.

 Can anyone shed some light?  How do I convince it to let me query the
 index from any node?  How do I get it to find results?

 Thanks,

 Richard

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Data Modelling Suggestions

2012-08-23 Thread Roshni Rajagopal
Hi,

Need some help on a data modelling question. We're using Hector  Datastax 
Enterprise 2.1.


I want to associate a list of items for a user. It should be sorted on the time 
added. And items can be updated (quantity of the item can be changed), and 
items can be deleted.
I can model it like this so that its denormalized and I get all my information 
in one go from one row, sorted by time added. I can use composite columns.

Row key: User Id
Column Name: TimeUUID:item ID: Item Name: Item Description: Item Price: Item Qty
Column Value : Null

Now, how do I handle manipulations

 1.  Add new item :Easy , just a new column
 2.  Add exiting item or modify qty: I want to get to the correct column to 
update . Can I search by second column in the composite column (equals 
condition)  update the column name itself to reflect new TimeUUID and qty?  Or 
would it be better to just add it as a new column and always use the latest 
column for an item in the application code and delete duplicates in the 
background.
 3.  Delete item: Can I search by second column in the composite column to find 
the correct column to delete?

I was trying to find hector examples where we search for second column in a 
composite column, but I couldn't find any good one. Im not sure if its 
possible.…if you have any do have any example please share.

Regards,
Roshni


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Decision Making- YCSB

2012-08-10 Thread Roshni Rajagopal
  Thanks Edward and Mohit.

   We do have an in house tool, but that tests pretty much the same thing as 
YCSB- read , write performance given a number of threads  type of operations 
as an input.
The good thing here is that we own the code and we can modify it easily. YCSB 
does not seem to be very well supported.

When you say you modify the tests for your use-case what exactly do you modify. 
Could you give me an example of a use case driven approach.

Regards,
Roshni

From: Mohit Anchlia mohitanch...@gmail.commailto:mohitanch...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Decision Making- YCSB

I agree with Edward. We always develop our own stress tool that tests each use 
case of interest. Every use case is different in certain ways that can only be 
tested using custom stress tool.

On Fri, Aug 10, 2012 at 7:25 AM, Edward Capriolo 
edlinuxg...@gmail.commailto:edlinuxg...@gmail.com wrote:
There are many YCSB forks on github that get optimized for specific
databases but the default one is decent across the defaults. Cassandra
has it's own internal stress tool that we like better.

The short comings are that generic tools and generic workloads are
generic and thus not real-world. But other then that being able to
tweak the workload percentages and change the read patterns from
latest/random/etc does a decent job of stressing normal and worst-case
scenarios on the read path. Still I would try to build my own real
world use case as a tool to evaluate a solution before making a
choice.

Edward

On Thu, Aug 9, 2012 at 8:58 PM, Roshni Rajagopal
roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote:
 Hi Folks,

 I'm coming up with a set of decision criteria on when to chose traditional 
 RDBMS vs various  NoSQL options.
 So one aspect is the  application requirements around Consistency, 
 Availability, Partition Tolerance, Scalability, Data Modeling etc. These can 
 be decided at a theoretical level.

 Once we are sure we need NoSQL, to effectively benchmark the performance 
 around use-cases or application workloads, we need a standard method.
 Some tools are specific to a database like cassandra's stress tool.The only 
 tool I could find which seems to compare across NoSQL databases, and can be 
 extended and is freely available is YCSB.

 Is YCSB updated for latest versions of cassandra and hbase? Does it work for 
 Datastax enterprise? Is it regularly updated for new versions of NoSQL 
 databases, or is this something we would need to take up as a development 
 effort?

 Are there any shortcomings to using YCSB- and would it be preferable to 
 develop own tool for performance benchmarking of NoSQL systems. Do share your 
 thoughts.


 Regards,
 Roshni

 This email and any files transmitted with it are confidential and intended 
 solely for the individual or entity to whom they are addressed. If you have 
 received this email in error destroy it immediately. *** Walmart Confidential 
 ***

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Decision Making- YCSB

2012-08-09 Thread Roshni Rajagopal
Hi Folks,

I'm coming up with a set of decision criteria on when to chose traditional 
RDBMS vs various  NoSQL options.
So one aspect is the  application requirements around Consistency, 
Availability, Partition Tolerance, Scalability, Data Modeling etc. These can be 
decided at a theoretical level.

Once we are sure we need NoSQL, to effectively benchmark the performance around 
use-cases or application workloads, we need a standard method.
Some tools are specific to a database like cassandra's stress tool.The only 
tool I could find which seems to compare across NoSQL databases, and can be 
extended and is freely available is YCSB.

Is YCSB updated for latest versions of cassandra and hbase? Does it work for 
Datastax enterprise? Is it regularly updated for new versions of NoSQL 
databases, or is this something we would need to take up as a development 
effort?

Are there any shortcomings to using YCSB- and would it be preferable to develop 
own tool for performance benchmarking of NoSQL systems. Do share your thoughts.


Regards,
Roshni

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Project Management

2012-08-07 Thread Roshni Rajagopal
Hi Baskar,

The key aspect here is, you have to think of your queries , and
denormalize. Here are my suggestions based on my understanding so far.

You seem to have 2 queries
A) what all users do I have
B) what organizations do the users belong to

The first can be a static column family- these are similar to RDBMS
'master data' or 'dimensions' in the DWH world.
So you can have a users_CF column family where the row key is the primary
key- so you can have userid as primary key. For email id as primary key-
choose something which will never change (natural key vs surrogate key
debate). 

The second query is where the real power of the data model comes in. You
would not be having a separate organizations table with a foreign key to
the users table.
You would have a column family say Oraganizations_Users_CF  with row key
corresponding to your 'where clause' needs- here organization name. And
then you can have a dynamic list of user names corresponding to each
organization as column names.One organization can have 3 users (3 cols)
another can have 10(10 cols)
Note it would automatically be sorted by username when you retrieve a row,
because comparator is Bytetype by default, which works for text sorting.
If you want some other sort criteria, like say last time logged in, keep
that as the column name, column value as username. Column names can also
store some useful information, like a value in itself.
Sorting is a design time decision.


I think there have been numerous posts advising against using secondary
indexes, so try to keep the key of the col family as what you would be
searching for, as far as possible.

If you have a different query, you can create a new column family- its ok
to denormalize and have a separate column family per query.
 

Regards,
Roshni

On 06/08/12 9:42 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:

Cassandra modeling is well documented on the web and a bit too complex
to be explained in one mail.

I advice you reading a lot before you make modeling choices.

You may start with these links :

http://www.datastax.com/docs/1.1/ddl/about-data-model#comparing-the-cassan
dra-data-model-to-a-relational-database
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cas
sandra/

and this link seem interesting, but I haven't read it yet (about indexes)
:

http://www.anuff.com/2011/02/indexing-in-cassandra.html

I hope you'll find your answers within this documentation.

Alain


2012/8/6 Baskar Sikkayan baskar@gmail.com:
 Hi,
   Just wanted to learn Cassandra and trying to convert RDBMS design to
 Canssandra.
 Considered my app is being deployed in multiple Data centers.

 DB Design :

A) CF : USER
   1) email_id - primary key
   2) fullname
   3) organization - ( I didnt create a separate table
for
 organization )

B) CF : ORG_USER

  1) organization - Primary Key
  2) email_id

  Here, my intention is to get users belong to an
 organization.
  Here, I can make the organization in the user table as
 secondary index, but heard that, this may hit the performance.
  Could you please clarify me which is the better
approach?


 Thanks,
 Baskar.S

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Changing comparator

2012-08-06 Thread Roshni Rajagopal
Christof,

Am not convinced you need to change your comparator. Bytestype works for
most sorting even text ones.
Did you mean validator- for a column's value. Comparator is for column
ordering (ORDER BY in sql).

I believe you can just convert the text you want to search for to bytes
and then put it in where clause

Bytes type should just be done via BytesTypeSerializer (a no-op

really) as a value  ... where value=raw bytes here


Quoted from 
https://groups.google.com/forum/?fromgroups#!topic/hector-users/BpaemK95sPo


Disclaimer - I have not used this. It just seems an unnecessary thing to
do - to convert your validator/comparator from the default BytesType. My
understanding is that comparator for the column family can be BytesType
unless you want a specific Order By like TimeUUID.
And validators can be left as BytesType  unless you want some specific
validation that the value you are storing is a number or a time etc.







Regards,
Roshni

On 03/08/12 5:36 PM, Christof Roduner chris...@scandit.com wrote:

Hi Roshni,

Thanks for your reply. As far as I know, ASSUME is only for cqlsh and
not for CQL in general. (We can of course achieve the same by
programmatically setting the encoding. It would be just simpler to let
the CQL driver take care of it...)

Regards,
Christof


On 8/3/2012 11:31 AM, Roshni Rajagopal wrote:
 Christof ,

 can't you just use ASSUME for the CQL session?

 http://www.datastax.com/docs/1.0/references/cql/ASSUME


 Regards,
 Roshni



 On 03/08/12 2:26 PM, Christof Roduner chris...@scandit.com wrote:

 Hi,

 I know that changing a CF's comparator is not officially supported.
 However, there is a post by Jonathan Ellis that implies it can be done
 (www.mail-archive.com/user@cassandra.apache.org/msg09502.html).

 I assume that we'd have to change entries in the system.schema_* column
 families.

 Has anyone successfully done this?

 We want to change the comparator from BytesType to UTF8Type to make the
 move to CQL easier (cannot parse 'foo' as hex bytes). Our CFs were
 created back in the Cassandra 0.6.x days and are too large to be easily
 copied to new CFs with a new schema.

 Many thanks in advance.

 Christof

 This email and any files transmitted with it are confidential and
intended solely for the individual or entity to whom they are addressed.
If you have received this email in error destroy it immediately. ***
Walmart Confidential ***


This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Unsuccessful attempt to add a second node to a ring.

2012-08-03 Thread Roshni Rajagopal
Hi Jakub,

 Were you able to resolve the issue?
For a multi data center setup I do believe some steps are different. You may 
need to set Networktopology as your replication strategy rather than simple 
strategy, and setup a snitch. And mention the rack/dc configurations in a 
config file.
You can refer to the steps for a multi data center installation.

Regards,
Roshni



From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Unsuccessful attempt to add a second node to a ring.

I found a similar thread from March : 
http://www.mail-archive.com/user@cassandra.apache.org/msg21007.html

For me clearing the data and starting from the beginning didn't help.

It's interesting because on my dev environment I was able to add another node 
without any problems.

The only difference is that the second node now is in a different data center. 
(but I'm not using any different settings, SimpleSnitch)
7000,9160,7199 ports were open between those 2 nodes.

How else can I check if the communication between those 2 nodes is working?
In the logs I see that:
DEBUG [WRITE-NODE1/node1.ip] 2012-07-31 13:50:39,642 OutboundTcpConnection.java 
(line 206) attempting to connect to NODE1/node1.ip

So I assume that the communication is somehow established?


--
regards,
Jakub Glapa


On Wed, Aug 1, 2012 at 11:36 AM, Jakub Glapa 
jakub.gl...@gmail.commailto:jakub.gl...@gmail.com wrote:
yes it's the same



--
regards,
pozdrawiam,
Jakub Glapa


On Wed, Aug 1, 2012 at 11:24 AM, Roshni Rajagopal 
roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote:
Ok, sorry it may not be required,
I was thinking of a configuration I had done on my local laptop, where I had 
aliased my IP address.
In that case the directories and jmx port needed to be different.

Cluster name is same right?


From: Jakub Glapa 
jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Unsuccessful attempt to add a second node to a ring.

Hi Roshni,
no they are the same, my changes in cassandra.yaml were only in the 
listen_address, rpc_address, seeds and initial_token field.
The rest is exactly the same as on node1.

That's how the file looks on node2:



cluster_name: 'Test Cluster'
initial_token: 85070591730234615865843651857942052864
hinted_handoff_enabled: true
hinted_handoff_throttle_delay_in_ms: 1
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authority: org.apache.cassandra.auth.AllowAllAuthority
partitioner: org.apache.cassandra.dht.RandomPartitioner
data_file_directories:
- /data/servers/cassandra_sbe_edtool/cassandra_data/data
commitlog_directory: /data/servers/cassandra_sbe_edtool/cassandra_data/commitlog
saved_caches_directory: 
/data/servers/cassandra_sbe_edtool/cassandra_data/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 1
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
  - seeds: NODE1
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 32
memtable_flush_queue_size: 4
sliced_buffer_size_in_kb: 64
storage_port: 7000
ssl_storage_port: 7001
listen_address: NODE2
rpc_address: NODE2
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: false
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 1
endpoint_snitch: org.apache.cassandra.locator.SimpleSnitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 60
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra




--
regards,
pozdrawiam,
Jakub Glapa


On Wed, Aug 1, 2012 at 10:29 AM, Roshni Rajagopal 
roshni.rajago...@wal

Re: Changing comparator

2012-08-03 Thread Roshni Rajagopal
Christof ,

can't you just use ASSUME for the CQL session?

http://www.datastax.com/docs/1.0/references/cql/ASSUME


Regards,
Roshni



On 03/08/12 2:26 PM, Christof Roduner chris...@scandit.com wrote:

Hi,

I know that changing a CF's comparator is not officially supported.
However, there is a post by Jonathan Ellis that implies it can be done
(www.mail-archive.com/user@cassandra.apache.org/msg09502.html).

I assume that we'd have to change entries in the system.schema_* column
families.

Has anyone successfully done this?

We want to change the comparator from BytesType to UTF8Type to make the
move to CQL easier (cannot parse 'foo' as hex bytes). Our CFs were
created back in the Cassandra 0.6.x days and are too large to be easily
copied to new CFs with a new schema.

Many thanks in advance.

Christof

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Unsuccessful attempt to add a second node to a ring.

2012-08-01 Thread Roshni Rajagopal
Jakub,

Have you set the
Data, commitlog, saved cache directories to different ones in each yaml file 
for each node?

Regards,
Roshni


From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Unsuccessful attempt to add a second node to a ring.

Hi Everybody!

I'm trying to add a second node to an already operating one node cluster.

Some specs:
- cassandra 1.0.7
- both nodes have a routable listen_address and rpc_address.
- Ports are open: (from node2) telnet node1 7000 is successful
- Seeds parameter on node2 points to node 1.

[node1] nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
Token
node1.ip datacenter1 rack1   Up Normal  74.33 KB100.00% 0

- initial token on node2 was specified

I see something like that in the logs on node2:

DEBUG [main] 2012-07-31 13:50:38,640 CollationController.java (line 76) 
collectTimeOrderedData
 INFO [main] 2012-07-31 13:50:38,641 StorageService.java (line 667) JOINING: 
waiting for ring and schema information
DEBUG [WRITE-NODE1/node1.ip] 2012-07-31 13:50:39,642 OutboundTcpConnection.java 
(line 206) attempting to connect to NODE1/node1.ip
DEBUG [ScheduledTasks:1] 2012-07-31 13:50:40,639 LoadBroadcaster.java (line 86) 
Disseminating load info ...
 INFO [main] 2012-07-31 13:51:08,641 StorageService.java (line 667) JOINING: 
schema complete, ready to bootstrap
DEBUG [main] 2012-07-31 13:51:08,642 StorageService.java (line 554) ... got 
ring + schema info
 INFO [main] 2012-07-31 13:51:08,642 StorageService.java (line 667) JOINING: 
getting bootstrap token
DEBUG [main] 2012-07-31 13:51:08,644 BootStrapper.java (line 138) token 
manually specified as 85070591730234615865843651857942052864
DEBUG [main] 2012-07-31 13:51:08,645 Table.java (line 387) applying mutation of 
row 4c


but it doesn't join the ring:

[node2] nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
Token
node2.ip   datacenter1 rack1   Up Normal  13.49 KB100.00% 
85070591730234615865843651857942052864



I'm attaching the full log from node2 startup in debug mode.



PS.
When I didn't specified the initial token on node2 I ended up with exception 
like that:
Exception encountered during startup: No other nodes seen!  Unable to 
bootstrap.If you intended to start a single-node cluster, you should make sure 
your broadcast_address (or listen_address) is listed as a seed.
Otherwise, you need to determine why the seed being contacted has no knowledge 
of the rest of the cluster.  Usually, this can be solved by giving all nodes 
the same seed list.


I'm not sure how to proceed now. I found a couple of posts with problems like 
that but they weren't very useful.

--
regards,
Jakub Glapa

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Unsuccessful attempt to add a second node to a ring.

2012-08-01 Thread Roshni Rajagopal
Ok, sorry it may not be required,
I was thinking of a configuration I had done on my local laptop, where I had 
aliased my IP address.
In that case the directories and jmx port needed to be different.

Cluster name is same right?


From: Jakub Glapa jakub.gl...@gmail.commailto:jakub.gl...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Unsuccessful attempt to add a second node to a ring.

Hi Roshni,
no they are the same, my changes in cassandra.yaml were only in the 
listen_address, rpc_address, seeds and initial_token field.
The rest is exactly the same as on node1.

That's how the file looks on node2:



cluster_name: 'Test Cluster'
initial_token: 85070591730234615865843651857942052864
hinted_handoff_enabled: true
hinted_handoff_throttle_delay_in_ms: 1
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authority: org.apache.cassandra.auth.AllowAllAuthority
partitioner: org.apache.cassandra.dht.RandomPartitioner
data_file_directories:
- /data/servers/cassandra_sbe_edtool/cassandra_data/data
commitlog_directory: /data/servers/cassandra_sbe_edtool/cassandra_data/commitlog
saved_caches_directory: 
/data/servers/cassandra_sbe_edtool/cassandra_data/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 1
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
  parameters:
  - seeds: NODE1
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 32
memtable_flush_queue_size: 4
sliced_buffer_size_in_kb: 64
storage_port: 7000
ssl_storage_port: 7001
listen_address: NODE2
rpc_address: NODE2
rpc_port: 9160
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: false
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 1
endpoint_snitch: org.apache.cassandra.locator.SimpleSnitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 60
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra




--
regards,
pozdrawiam,
Jakub Glapa


On Wed, Aug 1, 2012 at 10:29 AM, Roshni Rajagopal 
roshni.rajago...@wal-mart.commailto:roshni.rajago...@wal-mart.com wrote:
Jakub,

Have you set the
Data, commitlog, saved cache directories to different ones in each yaml file 
for each node?

Regards,
Roshni


From: Jakub Glapa 
jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.commailto:jakub.gl...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Unsuccessful attempt to add a second node to a ring.

Hi Everybody!

I'm trying to add a second node to an already operating one node cluster.

Some specs:
- cassandra 1.0.7
- both nodes have a routable listen_address and rpc_address.
- Ports are open: (from node2) telnet node1 7000 is successful
- Seeds parameter on node2 points to node 1.

[node1] nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
Token
node1.ip datacenter1 rack1   Up Normal  74.33 KB100.00% 0

- initial token on node2 was specified

I see something like that in the logs on node2:

DEBUG [main] 2012-07-31 13:50:38,640 CollationController.java (line 76) 
collectTimeOrderedData
 INFO [main] 2012-07-31 13:50:38,641 StorageService.java (line 667) JOINING: 
waiting for ring and schema information
DEBUG [WRITE-NODE1/node1.ip] 2012-07-31 13:50:39,642 OutboundTcpConnection.java 
(line 206) attempting to connect to NODE1/node1.ip
DEBUG [ScheduledTasks:1] 2012-07-31 13:50:40,639 LoadBroadcaster.java (line 86) 
Disseminating load info ...
 INFO [main] 2012-07-31 13:51:08,641 StorageService.java (line 667) JOINING: 
schema complete, ready to bootstrap
DEBUG [main] 2012-07-31 13:51:08,642 StorageService.java (line 554) ... got 
ring + schema info
 INFO [main] 2012-07-31 13:51:08,642

Re: Does Cassandra support operations in a transaction?

2012-08-01 Thread Roshni Rajagopal
Hi Ivan,

Cassandra supports 'tunable consistency' . If you always read and write at a 
quorum (or local quorum for multi data center) from one , you can guarantee 
that the results will be consistent as in all the data will be compared and the 
latest will be returned, and no data will be out of date. This is at a loss of 
performance- it will be fastest to just read and write once rather than check a 
quorum of nodes.

What you chose depends on what your application needs are. Is it ok if some 
users receive out of date data (it isn't earth shattering if someone doesn't 
know what you're eating right now), or is it a banking transaction system where 
all entities must be consistently updated.

So designing in cassandra priortizes de-normalization. You cannot have 
referential integrity that 2 tables (col families in cassandra) are in sync 
because the database has designed it to be so using foreign keys. The 
application needs to ensure that all data in column families are accurate and 
not out of sync, because data elements may be duplicated in different col 
families.


You cannot have 2 different entities and ensure that changes to both will be 
done and then only be visible to others.


Regards,


From: Jeffrey Kesselman jef...@gmail.commailto:jef...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Does Cassandra support operations in a transaction?

Short story is that few if any of the NoSql systems supprot transactions 
natively. Thats oen of the big compromises they make.  What they call eventual 
consistancy is actually eventual Durabiltiy in ACID terms.

Consistancy, as meant by the C in ACID,  is not gauranteed at all.

On Wed, Aug 1, 2012 at 6:21 AM, Ivan Jiang 
wiwi1...@gmail.commailto:wiwi1...@gmail.com wrote:
Hi,
I am a new guy to Cassandra, I wonder if available to call Cassandra in one 
Transaction such as in Relation-DB.

Thanks in advance.

Best Regards,
Ivan Jiang



--
It's always darkest just before you are eaten by a grue.

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Schema question : Query to support Find which all of these 500 email ids have been registered

2012-07-26 Thread Roshni Rajagopal
In general I believe wide rows (many cols ) are preferable to skinny rows
(many rows) so that you can get all the information in 1 go,
One can store 2 billion cols in a row.

However, on what basis would you store the 500 email ids in 1 row? What
can be the row key?
For e.g. If the query you want to answer with this column family is 'how
many email addresses are registered in this application?', then
application id can be a row key, and 500 email ids can be stored as
columns. Each other applications would be another row . Since you want to
search by application this may be the best approach.

If your information doesn't fit neatly into the model above, you can go
for 
An email id as a row key, and list of applications as columns.



Reading 500 rows does not seem a big  task - I doubt it would be a
performance issue given cassandra's powers.

On 27/07/12 11:12 AM, Aklin_81 asdk...@gmail.com wrote:

I need to find out what all email ids among a list of 500 ids passed
in a single query, have been registered on my app. (Total registered
email ids may be in millions). What is the best way to store this kind
of data?

Should I store each email id in a separate row ? But then I would have
to read 500 rows at a single time ! Or if I use single row or less no
of rows then they would get too heavy.

Btw Would it be really bad if I read 500 rows at a single time,
they'll be just 1 column rows  never modified once written columns.

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Cassandra and Tableau

2012-07-15 Thread Roshni Rajagopal
Hi Robin,

Im from an analytics background, was working in the traditional BI tools like 
OBIEE and Business Objects, so I am very interested in your evaluations of a 
good analytics toolset combination.
Do share your learnings,

At a high level as I understand, cassandra can be used as the backend for a 
transactional systems ( with the tunable consistency adjusted according to 
requirements ), because it is real time.
However Hadoop is not for a real time scenario. Its primarily for anlaytics- 
non real time processing on huge datasets. The actual information will be 
stored in HDFS file system. You can even use Hadoop as a replacement for 'ETL' 
processing.
With a Hadoop cluster you can directly use a statistical programming language 
like 'R' to extract information.
For traditional BI folks like me that’s a new piece to learn- no user friendly 
GUI!

For a better GUI, we have a new breed of tools like Pentaho, Jaspersoft, 
Karmasphere, Datasphere. Im not exactly sure how all of these work,
I know pentaho works with a Hadoop + Hive combination. Here's a nice ppt which 
explains, from their 'Chief Geek'- I found the whole series of presentations 
good.

Pentaho – hadoop knowledge series

http://www.pentaho.com/resources/videos/29/Hadoop-and-Business-Intelligence/



Regards,
Roshni


From: Robin Verlangen ro...@us2.nlmailto:ro...@us2.nl
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Cassandra and Tableau

Thank you Aaron and Brian. We're currently investigating several options. 
Hadoop + Hive combo also seems a good choice as our input files are flat. I'll 
keep you up-to-date about our final decision.

- Robin

2012/7/6 aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Here are two links I've noticed in my travels, have not looked into what they 
offer.

http://www.pentaho.com/big-data/nosql/cassandra/

http://www.jaspersoft.com/bigdata

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 7/07/2012, at 3:03 AM, Brian O'Neill wrote:

Robin,

We have the same issue right now.  We use Tableau for all of our
reporting needs, but we couldn't find any acceptable bridge between it
and Cassandra.

We ended up using cassandra-triggers to replicate the data to Oracle.
https://github.com/hmsonline/cassandra-triggers/

Let us know if you get things setup with a direct connection.
We'd be *very* interested int helping out if you find a way to do it.

-brian


On Fri, Jul 6, 2012 at 5:31 AM, Robin Verlangen 
ro...@us2.nlmailto:ro...@us2.nl wrote:
Hi there,

Is there anyone out there who's using Tableau in combination with a
Cassandra cluster? There seems to be no standard solution to connect, at
least I couldn't find one. Does anyone know how to tackle this problem?


With kind regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E ro...@us2.nlmailto:ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.




--
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024tel:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/




--
With kind regards,

Robin Verlangen
Software engineer

W http://www.robinverlangen.nl
E ro...@us2.nlmailto:ro...@us2.nl

Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Starting cassandra with -D option

2012-06-21 Thread Roshni Rajagopal
Hi Folks,

  We wanted to have a single cassandra installation, and use it to start 
cassandra in other nodes by passing it the cassandra configuration directories 
as a parameter. Idea is to avoid having the copies of cassandra code in each 
node, and starting each node by getting into bin/cassandra of that node.


As per http://www.datastax.com/docs/1.0/references/cassandra,
We have an option –D where we can supply some parameters to cassandra.
Has anyone tried this?
Im getting an error as below.

walmarts-MacBook-Pro-2:Node1-Cassandra1.1.0 walmart$  bin/cassandra  
-Dcassandra.config=file:///Users/walmart/Downloads/Cassandra/Node2-Cassandra1.1.0/conf
walmarts-MacBook-Pro-2:Node1-Cassandra1.1.0 walmart$  INFO 15:38:01,763 Logging 
initialized
 INFO 15:38:01,766 JVM vendor/version: Java HotSpot(TM) 64-Bit Server 
VM/1.6.0_31
 INFO 15:38:01,766 Heap size: 1052770304/1052770304
 INFO 15:38:01,766 Classpath: 
bin/../conf:bin/../build/classes/main:bin/../build/classes/thrift:bin/../lib/antlr-3.2.jar:bin/../lib/apache-cassandra-1.1.0.jar:bin/../lib/apache-cassandra-clientutil-1.1.0.jar:bin/../lib/apache-cassandra-thrift-1.1.0.jar:bin/../lib/avro-1.4.0-fixes.jar:bin/../lib/avro-1.4.0-sources-fixes.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/compress-lzf-0.8.4.jar:bin/../lib/concurrentlinkedhashmap-lru-1.2.jar:bin/../lib/guava-r08.jar:bin/../lib/high-scale-lib-1.1.2.jar:bin/../lib/jackson-core-asl-1.9.2.jar:bin/../lib/jackson-mapper-asl-1.9.2.jar:bin/../lib/jamm-0.2.5.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-0.7.0.jar:bin/../lib/log4j-1.2.16.jar:bin/../lib/metrics-core-2.0.3.jar:bin/../lib/mx4j-tools-3.0.1.jar:bin/../lib/servlet-api-2.5-20081211.jar:bin/../lib/slf4j-api-1.6.1.jar:bin/../lib/slf4j-log4j12-1.6.1.jar:bin/../lib/snakeyaml-1.6.jar:bin/../lib/snappy-java-1.0.4.1.jar:bin/../lib/snaptree-0.1.jar:bin/../lib/jamm-0.2.5.jar
 INFO 15:38:01,768 JNA not found. Native methods will be disabled.
 INFO 15:38:01,826 Loading settings from 
file:/Users/walmart/Downloads/Cassandra/Node2-Cassandra1.1.0/conf
ERROR 15:38:01,873 Fatal configuration error error
Can't construct a java object for 
tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=No single 
argument constructor found for class org.apache.cassandra.config.Config
 in reader, line 1, column 1:
cassandra.yaml

The other option would be to modify cassandra.in.sh.
Has anyone tried this??

Regards,
Roshni

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Setting column to null

2012-06-11 Thread Roshni Rajagopal
Leonid,


Are you using some client for doing these operations..?

Hector is a java client which provides APIs for adding/deleting columns to
a column family in cassandra.
I don¹t think you really need to write your wrapper in this format- you
are restricting the number of columns it can use etc.I suggest your code
can  accept user input to get col family name, operation, keys  , and
operation, and accordingly call the appropriate hector API for
adding/deleting data.


Regards,
Roshni


On 11/06/12 7:20 PM, Leonid Ilyevsky lilyev...@mooncapital.com wrote:

Thanks, I understand what you are telling me. Obviously deleting the
column is the proper way to do this in Cassandra.
What I was looking for, is some convenient wrapper on top of that which
will do it for me. Here is my scenario.

I have a function that takes a record to be saved in Cassandra (array of
objects, or MapString, Object). Let say, it can have up to 8 columns. I
prepare a statement like this:

Insert into table values(?, ?, ?, ?, ?, ?, ?, ?)

If I somehow could put null when I execute it, it would be enough to
prepare that statement once and execute it multiple times. I would then
expect that when some element is null, the corresponding column is not
inserted (for the new key) or deleted (for the existing key).
The way it is now, in my code I have to examine which columns are present
and which are not, depending on that I have to generate customized
statement, and it is going to be different for the case of existing key
versus case of the new key.
Isn't this too much hassle?

Related question. I assumed that prepared statement in Cassandra is there
for the same reason as in RDBMS, that is, for efficiency. In the above
scenario, how expensive is it to execute specialized statement for every
record compare to prepared statement executed multiple times?

If I need to execute those specialized statements, should I still use
prepared statement or should I just generate a string with everything in
ascii format?

-Original Message-
From: Roshni Rajagopal [mailto:roshni.rajago...@wal-mart.com]
Sent: Monday, June 11, 2012 12:58 AM
To: user@cassandra.apache.org
Subject: Re: Setting column to null

Would you want to view data like this there was a key, which had this
column , but now it does not have any value as of this time.

Unless you specifically want this information, I believe you should just
delete the column, rather than have an alternate value for NULL or create
a composite column.

Because in cassandra that¹s the way deletion is dealt with, putting NULLs
is the way we deal with it in RDBMS because we have a fixed number of
columns which always have to have some value, even if its NULL, and we
have to have the same set of columns for every row.
In Cassandara, we can delete the column, and in most scenarios that¹s
what we should do, unless we specifically want to preserve some history
that this column was turned null at this timeŠEach row can have different
columns.

Regards,
Roshni

From: Edward Capriolo
edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Setting column to null

Your best bet is to define the column as a composite column where one
part represents is null and the other part is the data.

On Friday, June 8, 2012, shashwat shriparv
dwivedishash...@gmail.commailto:dwivedishash...@gmail.com wrote:
 What you can do is you can define some specific variable like
NULLDATA some thing like that to update in columns that does have value


 On Fri, Jun 8, 2012 at 11:58 PM, aaron morton
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

 You don't nee to set columns to null, delete the column instead.
 Cheers
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 8/06/2012, at 9:34 AM, Leonid Ilyevsky wrote:

 Is it possible to explicitly set a column value to null?

 I see that if insert statement does not include a specific column, that
column comes up as null (assuming we are creating a record with new
unique key).
 But if we want to update a record, how we set it to null?

 Another situation is when I use prepared cql3 statement (in Java) and
send parameters when I execute it. If I want to leave some column
unassigned, I need a special statement without that column.
 What I would like is, prepare one statement including all columns, and
then be able to set some of them to null. I tried to set corresponding
ByteBuffer parameter to null, obviously got an exception.
 
 This email, along with any attachments, is confidential and may be
legally privileged or otherwise protected from disclosure. Any
unauthorized dissemination, copying or use of the contents of this email
is strictly prohibited

Re: Setting column to null

2012-06-10 Thread Roshni Rajagopal
Would you want to view data like this there was a key, which had this column , 
but now it does not have any value as of this time.

Unless you specifically want this information, I believe you should just delete 
the column, rather than have an alternate value for NULL or create a composite 
column.

Because in cassandra that’s the way deletion is dealt with, putting NULLs is 
the way we deal with it in RDBMS because we have a fixed number of columns 
which always have to have some value, even if its NULL, and we have to have the 
same set of columns for every row.
In Cassandara, we can delete the column, and in most scenarios that’s what we 
should do, unless we specifically want to preserve some history that this 
column was turned null at this time…Each row can have different columns.

Regards,
Roshni

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Setting column to null

Your best bet is to define the column as a composite column where one part 
represents is null and the other part is the data.

On Friday, June 8, 2012, shashwat shriparv 
dwivedishash...@gmail.commailto:dwivedishash...@gmail.com wrote:
 What you can do is you can define some specific variable like NULLDATA some 
 thing like that to update in columns that does have value


 On Fri, Jun 8, 2012 at 11:58 PM, aaron morton 
 aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

 You don't nee to set columns to null, delete the column instead.
 Cheers
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 On 8/06/2012, at 9:34 AM, Leonid Ilyevsky wrote:

 Is it possible to explicitly set a column value to null?

 I see that if insert statement does not include a specific column, that 
 column comes up as null (assuming we are creating a record with new unique 
 key).
 But if we want to update a record, how we set it to null?

 Another situation is when I use prepared cql3 statement (in Java) and send 
 parameters when I execute it. If I want to leave some column unassigned, I 
 need a special statement without that column.
 What I would like is, prepare one statement including all columns, and then 
 be able to set some of them to null. I tried to set corresponding ByteBuffer 
 parameter to null, obviously got an exception.
 
 This email, along with any attachments, is confidential and may be legally 
 privileged or otherwise protected from disclosure. Any unauthorized 
 dissemination, copying or use of the contents of this email is strictly 
 prohibited and may be in violation of law. If you are not the intended 
 recipient, any disclosure, copying, forwarding or distribution of this email 
 is strictly prohibited and this email and any attachments should be deleted 
 immediately. This email and any attachments do not constitute an offer to 
 sell or a solicitation of an offer to purchase any interest in any investment 
 vehicle sponsored by Moon Capital Management LP (“Moon Capital”). Moon 
 Capital does not provide legal, accounting or tax advice. Any statement 
 regarding legal, accounting or tax matters was not intended or written to be 
 relied upon by any person as advice. Moon Capital does not waive 
 confidentiality or privilege as a result of this email.



 --


 ∞

 Shashwat Shriparv



This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Problem in getting data from a 2 node cluster of Cassandra

2012-06-08 Thread Roshni Rajagopal
Hi Prakrati,

 In an ideal situation, no data should be lost when a node is added. How are 
you getting the statistics below.
The output below looks like its from some code using Hector or Thrift..is the 
code to get statistics from a 1 node cluster or 2 exactly the same- with the 
only change being a node being added or removed?
Could you verify the number of rows  cols in the column family using CLI or 
CQL..

Regards,
Roshni




From: Prakrati Agrawal 
prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday 8 June 2012 11:50 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Problem in getting data from a 2 node cluster of Cassandra

Dear all

I was originally having a 1 node cluster. Then I added one more node to it with 
initial token configured appropriately. Now when I run my queries I am not 
getting all my data ie all columns.
 Output on 2 nodes
Time taken to retrieve columns 43707 of key range is 1276
Time taken to retrieve columns 2084199 of all tickers is 54334
Time taken to count is 230776
Total number of rows in the database are 183
Total number of columns in the database are 7903753
Output on 1 node
Time taken to retrieve columns 43707 of key range is 767
Time taken to retrieve columns 382 of all tickers is 52793
Time taken to count is 268135
Total number of rows in the database are 396
Total number of columns in the database are 16316426
Please help me. Where is my data going or how should I retrieve it. I have 
consistency level specified as ONE and I did not specify any replication factor.



Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com



This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: How to include two nodes in Java code using Hector

2012-06-06 Thread Roshni Rajagopal


In Hector when you create a cluster using the API, you specify an IP address  
cluster name. Thereafter internally which node serves the request or how many 
nodes need to be contacted to read/write data depends on the cluster 
configuration i.e. Whats your replication strategy, factor, consistency levels 
for the col family , how many nodes are there in the ring etc. So you don't 
individually need to connect to each node via Hector client. Once you connect 
to the cluster  keyspace, via any IP add of any node in the cluster, when you 
make Hector calls to read/write data, it would automatically figure out the 
node level details and carry out the task. You won't get 50% of the data, you 
will get all data.


Also when you remove a node, your data will be unavailable ONLY if you don't 
have it available in some other node as a replica..


Regards,


From: Prakrati Agrawal 
prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tue, 5 Jun 2012 20:05:21 -0700
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: How to include two nodes in Java code using Hector

But the data is distributed on the nodes ( meaning 50% of data is on one node 
and 50% of data is on another node) so I need to specify the node ip address 
somewhere in the code. But where do I specify that is what I am clueless about. 
Please help me

Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com

From: Harshvardhan Ojha [mailto:harshvardhan.o...@makemytrip.com]
Sent: Tuesday, June 05, 2012 5:51 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: How to include two nodes in Java code using Hector

Use Consistency Level =2.

Regards
Harsh

From: Prakrati Agrawal [mailto:prakrati.agra...@mu-sigma.com]
Sent: Tuesday, June 05, 2012 4:08 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: How to include two nodes in Java code using Hector

Dear all

I am using a two node Cassandra cluster. How do I code in Java using Hector to 
get data from both the nodes. Please help

Thanks and Regards

Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | 
www.mu-sigma.comhttp://www.mu-sigma.com



This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.


This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Can not find auto bootstrap property in cassandra.yaml for Cassandra 1.1.0

2012-06-04 Thread Roshni Rajagopal
Hi Prakrati,

In 1.1.0 you don't need to set this, its by default. Im also on 1.1.0 and I 
didn't need to set this.


Regards,
Roshni

From: Prakrati Agrawal 
prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Sun, 3 Jun 2012 22:58:24 -0700
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Can not find auto bootstrap property in cassandra.yaml for Cassandra 
1.1.0

Dear all

I am trying to add a new node to the Cassandra cluster. In all the 
documentations available on net it says to set the auto bootstrap property in 
cassandra.yaml to true but I am not finding the property in the file. Please 
help me

Thanks and Regards

Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com



This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: Adding a new node to Cassandra cluster

2012-06-04 Thread Roshni Rajagopal
Prakrati,

I believe even though you would specify one node in your code, internally the 
request would be going to  any – perhaps more than 1 node based on your 
replication factors  consistency level settings.
You can try this  by connecting to one node and writing to it and then reading 
the same data from another node. You can see this replication happening via CLI 
as well.

Regards,
Roshni


From: R. Verlangen ro...@us2.nlmailto:ro...@us2.nl
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Mon, 4 Jun 2012 02:30:40 -0700
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Adding a new node to Cassandra cluster

You might consider using a higher level client (like Hector indeed). If you 
don't want this you will have to write your own connection pool. For start take 
a look at Hector. But keep in mind that you might be reinventing the wheel.

2012/6/4 Prakrati Agrawal 
prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com
Hi,

I am using Thrift API and I am not able to find anything on the internet about 
how to configure it for multiple nodes. I am not using any proper client like 
Hector.

Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | 
www.mu-sigma.comhttp://www.mu-sigma.com

From: R. Verlangen [mailto:ro...@us2.nlmailto:ro...@us2.nl]
Sent: Monday, June 04, 2012 2:44 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Adding a new node to Cassandra cluster

Hi there,

When you speak to one node it will internally redirect the request to the 
proper node (local / external): but you won't be able to failover on a crash of 
the localhost.
For adding another node to the connection pool you should take a look at the 
documentation of your java client.

Good luck!

2012/6/4 Prakrati Agrawal 
prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com
Dear all

I successfully added a new node to my cluster so now it’s a 2 node cluster. But 
how do I mention it in my Java code as when I am retrieving data its retrieving 
only for one node that I am specifying in the localhost. How do I specify more 
than one node in the localhost.

Please help me

Thanks and Regards

Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | 
www.mu-sigma.comhttp://www.mu-sigma.com



This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.



--
With kind regards,

Robin Verlangen
Software engineer

W www.robinverlangen.nlhttp://www.robinverlangen.nl
E ro...@us2.nlmailto:ro...@us2.nl

Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.



This email message may contain proprietary, private and confidential 
information. The information transmitted is intended only for the person(s) or 
entities to which it is addressed. Any review, retransmission, dissemination or 
other use of, or taking of any action in reliance upon, this information by 
persons or entities other than the intended recipient is prohibited and may be 
illegal. If you received this in error, please contact the sender and delete 
the message from your system.

Mu Sigma takes all reasonable steps to ensure that its electronic 
communications are free from viruses. However, given Internet accessibility, 
the Company cannot accept liability for any virus introduced by this e-mail or 
any attachment and you are advised to use up-to-date virus checking software.



--
With kind regards,

Robin Verlangen
Software engineer

W www.robinverlangen.nlhttp://www.robinverlangen.nl
E ro...@us2.nlmailto:ro...@us2.nl

Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use 

Replication factor via hector

2012-06-04 Thread Roshni Rajagopal
Hi ,

   I'm trying to see the effect of different replication factors and 
consistency levels for a keyspace on a 4 node cassandra cluster.

I'm doing this using hector client.
I could not find an api to set replication factor for a keyspace though I could 
find ways to modify consistency level.

Is it possible to change replication factor using hector or does it have to be 
done using CLI?

Regards,
Roshni

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***


Re: no snappyjava in java.library.path (JDK 1.7 issue?)

2012-05-15 Thread Roshni Rajagopal
Hi Stephen,

Cassandra's wiki says Cassandra requires the most stable version of Java 1.6 
you can deploy.

http://wiki.apache.org/cassandra/GettingStarted

Regards,

Roshni

From: Stephen McKamey step...@mckamey.commailto:step...@mckamey.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tue, 15 May 2012 13:40:43 -0700
To: Cassandra user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: no snappyjava in java.library.path (JDK 1.7 issue?)

Reverting to JDK 1.6 appears to fix the issue. Is JDK 1.7 not yet supported by 
Cassandra?

java version 1.6.0_31
Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-11M3635)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)

On Tue, May 15, 2012 at 12:55 PM, Stephen McKamey 
step...@mckamey.commailto:step...@mckamey.com wrote:
Worth noting is I'm on Mac OS X 10.7.4 and I recently upgraded to the latest 
JDK (really hoping this isn't the issue):

java version 1.7.0_04
Java(TM) SE Runtime Environment (build 1.7.0_04-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode)

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***