Re: AW: AW: How to control location of data?

2012-01-11 Thread Andreas Rudolph
Hi!

 ... 
 So actually if you use Cassandra – for the application the actual storage 
 location of the data should not matter. It will be available anywhere in the 
 cluster if it is stored on any reachable node.
I suspected it so, that is Cassandra does not provide a mechanism to strictly 
constrain what nodes in a cluster hold the data for a specific key space 
because Cassandra is not designed for that purpose.

Thank you very much for your effort and detailed explanation.

  
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 15:06
 An: user@cassandra.apache.org
 Betreff: Re: AW: How to control location of data?
  
 Hi!
  
 Thank you for your last reply. I'm still wondering if I got you right...
  
 ... 
 A partitioner decides into which partition a piece of data belongs
 Does your statement imply that the partitioner does not take any decisions at 
 all on the (physical) storage location? Or put another way: What do you mean 
 with partition?
  
 To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
 AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
 replicas of each key range. Primary replica is always determined by the token 
 ring (...)
 
 
 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
 What we want is to partition the cluster with respect to key spaces.
 That is we want to establish an association between nodes and key spaces so 
 that a node of the cluster holds data from a key space if and only if that 
 node is a *member* of that key space.
  
 To our knowledge Cassandra has no built-in way to specify such a 
 membership-relation. Therefore we thought of implementing our own replica 
 placement strategy until we started to assume that the partitioner had to be 
 replaced, too, to accomplish the task.
  
 Do you have any ideas?
  
 
 
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?
  
 Hi!
  
 We're evaluating Cassandra for our storage needs. One of the key benefits we 
 see is the online replication of the data, that is an easy way to share data 
 across nodes. But we have the need to precisely control on what node group 
 specific parts of a key space (columns/column families) are stored on. Now 
 we're having trouble understanding the documentation. Could anyone help us 
 with to find some answers to our questions?
 
 ·  What does the term replica mean: If a key is stored on exactly three 
 nodes in a cluster, is it correct then to say that there are three replicas 
 of that key or are there just two replicas (copies) and one original?
 ·  What is the relation between the Cassandra concepts Partitioner and 
 Replica Placement Strategy? According to documentation found on DataStax 
 web site and architecture internals from the Cassandra Wiki the first storage 
 location of a key (and its associated data) is determined by the 
 Partitioner whereas additional storage locations are defined by Replica 
 Placement Strategy. I'm wondering if I could completely redefine the way how 
 nodes are selected to store a key by just implementing my own subclass of 
 AbstractReplicationStrategy and configuring that subclass into the key space.
 ·  How can I suppress that the Partitioner is consulted at all to determine 
 what node stores a key first?
 ·  Is a key space always distributed across the whole cluster? Is it possible 
 to configure Cassandra in such a way that more or less freely chosen parts of 
 a key space (columns) are stored on arbitrarily chosen nodes?
  
 Any tips would be very appreciated :-)
  
 




Re: How to control location of data?

2012-01-11 Thread Andreas Rudolph
Hi!

 ...
 Again, it's probably a bad idea. 
I agree on that, now.

Thank you.

 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11/01/2012, at 4:56 AM, Roland Gude wrote:
 
  
 Each node in the cluster is assigned a token (can be done automatically – 
 but usually should not)
 The token of a node is the start token of the partition it is responsible 
 for (and the token of the next node is the end token of the current tokens 
 partition)
  
 Assume you have the following nodes/tokens (which are usually numbers but 
 for the example I will use letters)
  
 N1/A
 N2/D
 N3/M
 N4/X
  
 This means that N1 is responsible (primary) for [A-D)
N2 for [D-M)
N3 for [M-X)
 And N4 for [X-A)
  
 If you have a replication factor of 1 data will go on the nodes like this:
  
 B - N1
 E-N2
 X-N4
  
 And so on
 If you have a higher replication factor, the placement strategy decides 
 which node will take replicas of which partition (becoming secondary node 
 for that partition)
 Simple strategy will just put the replica on the next node in the ring
 So same example as above but RF of 2 and simple strategy:
  
 B- N1 and N2
 E - N2 and N3
 X - N4 and N1
  
 Other strategies can factor in things like “put  data in another datacenter” 
 or “put data in another rack” or such things.
  
 Even though the terms primary and secondary imply some means of quality or 
 consistency, this is not the case. If a node is responsible for a piece of 
 data, it will store it.
  
  
 But placement of the replicas is usually only relevant for availability 
 reasons (i.e. disaster recovery etc.)
 Actual location should mean nothing to most applications as you can ask any 
 node for the data you want and it will provide it to you (fetching it from 
 the responsible nodes).
 This should be sufficient in almost all cases.
  
 So in the above example again, you can ask N3 “what data is available” and 
 it will tell you: B, E and X, or you could ask it “give me X” and it will 
 fetch it from N4 or N1 or both of them depending on consistency 
 configuration and return the data to you.
  
  
 So actually if you use Cassandra – for the application the actual storage 
 location of the data should not matter. It will be available anywhere in the 
 cluster if it is stored on any reachable node.
  
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 15:06
 An: user@cassandra.apache.org
 Betreff: Re: AW: How to control location of data?
  
 Hi!
  
 Thank you for your last reply. I'm still wondering if I got you right...
  
 ... 
 A partitioner decides into which partition a piece of data belongs
 Does your statement imply that the partitioner does not take any decisions 
 at all on the (physical) storage location? Or put another way: What do you 
 mean with partition?
  
 To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
 AbstractReplicationStrategy controls what nodes get secondary, tertiary, 
 etc. replicas of each key range. Primary replica is always determined by the 
 token ring (...)
 
 
 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
 What we want is to partition the cluster with respect to key spaces.
 That is we want to establish an association between nodes and key spaces so 
 that a node of the cluster holds data from a key space if and only if that 
 node is a *member* of that key space.
  
 To our knowledge Cassandra has no built-in way to specify such a 
 membership-relation. Therefore we thought of implementing our own replica 
 placement strategy until we started to assume that the partitioner had to be 
 replaced, too, to accomplish the task.
  
 Do you have any ideas?
  
 
 
 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?
  
 Hi!
  
 We're evaluating Cassandra for our storage needs. One of the key benefits we 
 see is the online replication of the data, that is an easy way to share data 
 across nodes. But we have the need to precisely control on what node group 
 specific parts of a key space (columns/column families) are stored on. Now 
 we're having trouble understanding the documentation. Could anyone help us 
 with to find some answers to our questions?
 
 ·  What does the term replica mean: If a key is stored on exactly three 
 nodes in a cluster, is it correct then to say that there are three replicas 
 of that key or are there just two replicas (copies) and one original?
 ·  What is the relation between the Cassandra

Re: AW: How to control location of data?

2012-01-10 Thread Andreas Rudolph
Hi!

Thank you for your last reply. I'm still wondering if I got you right...

 ... 
 A partitioner decides into which partition a piece of data belongs
Does your statement imply that the partitioner does not take any decisions at 
all on the (physical) storage location? Or put another way: What do you mean 
with partition?

To quote http://wiki.apache.org/cassandra/ArchitectureInternals: ... 
AbstractReplicationStrategy controls what nodes get secondary, tertiary, etc. 
replicas of each key range. Primary replica is always determined by the token 
ring (...)

 ... 
 You can select different placement strategies and partitioners for different 
 keyspaces, thereby choosing known data to be stored on known hosts.
 This is however discouraged for various reasons – i.e.  you need a lot of 
 knowledge about your data to keep the cluster balanced. What is your usecase 
 for this requirement? there is probably a more suitable solution.
  
What we want is to partition the cluster with respect to key spaces.
That is we want to establish an association between nodes and key spaces so 
that a node of the cluster holds data from a key space if and only if that node 
is a *member* of that key space.

To our knowledge Cassandra has no built-in way to specify such a 
membership-relation. Therefore we thought of implementing our own replica 
placement strategy until we started to assume that the partitioner had to be 
replaced, too, to accomplish the task.

Do you have any ideas?


 Von: Andreas Rudolph [mailto:andreas.rudo...@spontech-spine.com] 
 Gesendet: Dienstag, 10. Januar 2012 09:53
 An: user@cassandra.apache.org
 Betreff: How to control location of data?
  
 Hi!
  
 We're evaluating Cassandra for our storage needs. One of the key benefits we 
 see is the online replication of the data, that is an easy way to share data 
 across nodes. But we have the need to precisely control on what node group 
 specific parts of a key space (columns/column families) are stored on. Now 
 we're having trouble understanding the documentation. Could anyone help us 
 with to find some answers to our questions?
 
 ·  What does the term replica mean: If a key is stored on exactly three 
 nodes in a cluster, is it correct then to say that there are three replicas 
 of that key or are there just two replicas (copies) and one original?
 ·  What is the relation between the Cassandra concepts Partitioner and 
 Replica Placement Strategy? According to documentation found on DataStax 
 web site and architecture internals from the Cassandra Wiki the first storage 
 location of a key (and its associated data) is determined by the 
 Partitioner whereas additional storage locations are defined by Replica 
 Placement Strategy. I'm wondering if I could completely redefine the way how 
 nodes are selected to store a key by just implementing my own subclass of 
 AbstractReplicationStrategy and configuring that subclass into the key space.
 ·  How can I suppress that the Partitioner is consulted at all to determine 
 what node stores a key first?
 ·  Is a key space always distributed across the whole cluster? Is it possible 
 to configure Cassandra in such a way that more or less freely chosen parts of 
 a key space (columns) are stored on arbitrarily chosen nodes?
  
 Any tips would be very appreciated :-)