Re: Replica data distributing between racks
On Wed, May 4, 2011 at 10:09 AM, Konstantin Naryshkin wrote: > The way that I understand it (and that seems to be consistent with what was > said in this discussion) is that each DC has its own data space. Using your > simplified 1-10 system: > DC1 DC2 > 0 D1R1 D2R2 > 1 D1R1 D2R1 > 2 D1R1 D2R1 > 3 D1R1 D2R1 > 4 D1R1 D2R1 > 5 D1R2 D2R1 > 6 D1R2 D2R2 > 7 D1R2 D2R2 > 8 D1R2 D2R2 > 9 D1R2 D2R2 > > Each node is responsible for half of the ring in its own DC. Okay that makes sense from a primary distribution perspective, but how do the nodes magically know where to send the data? When using NTS, if there are two nodes that overlap tokens, does NTS choose the "closest" node to place the primary on? If that is the case, then it makes sense. As far as the replication distribution... with a replica going to each data center {DC1:1,DC2:1} does NTS take the token and find the "closest" node in the opposite data center? ... so for token 7 in D1 replicating to D2, it will look for a node with a token range closest to that? In this scenario it would go to D2R2? That makes sense as far as why the replication was hot spotting before where my tokens were N,M,O,P where N
Re: Replica data distributing between racks
The way that I understand it (and that seems to be consistent with what was said in this discussion) is that each DC has its own data space. Using your simplified 1-10 system: DC1 DC2 0 D1R1 D2R2 1 D1R1 D2R1 2 D1R1 D2R1 3 D1R1 D2R1 4 D1R1 D2R1 5 D1R2 D2R1 6 D1R2 D2R2 7 D1R2 D2R2 8 D1R2 D2R2 9 D1R2 D2R2 Each node is responsible for half of the ring in its own DC. - Original Message - From: "Eric tamme" To: user@cassandra.apache.org Sent: Wednesday, May 4, 2011 1:58:19 PM Subject: Re: Replica data distributing between racks > Jonathan is suggesting the approach Jeremiah was using. > > Calculate the tokens the nodes in each DC independantly, and then add > 1 to the tokens if there are two nodes with the same tokens. > > In your case with 2 DC's with 2 nodes each. > > In DC 1 > node 1 = 0 > node 2 = 85070591730234615865843651857942052864 > > In DC 2 > node 1 = 1 > node 2 = 85070591730234615865843651857942052865 > > This will evenly distribute the keys in each DC, which is what the > NetworkTopologyStrategy is trying to do. Okay - I appreciate the direct solution, but I am still really confused. I think I am missing some thing conceptual here... it just isn't "clicking". If I have nodes 4 nodes, in two data centers, each in it's own rack: DC1R1, DC1R2, DC2R1, DC2R2 Tokens: DC1R1: N DC1R2: M DC2R1: N+1 DC2R1: M+1 Who is responsible for what in primary distribution and in replication? Is DC1R2 responsible for M-M+1 (aka 1 token, M)??? that doesn't make any sense... or am I supposed to be making primary distribution uneven so that the uneven replication then balances it? I am trying to conceptualize this... I drew up a graph of the range responsibility based on this token assignment based on a simplified token range of 0-9 http://dl.dropbox.com/u/19254184/tokens.jpg I must be missing some thing, I just don't know what. Please if some one can please explain or point me to resources that clearly explain this. Thanks for everyones time -Eric
Re: Replica data distributing between racks
> Jonathan is suggesting the approach Jeremiah was using. > > Calculate the tokens the nodes in each DC independantly, and then add > 1 to the tokens if there are two nodes with the same tokens. > > In your case with 2 DC's with 2 nodes each. > > In DC 1 > node 1 = 0 > node 2 = 85070591730234615865843651857942052864 > > In DC 2 > node 1 = 1 > node 2 = 85070591730234615865843651857942052865 > > This will evenly distribute the keys in each DC, which is what the > NetworkTopologyStrategy is trying to do. Okay - I appreciate the direct solution, but I am still really confused. I think I am missing some thing conceptual here... it just isn't "clicking". If I have nodes 4 nodes, in two data centers, each in it's own rack: DC1R1, DC1R2, DC2R1, DC2R2 Tokens: DC1R1: N DC1R2: M DC2R1: N+1 DC2R1: M+1 Who is responsible for what in primary distribution and in replication? Is DC1R2 responsible for M-M+1 (aka 1 token, M)??? that doesn't make any sense... or am I supposed to be making primary distribution uneven so that the uneven replication then balances it? I am trying to conceptualize this... I drew up a graph of the range responsibility based on this token assignment based on a simplified token range of 0-9 http://dl.dropbox.com/u/19254184/tokens.jpg I must be missing some thing, I just don't know what. Please if some one can please explain or point me to resources that clearly explain this. Thanks for everyones time -Eric
Re: Replica data distributing between racks
Eric, Jonathan is suggesting the approach Jeremiah was using. Calculate the tokens the nodes in each DC independantly, and then add 1 to the tokens if there are two nodes with the same tokens. In your case with 2 DC's with 2 nodes each. In DC 1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 In DC 2 node 1 = 1 node 2 = 85070591730234615865843651857942052865 This will evenly distribute the keys in each DC, which is what the NetworkTopologyStrategy is trying to do. You can make this change using nodetool move. Hope that helps. - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 4 May 2011, at 08:20, Eric tamme wrote: > On Tue, May 3, 2011 at 4:08 PM, Jonathan Ellis wrote: >> On Tue, May 3, 2011 at 2:46 PM, aaron morton wrote: >>> Jonathan, >>>I think you are saying each DC should have it's own (logical) token >>> ring. >> >> Right. (Only with NTS, although you'd usually end up with a similar >> effect if you alternate DC locations for nodes in a ONTS cluster.) >> >>>But currently two endpoints cannot have the same token regardless of >>> the DC they are in. >> >> Also right. >> >>> Or should people just bump the tokens in extra DC's to avoid the collision? >> >> Yes. >> > > > I am sorry, but I do not understand fully. I would appreciate it if > some one could explain with more verbosity for me. > > I do not understand why data insertion is even, but replication is not. > > I do not understand how to solve the problem. What does "bumping" > tokens entail - Is that going to change my insertion distribution? I > had no idea you can create different logical keyspaces ... and I am > not sure what that exactly means... or that I even want to do it. Is > there a clear solution to "fixing" the problem I laid out, and getting > replication data evenly distributed between racks in each DC? > > Sorry again for needing more verbosity - I am learning as I go with > this stuff. I appreciate everyones help. > > -Eric
Re: Replica data distributing between racks
On Tue, May 3, 2011 at 4:08 PM, Jonathan Ellis wrote: > On Tue, May 3, 2011 at 2:46 PM, aaron morton wrote: >> Jonathan, >> I think you are saying each DC should have it's own (logical) token >> ring. > > Right. (Only with NTS, although you'd usually end up with a similar > effect if you alternate DC locations for nodes in a ONTS cluster.) > >> But currently two endpoints cannot have the same token regardless of >> the DC they are in. > > Also right. > >> Or should people just bump the tokens in extra DC's to avoid the collision? > > Yes. > I am sorry, but I do not understand fully. I would appreciate it if some one could explain with more verbosity for me. I do not understand why data insertion is even, but replication is not. I do not understand how to solve the problem. What does "bumping" tokens entail - Is that going to change my insertion distribution? I had no idea you can create different logical keyspaces ... and I am not sure what that exactly means... or that I even want to do it. Is there a clear solution to "fixing" the problem I laid out, and getting replication data evenly distributed between racks in each DC? Sorry again for needing more verbosity - I am learning as I go with this stuff. I appreciate everyones help. -Eric
Re: Replica data distributing between racks
On Tue, May 3, 2011 at 2:46 PM, aaron morton wrote: > Jonathan, > I think you are saying each DC should have it's own (logical) token > ring. Right. (Only with NTS, although you'd usually end up with a similar effect if you alternate DC locations for nodes in a ONTS cluster.) > But currently two endpoints cannot have the same token regardless of > the DC they are in. Also right. > Or should people just bump the tokens in extra DC's to avoid the collision? Yes. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replica data distributing between racks
Jonathan, I think you are saying each DC should have it's own (logical) token ring. Which makes sense as the only way to balance the load in each dc. I think most people assume (including me) there was a single token ring for the entire cluster. But currently two endpoints cannot have the same token regardless of the DC they are in. Or should people just bump the tokens in extra DC's to avoid the collision? Cheers Aaron On 4 May 2011, at 03:03, Eric tamme wrote: > On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis wrote: >> Right, when you are computing balanced RP tokens for NTS you need to >> compute the tokens for each DC independently. > > I am confused ... sorry. Are you saying that ... I need to change how > my keys are calculated to fix this problem? Or are you talking about > the implementation of how replication selects a token? > > -Eric
Re: Replica data distributing between racks
On Tue, May 3, 2011 at 10:13 AM, Jonathan Ellis wrote: > Right, when you are computing balanced RP tokens for NTS you need to > compute the tokens for each DC independently. I am confused ... sorry. Are you saying that ... I need to change how my keys are calculated to fix this problem? Or are you talking about the implementation of how replication selects a token? -Eric
RE: Replica data distributing between racks
So we are currently running a 10 node ring in one DC, and we are going to be adding 5 more nodes in another DC. To keep the rings in each DC balanced, should I really calculate the tokens independently and just make sure none of them are the same? Something like: DC1 (RF 5): 1: 0 2: 17014118346046923173168730371588410572 3: 34028236692093846346337460743176821144 4: 51042355038140769519506191114765231716 5: 68056473384187692692674921486353642288 6: 85070591730234615865843651857942052860 7: 102084710076281539039012382229530463432 8: 119098828422328462212181112601118874004 9: 136112946768375385385349842972707284576 10: 153127065114422308558518573344295695148 DC2 (RF 3): 1: 1 (one off from DC1 node 1) 2: 34028236692093846346337460743176821145 (one off from DC1 node 3) 3: 68056473384187692692674921486353642290 (two off from DC1 node 5) 4: 102084710076281539039012382229530463435 (three off from DC1 node 7) 5: 136112946768375385385349842972707284580 (four off from DC1 node 9) Originally I was thinking I should spread the DC2 nodes evenly in between every other DC1 node. Or does it not matter where they are in respect to the DC1 nodes, and long as they fall somewhere after every other DC1 node? So it is DC1-1, DC2-1, DC1-2, DC1-3, DC2-2, DC1-4, DC1-5... -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Tuesday, May 03, 2011 9:14 AM To: user@cassandra.apache.org Subject: Re: Replica data distributing between racks Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. On Tue, May 3, 2011 at 6:23 AM, aaron morton wrote: > I've been digging into this and worked was able to reproduce something, not > sure if it's a fault and I can't work on it any more tonight. > > > To reproduce: > - 2 node cluster on my mac book > - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, > e.g. node 1 with 85070591730234615865843651857942052864 and node 2 > 127605887595351923798765477786913079296 > - set cassandra-topology.properties to put the nodes in DC1 on RAC1 > and RAC2 > - create a keyspace using NTS and strategy_options = [{DC1:1}] > > Inserted 10 rows they were distributed as > - node 1 - 9 rows > - node 2 - 1 row > > I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It > often says the closest token to a key is the node 1 because in effect... > > - node 1 is responsible for 0 to > 85070591730234615865843651857942052864 > - node 2 is responsible for 85070591730234615865843651857942052864 to > 127605887595351923798765477786913079296 > - AND node 1 does the wrap around from > 127605887595351923798765477786913079296 to 0 as keys that would insert past > the last token in the ring array wrap to 0 because insertMin is false. > > Thoughts ? > > Aaron > > > On 3 May 2011, at 10:29, Eric tamme wrote: > >> On Mon, May 2, 2011 at 5:59 PM, aaron morton wrote: >>> My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() >>> work. >>> >>> Eric, can you show the output from nodetool ring ? >>> >>> >> >> Sorry if the previous paste was way to unformatted, here is a >> pastie.org link with nicer formatting of nodetool ring output than >> plain text email allows. >> >> http://pastie.org/private/50khpakpffjhsmgf66oetg > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replica data distributing between racks
Right, when you are computing balanced RP tokens for NTS you need to compute the tokens for each DC independently. On Tue, May 3, 2011 at 6:23 AM, aaron morton wrote: > I've been digging into this and worked was able to reproduce something, not > sure if it's a fault and I can't work on it any more tonight. > > > To reproduce: > - 2 node cluster on my mac book > - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node > 1 with 85070591730234615865843651857942052864 and node 2 > 127605887595351923798765477786913079296 > - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2 > - create a keyspace using NTS and strategy_options = [{DC1:1}] > > Inserted 10 rows they were distributed as > - node 1 - 9 rows > - node 2 - 1 row > > I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It > often says the closest token to a key is the node 1 because in effect... > > - node 1 is responsible for 0 to 85070591730234615865843651857942052864 > - node 2 is responsible for 85070591730234615865843651857942052864 to > 127605887595351923798765477786913079296 > - AND node 1 does the wrap around from > 127605887595351923798765477786913079296 to 0 as keys that would insert past > the last token in the ring array wrap to 0 because insertMin is false. > > Thoughts ? > > Aaron > > > On 3 May 2011, at 10:29, Eric tamme wrote: > >> On Mon, May 2, 2011 at 5:59 PM, aaron morton wrote: >>> My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() >>> work. >>> >>> Eric, can you show the output from nodetool ring ? >>> >>> >> >> Sorry if the previous paste was way to unformatted, here is a >> pastie.org link with nicer formatting of nodetool ring output than >> plain text email allows. >> >> http://pastie.org/private/50khpakpffjhsmgf66oetg > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replica data distributing between racks
I've been digging into this and worked was able to reproduce something, not sure if it's a fault and I can't work on it any more tonight. To reproduce: - 2 node cluster on my mac book - set the tokens as if they were nodes 3 and 4 in a 4 node cluster, e.g. node 1 with 85070591730234615865843651857942052864 and node 2 127605887595351923798765477786913079296 - set cassandra-topology.properties to put the nodes in DC1 on RAC1 and RAC2 - create a keyspace using NTS and strategy_options = [{DC1:1}] Inserted 10 rows they were distributed as - node 1 - 9 rows - node 2 - 1 row I *think* the problem has to do with TokenMetadata.firstTokenIndex(). It often says the closest token to a key is the node 1 because in effect... - node 1 is responsible for 0 to 85070591730234615865843651857942052864 - node 2 is responsible for 85070591730234615865843651857942052864 to 127605887595351923798765477786913079296 - AND node 1 does the wrap around from 127605887595351923798765477786913079296 to 0 as keys that would insert past the last token in the ring array wrap to 0 because insertMin is false. Thoughts ? Aaron On 3 May 2011, at 10:29, Eric tamme wrote: > On Mon, May 2, 2011 at 5:59 PM, aaron morton wrote: >> My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() >> work. >> >> Eric, can you show the output from nodetool ring ? >> >> > > Sorry if the previous paste was way to unformatted, here is a > pastie.org link with nicer formatting of nodetool ring output than > plain text email allows. > > http://pastie.org/private/50khpakpffjhsmgf66oetg
Re: Replica data distributing between racks
On Mon, May 2, 2011 at 5:59 PM, aaron morton wrote: > My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() > work. > > Eric, can you show the output from nodetool ring ? > > Sorry if the previous paste was way to unformatted, here is a pastie.org link with nicer formatting of nodetool ring output than plain text email allows. http://pastie.org/private/50khpakpffjhsmgf66oetg
Re: Replica data distributing between racks
On Mon, May 2, 2011 at 5:59 PM, aaron morton wrote: > My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() > work. > > Eric, can you show the output from nodetool ring ? > Here is output from nodtool ring - ip addresses changed obviously. Address Status State LoadOwnsToken 127605887595351923798765477786913079296 :0::111:0:0:0: Up Normal 195.28 GB 25.00% 0 :0::111:0:0:0:aaab Up Normal 47.12 GB25.00% 42535295865117307932921825928971026432 :0::112:0:0:0: Up Normal 189.96 GB 25.00% 85070591730234615865843651857942052864 :0::112:0:0:0:aaab Up Normal 42.82 GB25.00% 127605887595351923798765477786913079296
Re: Replica data distributing between racks
My bad, I missed the way TokenMetadata.ringIterator() and firstTokenIndex() work. Eric, can you show the output from nodetool ring ? Aaron On 3 May 2011, at 07:30, Eric tamme wrote: > On Mon, May 2, 2011 at 3:22 PM, Jonathan Ellis wrote: >> On Mon, May 2, 2011 at 2:18 PM, aaron morton wrote: >>> When the NTS selects replicas in a DC it orders the tokens available in >>> the DC, then (in the first pass) iterates through them placing a replica in >>> each unique rack. e.g. if the RF in each DC was 2, the replicas would be >>> put on 2 unique racks if possible. So the lowest token in the DC will >>> *always* get a write. >> >> It's supposed to start w/ the node closest to the token in each DC, so >> that shouldn't be the case unless you are using BOP/OPP instead of RP. >> > > I am using a RandomPartitioner as shown below: > > Cluster Information: > Snitch: org.apache.cassandra.locator.PropertyFileSnitch > Partitioner: org.apache.cassandra.dht.RandomPartitioner > > So as far as "closeness" .. how does that get factored in when using a > PropertyFileSnitch? Is one rack closer than the other? In reality > for each data center there are two nodes in the same rack on the same > switch, but I set the topology file up to have 2 racks per data > center specifically so I would get distribution. > > -Eric
Re: Replica data distributing between racks
On Mon, May 2, 2011 at 3:22 PM, Jonathan Ellis wrote: > On Mon, May 2, 2011 at 2:18 PM, aaron morton wrote: >> When the NTS selects replicas in a DC it orders the tokens available in the >> DC, then (in the first pass) iterates through them placing a replica in each >> unique rack. e.g. if the RF in each DC was 2, the replicas would be put on >> 2 unique racks if possible. So the lowest token in the DC will *always* get >> a write. > > It's supposed to start w/ the node closest to the token in each DC, so > that shouldn't be the case unless you are using BOP/OPP instead of RP. > I am using a RandomPartitioner as shown below: Cluster Information: Snitch: org.apache.cassandra.locator.PropertyFileSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner So as far as "closeness" .. how does that get factored in when using a PropertyFileSnitch? Is one rack closer than the other? In reality for each data center there are two nodes in the same rack on the same switch, but I set the topology file up to have 2 racks per data center specifically so I would get distribution. -Eric
Re: Replica data distributing between racks
On Mon, May 2, 2011 at 2:18 PM, aaron morton wrote: > When the NTS selects replicas in a DC it orders the tokens available in the > DC, then (in the first pass) iterates through them placing a replica in each > unique rack. e.g. if the RF in each DC was 2, the replicas would be put on 2 > unique racks if possible. So the lowest token in the DC will *always* get a > write. It's supposed to start w/ the node closest to the token in each DC, so that shouldn't be the case unless you are using BOP/OPP instead of RP. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Replica data distributing between racks
That appears to be working correctly, but does not sound great. When the NTS selects replicas in a DC it orders the tokens available in the DC, then (in the first pass) iterates through them placing a replica in each unique rack. e.g. if the RF in each DC was 2, the replicas would be put on 2 unique racks if possible. So the lowest token in the DC will *always* get a write. It's not possible to load balance between the racks as there is no state shared between requests. A possible alternative would be to find the nearest token to the key and start allocating replicas from there. But as each DC contains only a part (say half) of the token range the likelihood is that half of the keys would match to either end of the DC's range so it would not be a great solution. I think what you are trying to achieve is not possible. Do you have the capacity to run RF 2 in each DC ? That would at least even things out. Aaron On 3 May 2011, at 06:40, Eric tamme wrote: > I am experiencing an issue where replication is not being distributed > between racks when using PropertyFileSnitch in conjunction with > NetworkTopologyStrategy. > > I am running 0.7.3 from a tar.gz on cassandra.apache.org > > I have 4 nodes, 2 data centers, and 2 racks in each data center. Each > rack has 1 node. > > I have even token distribution so that each node gets 25%: > > 0 > 425352958651173079329218259289 > 71026432 > 85070591730234615865843651857942052864 > 127605887595351923798765477786913079296 > > My cassandra-topology.properties is as follows: > > # Cassandra Node IP=Data Center:Rack > \:0\:\:\:\:fffe=NY1:RAC1 > \:0\:\:\:\:=NY1:RAC2 > > \:0\:\:\:\:fffe=LA1:RAC1 > \:0\:\:\:\:=LA1:RAC2 > > # default for unknown nodes > default=NY1:RAC1 > > > My Keyspace replication strategy is as follows: > Keyspace: SipTrace: > Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy >Options: [LA1:1,NY1:1] > > So each data center should get 1 copy of the data, and this does > happen. The problem is that the replicated copies get pinned to the > first host configured in the properties file, from what I can discern, > and DO NOT distribute between racks. So I have 2 nodes that have a 4 > to 1 ratio of data compared to the other 2 nodes. This is a problem! > > Can any one please tell me if I have misconfigured this? Or how I can > get replica data to distribute evenly between racks within a > datacenter? I was led to believe that cassandra will try to > distribute between racks for replica data automatically under this > setup. > > Thank you for your help in advance! > > -Eric