Re: RFC: Cassandra Virtual Nodes

2012-04-10 Thread Sam Overton
There is now a parent ticket for this issue in JIRA: https://issues.apache.org/jira/browse/CASSANDRA-4119 Comments and contributions are still welcome! Cheers, Sam On 16 March 2012 23:38, Sam Overton wrote: > Hello cassandra-dev, > > This is a long email. It concerns a significant change to Ca

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Zhu Han
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller wrote: > > No I don't think you did, in fact, depending on the size of your SSTable > a > > contiguous range (or the entire SSTable) may or may not be affected by a > > cleanup/move or any type of topology change. There is lots of room for > > optim

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Peter Schuller
> No I don't think you did, in fact, depending on the size of your SSTable a > contiguous range (or the entire SSTable) may or may not be affected by a > cleanup/move or any type of topology change. There is lots of room for > optimization here. After loading the indexes we actually know start/end

Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Ben Coverston
> The SSTable indices should still be scanned for size tiered compaction. > Do I miss anything here? > > No I don't think you did, in fact, depending on the size of your SSTable a contiguous range (or the entire SSTable) may or may not be affected by a cleanup/move or any type of topology change. T

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller wrote: > > You would have to iterate through all sstables on the system to repair > one > > vnode, yes: but building the tree for just one range of the data means > that > > huge portions of the sstables files can be skipped. It should scale down >

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Peter Schuller
> You would have to iterate through all sstables on the system to repair one > vnode, yes: but building the tree for just one range of the data means that > huge portions of the sstables files can be skipped. It should scale down > linearly as the number of vnodes increases (ie, with 100 vnodes, it

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Stu Hood
> > Does the new scheme still require the node to re-iterate all sstables to > build the merkle tree or streaming data for partition level > repair and move? You would have to iterate through all sstables on the system to repair one vnode, yes: but building the tree for just one range of the data

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low wrote: > On 22 March 2012 05:48, Zhu Han wrote: > > > I second it. > > > > Is there some goals we missed which can not be achieved by assigning > > multiple tokens to a single node? > > This is exactly the proposed solution. The discussion is about h

Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Richard Low
On 22 March 2012 05:48, Zhu Han wrote: > I second it. > > Is there some goals we missed which can not be achieved by assigning > multiple tokens to a single node? This is exactly the proposed solution. The discussion is about how to implement this, and the methods of choosing tokens and replica

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Zhu Han
e achieved by assigning multiple tokens to a single node? > > -Jeremiah Jordan > > > From: Rick Branson [rbran...@datastax.com] > Sent: Monday, March 19, 2012 5:16 PM > To: dev@cassandra.apache.org > Subject: Re: RFC: Cassandra Virtual Nodes > >

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
A friend pointed out to me privately that I came across pretty harsh in this thread. While I stand by my technical concerns, I do want to acknowledge that Sam's proposal here indicates a strong grasp of the principles involved, and a deeper level of thought into the issues than I think anyone else

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Vijay
>>> I envision vnodes as Cassandra master being a shared cache,memtables, and manager for what we today consider a Cassandra instance. It might be kind of problematic when you are moving the nodes you want the data associated with the node to move too, otherwise it will be a pain to cleanup after

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
I just see vnodes as a way to make the problem smaller and by making the problem smaller the overall system is more agile. Aka rather then 1 node streaming 100 gb the 4 nodes stream 25gb. Moves by hand are not so bad because the take 1/4th the time. The most simple vnode implementation is vmware.

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Peter Schuller
> Software wise it is the same deal. Each node streams off only disk 4 > to the new node. I think an implication on software is that if you want to make specific selections of partitions to move, you are effectively incompatible with deterministically generating the mapping of partition to respons

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie wrote: > Hi Edward > >> 1) No more raid 0. If a machine is responsible for 4 vnodes they >> should correspond to for JBOD. > > So each vnode corresponds to a disk?  I suppose we could have a > separate data directory per disk, but I think this should be

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Jonathan Ellis
On Wed, Mar 21, 2012 at 8:50 AM, Eric Evans wrote: > I must admit I find this a little disheartening.  The discussion has > barely started.  No one has had a chance to discuss implementation > specifics so that the rest of us could understand *how* disruptive it > would be (a necessary requirement

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Tom Wilkie
Hi Edward > 1) No more raid 0. If a machine is responsible for 4 vnodes they > should correspond to for JBOD. So each vnode corresponds to a disk? I suppose we could have a separate data directory per disk, but I think this should be a separate, subsequent change. However, do note that making t

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Chris Goffinet
I'm going to agree with Eric on this one. Twitter has wanted some sort of vnode support for quite sometime. We even were willing to do all the work. I have reservations about that now We have been silent due to the community and how this is more like an exclusive Datastax project than an Apache

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Edward Capriolo
On Wed, Mar 21, 2012 at 9:50 AM, Eric Evans wrote: > On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis wrote: >> It's reasonable that we can attach different levels of importance to >> these things.  Taking a step back, I have two main points: >> >> 1) vnodes add enormous complexity to *many* parts

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Eric Evans
On Tue, Mar 20, 2012 at 9:53 PM, Jonathan Ellis wrote: > It's reasonable that we can attach different levels of importance to > these things.  Taking a step back, I have two main points: > > 1) vnodes add enormous complexity to *many* parts of Cassandra.  I'm > skeptical of the cost:benefit ratio

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
It's reasonable that we can attach different levels of importance to these things. Taking a step back, I have two main points: 1) vnodes add enormous complexity to *many* parts of Cassandra. I'm skeptical of the cost:benefit ratio here. 1a) The benefit is lower in my mind because many of the pr

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Peter Schuller
> Each node would have a lower and an upper token, which would form a range > that would be actively distributed via gossip. Read and replication > requests would only be routed to a replica when the key of these operations > matched the replica's token range in the gossip tables. Each node would >

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:55, Jonathan Ellis wrote: > Here's how I see Sam's list: > > * Even load balancing when growing and shrinking the cluster > > Nice to have, but post-bootstrap load balancing works well in practice > (and is improved by TRP). Post-bootstrap load balancing without vnodes necessa

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Richard Low
On 20 March 2012 14:50, Rick Branson wrote: > To support a form of DF, I think some tweaking of the replica placement could > achieve this effect quite well. We could introduce a variable into replica > placement, which I'm going to incorrectly call DF for the purposes of > illustration. The k

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 19 March 2012 23:41, Peter Schuller wrote: >>> Using this ring bucket in the CRUSH topology, (with the hash function >>> being the identity function) would give the exact same distribution >>> properties as the virtual node strategy that I suggested previously, >>> but of course with much bette

RE: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jeremiah Jordan
whole cluster, not just your neighbor. etc. etc. -Jeremiah Jordan From: Rick Branson [rbran...@datastax.com] Sent: Monday, March 19, 2012 5:16 PM To: dev@cassandra.apache.org Subject: Re: RFC: Cassandra Virtual Nodes I think if we could go back and r

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
On Tue, Mar 20, 2012 at 9:08 AM, Eric Evans wrote: > On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis wrote: >> I like this idea.  It feels like a good 80/20 solution -- 80% of the >> benefits, 20% of the effort.  More like 5% of the effort.  I can't >> even enumerate all the places full vnode sup

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 13:37, Eric Evans wrote: > On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton wrote: >> On 20 March 2012 04:35, Vijay wrote: >>> May be, what i mean is little more simple than that... We can consider >>> every node having a multiple conservative ranges and moving those ranges >>> for

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Rick Branson
> > I like this idea. It feels like a good 80/20 solution -- 80% of the > > benefits, 20% of the effort. More like 5% of the effort. I can't > > even enumerate all the places full vnode support would change, but an > > "active token range" concept would be relatively limited in scope. > > > It on

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 8:39 AM, Jonathan Ellis wrote: > I like this idea.  It feels like a good 80/20 solution -- 80% of the > benefits, 20% of the effort.  More like 5% of the effort.  I can't > even enumerate all the places full vnode support would change, but an > "active token range" concept

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Jonathan Ellis
I like this idea. It feels like a good 80/20 solution -- 80% of the benefits, 20% of the effort. More like 5% of the effort. I can't even enumerate all the places full vnode support would change, but an "active token range" concept would be relatively limited in scope. Full vnodes feels a lot m

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Eric Evans
On Tue, Mar 20, 2012 at 6:40 AM, Sam Overton wrote: > On 20 March 2012 04:35, Vijay wrote: >> On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote: >> >>> I'm guessing you're referring to Rick's proposal about ranges per node? >>> >> >> May be, what i mean is little more simple than that... We can

Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Sam Overton
On 20 March 2012 04:35, Vijay wrote: > On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote: > >> I'm guessing you're referring to Rick's proposal about ranges per node? >> > > May be, what i mean is little more simple than that... We can consider > every node having a multiple conservative ranges a

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
On Mon, Mar 19, 2012 at 8:24 PM, Eric Evans wrote: > I'm guessing you're referring to Rick's proposal about ranges per node? > May be, what i mean is little more simple than that... We can consider every node having a multiple conservative ranges and moving those ranges for bootstrap etc, instea

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Eric Evans
On Mon, Mar 19, 2012 at 9:37 PM, Vijay wrote: > I also did create a ticket > https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the > reason why I would like to see vnodes in cassandra. > It can also potentially reduce the SSTable seeks which a node has to do to > query data in Size

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
On Mon, Mar 19, 2012 at 4:45 PM, Peter Schuller wrote: > > As a side note: vnodes fail to provide solutions to node-based limitations > > that seem to me to cause a substantial portion of operational issues such > > as impact of node restarts / upgrades, GC and compaction induced latency. I > > Ac

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Vijay
I also did create a ticket https://issues.apache.org/jira/browse/CASSANDRA-3768 with some of the reason why I would like to see vnodes in cassandra. It can also potentially reduce the SSTable seeks which a node has to do to query data in SizeTireCompaction if extended to the filesystem. But 110% a

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
(I may comment on other things more later) > As a side note: vnodes fail to provide solutions to node-based limitations > that seem to me to cause a substantial portion of operational issues such > as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At l

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
>> Using this ring bucket in the CRUSH topology, (with the hash function >> being the identity function) would give the exact same distribution >> properties as the virtual node strategy that I suggested previously, >> but of course with much better topology awareness. > > I will have to re-read yo

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
> a) a virtual node partitioning scheme (to support heterogeneity and > management simplicity) > b) topology aware replication > c) topology aware routing I would add (d) limiting the distribution factor to decrease the probability of data loss/multiple failures within a replica set. > First of a

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Rick Branson
I think if we could go back and rebuild Cassandra from scratch, vnodes would likely be implemented from the beginning. However, I'm concerned that implementing them now could be a big distraction from more productive uses of all of our time and introduce major potential stability issues into what i

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:24 PM, Sam Overton wrote: >> For OPP the problem of load balancing is more profound. Now you need >> vnodes per keyspace because you can not expect each keyspace to have >> the same distribution. With three keyspaces you are not unsure as to >> which was is causing the ho

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
> For OPP the problem of load balancing is more profound. Now you need > vnodes per keyspace because you can not expect each keyspace to have > the same distribution. With three keyspaces you are not unsure as to > which was is causing the hotness. I think OPP should just go away. That's a good po

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Edward Capriolo
On Mon, Mar 19, 2012 at 4:15 PM, Sam Overton wrote: > On 19 March 2012 09:23, Radim Kolar wrote: >> >>> >>> Hi Radim, >>> >>> The number of virtual nodes for each host would be configurable by the >>> user, in much the same way that initial_token is configurable now. A host >>> taking a larger nu

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
On 19 March 2012 09:23, Radim Kolar wrote: > >> >> Hi Radim, >> >> The number of virtual nodes for each host would be configurable by the >> user, in much the same way that initial_token is configurable now. A host >> taking a larger number of virtual nodes (tokens) would have >> proportionately >

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Sam Overton
Hi Peter, It's great to hear that others have come to some of the same conclusions! I think a CRUSH-like strategy for topologically aware replication/routing/locality is a great idea. I think I can see three mostly orthogonal sets of functionality that we're concerned with: a) a virtual node par

Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Radim Kolar
Hi Radim, The number of virtual nodes for each host would be configurable by the user, in much the same way that initial_token is configurable now. A host taking a larger number of virtual nodes (tokens) would have proportionately more of the data. This is how we anticipate support for heterog

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
Point of clarification: My use of the term "bucket" is completely unrelated to the term "bucket" used in the CRUSH paper. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
> *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Cassandra and we are asking for comments, > criticism and collaboration! I am very happy to see some momentum on this, and I would like to go even further than what you propose. The main reaso

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans
On Sat, Mar 17, 2012 at 3:22 PM, Zhu Han wrote: > On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton wrote: >> This is a long email. It concerns a significant change to Cassandra, so >> deserves a thorough introduction. >> >> *The summary is*: we believe virtual nodes are the way forward. We would >> l

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Edward Capriolo
I agree having smaller regions would help the rebalencing situation both with rp and bop. However i an not sure if dividing tables across disk s will give any better performance. you will have more seeking spindles and can possibly sub divide token ranges into separate files. But fs cache will ge

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Eric Evans
On Sat, Mar 17, 2012 at 11:15 AM, Radim Kolar wrote: > I don't like that every node will have same portion of data. > > 1. We are using nodes with different HW sizes (number of disks) > 2.  especially with ordered partitioner there tends to be hotspots and you > must assign smaller portion of data

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Zhu Han
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton wrote: > Hello cassandra-dev, > > This is a long email. It concerns a significant change to Cassandra, so > deserves a thorough introduction. > > *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Ca

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Sam Overton
On 17 March 2012 11:15, Radim Kolar wrote: > I don't like that every node will have same portion of data. > > 1. We are using nodes with different HW sizes (number of disks) > 2. especially with ordered partitioner there tends to be hotspots and you > must assign smaller portion of data to nodes

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Radim Kolar
I don't like that every node will have same portion of data. 1. We are using nodes with different HW sizes (number of disks) 2. especially with ordered partitioner there tends to be hotspots and you must assign smaller portion of data to nodes holding hotspots

RFC: Cassandra Virtual Nodes

2012-03-16 Thread Sam Overton
Hello cassandra-dev, This is a long email. It concerns a significant change to Cassandra, so deserves a thorough introduction. *The summary is*: we believe virtual nodes are the way forward. We would like to add virtual nodes to Cassandra and we are asking for comments, criticism and collaboratio