Re: VNode Streaming Math

2016-10-12 Thread Vladimir Yudovin
Hi,



Calculation in general is very simple - each node keeps 
replication_factor/number_of_nodes part of data (number of replicas is spread 
over all nodes). I.e. If you have 100 nodes and replication factor is three 
each node keeps 0.03 of table size. 



But you can go even with more simple approach - each node keeps more or less 
the same amount of date. If you add new node to cluster it should get the same 
data volume that is stored on any other node. 





Best regards, Vladimir Yudovin, 

Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.






 On Wed, 12 Oct 2016 12:13:25 -0400Anubhav Kale 
anubhav.k...@microsoft.com wrote 




Hello,

 

Suppose I have a 100 node ring, with num_tokens=32 (thus, 32 VNodes per 
physical machine). Assume this cluster has just one keyspace having one table. 
There are 10 SS Tables on each node, and size on disk is 10GB on each node. For 
simplicity, assume each SSTable is 1GB.

 

Now, a node went down, and I need to rebuild it. Can you please explain to me 
the math around how many SS Table files (and size) each node would stream to 
this node ? How does that math change as #VNodes change ?

 

I am looking for rough calculations to understand this process better. I am 
guessing I might have missed some variables in here (amount of data per token 
range ?), so please let me know that too !

 

Thanks much !









VNode Streaming Math

2016-10-12 Thread Anubhav Kale
Hello,

Suppose I have a 100 node ring, with num_tokens=32 (thus, 32 VNodes per 
physical machine). Assume this cluster has just one keyspace having one table. 
There are 10 SS Tables on each node, and size on disk is 10GB on each node. For 
simplicity, assume each SSTable is 1GB.

Now, a node went down, and I need to rebuild it. Can you please explain to me 
the math around how many SS Table files (and size) each node would stream to 
this node ? How does that math change as #VNodes change ?

I am looking for rough calculations to understand this process better. I am 
guessing I might have missed some variables in here (amount of data per token 
range ?), so please let me know that too !

Thanks much !