A few random thoughts here

1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a 
manageable size. 

2) The 2TB guidance is old and irrelevant for most people, what you really care 
about is how fast you can replace the failed machine

You’d likely be ok going significantly larger than that if you use a few 
vnodes, since that’ll help rebuild faster (you’ll stream from more sources on 
rebuild)

If you don’t want to use vnodes, buy big machines and run multiple Cassandra 
instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per 
machine 

3) Transient replication in 4.0 could potentially be worth trying out, 
depending on your risk tolerance. Doing 2 full and one transient replica may 
save you 30% storage 

4) Note that you’re not factoring in compression, and some of the recent zstd 
work may go a long way if your sensor data is similar / compressible.

> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
> Hello users
> 
> I'm facing with a very challenging exercise: size a cluster with a huge 
> dataset.
> 
> Use-case = IoT
> 
> Number of sensors: 30 millions
> Frequency of data: every 10 minutes
> Estimate size of a data: 100 bytes (including clustering columns)
> Data retention: 2 years
> Replication factor: 3 (pretty standard)
> 
> A very quick math gives me:
> 
> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> 
> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
> 
> Now the big problem is that we have 30 millions of sensor so the disk
> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> worth of data/year
> 
> We want to store data for 2 years => 300Tb
> 
> We have RF=3 ==> 900Tb !!!!
> 
> Now, according to commonly recommended density (with SSD), one shall
> not exceed 2Tb of data per node, which give us a rough sizing of 450
> nodes cluster !!!
> 
> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> ?) We would still need 90 beefy nodes to support this.
> 
> Any thoughts/ideas to reduce the nodes count or increase density and
> keep the cluster manageable ?
> 
> Regards
> 
> Duy Hai DOAN
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to