Re: Challenge with initial data load with TWCS
We used to do either: - CQLSSTableWriter and explicitly break between windows (then nodetool refresh or sstableloader to push them into the system), or - Use the normal write path for a single window at a time, explicitly calling flush between windows. You can’t have current data writing while you do your historical load using this method > On Sep 28, 2019, at 1:31 PM, DuyHai Doan wrote: > > Hello users > > TWCS works great for permanent state. It creates SSTables of roughly > fixed size if your insertion rate is pretty constant. > > Now the big deal is about the initial load. > > Let's say we configure a TWCS with window unit = day and window size = > 1, we would have 1 SSTable per day and with TTL = 365 days all data > would expire after 1 year > > Now, since the cluster is still empty we need to load data worth of 1 > year. If we use TWCS and if the loading takes 7 days, we would have 7 > SSTables, each of them aggregating 365/7 worth of annual data. Ideally > we would like TWCS to split these data into 365 distinct SSTables > > So my question is: how to manage this scenario ? How to perform an > initial load for a table using TWCS and make the compaction split > nicely the data base on source data timestamp and not insertion > timestamp ? > > Regards > > Duy Hai DOAN > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Cluster sizing for huge dataset
A few random thoughts here 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable size. 2) The 2TB guidance is old and irrelevant for most people, what you really care about is how fast you can replace the failed machine You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll help rebuild faster (you’ll stream from more sources on rebuild) If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine 3) Transient replication in 4.0 could potentially be worth trying out, depending on your risk tolerance. Doing 2 full and one transient replica may save you 30% storage 4) Note that you’re not factoring in compression, and some of the recent zstd work may go a long way if your sensor data is similar / compressible. > On Sep 28, 2019, at 1:23 PM, DuyHai Doan wrote: > > Hello users > > I'm facing with a very challenging exercise: size a cluster with a huge > dataset. > > Use-case = IoT > > Number of sensors: 30 millions > Frequency of data: every 10 minutes > Estimate size of a data: 100 bytes (including clustering columns) > Data retention: 2 years > Replication factor: 3 (pretty standard) > > A very quick math gives me: > > 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor > > In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor > > Now the big problem is that we have 30 millions of sensor so the disk > requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb > worth of data/year > > We want to store data for 2 years => 300Tb > > We have RF=3 ==> 900Tb > > Now, according to commonly recommended density (with SSD), one shall > not exceed 2Tb of data per node, which give us a rough sizing of 450 > nodes cluster !!! > > Even if we push the limit up to 10Tb using TWCS (has anyone tried this > ?) We would still need 90 beefy nodes to support this. > > Any thoughts/ideas to reduce the nodes count or increase density and > keep the cluster manageable ? > > Regards > > Duy Hai DOAN > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Challenge with initial data load with TWCS
Hello users TWCS works great for permanent state. It creates SSTables of roughly fixed size if your insertion rate is pretty constant. Now the big deal is about the initial load. Let's say we configure a TWCS with window unit = day and window size = 1, we would have 1 SSTable per day and with TTL = 365 days all data would expire after 1 year Now, since the cluster is still empty we need to load data worth of 1 year. If we use TWCS and if the loading takes 7 days, we would have 7 SSTables, each of them aggregating 365/7 worth of annual data. Ideally we would like TWCS to split these data into 365 distinct SSTables So my question is: how to manage this scenario ? How to perform an initial load for a table using TWCS and make the compaction split nicely the data base on source data timestamp and not insertion timestamp ? Regards Duy Hai DOAN - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Cluster sizing for huge dataset
Hello users I'm facing with a very challenging exercise: size a cluster with a huge dataset. Use-case = IoT Number of sensors: 30 millions Frequency of data: every 10 minutes Estimate size of a data: 100 bytes (including clustering columns) Data retention: 2 years Replication factor: 3 (pretty standard) A very quick math gives me: 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor Now the big problem is that we have 30 millions of sensor so the disk requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb worth of data/year We want to store data for 2 years => 300Tb We have RF=3 ==> 900Tb Now, according to commonly recommended density (with SSD), one shall not exceed 2Tb of data per node, which give us a rough sizing of 450 nodes cluster !!! Even if we push the limit up to 10Tb using TWCS (has anyone tried this ?) We would still need 90 beefy nodes to support this. Any thoughts/ideas to reduce the nodes count or increase density and keep the cluster manageable ? Regards Duy Hai DOAN - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org