Re: Challenge with initial data load with TWCS

2019-09-28 Thread Jeff Jirsa



We used to do either:

- CQLSSTableWriter and explicitly break between windows (then nodetool refresh 
or sstableloader to push them into the system), or

- Use the normal write path for a single window at a time, explicitly calling 
flush between windows. You can’t have current data writing while you do your 
historical load using this method



> On Sep 28, 2019, at 1:31 PM, DuyHai Doan  wrote:
> 
> Hello users
> 
> TWCS works great for permanent state. It creates SSTables of roughly
> fixed size if your insertion rate is pretty constant.
> 
> Now the big deal is about the initial load.
> 
> Let's say we configure a TWCS with window unit = day and window size =
> 1, we would have 1 SSTable per day and with TTL = 365 days all data
> would expire after 1 year
> 
> Now, since the cluster is still empty we need to load data worth of 1
> year. If we use TWCS and if the loading takes 7 days, we would have 7
> SSTables, each of them aggregating 365/7 worth of annual data. Ideally
> we would like TWCS to split these data into 365 distinct SSTables
> 
> So my question is: how to manage this scenario ? How to perform an
> initial load for a table using TWCS and make the compaction split
> nicely the data base on source data timestamp and not insertion
> timestamp ?
> 
> Regards
> 
> Duy Hai DOAN
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cluster sizing for huge dataset

2019-09-28 Thread Jeff Jirsa
A few random thoughts here

1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a 
manageable size. 

2) The 2TB guidance is old and irrelevant for most people, what you really care 
about is how fast you can replace the failed machine

You’d likely be ok going significantly larger than that if you use a few 
vnodes, since that’ll help rebuild faster (you’ll stream from more sources on 
rebuild)

If you don’t want to use vnodes, buy big machines and run multiple Cassandra 
instances in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per 
machine 

3) Transient replication in 4.0 could potentially be worth trying out, 
depending on your risk tolerance. Doing 2 full and one transient replica may 
save you 30% storage 

4) Note that you’re not factoring in compression, and some of the recent zstd 
work may go a long way if your sensor data is similar / compressible.

> On Sep 28, 2019, at 1:23 PM, DuyHai Doan  wrote:
> 
> Hello users
> 
> I'm facing with a very challenging exercise: size a cluster with a huge 
> dataset.
> 
> Use-case = IoT
> 
> Number of sensors: 30 millions
> Frequency of data: every 10 minutes
> Estimate size of a data: 100 bytes (including clustering columns)
> Data retention: 2 years
> Replication factor: 3 (pretty standard)
> 
> A very quick math gives me:
> 
> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> 
> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
> 
> Now the big problem is that we have 30 millions of sensor so the disk
> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> worth of data/year
> 
> We want to store data for 2 years => 300Tb
> 
> We have RF=3 ==> 900Tb 
> 
> Now, according to commonly recommended density (with SSD), one shall
> not exceed 2Tb of data per node, which give us a rough sizing of 450
> nodes cluster !!!
> 
> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> ?) We would still need 90 beefy nodes to support this.
> 
> Any thoughts/ideas to reduce the nodes count or increase density and
> keep the cluster manageable ?
> 
> Regards
> 
> Duy Hai DOAN
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Challenge with initial data load with TWCS

2019-09-28 Thread DuyHai Doan
Hello users

TWCS works great for permanent state. It creates SSTables of roughly
fixed size if your insertion rate is pretty constant.

Now the big deal is about the initial load.

Let's say we configure a TWCS with window unit = day and window size =
1, we would have 1 SSTable per day and with TTL = 365 days all data
would expire after 1 year

Now, since the cluster is still empty we need to load data worth of 1
year. If we use TWCS and if the loading takes 7 days, we would have 7
SSTables, each of them aggregating 365/7 worth of annual data. Ideally
we would like TWCS to split these data into 365 distinct SSTables

So my question is: how to manage this scenario ? How to perform an
initial load for a table using TWCS and make the compaction split
nicely the data base on source data timestamp and not insertion
timestamp ?

Regards

Duy Hai DOAN

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Cluster sizing for huge dataset

2019-09-28 Thread DuyHai Doan
Hello users

I'm facing with a very challenging exercise: size a cluster with a huge dataset.

Use-case = IoT

Number of sensors: 30 millions
Frequency of data: every 10 minutes
Estimate size of a data: 100 bytes (including clustering columns)
Data retention: 2 years
Replication factor: 3 (pretty standard)

A very quick math gives me:

6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor

In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor

Now the big problem is that we have 30 millions of sensor so the disk
requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
worth of data/year

We want to store data for 2 years => 300Tb

We have RF=3 ==> 900Tb 

Now, according to commonly recommended density (with SSD), one shall
not exceed 2Tb of data per node, which give us a rough sizing of 450
nodes cluster !!!

Even if we push the limit up to 10Tb using TWCS (has anyone tried this
?) We would still need 90 beefy nodes to support this.

Any thoughts/ideas to reduce the nodes count or increase density and
keep the cluster manageable ?

Regards

Duy Hai DOAN

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org