Thanks for the response Alain. I am using STCS and would like to take some action as we would be hitting 50% disk space pretty soon. Would adding nodes be the right way to start if I want to get the data per node down otherwise can you or someone on the list please suggest the right way to go about it.
Thanks Sent from my iPhone > On Apr 14, 2016, at 5:17 PM, Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > Hi, > >> I seek advice in data size per node. Each of my node has close to 1 TB of >> data. I am not seeing any issues as of now but wanted to run it by you guys >> if this data size is pushing the limits in any manner and if I should be >> working on reducing data size per node. > > There is no real limit to the data size other than 50% of the machine disk > space using STCS and 80 % if you are using LCS. Those are 'soft' limits as it > will depend on your biggest sstables size and the number of concurrent > compactions mainly, but to stay away from trouble, it is better to keep > things under control, below the limits mentioned above. > >> I will me migrating to incremental repairs shortly and full repair as of now >> takes 20 hr/node. I am not seeing any issues with the nodes for now. > > As you noticed, you need to keep in mind that the larger the dataset is, the > longer operations will take. Repairs but also bootstrap or replace a node, > remove a node, any operation that require to stream data or read it. Repair > time can be mitigated by using incremental repairs indeed. > >> I am running a 9 node C* 2.1.12 cluster. > > It should be quite safe to give incremental repair a try as many bugs have > been fixe in this version: > > FIX 2.1.12 - A lot of sstables using range repairs due to anticompaction - > incremental only > > https://issues.apache.org/jira/browse/CASSANDRA-10422 > > FIX 2.1.12 - repair hang when replica is down - incremental only > > https://issues.apache.org/jira/browse/CASSANDRA-10288 > > If you are using DTCS be aware of > https://issues.apache.org/jira/browse/CASSANDRA-11113 > > If using LCS, watch closely sstable and compactions pending counts. > > As a general comment, I would say that Cassandra has evolved to be able to > handle huge datasets (memory structures off-heap + increase of heap size > using G1GC, JBOD, vnodes, ...). Today Cassandra works just fine with big > dataset. I have seen clusters with 4+ TB nodes and other using a few GB per > node. It all depends on your requirements and your machines spec. If fast > operations are absolutely necessary, keep it small. If you want to use the > entire disk space (50/80% of total disk space max), go ahead as long as other > resources are fine (CPU, memory, disk throughput, ...). > > C*heers, > > ----------------------- > Alain Rodriguez - al...@thelastpickle.com > France > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > > 2016-04-14 10:57 GMT+02:00 Aiman Parvaiz <ai...@flipagram.com>: >> Hi all, >> I am running a 9 node C* 2.1.12 cluster. I seek advice in data size per >> node. Each of my node has close to 1 TB of data. I am not seeing any issues >> as of now but wanted to run it by you guys if this data size is pushing the >> limits in any manner and if I should be working on reducing data size per >> node. I will me migrating to incremental repairs shortly and full repair as >> of now takes 20 hr/node. I am not seeing any issues with the nodes for now. >> >> Thanks >