Re: Totally unbalanced cluster

Jon Haddad Thu, 04 May 2017 09:11:01 -0700

Adding nodes with NTS is easier, in my opinion.  You don’t need to worry about 
replica placement, if you do it right.


> On May 4, 2017, at 7:43 AM, Cogumelos Maravilha <[email protected]> 
> wrote:
> 
> Hi Alain thanks for your kick reply.
> 
> 
> Regarding SimpleStrategy perhaps you are right but it's so easy to add nodes.
> 
> I'm not using vnodes and the default 256. The information that I've posted it 
> a regular nodetool status keyspace.
> 
> My partition key is a sequencial big int but nodetool cfstatus shows that the 
> number of keys are not balanced (data from 3 nodes):
> 
> Number of keys (estimate): 442779640
> 
> Number of keys (estimate): 736380940
> 
> Number of keys (estimate): 451097313
> 
> Should I use nodetool rebuild?
> 
> Running:
> 
> nodetool getendpoints mykeyspace data 9213395123941039285
> 
> 10.1.1.52
> 10.1.1.185
> 
> nodetool getendpoints mykeyspace data 9213395123941039286
> 
> 10.1.1.161
> 10.1.1.19
> All nodes are working hard because my TTL is for 18 days and daily data 
> ingestion is around 120,000,000 records:
> nodetool compactionstats -H
> pending tasks: 3
> - mykeyspace.data: 3
> 
> id                                   compaction type     keyspace  table     
> completed  total      unit  progress
> c49599b1-308d-11e7-ba5b-67e232f1bee1 Remove deleted data mykeyspace data 
> 133.89 GiB 158.33 GiB bytes 84.56%  
> c49599b0-308d-11e7-ba5b-67e232f1bee1 Remove deleted data mykeyspace data 
> 136.2 GiB  278.96 GiB bytes 48.83%
> 
> Active compaction remaining time :   0h00m00s
> 
> 
> nodetool compactionstats -H
> pending tasks: 2
> - mykeyspace.data: 2
> 
> id                                   compaction type keyspace  table     
> completed total      unit  progress
> b6e8ce80-30d4-11e7-a2be-9b830f114108 Compaction      mykeyspace data 4.05 GiB 
>  133.02 GiB bytes 3.04%   
> Active compaction remaining time :   2h17m34s
> 
> The nodetool repair by default in this C* version is incremental and since 
> the repair is run in all nodes in different hours and I don't want snapshots 
> that's why I'm cleaning twice a day (not sure that with -pr a snapshot is 
> created).
> 
> The cleanup was already remove was there because last node was created a few 
> days ago.
> 
> I'm using garbagecollect to force the cleanup since I'm running out of space.
> 
> 
> Regards.
> 
> 
> 
> On 05/04/2017 12:50 PM, Alain RODRIGUEZ wrote:
>> Hi,
>> 
>> CREATE KEYSPACE mykeyspace WITH replication = {'class':
>> 'SimpleStrategy', 'replication_factor': '2'}  AND durable_writes = false;
>> 
>> The SimpleStrategy is never recommended for production clusters as it does 
>> not recognise racks or datacenter, inducing possible availability issues and 
>> unpredictable latency when using those. I would not even use it for testing 
>> purposes, I see no point in most cases.
>> 
>> Even if this should be changed, carefully but as soon as possible imho, it 
>> is probably not related to your main issue at hand.
>> 
>> If nodes are imbalanced, there are 3 mains questions that come to my mind:
>> 
>> Are the token well distributed among the available nodes?
>> Is the data correctly balanced on the token ring (i.e. are the 'id' values 
>> of 'mykeyspace.data' table well spread between the nodes?
>> Are the compaction processes running smoothly on every nodes
>> 
>> Point 1 depends on whether you are using vnodes or not and what number of 
>> vnodes ('num_token' in cassandra.yaml).
>> If not using vnodes, you have to manually set the positions of the nodes and 
>> move them around when adding more nodes so thing remain balanced
>> If using vnodes, make sure to use a high enough number of vnodes so 
>> distribution is 'good enough' (More than 32 in most cases, default is 256, 
>> which lead to quite balanced rings, but brings other issues)
>> 
>> UN  10.1.1.161  398.39 GiB  256          28.9%
>> UN  10.1.1.19   765.32 GiB  256          29.9%
>> UN  10.1.1.52   574.24 GiB  256          28.2%
>> UN  10.1.1.213  817.56 GiB  256          28.2%
>> UN  10.1.1.85   638.82 GiB  256          28.2%
>> UN  10.1.1.245  408.95 GiB  256          28.7%
>> UN  10.1.1.185  574.63 GiB  256          27.9%
>> 
>> You can have the token ownership information by running 'nodetool status 
>> <mykeyspace>'. Adding the keyspace name in the command give you the real 
>> ownership. Also, RF = 2 means the total of the ownership should be 200%, 
>> ideally evenly balanced. I am not sure about the command you ran here. Also 
>> as a global advice, let us the command you ran and what you expect us to see 
>> in the output.
>> 
>> Still the tokens seems to be well distributed, and I guess you are using the 
>> default 'num_token': 256. So I believe you are not having this issue. But 
>> the delta between the data hold on each node is up to x2 (400 GB on some 
>> nodes, 800 GB on some others).
>> 
>> Point 2 highly depends on the workload. Are your partitions evenly 
>> distributed among the nodes? It depends on your primary key. Using an UUID 
>> as the partition key is often a good idea, but it depends on your needs as 
>> well, of course. You could look at the distribution on the distinct nodes 
>> through: 'nodetool cfstats'.
>> 
>> Point 3 : even if the tokens are perfectly distributed and the primary key 
>> perfectly randomized, some node can have some disk issue or any other reason 
>> having the compactions falling behind. This would lead to this node to hold 
>> more data and note evicting tombstones properly in some cases, increasing 
>> disk space used. Other than that, you can have a big SSTable being compacted 
>> on a node, having the size of the node growing quite suddenly (that's why 50 
>> to 20% of the disk should always be free, depending on the compaction 
>> strategy in use and the number of concurrent compactions). Here, running 
>> 'nodetool compactionstats -H' on all the nodes would probably help you to 
>> troubleshoot.
>> 
>> About crontab
>>  
>> 08 05   * * *   root    nodetool repair -pr
>> 11 11   * * *   root    fstrim -a
>> 04 12   * * *   root    nodetool clearsnapshot
>> 33 13   * * 2   root    nodetool cleanup
>> 35 15   * * *   root    nodetool garbagecollect
>> 46 19   * * *   root    nodetool clearsnapshot
>> 50 23   * * *   root    nodetool flush
>> 
>> I don't understand what you try to achieve with some of the commands:
>> 
>> nodetool repair -pr
>> 
>> Repairing the cluster regularly is good in most cases, but as default 
>> changes with version, I would specify if the repair is supposed to be 
>> 'incremental' or 'full', if it is supposed to be 'sequential' or 'parallel' 
>> for example. Also, as the dataset growth, some issue will appear with 
>> repairs.Just search for 'repairs cassandra' on google or any search engine 
>> you are using and you will see that repair is a complex topic. Look for 
>> videos and you will find a lot of informations about it from nice talks like 
>> these 2 from the last summit:
>> 
>> https://www.youtube.com/watch?v=FrF8wQuXXks 
>> <https://www.youtube.com/watch?v=FrF8wQuXXks>
>> https://www.youtube.com/watch?v=1Sz_K8UID6E 
>> <https://www.youtube.com/watch?v=1Sz_K8UID6E>
>> 
>> Also some nice tools exist to help with repairs:
>> 
>> The Reaper (originally made at Spotify now maintained by The Last Pickle): 
>> https://github.com/thelastpickle/cassandra-reaper 
>> <https://github.com/thelastpickle/cassandra-reaper>
>> 'cassandra_range_repair.py':  
>> https://github.com/BrianGallew/cassandra_range_repair 
>> <https://github.com/BrianGallew/cassandra_range_repair>
>> 
>> 11 11   * * *   root    fstrim -a
>> 
>> I am not really sure about this one but it looks good as long as the 
>> 'fstrim' do not create performance issue while it is running it seems fine.
>> 
>> 04 12   * * *   root    nodetool clearsnapshot
>> 
>> This will automatically erase any snapshot you might want to keep. It might 
>> be good to specify what snapshot you want to remove and name it. Some 
>> snapshots will be created and not removed when using a sequential repair. So 
>> I believe clearing specific snapshots is a good idea to save disk space.
>> 
>> 33 13   * * 2   root    nodetool cleanup
>> 
>> This is to be ran on all the nodes after adding a new node. It will just 
>> remove data from existing node that 'gave' some token ranges to the new 
>> node. To do so it will compact all the SSTables. It doesn't seem to be a 
>> good idea to 'cron' that.
>> 
>> 35 15   * * *   root    nodetool garbagecollect
>> 
>> This is also an heavy operation that you should not need in a regular basis: 
>> http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html 
>> <http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html>. 
>> What problem are you trying to solve here? Your data uses TTLs and TWCS, so 
>> expired SSTable should be going away without any issue.
>> 
>> 46 19   * * *   root    nodetool clearsnapshot
>> 
>> Again? What for?
>> 
>> 50 23   * * *   root    nodetool flush
>> 
>> This will produce tables to be flushed at the same time, no matter their 
>> sizes or any other considerations. It is not to be used unless you are doing 
>> some testing, debugging or on your way to shut down the node.
>> 
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - [email protected] 
>> <mailto:[email protected]>
>> France
>> 
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com <http://www.thelastpickle.com/>
>> 
>> 2017-05-04 11:38 GMT+01:00 Cogumelos Maravilha <[email protected] 
>> <mailto:[email protected]>>:
>> Hi all,
>> 
>> I'm using C* 3.10.
>> 
>> CREATE KEYSPACE mykeyspace WITH replication = {'class':
>> 'SimpleStrategy', 'replication_factor': '2'}  AND durable_writes = false;
>> 
>> CREATE TABLE mykeyspace.data (
>>     id bigint PRIMARY KEY,
>>     kafka text
>> ) WITH bloom_filter_fp_chance = 0.5
>>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>     AND comment = ''
>>     AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
>> 'compaction_window_size': '10', 'compaction_window_unit': 'HOURS',
>> 'max_threshold': '32', 'min_threshold': '6'}
>>     AND compression = {'chunk_length_in_kb': '64', 'class':
>> 'org.apache.cassandra.io 
>> <http://org.apache.cassandra.io/>.compress.LZ4Compressor'}
>>     AND crc_check_chance = 0.0
>>     AND dclocal_read_repair_chance = 0.1
>>     AND default_time_to_live = 1555200
>>     AND gc_grace_seconds = 10800
>>     AND max_index_interval = 2048
>>     AND memtable_flush_period_in_ms = 0
>>     AND min_index_interval = 128
>>     AND read_repair_chance = 0.0
>>     AND speculative_retry = '99PERCENTILE';
>> 
>> UN  10.1.1.161  398.39 GiB  256          28.9%
>> UN  10.1.1.19   765.32 GiB  256          29.9%
>> UN  10.1.1.52   574.24 GiB  256          28.2%
>> UN  10.1.1.213  817.56 GiB  256          28.2%
>> UN  10.1.1.85   638.82 GiB  256          28.2%
>> UN  10.1.1.245  408.95 GiB  256          28.7%
>> UN  10.1.1.185  574.63 GiB  256          27.9%
>> 
>> At crontab in all nodes (only changes the time):
>> 
>> 08 05   * * *   root    nodetool repair -pr
>> 11 11   * * *   root    fstrim -a
>> 04 12   * * *   root    nodetool clearsnapshot
>> 33 13   * * 2   root    nodetool cleanup
>> 35 15   * * *   root    nodetool garbagecollect
>> 46 19   * * *   root    nodetool clearsnapshot
>> 50 23   * * *   root    nodetool flush
>> 
>> I can I fixed this?
>> 
>> Thanks in advance.
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>> 
>> 
>

Re: Totally unbalanced cluster

Reply via email to