Re: Truncate data from a single node

2017-07-11 Thread Patrick McFadin
Hey Kevin,

I would worry that much about a truncate operation. It can quietly destroy
all your data very efficiently. One thing you should know is that a
snapshot is automatically created when you issue a truncate. Yes. An
undelete if you screw up. Just don't be surprised when you find it.

Deleting SSTables is also valid. If you are using something like twcs you
can pick some files that are older and grouped together.  Altering the
keyspace to a different RF won't account for what keys are present in the
SStable. You could determine the keys in each file, but at this point it's
getting much more complicated.

Find some old SSTables for the table in question and delete them. Much
easier.

Patrick

On Tue, Jul 11, 2017 at 8:09 PM, Kevin O'Connor 
wrote:

> This might be an interesting question - but is there a way to truncate
> data from just a single node or two as a test instead of truncating from
> the entire cluster? We have time series data we don't really care if we're
> missing gaps in, but it's taking up a huge amount of space and we're
> looking to clear some. I'm worried if we run a truncate on this huge CF
> it'll end up locking up the cluster, but I don't care so much if it just
> kills a single node.
>
> Is doing something like deleting SSTables from disk possible? If I alter
> this keyspace from an RF of 2 down to 1 and then delete them, they won't be
> able to be repaired if I'm thinking this through right.
>
> Thanks!
>


Truncate data from a single node

2017-07-11 Thread Kevin O'Connor
This might be an interesting question - but is there a way to truncate data
from just a single node or two as a test instead of truncating from the
entire cluster? We have time series data we don't really care if we're
missing gaps in, but it's taking up a huge amount of space and we're
looking to clear some. I'm worried if we run a truncate on this huge CF
it'll end up locking up the cluster, but I don't care so much if it just
kills a single node.

Is doing something like deleting SSTables from disk possible? If I alter
this keyspace from an RF of 2 down to 1 and then delete them, they won't be
able to be repaired if I'm thinking this through right.

Thanks!


Re: reduced num_token = improved performance ??

2017-07-11 Thread Justin Cameron
Hi,

Using fewer vnodes means you'll have a higher chance of hot spots in your
cluster. Hot spots in Cassandra are nodes that, by random chance, are
responsible for a higher percentage of the token space than others. This
means they will receive more data and also more traffic/load than other
nodes in the cluster.

CASSANDRA-7032 goes a long way towards addresses this issue by allocating
vnode tokens more intelligently, rather than just randomly assigning them.
If you're using a version of Cassandra that contains this feature (3.0+),
you can use a smaller number of vnodes in your cluster.

A high number of vnodes won't affect performance for most Cassandra
workloads, but if you're running tasks that need to do token-range scans
(such as Spark), there is usually a significant performance hit.

If you're on C* 3.0+ and are using Spark (or similar workloads - cassandra
lucene index plugin is also affected) then I'd recommend using fewer vnodes
- 16 would be ok. You'll probably still see some variance in token-space
ownership between nodes, but the trade-off for better Spark performance
will likely be worth it.

Justin

On Wed, 12 Jul 2017 at 00:34 ZAIDI, ASAD A  wrote:

> Hi Folks,
>
>
>
> Pardon me if I’m missing  something obvious.  I’m still using
> apache-cassandra 2.2 and planning for upgrade to  3.x.
>
> I came across this jira [
> https://issues.apache.org/jira/browse/CASSANDRA-7032] that suggests
> reducing num_token may improve general performance of Cassandra like
> having  num_token=16 instead of 256   may help!
>
>
>
> Can you please suggests if having less num_token would provide real
> performance benefits or if  it comes with any downsides that we should also
> consider? I’ll much appreciate your insights.
>
>
>
> Thank you
>
> Asad
>
-- 


*Justin Cameron*Senior Software Engineer





This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.


Re: c* updates not getting reflected.

2017-07-11 Thread Carlos Rolo
What consistency are you using on those queries?

On 11 Jul 2017 19:09, "techpyaasa ."  wrote:

> Hi,
>
> We have a table with following schema:
>
> CREATE TABLE ks1.cf1 ( pid bigint, cid bigint, resp_json text, status int,
> PRIMARY KEY (pid, cid) ) WITH CLUSTERING ORDER BY (cid ASC) with LCS
> compaction strategy.
>
> We make very frequent updates to this table with query like
>
> UPDATE ks1.cf1 SET status = 0 where pid=1 and cid=1;
> UPDATE ks1.cf1 SET resp_json='' where uid=1 and mid=1;
>
>
> Now we seeing a strange issue like sometimes status column or resp_json
> column value not getting updated when we query using SELECT query.
>
> We are not seeing any exceptions though during UPDATE query executions.
> And also is there any way to make sure that last UPDATE was success??
>
> We are using c* - 2.1.17 , datastax java driver 2.1.18.
>
> Can someone point out what the issue is or anybody faced such strange
> issue?
>
> Any help is appreciated.
>
> Thanks in advance
> TechPyaasa
>

-- 


--





Re: "nodetool repair -dc"

2017-07-11 Thread Anuj Wadehra
Hi, 
I have not used dc local repair specifically but generally repair syncs all 
local tokens of the node with other replicas (full repair) or a subset of local 
tokens (-pr and subrange). Full repair with - Dc option should only sync data 
for all the tokens present on the node where the command is run with other 
replicas in local dc.
You should run full repair on all nodes of the DC unless RF of all keyspaces in 
local DC =number of nodes in DC. E.g if you have 3 nodes in dc1 and RF is 
DC1:3, repairing single node should sync all data within a DC. This doesnt hold 
true if you have 5 nodes and no node holds 100% data. 
Running full repair on all nodes in a dc may lead to repairing every data RF 
times. Inefficient!!  And you cant use pr with dc option.   Even if its allowed 
it wont repair entire ring as a dc owns a subset of entire token ring. 
ThanksAnuj
 
 
  On Tue, 11 Jul 2017 at 20:08, vasu gunja wrote:   Hi ,
My Question specific to -dc option 
Do we need to run this on all nodes that belongs to that DC ?Or only on one of 
the nodes that belongs to that DC then it will repair all nodes ?

On Sat, Jul 8, 2017 at 10:56 PM, Varun Gupta  wrote:

I do not see the need to run repair, as long as cluster was in healthy state on 
adding new nodes.
On Fri, Jul 7, 2017 at 8:37 AM, vasu gunja  wrote:

Hi , 
I have a question regarding "nodetool repair -dc" option. recently we added 
multiple nodes to one DC center, we want to perform repair only on current DC. 
Here is my question.
Do we need to perform "nodetool repair -dc" on all nodes belongs to that DC ? 
or only one node of that DC?


thanks,V



  


c* updates not getting reflected.

2017-07-11 Thread techpyaasa .
Hi,

We have a table with following schema:

CREATE TABLE ks1.cf1 ( pid bigint, cid bigint, resp_json text, status int,
PRIMARY KEY (pid, cid) ) WITH CLUSTERING ORDER BY (cid ASC) with LCS
compaction strategy.

We make very frequent updates to this table with query like

UPDATE ks1.cf1 SET status = 0 where pid=1 and cid=1;
UPDATE ks1.cf1 SET resp_json='' where uid=1 and mid=1;


Now we seeing a strange issue like sometimes status column or resp_json
column value not getting updated when we query using SELECT query.

We are not seeing any exceptions though during UPDATE query executions.
And also is there any way to make sure that last UPDATE was success??

We are using c* - 2.1.17 , datastax java driver 2.1.18.

Can someone point out what the issue is or anybody faced such strange issue?

Any help is appreciated.

Thanks in advance
TechPyaasa


Re: "nodetool repair -dc"

2017-07-11 Thread vasu gunja
Hi ,

My Question specific to -dc option

Do we need to run this on all nodes that belongs to that DC ?
Or only on one of the nodes that belongs to that DC then it will repair all
nodes ?


On Sat, Jul 8, 2017 at 10:56 PM, Varun Gupta  wrote:

> I do not see the need to run repair, as long as cluster was in healthy
> state on adding new nodes.
>
> On Fri, Jul 7, 2017 at 8:37 AM, vasu gunja  wrote:
>
>> Hi ,
>>
>> I have a question regarding "nodetool repair -dc" option. recently we
>> added multiple nodes to one DC center, we want to perform repair only on
>> current DC.
>>
>> Here is my question.
>>
>> Do we need to perform "nodetool repair -dc" on all nodes belongs to that
>> DC ?
>> or only one node of that DC?
>>
>>
>>
>> thanks,
>> V
>>
>
>


reduced num_token = improved performance ??

2017-07-11 Thread ZAIDI, ASAD A
Hi Folks,

Pardon me if I’m missing  something obvious.  I’m still using apache-cassandra 
2.2 and planning for upgrade to  3.x.
I came across this jira [https://issues.apache.org/jira/browse/CASSANDRA-7032] 
that suggests reducing num_token may improve general performance of Cassandra 
like having  num_token=16 instead of 256   may help!

Can you please suggests if having less num_token would provide real performance 
benefits or if  it comes with any downsides that we should also consider? I’ll 
much appreciate your insights.

Thank you
Asad


Re: Unbalanced cluster

2017-07-11 Thread Jonathan Haddad
Awesome utility Avi! Thanks for sharing.
On Tue, Jul 11, 2017 at 10:57 AM Avi Kivity  wrote:

> There is now a readme with some examples and a build file.
>
> On 07/11/2017 11:53 AM, Avi Kivity wrote:
>
> Yeah, posting a github link carries an implied undertaking to write a
> README file and make it easily buildable. I'll see what I can do.
>
>
>
>
> On 07/11/2017 06:25 AM, Nate McCall wrote:
>
> You wouldnt have a build file laying around for that, would you?
>
> On Tue, Jul 11, 2017 at 3:23 PM, Nate McCall 
> wrote:
>
>> On Tue, Jul 11, 2017 at 3:20 AM, Avi Kivity  wrote:
>>
>>>
>>>
>>>
>>> [1] https://github.com/avikivity/shardsim
>>
>>
>> Avi, that's super handy - thanks for posting.
>>
>
>
>
> --
> -
> Nate McCall
> Wellington, NZ
> @zznate
>
> CTO
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>


Re: Unbalanced cluster

2017-07-11 Thread Avi Kivity

There is now a readme with some examples and a build file.


On 07/11/2017 11:53 AM, Avi Kivity wrote:


Yeah, posting a github link carries an implied undertaking to write a 
README file and make it easily buildable. I'll see what I can do.





On 07/11/2017 06:25 AM, Nate McCall wrote:

You wouldnt have a build file laying around for that, would you?

On Tue, Jul 11, 2017 at 3:23 PM, Nate McCall > wrote:


On Tue, Jul 11, 2017 at 3:20 AM, Avi Kivity > wrote:




[1] https://github.com/avikivity/shardsim



Avi, that's super handy - thanks for posting.




--
-
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com






Re: Unbalanced cluster

2017-07-11 Thread Avi Kivity
Yeah, posting a github link carries an implied undertaking to write a 
README file and make it easily buildable. I'll see what I can do.





On 07/11/2017 06:25 AM, Nate McCall wrote:

You wouldnt have a build file laying around for that, would you?

On Tue, Jul 11, 2017 at 3:23 PM, Nate McCall > wrote:


On Tue, Jul 11, 2017 at 3:20 AM, Avi Kivity > wrote:




[1] https://github.com/avikivity/shardsim



Avi, that's super handy - thanks for posting.




--
-
Nate McCall
Wellington, NZ
@zznate

CTO
Apache Cassandra Consulting
http://www.thelastpickle.com




Re: Unbalanced cluster

2017-07-11 Thread Avi Kivity
It is ScyllaDB specific. Scylla divides data not only among nodes, but 
also internally within a node among cores (=shards in our terminology). 
In the past we had problems with shards being over- and under-utilized 
(just like your cluster), so this simulator was developed to validate 
the solution.



On 07/11/2017 10:27 AM, Loic Lambiel wrote:

Thanks for the hint and tool !

By the way, what does the --shards parameter means ?

Thanks

Loic

On 07/10/2017 05:20 PM, Avi Kivity wrote:

32 tokens is too few for 33 nodes. I have a sharding simulator [1] and
it shows


$ ./shardsim --vnodes 32 --nodes 33 --shards 1
33 nodes, 32 vnodes, 1 shards
maximum node overcommit:  1.42642
maximum shard overcommit: 1.426417


So 40% overcommit over the average. Since some nodes can be
undercommitted, this easily explains the 2X difference (40% overcommit +
30% undercommit = 2X).


Newer versions of Cassandra have better token selection and will suffer
less from this.



[1] https://github.com/avikivity/shardsim


On 07/10/2017 04:02 PM, Loic Lambiel wrote:

Hi,

One of our clusters is becoming somehow unbalanced, at least some of the
nodes:

(output edited to remove unnecessary information)
--  Address Load   Tokens  Owns (effective)   Rack
UN  192.168.1.22   2.99 TB32  10.6%   RACK1
UN  192.168.1.23   3.35 TB32  11.7%   RACK1
UN  192.168.1.20   3.22 TB32  11.3%   RACK1
UN  192.168.1.21   3.21 TB32  11.2%   RACK1
UN  192.168.1.18   2.87 TB32  10.3%   RACK1
UN  192.168.1.19   3.49 TB32  12.0%   RACK1
UN  192.168.1.16   5.32 TB32  12.9%   RACK1
UN  192.168.1.17   3.77 TB32  12.0%   RACK1
UN  192.168.1.26   4.46 TB32  11.2%   RACK1
UN  192.168.1.24   3.24 TB32  11.4%   RACK1
UN  192.168.1.25   3.31 TB32  11.2%   RACK1
UN  192.168.1.134  2.75 TB18  7.2%RACK1
UN  192.168.1.135  2.52 TB18  6.0%RACK1
UN  192.168.1.132  1.85 TB18  6.8%RACK1
UN  192.168.1.133  2.41 TB18  5.7%RACK1
UN  192.168.1.130  2.95 TB18  7.1%RACK1
UN  192.168.1.131  2.82 TB18  6.7%RACK1
UN  192.168.1.128  3.04 TB18  7.1%RACK1
UN  192.168.1.129  2.47 TB18  7.2%RACK1
UN  192.168.1.14   5.63 TB32  13.4%   RACK1
UN  192.168.1.15   2.95 TB32  10.4%   RACK1
UN  192.168.1.12   3.83 TB32  12.4%   RACK1
UN  192.168.1.13   2.71 TB32  9.5%RACK1
UN  192.168.1.10   3.51 TB32  11.9%   RACK1
UN  192.168.1.11   2.96 TB32  10.3%   RACK1
UN  192.168.1.126  2.48 TB18  6.7%RACK1
UN  192.168.1.127  2.23 TB18  5.5%RACK1
UN  192.168.1.124  2.05 TB18  5.5%RACK1
UN  192.168.1.125  2.33 TB18  5.8%RACK1
UN  192.168.1.122  1.99 TB18  5.1%RACK1
UN  192.168.1.123  2.44 TB18  5.7%RACK1
UN  192.168.1.120  3.58 TB28  11.4%   RACK1
UN  192.168.1.121  2.33 TB18  6.8%RACK1

Notice the node 192.168.1.14 owns 13.4%  / 5.63TB while node
192.168.1.13 owns only 9.5% / 2.71TB, which is almost twice the load.
They both have 32 tokens.

The cluster is running:

* Cassandra 2.1.16 (initially bootstrapped running 2.1.2, with vnodes
enabled)
* RF=3 with single DC and single rack. LCS as the compaction strategy,
JBOD storage
* Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
* Node cleanup performed on all nodes

Almost all of the cluster load comes from a single CF:

CREATE TABLE blobstore.block (
  inode uuid,
  version timeuuid,
  block bigint,
  offset bigint,
  chunksize int,
  payload blob,
  PRIMARY KEY ((inode, version, block), offset)
) WITH CLUSTERING ORDER BY (offset ASC)
  AND bloom_filter_fp_chance = 0.01
  AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
  AND comment = ''
  AND compaction = {'tombstone_threshold': '0.1',
'tombstone_compaction_interval': '60', 'unchecked_tombstone_compaction':
'false', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
  AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
  AND dclocal_read_repair_chance = 0.1
  AND default_time_to_live = 0
  AND gc_grace_seconds = 172000
  AND max_index_interval = 2048
  AND memtable_flush_period_in_ms = 0
  AND min_index_interval = 128
  AND read_repair_chance = 0.0
  AND speculative_retry = '99.0PERCENTILE';

The payload column is almost the same size in each record.

I understand that an unbalanced cluster may be the result of a bad

Re: Cassandra crashed with OOM, and the system.log and debug.log doesn't match.

2017-07-11 Thread qiang zhang
Thanks for your explanation!

> It's taking a full minute to sync your memtable to disk. This is either
indication that your disk is broken, or your JVM is pausing for GC.
The disk is ok, the long time JVM pausing happens many times, I didn't
disable the paging file in windows, may be that's the reason.

> That's a bit thin - depending on data model and data volume, you may be
able to construct a read that fills up your 3G heap, and causes you to OOM
with a single read. How much data is involved?  What does 'nodetool
tablestats' look like, and finally, how many reads/seconds are you doing on
this workload?
The total data size in this database is about 45GB, the read operation is
like this:
SELECT value FROM kairosdb.data_points WHERE key =
44617461506572665f31315cf125b80d6b6169726f735f646f75626c656974656d5f6e616d653d6974656d5f333a
AND column1 >= 4d6d2ac0 AND column1 <= 4ddb07c1 THRIFT LIMIT (partitions=1,
cells_per_partition=1)
I think this operation can limit the max size of the result data, is the
memory usage of this operation also limited?

I've removed all the data already and start another test using the same
case. So the 'nodetool tablestats' information is not available now, I've
remember that the sstable file number of that column family is not too much
(less than 100).

I'm doing 1 querys in about 300 seconds, and doing that once for every
30 minutes, while the same time, I'm inserting 1 datas every second.

I'm hoping to find a way to prevent OOM happens, and keep cassandra node
alive for longer time. Is there any suggestions?

Thanks.


2017-07-11 0:18 GMT+08:00 Jeff Jirsa :

>
>
> On 2017-07-10 02:07 (-0700), 张强  wrote:
> > Hi experts, I've a single cassandra 3.11.0 node working with kairosdb (a
> > time series database), after running 4 days with stable workload, the
> > database client start to get "request errors", but there are not a lot of
> > error or warning messages in the cassandra log file, the client start to
> > receive error message at about 7-7 21:03:00, and kairosdb keep retrying
> > after that time, but there isn't much logs in the cassandra log file.
> > I've noticed the abnormal status at about 7-8 16:00:00, then I've typed a
> > "nodetool tablestats" command to get some information, the command got an
> > error, and while that time, the cassandra process start to crash, and
> > generated a dump file.
> > After C* shutdown, I take the logs to see what happened, and I found
> > something strange inside the logs.
> >
> > 1. In the system.log, there are two lines shows that no logs between
> > 2017-07-07 21:03:50 to 2017-07-08 16:07:33, I think that is a pretty long
> > period without any logs, and in gc.log file, there are a lot of logs
> shows
> > long time GC, that should be logged in system.log.
> > INFO  [ReadStage-1] 2017-07-07 21:03:50,824 NoSpamLogger.java:91 -
> Maximum
> > memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB
>
> Failing to allocate during read stage is a good indication that you're out
> of memory - either the heap is too small, or it's a direct memory
> allocation failure, or something, but that log line probably shouldn't be
> at INFO, because it seems like it's probably hiding a larger problem.
>
> > WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2017-07-08 16:07:33,347
> > NoSpamLogger.java:94 - Out of 1 commit log syncs over the past 0.00s with
> > average duration of 60367.73ms, 1 have exceeded the configured commit
> > interval by an average of 50367.73ms
>
> It's taking a full minute to sync your memtable to disk. This is either
> indication that your disk is broken, or your JVM is pausing for GC.
>
> >
> > 2. In the system.log, there is a log shows very long time GC, and then
> the
> > C* start to close.
> > WARN  [ScheduledTasks:1] 2017-07-08 16:07:46,846 NoSpamLogger.java:94 -
> > Some operations timed out, details available at debug level (debug.log)
> > WARN  [Service Thread] 2017-07-08 16:10:36,114 GCInspector.java:282 -
> > ConcurrentMarkSweep GC in 688850ms.  CMS Old Gen: 2114938312 ->
> 469583832;
> > Par Eden Space: 837584 -> 305319752; Par Survivor Space: 41943040 ->
> > 25784008
> > ..
> > ERROR [Thrift:22] 2017-07-08 16:10:56,322 CassandraDaemon.java:228 -
> > Exception in thread Thread[Thrift:22,5,main]
> > java.lang.OutOfMemoryError: Java heap space
>
> You ran out of heap. We try to clean up and kill things when this happens,
> but by definition, the JVM is in an undefined state, and we may not be able
> to shut things down properly.
>
> >
> > 3. In the debug.log, the last INFO level log is at 2017-07-07 14:43:59,
> the
> > log is:
> > INFO  [IndexSummaryManager:1] 2017-07-07 14:43:59,967
> > IndexSummaryRedistribution.java:75 - Redistributing index summaries
> > After that, there are DEBUG level logs until 2017-07-07 21:11:34, but no
> > more INFO level or other level logs in that log file, while there are
> still
> > many logs in the system.log after 

Re: Unbalanced cluster

2017-07-11 Thread Loic Lambiel
Thanks for the hint and tool !

By the way, what does the --shards parameter means ?

Thanks

Loic

On 07/10/2017 05:20 PM, Avi Kivity wrote:
> 32 tokens is too few for 33 nodes. I have a sharding simulator [1] and
> it shows
> 
> 
> $ ./shardsim --vnodes 32 --nodes 33 --shards 1
> 33 nodes, 32 vnodes, 1 shards
> maximum node overcommit:  1.42642
> maximum shard overcommit: 1.426417
> 
> 
> So 40% overcommit over the average. Since some nodes can be
> undercommitted, this easily explains the 2X difference (40% overcommit +
> 30% undercommit = 2X).
> 
> 
> Newer versions of Cassandra have better token selection and will suffer
> less from this.
> 
> 
> 
> [1] https://github.com/avikivity/shardsim
> 
> 
> On 07/10/2017 04:02 PM, Loic Lambiel wrote:
>> Hi,
>>
>> One of our clusters is becoming somehow unbalanced, at least some of the
>> nodes:
>>
>> (output edited to remove unnecessary information)
>> --  Address Load   Tokens  Owns (effective)   Rack
>> UN  192.168.1.22   2.99 TB32  10.6%   RACK1
>> UN  192.168.1.23   3.35 TB32  11.7%   RACK1
>> UN  192.168.1.20   3.22 TB32  11.3%   RACK1
>> UN  192.168.1.21   3.21 TB32  11.2%   RACK1
>> UN  192.168.1.18   2.87 TB32  10.3%   RACK1
>> UN  192.168.1.19   3.49 TB32  12.0%   RACK1
>> UN  192.168.1.16   5.32 TB32  12.9%   RACK1
>> UN  192.168.1.17   3.77 TB32  12.0%   RACK1
>> UN  192.168.1.26   4.46 TB32  11.2%   RACK1
>> UN  192.168.1.24   3.24 TB32  11.4%   RACK1
>> UN  192.168.1.25   3.31 TB32  11.2%   RACK1
>> UN  192.168.1.134  2.75 TB18  7.2%RACK1
>> UN  192.168.1.135  2.52 TB18  6.0%RACK1
>> UN  192.168.1.132  1.85 TB18  6.8%RACK1
>> UN  192.168.1.133  2.41 TB18  5.7%RACK1
>> UN  192.168.1.130  2.95 TB18  7.1%RACK1
>> UN  192.168.1.131  2.82 TB18  6.7%RACK1
>> UN  192.168.1.128  3.04 TB18  7.1%RACK1
>> UN  192.168.1.129  2.47 TB18  7.2%RACK1
>> UN  192.168.1.14   5.63 TB32  13.4%   RACK1
>> UN  192.168.1.15   2.95 TB32  10.4%   RACK1
>> UN  192.168.1.12   3.83 TB32  12.4%   RACK1
>> UN  192.168.1.13   2.71 TB32  9.5%RACK1
>> UN  192.168.1.10   3.51 TB32  11.9%   RACK1
>> UN  192.168.1.11   2.96 TB32  10.3%   RACK1
>> UN  192.168.1.126  2.48 TB18  6.7%RACK1
>> UN  192.168.1.127  2.23 TB18  5.5%RACK1
>> UN  192.168.1.124  2.05 TB18  5.5%RACK1
>> UN  192.168.1.125  2.33 TB18  5.8%RACK1
>> UN  192.168.1.122  1.99 TB18  5.1%RACK1
>> UN  192.168.1.123  2.44 TB18  5.7%RACK1
>> UN  192.168.1.120  3.58 TB28  11.4%   RACK1
>> UN  192.168.1.121  2.33 TB18  6.8%RACK1
>>
>> Notice the node 192.168.1.14 owns 13.4%  / 5.63TB while node
>> 192.168.1.13 owns only 9.5% / 2.71TB, which is almost twice the load.
>> They both have 32 tokens.
>>
>> The cluster is running:
>>
>> * Cassandra 2.1.16 (initially bootstrapped running 2.1.2, with vnodes
>> enabled)
>> * RF=3 with single DC and single rack. LCS as the compaction strategy,
>> JBOD storage
>> * Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>> * Node cleanup performed on all nodes
>>
>> Almost all of the cluster load comes from a single CF:
>>
>> CREATE TABLE blobstore.block (
>>  inode uuid,
>>  version timeuuid,
>>  block bigint,
>>  offset bigint,
>>  chunksize int,
>>  payload blob,
>>  PRIMARY KEY ((inode, version, block), offset)
>> ) WITH CLUSTERING ORDER BY (offset ASC)
>>  AND bloom_filter_fp_chance = 0.01
>>  AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>  AND comment = ''
>>  AND compaction = {'tombstone_threshold': '0.1',
>> 'tombstone_compaction_interval': '60', 'unchecked_tombstone_compaction':
>> 'false', 'class':
>> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>>  AND compression = {'sstable_compression':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>  AND dclocal_read_repair_chance = 0.1
>>  AND default_time_to_live = 0
>>  AND gc_grace_seconds = 172000
>>  AND max_index_interval = 2048
>>  AND memtable_flush_period_in_ms = 0
>>  AND min_index_interval = 128
>>  AND read_repair_chance = 0.0
>>  AND speculative_retry = '99.0PERCENTILE';
>>
>> The payload column is almost the same size in each record.
>>
>> I understand that an unbalanced cluster may be the result of a bad
>> Primary key, which I believe isn't the case here.
>>
>> Any clue on what