I'm pretty stumped by this, so here is some more detail if it helps. Here is what the suspicious partition looks like in the `sstabledump` output (some pii etc redacted): ``` { "partition" : { "key" : [ "some_user_id_value", "user_id", "demo-test" ], "position" : 210 }, "rows" : [ { "type" : "row", "position" : 1132, "clustering" : [ "2019-01-22 15:27:45.000Z" ], "liveness_info" : { "tstamp" : "2019-01-22T15:31:12.415081Z" }, "cells" : [ { "some": "data" } ] } ] } ```
And here is what every other partition looks like: ``` { "partition" : { "key" : [ "some_other_user_id", "user_id", "some_site_id" ], "position" : 1133 }, "rows" : [ { "type" : "row", "position" : 1234, "clustering" : [ "2019-01-22 17:59:35.547Z" ], "liveness_info" : { "tstamp" : "2019-01-22T17:59:35.708Z", "ttl" : 86400, "expires_at" : "2019-01-23T17:59:35Z", "expired" : true }, "cells" : [ { "name" : "activity_data", "deletion_info" : { "local_delete_time" : "2019-01-22T17:59:35Z" } } ] } ] } ``` As expected, almost all of the data except this one suspicious partition has a ttl and is already expired. But if a partition isn't expired and I see it in the sstable, why wouldn't I see it executing a CQL query against the CF? Why would this sstable be preventing so many other sstable's from getting cleaned up? On Tue, Apr 30, 2019 at 12:34 PM Mike Torra <mto...@salesforce.com> wrote: > Hello - > > I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few > months ago I started noticing disk usage on some nodes increasing > consistently. At first I solved the problem by destroying the nodes and > rebuilding them, but the problem returns. > > I did some more investigation recently, and this is what I found: > - I narrowed the problem down to a CF that uses TWCS, by simply looking at > disk space usage > - in each region, 3 nodes have this problem of growing disk space (matches > replication factor) > - on each node, I tracked down the problem to a particular SSTable using > `sstableexpiredblockers` > - in the SSTable, using `sstabledump`, I found a row that does not have a > ttl like the other rows, and appears to be from someone else on the team > testing something and forgetting to include a ttl > - all other rows show "expired: true" except this one, hence my suspicion > - when I query for that particular partition key, I get no results > - I tried deleting the row anyways, but that didn't seem to change anything > - I also tried `nodetool scrub`, but that didn't help either > > Would this rogue row without a ttl explain the problem? If so, why? If > not, does anyone have any other ideas? Why does the row show in > `sstabledump` but not when I query for it? > > I appreciate any help or suggestions! > > - Mike >