New node has high network and disk usage.

2016-01-06 Thread Vickrum Loi
Hi,

We recently added a new node to our cluster in order to replace a node that
died (hardware failure we believe). For the next two weeks it had high disk
and network activity. We replaced the server, but it's happened again.
We've looked into memory allowances, disk performance, number of
connections, and all the nodetool stats, but can't find the cause of the
issue.

`nodetool tpstats`[0] shows a lot of active and pending threads, in
comparison to the rest of the cluster, but that's likely a symptom, not a
cause.

`nodetool status`[1] shows the cluster isn't quite balanced. The bad node
(D) has less data.

Disk Activity[2] and Network activity[3] on this node is far higher than
the rest.

The only other difference this node has to the rest of the cluster is that
its on the ext4 filesystem, whereas the rest are ext3, but we've done
plenty of testing there and can't see how that would affect performance on
this node so much.

Nothing of note in system.log.

What should our next step be in trying to diagnose this issue?

Best wishes,
Vic

[0] `nodetool tpstats` output:

Good node:
Pool NameActive   Pending  Completed   Blocked
All time blocked
ReadStage 0 0   46311521
0 0
RequestResponseStage  0 0   23817366
0 0
MutationStage 0 0   47389269
0 0
ReadRepairStage   0 0  11108
0 0
ReplicateOnWriteStage 0 0  0
0 0
GossipStage   0 05259908
0 0
CacheCleanupExecutor  0 0  0
0 0
MigrationStage0 0 30
0 0
MemoryMeter   0 0  16563
0 0
FlushWriter   0 0  39637
026
ValidationExecutor0 0  19013
0 0
InternalResponseStage 0 0  9
0 0
AntiEntropyStage  0 0  38026
0 0
MemtablePostFlusher   0 0  81740
0 0
MiscStage 0 0  19196
0 0
PendingRangeCalculator0 0 23
0 0
CompactionExecutor0 0  61629
0 0
commitlog_archiver0 0  0
0 0
HintedHandoff 0 0 63
0 0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  0
PAGED_RANGE  0
BINARY   0
READ   640
MUTATION 0
_TRACE   0
REQUEST_RESPONSE 0
COUNTER_MUTATION 0

Bad node:
Pool NameActive   Pending  Completed   Blocked
All time blocked
ReadStage32   113  52216
0 0
RequestResponseStage  0 0   4167
0 0
MutationStage 0 0 127559
0 0
ReadRepairStage   0 0125
0 0
ReplicateOnWriteStage 0 0  0
0 0
GossipStage   0 0   9965
0 0
CacheCleanupExecutor  0 0  0
0 0
MigrationStage0 0  0
0 0
MemoryMeter   0 0 24
0 0
FlushWriter   0 0 27
0 1
ValidationExecutor0 0  0
0 0
InternalResponseStage 0 0  0
0 0
AntiEntropyStage  0 0  0
0 0
MemtablePostFlusher   0 0 96
0 0
MiscStage 0 0  0
0 0
PendingRangeCalculator0 0 10
0 0
CompactionExecutor1 1 73
0 0
commitlog_archiver0 0  0
0 0
HintedHandoff 0 0 15
0 0

Message type   Dropped
RANGE_SLICE130
READ_REPAIR  1
PAGED_RANGE   

Re: New node has high network and disk usage.

2016-01-06 Thread Vickrum Loi
I should probably have mentioned that we're on Cassandra 2.0.10.

On 6 January 2016 at 15:26, Vickrum Loi <vickrum@idioplatform.com>
wrote:

> Hi,
>
> We recently added a new node to our cluster in order to replace a node
> that died (hardware failure we believe). For the next two weeks it had high
> disk and network activity. We replaced the server, but it's happened again.
> We've looked into memory allowances, disk performance, number of
> connections, and all the nodetool stats, but can't find the cause of the
> issue.
>
> `nodetool tpstats`[0] shows a lot of active and pending threads, in
> comparison to the rest of the cluster, but that's likely a symptom, not a
> cause.
>
> `nodetool status`[1] shows the cluster isn't quite balanced. The bad node
> (D) has less data.
>
> Disk Activity[2] and Network activity[3] on this node is far higher than
> the rest.
>
> The only other difference this node has to the rest of the cluster is that
> its on the ext4 filesystem, whereas the rest are ext3, but we've done
> plenty of testing there and can't see how that would affect performance on
> this node so much.
>
> Nothing of note in system.log.
>
> What should our next step be in trying to diagnose this issue?
>
> Best wishes,
> Vic
>
> [0] `nodetool tpstats` output:
>
> Good node:
> Pool NameActive   Pending  Completed
> Blocked  All time blocked
> ReadStage 0 0   46311521
> 0 0
> RequestResponseStage  0 0   23817366
> 0 0
> MutationStage 0 0   47389269
> 0 0
> ReadRepairStage   0 0  11108
> 0 0
> ReplicateOnWriteStage 0 0  0
> 0 0
> GossipStage   0 05259908
> 0 0
> CacheCleanupExecutor  0 0  0
> 0 0
> MigrationStage0 0 30
> 0 0
> MemoryMeter   0 0  16563
> 0 0
> FlushWriter   0 0  39637
> 026
> ValidationExecutor0 0  19013
> 0 0
> InternalResponseStage 0 0  9
> 0 0
> AntiEntropyStage  0 0  38026
> 0 0
> MemtablePostFlusher   0 0  81740
> 0 0
> MiscStage 0 0  19196
> 0 0
> PendingRangeCalculator0 0 23
> 0 0
> CompactionExecutor0 0  61629
> 0 0
> commitlog_archiver0 0  0
> 0 0
> HintedHandoff 0 0 63
> 0 0
>
> Message type   Dropped
> RANGE_SLICE  0
> READ_REPAIR  0
> PAGED_RANGE  0
> BINARY   0
> READ   640
> MUTATION 0
> _TRACE   0
> REQUEST_RESPONSE 0
> COUNTER_MUTATION 0
>
> Bad node:
> Pool NameActive   Pending  Completed
> Blocked  All time blocked
> ReadStage32   113  52216
> 0 0
> RequestResponseStage  0 0   4167
> 0 0
> MutationStage 0 0 127559
> 0 0
> ReadRepairStage   0 0125
> 0 0
> ReplicateOnWriteStage 0 0  0
> 0 0
> GossipStage   0 0   9965
> 0 0
> CacheCleanupExecutor  0 0  0
> 0 0
> MigrationStage0 0  0
> 0 0
> MemoryMeter   0 0 24
> 0 0
> FlushWriter   0 0 27
> 0 1
> ValidationExecutor0 0  0
> 0 0
> InternalResponseStage 0 0  0
> 0 0
> A

Re: New node has high network and disk usage.

2016-01-06 Thread Vickrum Loi
# nodetool compactionstats
pending tasks: 22
  compaction typekeyspace   table
completed   total  unit  progress
   Compactionproduction_analyticsinteractions
240410213161172668724 bytes 0.15%

Compactionproduction_decisionsdecisions.decisions_q_idx
120815385   226295183 bytes53.39%
Active compaction remaining time :   2h39m58s

Worth mentioning that compactions haven't been running on this node
particularly often. The node's been performing badly regardless of whether
it's compacting or not.

On 6 January 2016 at 16:35, Jeff Ferland <j...@tubularlabs.com> wrote:

> What’s your output of `nodetool compactionstats`?
>
> On Jan 6, 2016, at 7:26 AM, Vickrum Loi <vickrum@idioplatform.com>
> wrote:
>
> Hi,
>
> We recently added a new node to our cluster in order to replace a node
> that died (hardware failure we believe). For the next two weeks it had high
> disk and network activity. We replaced the server, but it's happened again.
> We've looked into memory allowances, disk performance, number of
> connections, and all the nodetool stats, but can't find the cause of the
> issue.
>
> `nodetool tpstats`[0] shows a lot of active and pending threads, in
> comparison to the rest of the cluster, but that's likely a symptom, not a
> cause.
>
> `nodetool status`[1] shows the cluster isn't quite balanced. The bad node
> (D) has less data.
>
> Disk Activity[2] and Network activity[3] on this node is far higher than
> the rest.
>
> The only other difference this node has to the rest of the cluster is that
> its on the ext4 filesystem, whereas the rest are ext3, but we've done
> plenty of testing there and can't see how that would affect performance on
> this node so much.
>
> Nothing of note in system.log.
>
> What should our next step be in trying to diagnose this issue?
>
> Best wishes,
> Vic
>
> [0] `nodetool tpstats` output:
>
> Good node:
> Pool NameActive   Pending  Completed
> Blocked  All time blocked
> ReadStage 0 0   46311521
> 0 0
> RequestResponseStage  0 0   23817366
> 0 0
> MutationStage 0 0   47389269
> 0 0
> ReadRepairStage   0 0  11108
> 0 0
> ReplicateOnWriteStage 0 0  0
> 0 0
> GossipStage   0 05259908
> 0 0
> CacheCleanupExecutor  0 0  0
> 0 0
> MigrationStage0 0 30
> 0 0
> MemoryMeter   0 0  16563
> 0 0
> FlushWriter   0 0  39637
> 026
> ValidationExecutor0 0  19013
> 0 0
> InternalResponseStage 0 0  9
> 0 0
> AntiEntropyStage  0 0  38026
> 0 0
> MemtablePostFlusher   0 0  81740
> 0 0
> MiscStage 0 0  19196
> 0 0
> PendingRangeCalculator0 0 23
> 0 0
> CompactionExecutor0 0  61629
> 0 0
> commitlog_archiver0 0  0
> 0 0
> HintedHandoff 0 0 63
> 0 0
>
> Message type   Dropped
> RANGE_SLICE  0
> READ_REPAIR  0
> PAGED_RANGE  0
> BINARY   0
> READ   640
> MUTATION 0
> _TRACE   0
> REQUEST_RESPONSE 0
> COUNTER_MUTATION 0
>
> Bad node:
> Pool NameActive   Pending  Completed
> Blocked  All time blocked
> ReadStage32   113  52216
> 0 0
> RequestResponseStage  0 0   4167
> 0 0
> MutationStage 0 0 127559
> 0 0
> ReadRepairStage   0 0125
> 0 0
> ReplicateOnWriteStage 0 0  0
> 0 0
> Goss