How large is your row? You may meet reading wide row problem.
-Simon
From: Laxmikant Upadhyay
Date: 2018-09-05 01:01
To: user
Subject: High IO and poor read performance on 3.11.2 cassandra cluster
We have 3 node cassandra cluster (3.11.2) in single dc.
We have written 450 million records on the table with LCS. The write latency is
fine. After write we perform read and update operations.
When we run read+update operations on newly inserted 1 million records (on top
of 450 m records) then the read latency and io usage is under control. However
when we perform read+update on old 1 million records which are part of 450
million records we observe high read latency (The performance goes down by 4
times in comparison 1st case ). We have not observed major gc pauses.
system information:
cpu core : 24
disc type : ssd . we are using raid with deadline schedular
disk space:
df -h :
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 1.9T 393G 1.5T 22% /var/lib/cassandra
memory:
free -g
total used free shared buff/cache available
Mem: 62 30 0 0 32 31
Swap: 8 0 8
==========================================
schema
desc table ks.xyz;
CREATE TABLE ks.xyz (
key text,
column1 text,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
==============================================================================
Below is some system stats snippet when read operations was running:
iotop -o : Observation : the disk read goes up to 5.5 G/s
Total DISK READ : 3.86 G/s | Total DISK WRITE : 1252.88 K/s
Actual DISK READ: 3.92 G/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
10715 be/4 cassandr 375.89 M/s 99.79 K/s 0.00 % 29.15 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10714 be/4 cassandr 358.56 M/s 107.18 K/s 0.00 % 27.06 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10712 be/4 cassandr 351.86 M/s 147.83 K/s 0.00 % 25.02 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10718 be/4 cassandr 359.82 M/s 110.87 K/s 0.00 % 24.49 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10711 be/4 cassandr 333.03 M/s 125.66 K/s 0.00 % 23.37 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10716 be/4 cassandr 330.80 M/s 103.48 K/s 0.00 % 23.02 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10717 be/4 cassandr 319.49 M/s 118.27 K/s 0.00 % 22.11 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10713 be/4 cassandr 300.62 M/s 118.27 K/s 0.00 % 21.65 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10719 be/4 cassandr 294.98 M/s 81.31 K/s 0.00 % 21.60 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10720 be/4 cassandr 289.00 M/s 73.92 K/s 0.00 % 21.45 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10742 be/4 cassandr 240.98 M/s 81.31 K/s 0.00 % 17.68 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10743 be/4 cassandr 224.43 M/s 36.96 K/s 0.00 % 17.57 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10744 be/4 cassandr 113.29 M/s 14.78 K/s 0.00 % 10.22 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
10745 be/4 cassandr 61.63 M/s 33.26 K/s 0.00 % 4.20 % java
-Dorg.xerial.snappy.tempdir=/var/tmp
-Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/
==================================================================================================
iostats -x 5
avg-cpu: %user %nice %system %iowait %steal %idle
2.35 0.00 7.09 11.26 0.00 79.30
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdb 205.60 0.00 5304.60 3.00 3707651.20 2450.40 1398.03
11.37 2.14 2.14 0.67 0.18 93.60
avg-cpu: %user %nice %system %iowait %steal %idle
2.99 0.00 7.49 10.95 0.00 78.56
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
sdb 228.00 0.00 5639.40 4.80 3947922.40 1348.00 1399.41
12.65 2.24 2.25 0.46 0.17 94.00
==================================================================================================
nodetool info
ID : ddab003d-41d1-413e-ace2-87ccd34229d5
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 391.58 GiB
Generation No : 1536065356
Uptime (seconds) : 4714
Heap Memory (MB) : 5172.52 / 13312.00
Off Heap Memory (MB) : 2288.28
Data Center : DC1
Rack : RAC1
Exceptions : 0
Key Cache : entries 1604008, size 167.04 MiB, capacity 256 MiB,
3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0
requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0
requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 7680, size 480 MiB, capacity 480 MiB, 1399440
misses, 2338352 requests, 0.402 recent hit rate, 2760.976 microseconds miss
latency
Percent Repaired : 1.6357850897532832E-8%
Token : (invoke with -T/--tokens to see all 256 tokens)
=============================================================================
nodetool tablehistograms ks xyz : Observation : not a wide row
ks/xyz histograms
Percentile SSTables Write Latency Read Latency Partition Size
Cell Count
(micros) (micros) (bytes)
50% 1.00 17.08 5839.59 1109
17
75% 1.00 20.50 7007.51 1331
24
95% 2.00 29.52 12108.97 1331
24
98% 2.00 35.43 14530.76 1331
24
99% 2.00 42.51 17436.92 1331
24
Min 1.00 5.72 105.78 150
2
Max 2.00 315.85 464228.84 1916
35
nodetool tablehistograms ks xyz
ks/xyz histograms
Percentile SSTables Write Latency Read Latency Partition Size
Cell Count
(micros) (micros) (bytes)
50% 1.00 17.08 5839.59 1109
17
75% 1.00 20.50 7007.51 1331
24
95% 2.00 29.52 12108.97 1331
24
98% 2.00 35.43 14530.76 1331
24
99% 2.00 42.51 17436.92 1331
24
Min 1.00 4.77 105.78 150
2
Max 2.00 315.85 464228.84 1916
35
======================================================================
nodetool tpstats
Pool Name Active Pending Completed Blocked
All time blocked
ReadStage 3 1 1603986 0
0
MiscStage 0 0 0 0
0
CompactionExecutor 0 0 6670 0
0
MutationStage 0 0 5044371 0
0
MemtableReclaimMemory 0 0 57 0
0
PendingRangeCalculator 0 0 3 0
0
GossipStage 0 0 14299 0
0
SecondaryIndexManagement 0 0 0 0
0
HintsDispatcher 0 0 0 0
0
RequestResponseStage 0 1 3867575 0
0
Native-Transport-Requests 3 0 2319802 0
0
ReadRepairStage 0 0 13281 0
0
CounterMutationStage 0 0 0 0
0
MigrationStage 0 0 2 0
0
MemtablePostFlush 0 0 65 0
0
PerDiskMemtableFlushWriter_0 0 0 57 0
0
ValidationExecutor 0 0 0 0
0
Sampler 0 0 0 0
0
MemtableFlushWriter 0 0 57 0
0
InternalResponseStage 0 0 0 0
0
ViewMutationStage 0 0 0 0
0
AntiEntropyStage 0 0 0 0
0
CacheCleanupExecutor 0 0 0 0
0
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
=====================================================================
notetool compactionstats
pending tasks: 0
==============================================================================
debug.log Snippent
WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057 NoSpamLogger.java:94
- Out of 29 commit log syncs over the past 263.94s with average duration of
15.17ms, 2 have exceeded the configured commit interval by an average of 20.32ms
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a
fresh one
DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:1308 -
Flushing largest CFS(Keyspace='system', ColumnFamily='batches') to free up
room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this:
0.13/0.00
DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:918 -
Enqueuing flush of batches: 128.567MiB (13%) on-heap, 0.000KiB (0%) off-heap
DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB serialized
bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range = (null, null]
DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
Memtable.java:485 - Completed flushing
/var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
(0.000KiB) for commitlog position CommitLogPosition(segmentId=1536065319618,
position=7958044)
DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
0.000KiB, smallest 8589934592.000GiB
DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173 - 2
operations were slow in the last 4999 msecs:
<SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec - slow
timeout 500 msec/cross-node
<SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow
timeout 500 msec/cross-node
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a
fresh one
DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a
fresh one
DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976 ColumnFamilyStore.java:1308 -
Flushing largest CFS(Keyspace='ks', ColumnFamily='xyz') to free up room. Used
total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977 ColumnFamilyStore.java:918 -
Enqueuing flush of xyz: 121.374MiB (12%) on-heap, 0.000KiB (0%) off-heap
Observation : frequent "operations were slow in the last " (for select
queries) and "WARN: commit log syncs over the past"
===================================================
notetool tablestats -H ks.xyz;
Total number of tables: 89
----------------
Keyspace : ks
Read Count: 1439722
Read Latency: 1.8982509581710914 ms
Write Count: 4222811
Write Latency: 0.016324778684151386 ms
Pending Flushes: 0
Table: xyz
SSTable count: 1036
SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0, 0]
Space used (live): 187.09 GiB
Space used (total): 187.09 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 783.93 MiB
SSTable Compression Ratio: 0.3238726404414842
Number of partitions (estimate): 447095605
Memtable cell count: 306194
Memtable data size: 20.59 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 7
Local read count: 1440322
Local read latency: 6.785 ms
Local write count: 1408204
Local write latency: 0.021 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 19
Bloom filter false ratio: 0.00003
Bloom filter space used: 418.2 MiB
Bloom filter off heap memory used: 418.19 MiB
Index summary off heap memory used: 307.75 MiB
Compression metadata off heap memory used: 57.99 MiB
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 1916
Compacted partition mean bytes: 1003
Average live cells per slice (last five minutes): 20.0
Maximum live cells per slice (last five minutes): 20
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 0 bytes
--
regards,
Laxmikant Upadhyay