Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test. On Wed, Sep 5, 2018, 5:51 AM [email protected] <[email protected]> wrote:
> How large is your row? You may meet reading wide row problem. > > -Simon > > *From:* Laxmikant Upadhyay <[email protected]> > *Date:* 2018-09-05 01:01 > *To:* user <[email protected]> > *Subject:* High IO and poor read performance on 3.11.2 cassandra cluster > We have 3 node cassandra cluster (3.11.2) in single dc. > > We have written 450 million records on the table with LCS. The write > latency is fine. After write we perform read and update operations. > > When we run read+update operations on newly inserted 1 million records (on > top of 450 m records) then the read latency and io usage is under control. > However when we perform read+update on old 1 million records which are part > of 450 million records we observe high read latency (The performance goes > down by 4 times in comparison 1st case ). We have not observed major gc > pauses. > > *system information:* > *cpu core :* 24 > *disc type : *ssd . we are using raid with deadline schedular > *disk space:* > df -h : > Filesystem Size Used Avail Use% Mounted on > /dev/sdb1 1.9T 393G 1.5T 22% /var/lib/cassandra > *memory:* > free -g > total used free shared buff/cache > available > Mem: 62 30 0 0 32 > 31 > Swap: 8 0 8 > > ========================================== > > *schema* > > desc table ks.xyz; > > CREATE TABLE ks.xyz ( > key text, > column1 text, > value text, > PRIMARY KEY (key, column1) > ) WITH COMPACT STORAGE > AND CLUSTERING ORDER BY (column1 ASC) > AND bloom_filter_fp_chance = 0.1 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.0 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > > ============================================================================== > Below is some system stats snippet when read operations was running: > > *iotop -o * : Observation : the disk read goes up to 5.5 G/s > > Total DISK READ : *3.86 G/s* | Total DISK WRITE : 1252.88 K/s > Actual DISK READ: * 3.92 G/s* | Actual DISK WRITE: 0.00 B/s > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > 10715 be/4 cassandr 375.89 M/s 99.79 K/s 0.00 % 29.15 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10714 be/4 cassandr 358.56 M/s 107.18 K/s 0.00 % 27.06 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10712 be/4 cassandr 351.86 M/s 147.83 K/s 0.00 % 25.02 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10718 be/4 cassandr 359.82 M/s 110.87 K/s 0.00 % 24.49 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10711 be/4 cassandr 333.03 M/s 125.66 K/s 0.00 % 23.37 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10716 be/4 cassandr 330.80 M/s 103.48 K/s 0.00 % 23.02 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10717 be/4 cassandr 319.49 M/s 118.27 K/s 0.00 % 22.11 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10713 be/4 cassandr 300.62 M/s 118.27 K/s 0.00 % 21.65 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10719 be/4 cassandr 294.98 M/s 81.31 K/s 0.00 % 21.60 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10720 be/4 cassandr 289.00 M/s 73.92 K/s 0.00 % 21.45 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10742 be/4 cassandr 240.98 M/s 81.31 K/s 0.00 % 17.68 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10743 be/4 cassandr 224.43 M/s 36.96 K/s 0.00 % 17.57 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10744 be/4 cassandr 113.29 M/s 14.78 K/s 0.00 % 10.22 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > 10745 be/4 cassandr 61.63 M/s 33.26 K/s 0.00 % 4.20 % java > -Dorg.xerial.snappy.tempdir=/var/tmp > -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ > > > ================================================================================================== > *iostats -x 5* > > avg-cpu: %user %nice %system %iowait %steal %idle > 2.35 0.00 7.09 *11.26 * 0.00 79.30 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 205.60 0.00 5304.60 3.00 3707651.20 2450.40 > 1398.03 11.37 2.14 2.14 0.67 0.18 93.60 > > > avg-cpu: %user %nice %system %iowait %steal %idle > 2.99 0.00 7.49 *10.95* 0.00 78.56 > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > sdb 228.00 0.00 5639.40 4.80 3947922.40 1348.00 > 1399.41 12.65 2.24 2.25 0.46 0.17 94.00 > > > ================================================================================================== > *nodetool info* > ID : ddab003d-41d1-413e-ace2-87ccd34229d5 > Gossip active : true > Thrift active : false > Native Transport active: true > Load : 391.58 GiB > Generation No : 1536065356 > Uptime (seconds) : 4714 > Heap Memory (MB) : 5172.52 / 13312.00 > Off Heap Memory (MB) : 2288.28 > Data Center : DC1 > Rack : RAC1 > Exceptions : 0 > Key Cache : entries 1604008, size 167.04 MiB, capacity 256 > MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period > in seconds > Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 > hits, 0 requests, NaN recent hit rate, 0 save period in seconds > Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, > 0 requests, NaN recent hit rate, 7200 save period in seconds > Chunk Cache : entries 7680, size 480 MiB, capacity 480 MiB, > 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976 > microseconds miss latency > Percent Repaired : 1.6357850897532832E-8% > Token : (invoke with -T/--tokens to see all 256 tokens) > > > ============================================================================= > > *nodetool tablehistograms ks xyz : *Observation : not a wide row > ks/xyz histograms > Percentile SSTables Write Latency Read Latency Partition > Size Cell Count > (micros) (micros) (bytes) > 50% 1.00 17.08 5839.59 > 1109 17 > 75% 1.00 20.50 *7007.51 * > 1331 24 > 95% 2.00 29.52 *12108.97 * > 1331 24 > 98% 2.00 35.43 14530.76 > 1331 24 > 99% 2.00 42.51 17436.92 > 1331 24 > Min 1.00 5.72 105.78 > 150 2 > Max 2.00 315.85 464228.84 > 1916 35 > > *nodetool tablehistograms ks xyz* > ks/xyz histograms > Percentile SSTables Write Latency Read Latency Partition > Size Cell Count > (micros) (micros) (bytes) > 50% 1.00 17.08 5839.59 > 1109 17 > 75% 1.00 20.50 7007.51 > 1331 24 > 95% 2.00 29.52 12108.97 > 1331 24 > 98% 2.00 35.43 14530.76 > 1331 24 > 99% 2.00 42.51 17436.92 > 1331 24 > Min 1.00 4.77 105.78 > 150 2 > Max 2.00 315.85 464228.84 > 1916 35 > > > > ====================================================================== > > *nodetool tpstats* > > Pool Name Active Pending Completed > Blocked All time blocked > ReadStage 3 1 1603986 > 0 0 > MiscStage 0 0 0 > 0 0 > CompactionExecutor 0 0 6670 > 0 0 > MutationStage 0 0 5044371 > 0 0 > MemtableReclaimMemory 0 0 57 > 0 0 > PendingRangeCalculator 0 0 3 > 0 0 > GossipStage 0 0 14299 > 0 0 > SecondaryIndexManagement 0 0 0 > 0 0 > HintsDispatcher 0 0 0 > 0 0 > RequestResponseStage 0 1 3867575 > 0 0 > Native-Transport-Requests 3 0 2319802 > 0 0 > ReadRepairStage 0 0 13281 > 0 0 > CounterMutationStage 0 0 0 > 0 0 > MigrationStage 0 0 2 > 0 0 > MemtablePostFlush 0 0 65 > 0 0 > PerDiskMemtableFlushWriter_0 0 0 57 > 0 0 > ValidationExecutor 0 0 0 > 0 0 > Sampler 0 0 0 > 0 0 > MemtableFlushWriter 0 0 57 > 0 0 > InternalResponseStage 0 0 0 > 0 0 > ViewMutationStage 0 0 0 > 0 0 > AntiEntropyStage 0 0 0 > 0 0 > CacheCleanupExecutor 0 0 0 > 0 0 > > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > HINT 0 > MUTATION 0 > COUNTER_MUTATION 0 > BATCH_STORE 0 > BATCH_REMOVE 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > > ===================================================================== > > *notetool compactionstats* > pending tasks: 0 > > > ============================================================================== > *debug.log Snippent* > > > WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057 > NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s > with average duration of 15.17ms, 2 have exceeded the configured commit > interval by an average of 20.32ms > DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015 > AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating > a fresh one > DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 > ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system', > ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live: > 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00 > DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 ColumnFamilyStore.java:918 > - Enqueuing flush of batches: 128.567MiB (13%) on-heap, 0.000KiB (0%) > off-heap > DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685 > Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB > serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range = > (null, null] > DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695 > Memtable.java:485 - Completed flushing > /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db > (0.000KiB) for commitlog position > CommitLogPosition(segmentId=1536065319618, position=7958044) > DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695 > ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest > 0.000KiB, smallest 8589934592.000GiB > DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173 - > 2 operations were slow in the last 4999 msecs: > <SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec - slow > timeout 500 msec/cross-node > <SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow > timeout 500 msec/cross-node > DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475 > AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating > a fresh one > DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675 > AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating > a fresh one > DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976 > ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks', > ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live: > 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00 > DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977 ColumnFamilyStore.java:918 > - Enqueuing flush of xyz: 121.374MiB (12%) on-heap, 0.000KiB (0%) off-heap > > *Observation :* frequent "operations were slow in the last " (for select > queries) and "WARN: commit log syncs over the past" > =================================================== > > *notetool tablestats -H ks.xyz <http://ks.xyz>;* > Total number of tables: 89 > ---------------- > Keyspace : ks > Read Count: 1439722 > Read Latency: 1.8982509581710914 ms > Write Count: 4222811 > Write Latency: 0.016324778684151386 ms > Pending Flushes: 0 > Table: xyz > SSTable count: 1036 > SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0, > 0] > Space used (live): 187.09 GiB > Space used (total): 187.09 GiB > Space used by snapshots (total): 0 bytes > Off heap memory used (total): 783.93 MiB > SSTable Compression Ratio: 0.3238726404414842 > Number of partitions (estimate): 447095605 > Memtable cell count: 306194 > Memtable data size: 20.59 MiB > Memtable off heap memory used: 0 bytes > Memtable switch count: 7 > Local read count: 1440322 > Local read latency: 6.785 ms > Local write count: 1408204 > Local write latency: 0.021 ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 19 > Bloom filter false ratio: 0.00003 > Bloom filter space used: 418.2 MiB > Bloom filter off heap memory used: 418.19 MiB > Index summary off heap memory used: 307.75 MiB > Compression metadata off heap memory used: 57.99 MiB > Compacted partition minimum bytes: 150 > Compacted partition maximum bytes: 1916 > Compacted partition mean bytes: 1003 > Average live cells per slice (last five minutes): 20.0 > Maximum live cells per slice (last five minutes): 20 > Average tombstones per slice (last five minutes): 1.0 > Maximum tombstones per slice (last five minutes): 1 > Dropped Mutations: 0 bytes > > -- > > regards, > Laxmikant Upadhyay > >
