Re: Bootstrap OOM issues with Cassandra 3.11.1

Jonathan Haddad Tue, 07 Aug 2018 07:22:32 -0700

By default Cassandra is set to generate a heap dump on OOM. It can be a bit
tricky to figure out what’s going on exactly but it’s the best evidence you
can work with.


On Tue, Aug 7, 2018 at 6:30 AM Laszlo Szabo <laszlo.viktor.sz...@gmail.com>
wrote:

> Hi,
>
> Thanks for the fast response!
>
> We are not using any materialized views, but there are several indexes.  I
> don't have a recent heap dump, and it will be about 24 before I can
> generate an interesting one, but most of the memory was allocated to byte
> buffers, so not entirely helpful.
>
> nodetool cfstats is also below.
>
> I also see a lot of flushing happening, but it seems like there are too
> many small allocations to be effective.  Here are the messages I see,
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,459
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='gpsmessages') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,459 ColumnFamilyStore.java:915
>> - Enqueuing flush of gpsmessages: 0.000KiB (0%) on-heap, 0.014KiB (0%)
>> off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,460
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='user_history') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,461 ColumnFamilyStore.java:915
>> - Enqueuing flush of user_history: 0.000KiB (0%) on-heap, 0.011KiB (0%)
>> off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,465
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='tweets') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,465 ColumnFamilyStore.java:915
>> - Enqueuing flush of tweets: 0.000KiB (0%) on-heap, 0.188KiB (0%) off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='user_history') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470 ColumnFamilyStore.java:915
>> - Enqueuing flush of user_history: 0.000KiB (0%) on-heap, 0.024KiB (0%)
>> off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='tweets') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470 ColumnFamilyStore.java:915
>> - Enqueuing flush of tweets: 0.000KiB (0%) on-heap, 0.188KiB (0%) off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,472
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='gpsmessages') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,472 ColumnFamilyStore.java:915
>> - Enqueuing flush of gpsmessages: 0.000KiB (0%) on-heap, 0.013KiB (0%)
>> off-heap
>
>
>>
>
> Stack traces from errors are below.
>
>
>> java.io.IOException: Broken pipe
>
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>> ~[na:1.8.0_181]
>
>         at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>> ~[na:1.8.0_181]
>
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>> ~[na:1.8.0_181]
>
>         at sun.nio.ch.IOUtil.write(IOUtil.java:51) ~[na:1.8.0_181]
>
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>> ~[na:1.8.0_181]
>
>         at
>> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.doFlush(BufferedDataOutputStreamPlus.java:323)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.flush(BufferedDataOutputStreamPlus.java:331)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:409)
>> [apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380)
>> [apache-cassandra-3.11.1.jar:3.11.1]
>
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
>
> ERROR [MutationStage-226] 2018-08-06 07:16:08,236
>> JVMStabilityInspector.java:142 - JVM state determined to be unstable.
>> Exiting forcefully due to:
>
> java.lang.OutOfMemoryError: Direct buffer memory
>
>         at java.nio.Bits.reserveMemory(Bits.java:694) ~[na:1.8.0_181]
>
>         at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>> ~[na:1.8.0_181]
>
>         at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>> ~[na:1.8.0_181]
>
>         at
>> org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:139)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:104)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:57)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.utils.memory.ContextAllocator.clone(ContextAllocator.java:47)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone(MemtableBufferAllocator.java:40)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at org.apache.cassandra.db.Memtable.put(Memtable.java:269)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1332)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:618)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:425)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:222)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:68)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> ~[na:1.8.0_181]
>
>         at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
>> ~[apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
>> [apache-cassandra-3.11.1.jar:3.11.1]
>
>         at
>> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
>> [apache-cassandra-3.11.1.jar:3.11.1]
>
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,459
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='gpsmessages') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,459 ColumnFamilyStore.java:915
>> - Enqueuing flush of gpsmessages: 0.000KiB (0%) on-heap, 0.014KiB (0%)
>> off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,460
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='user_history') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,461 ColumnFamilyStore.java:915
>> - Enqueuing flush of user_history: 0.000KiB (0%) on-heap, 0.011KiB (0%)
>> off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,465
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='tweets') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,465 ColumnFamilyStore.java:915
>> - Enqueuing flush of tweets: 0.000KiB (0%) on-heap, 0.188KiB (0%) off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='user_history') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470 ColumnFamilyStore.java:915
>> - Enqueuing flush of user_history: 0.000KiB (0%) on-heap, 0.024KiB (0%)
>> off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='tweets') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,470 ColumnFamilyStore.java:915
>> - Enqueuing flush of tweets: 0.000KiB (0%) on-heap, 0.188KiB (0%) off-heap
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,472
>> ColumnFamilyStore.java:1305 - Flushing largest CFS(Keyspace='userinfo',
>> ColumnFamily='gpsmessages') to free up room. Used total: 0.54/0.05, live:
>> 0.00/0.00, flushing: 0.40/0.04, this: 0.00/0.00
>
> DEBUG [SlabPoolCleaner] 2018-08-06 07:16:08,472 ColumnFamilyStore.java:915
>> - Enqueuing flush of gpsmessages: 0.000KiB (0%) on-heap, 0.013KiB (0%)
>> off-heap
>
>
> Total number of tables: 40
> ----------------
> Keyspace : userinfo
> Read Count: 143301
> Read Latency: 14.945587623254548 ms.
> Write Count: 2754603904
> Write Latency: 0.020883145284324698 ms.
> Pending Flushes: 0
> Table (index):
> gpsmessages.gpsmessages_addresscount_idxgpsmessages.gpsmessages_addresscount_idx
> SSTable count: 9
> Space used (live): 19043463189
> Space used (total): 19043463189
> Space used by snapshots (total): 0
> Off heap memory used (total): 6259448
> SSTable Compression Ratio: 0.3704785164266614
> Number of partitions (estimate): 1025
> Memtable cell count: 309066
> Memtable data size: 13602774
> Memtable off heap memory used: 0
> Memtable switch count: 0
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 46025778
> Local write latency: 0.034 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 2504
> Bloom filter off heap memory used: 2432
> Index summary off heap memory used: 320
> Compression metadata off heap memory used: 6256696
> Compacted partition minimum bytes: 43
> Compacted partition maximum bytes: 44285675122
> Compacted partition mean bytes: 30405277
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
>
> Table (index):
> gpsmessages.gpsmessages_addresses_idxgpsmessages.gpsmessages_addresses_idx
> SSTable count: 18
> Space used (live): 409514565570
> Space used (total): 409514565570
> Space used by snapshots (total): 0
> Off heap memory used (total): 153405673
> SSTable Compression Ratio: 0.4447731157134059
> Number of partitions (estimate): 6013125
> Memtable cell count: 1110334
> Memtable data size: 67480140
> Memtable off heap memory used: 0
> Memtable switch count: 0
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 147639252
> Local write latency: 0.015 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 34175400
> Bloom filter off heap memory used: 34175256
> Index summary off heap memory used: 7432177
> Compression metadata off heap memory used: 111798240
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 322381140
> Compacted partition mean bytes: 36692
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
>
> Table (index):
> gpsmessages.addressreceivedtime_idxgpsmessages.addressreceivedtime_idx
> SSTable count: 10
> Space used (live): 52738155477
> Space used (total): 52738155477
> Space used by snapshots (total): 0
> Off heap memory used (total): 1909362628
> SSTable Compression Ratio: 0.4106961621795128
> Number of partitions (estimate): 1338730016
> Memtable cell count: 308990
> Memtable data size: 13410867
> Memtable off heap memory used: 0
> Memtable switch count: 0
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 46012614
> Local write latency: 0.012 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 1687550888
> Bloom filter off heap memory used: 1687550808
> Index summary off heap memory used: 213249180
> Compression metadata off heap memory used: 8562640
> Compacted partition minimum bytes: 36
> Compacted partition maximum bytes: 2759
> Compacted partition mean bytes: 54
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
>
> Table: gpsmessages
> SSTable count: 13
> Space used (live): 337974446627
> Space used (total): 337974446627
> Space used by snapshots (total): 0
> Off heap memory used (total): 77833540
> SSTable Compression Ratio: 0.5300637241381126
> Number of partitions (estimate): 22034
> Memtable cell count: 308904
> Memtable data size: 72074512
> Memtable off heap memory used: 0
> Memtable switch count: 110
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 45996652
> Local write latency: 0.281 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 67904
> Bloom filter off heap memory used: 67800
> Index summary off heap memory used: 11756
> Compression metadata off heap memory used: 77753984
> Compacted partition minimum bytes: 73
> Compacted partition maximum bytes: 1155149911
> Compacted partition mean bytes: 13158224
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 13699
>
> Table: user_history
> SSTable count: 17
> Space used (live): 116361158882
> Space used (total): 116361158882
> Space used by snapshots (total): 0
> Off heap memory used (total): 29562319
> SSTable Compression Ratio: 0.5683114352331539
> Number of partitions (estimate): 1337206
> Memtable cell count: 773277
> Memtable data size: 40623368
> Memtable off heap memory used: 0
> Memtable switch count: 57
> Local read count: 209
> Local read latency: NaN ms
> Local write count: 145853733
> Local write latency: 0.020 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 3844416
> Bloom filter off heap memory used: 3844280
> Index summary off heap memory used: 800991
> Compression metadata off heap memory used: 24917048
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 464228842
> Compacted partition mean bytes: 72182
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 66702
>
> Table: users
> SSTable count: 3
> Space used (live): 89945186
> Space used (total): 89945186
> Space used by snapshots (total): 0
> Off heap memory used (total): 2092053
> SSTable Compression Ratio: 0.5712127629253333
> Number of partitions (estimate): 1365645
> Memtable cell count: 3556
> Memtable data size: 150903
> Memtable off heap memory used: 0
> Memtable switch count: 42
> Local read count: 143087
> Local read latency: 6.094 ms
> Local write count: 250971
> Local write latency: 0.024 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 1709848
> Bloom filter off heap memory used: 1709824
> Index summary off heap memory used: 372125
> Compression metadata off heap memory used: 10104
> Compacted partition minimum bytes: 36
> Compacted partition maximum bytes: 310
> Compacted partition mean bytes: 66
> Average live cells per slice (last five minutes): 1.0
> Maximum live cells per slice (last five minutes): 1
> Average tombstones per slice (last five minutes): 1.0
> Maximum tombstones per slice (last five minutes): 1
> Dropped Mutations: 114
>
> Table: tweets
> SSTable count: 18
> Space used (live): 1809145656486
> Space used (total): 1809145656486
> Space used by snapshots (total): 0
> Off heap memory used (total): 435915908
> SSTable Compression Ratio: 0.5726200929451171
> Number of partitions (estimate): 26217889
> Memtable cell count: 710146
> Memtable data size: 31793929
> Memtable off heap memory used: 0
> Memtable switch count: 399
> Local read count: 5
> Local read latency: NaN ms
> Local write count: 2322829524
> Local write latency: 0.019 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 35019224
> Bloom filter off heap memory used: 35019080
> Index summary off heap memory used: 16454076
> Compression metadata off heap memory used: 384442752
> Compacted partition minimum bytes: 104
> Compacted partition maximum bytes: 3379391
> Compacted partition mean bytes: 124766
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 697696
>
> ----------------
>
>
> On Mon, Aug 6, 2018 at 8:57 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>
>>
>>
>> Upgrading to 3.11.3 May fix it (there were some memory recycling bugs
>> fixed recently), but analyzing the heap will be the best option
>>
>> If you can print out the heap histogram and stack trace or open a heap
>> dump in your kit or visualvm or MAT and show us what’s at the top of the
>> reclaimed objects, we may be able to figure out what’s going on
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Aug 6, 2018, at 5:42 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>> Are you using materialized views or secondary indices?
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Aug 6, 2018, at 3:49 PM, Laszlo Szabo <laszlo.viktor.sz...@gmail.com>
>> wrote:
>>
>> Hello All,
>>
>> I'm having JVM unstable / OOM errors when attempting to auto bootstrap a
>> 9th node to an existing 8 node cluster (256 tokens).  Each machine has 24
>> cores 148GB RAM and 10TB (2TB used).  Under normal operation the 8 nodes
>> have JVM memory configured with Xms35G and Xmx35G, and handle 2-4 billion
>> inserts per day.  There are never updates, deletes, or sparsely populated
>> rows.
>>
>> For the bootstrap node, I've tried memory values from 35GB to 135GB in
>> 10GB increments. I've tried using both memtable_allocation_types
>> (heap_buffers and offheap_buffers).  I've not tried modifying the
>> memtable_cleanup_threshold but instead have tried memtable_flush_writers
>> from 2 to 8.  I've tried memtable_(off)heap_space_in_mb from 20000 to
>> 60000.  I've tried both CMS and G1 garbage collection with various
>> settings.
>>
>> Typically, after streaming about ~2TB of data, CPU load will hit a
>> maximum, and the "nodetool info" heap memory will, over the course of an
>> hour, approach the maximum.  At that point, CPU load will drop to a single
>> thread with minimal activity until the system becomes unstable and
>> eventually the OOM error occurs.
>>
>> Excerpt of the system log is below, and what I consistently see is the
>> MemtableFlushWriter and the MemtableReclaimMemory pending queues grow as
>> the memory becomes depleted, but the number of completed seems to stop
>> changing a few minutes after the CPU load spikes.
>>
>> One other data point is there seems to be a huge number of mutations that
>> occur after most of the stream has occured.  Concurrent_writes is set at
>> 256 with the queue getting as high as 200K before dropping down.
>>
>> Any suggestions for yaml changes or jvm changes?  JVM.options is
>> currently the default with the memory set to the max, the current YAML file
>> is below.
>>
>> Thanks!
>>
>>
>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,329 StatusLogger.java:51 -
>>> MutationStage                     1         2      191498052         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,331 StatusLogger.java:51 -
>>> ViewMutationStage                 0         0              0         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,338 StatusLogger.java:51 -
>>> PerDiskMemtableFlushWriter_0         0         0           5865         0
>>>                0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,343 StatusLogger.java:51 -
>>> ReadStage                         0         0              0         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,347 StatusLogger.java:51 -
>>> ValidationExecutor                0         0              0         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,360 StatusLogger.java:51 -
>>> RequestResponseStage              0         0              8         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,380 StatusLogger.java:51 -
>>> Sampler                           0         0              0         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,382 StatusLogger.java:51 - 
>>> *MemtableFlushWriter
>>>              8     74293           4716         0  *               0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,388 StatusLogger.java:51 -
>>> ReadRepairStage                   0         0              0         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,389 StatusLogger.java:51 -
>>> CounterMutationStage              0         0              0         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,404 StatusLogger.java:51 -
>>> MiscStage                         0         0              0         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,407 StatusLogger.java:51 -
>>> CompactionExecutor                8        13            493         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,410 StatusLogger.java:51 -
>>> InternalResponseStage             0         0             16         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,413 StatusLogger.java:51 - 
>>> *MemtableReclaimMemory
>>>            1      6066            356         0     *            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,421 StatusLogger.java:51 -
>>> AntiEntropyStage                  0         0              0         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,430 StatusLogger.java:51 -
>>> CacheCleanupExecutor              0         0              0         0
>>>            0
>>> INFO  [ScheduledTasks:1] 2018-08-06 17:49:26,431 StatusLogger.java:51 -
>>> PendingRangeCalculator            0         0              9         0
>>>            0
>>> INFO  [Service Thread] 2018-08-06 17:49:26,436 StatusLogger.java:61 -
>>> CompactionManager                 8        19
>>
>>
>>
>>
>>  Current Yaml
>>
>> num_tokens: 256
>>
>> hinted_handoff_enabled: true
>>
>> hinted_handoff_throttle_in_kb: 10240
>>
>> max_hints_delivery_threads: 8
>>
>> hints_flush_period_in_ms: 10000
>>
>> max_hints_file_size_in_mb: 128
>>
>> batchlog_replay_throttle_in_kb: 10240
>>
>> authenticator: AllowAllAuthenticator
>>
>> authorizer: AllowAllAuthorizer
>>
>> role_manager: CassandraRoleManager
>>
>> roles_validity_in_ms: 2000
>>
>> permissions_validity_in_ms: 2000
>>
>> credentials_validity_in_ms: 2000
>>
>> partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>>
>> data_file_directories:
>>
>>     - /data/cassandra/data
>>
>> commitlog_directory: /data/cassandra/commitlog
>>
>> cdc_enabled: false
>>
>> disk_failure_policy: stop
>>
>> commit_failure_policy: stop
>>
>> prepared_statements_cache_size_mb:
>>
>> thrift_prepared_statements_cache_size_mb:
>>
>> key_cache_size_in_mb:
>>
>> key_cache_save_period: 14400
>>
>> row_cache_size_in_mb: 0
>>
>> row_cache_save_period: 0
>>
>> counter_cache_size_in_mb:
>>
>> counter_cache_save_period: 7200
>>
>> saved_caches_directory: /data/cassandra/saved_caches
>>
>> commitlog_sync: periodic
>>
>> commitlog_sync_period_in_ms: 10000
>>
>> commitlog_segment_size_in_mb: 32
>>
>> seed_provider:
>>
>>     - class_name: org.apache.cassandra.locator.SimpleSeedProvider
>>
>>       parameters:
>>
>>           - seeds: "10.1.1.11,10.1.1.12,10.1.1.13"
>>
>> concurrent_reads: 128
>>
>> concurrent_writes: 256
>>
>> concurrent_counter_writes: 96
>>
>> concurrent_materialized_view_writes: 32
>>
>> disk_optimization_strategy: spinning
>>
>> memtable_heap_space_in_mb: 61440
>>
>> memtable_offheap_space_in_mb: 61440
>>
>> memtable_allocation_type: heap_buffers
>>
>> commitlog_total_space_in_mb: 81920
>>
>> memtable_flush_writers: 8
>>
>>
>>
> --
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: Bootstrap OOM issues with Cassandra 3.11.1

Reply via email to