[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105217#comment-14105217 ] Benedict commented on CASSANDRA-7743: - [~normanm] IMO the netty behaviour is surprising and likely to bite other projects as well, however it can be worked around if you realise it's there - but only with careful code analysis, it's hard to be certain you aren't allocating/releasing on other threads. It would be useful I think to have some warnings logged by netty if you initialise a new threadlocal memory pool on _returning_ a bytebuf, as this might well be indicative of pathological behaviour (you'd expect a thread to have allocated at least once before releasing if it is likely to allocate again). It might even be nice to explicitly define which threads are permitted to pool memory, so that you cannot accidentally build up pools on worker threads without noticing through accidental allocations as well. This wasn't a problem for us here, but I could see us accidentally introducing a bug like that pretty easily in future. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1 rc6 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache:
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105211#comment-14105211 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] so no netty issue at all ? > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1 rc6 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 > UN 10.240
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100197#comment-14100197 ] Benedict commented on CASSANDRA-7743: - Did you see the actual error, or have more info than meminfo? Because that is not at all conclusive by itself. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1 rc6 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0%
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100080#comment-14100080 ] Kishan Karunaratne commented on CASSANDRA-7743: --- I'm running rc5 + the patch, and the issue still shows up. I patched rc5 with the one file, and ran "ant realclean jar" to compile. I hope this command didn't re-pull from git. $ free -m total used free sharedbuffers cached Mem: 3702 2667 1035 0 1144 -/+ buffers/cache: 2520 1181 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 1060548 kB Buffers:1280 kB Cached: 148968 kB > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1 rc6 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to disp
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098977#comment-14098977 ] T Jake Luciani commented on CASSANDRA-7743: --- Looks good +1 > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 > UN 10.240.72.183 896.57 KB 256
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096651#comment-14096651 ] Benedict commented on CASSANDRA-7743: - bq. well it will be released after a while if not used. how long? it shouldn't ever be used, and it looks like it accumulates gigabytes in total over the course of a few days (around 16-32Mb per thread) bq. just pass in 0 for "int tinyCacheSize, int smallCacheSize, int normalCacheSize". Won't that obviate most of the benefit of the pooled buffers? I plan to simply prevent our deallocating on the other threads. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Do
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096647#comment-14096647 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] well it will be released after a while if not used. But I think for your use-case it would be best to disable the cache which can be done via the PooledByteBufAllocator constructor just pass in 0 for "int tinyCacheSize, int smallCacheSize, int normalCacheSize". > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID >
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096638#comment-14096638 ] Benedict commented on CASSANDRA-7743: - We're conflating two pools maybe :) I mean the "pool" of memory the thread can allocate from. So, to confirm I have this right, if you have two threads A and B, A only allocating and B only releasing, you would get memory accumulating up to max pool size in B, and A always allocating new memory? > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096634#comment-14096634 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] Yeah it add to the cache of the "releasing" thread that is right.. I thought you talk about return to pool. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096631#comment-14096631 ] Benedict commented on CASSANDRA-7743: - I haven't got to that stage yet, I'm just analysing the code right now. It's why I asked for your input, was hoping you could disabuse me if I'm completely wrong. I don't 100% understand the control flow, as it doesn't make much sense (to me) to be adding it to a different cache. However if you look in PooledByteBuf.deallocate(), it calls PoolArena.free() to release the memory, which in turn calls parent.threadCache.get().add() to cache its memory; obviously the threadCache.get() is grabbing the threadlocal cache for the thread releasing, not the source PoolThreadCache. Also worth noting I'm not convinced that, even if I'm correct, this fully explains the behaviour. We should only release on a different thread if an exception occurs during processing anyway, so I'm still digging for a more satisfactory full explanation of the behaviour. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096625#comment-14096625 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] hmm.. it should always get returned to the pool that it was allocated from. Could you provide me with an easy way to reproduce ? > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096619#comment-14096619 ] Benedict commented on CASSANDRA-7743: - Hmm. So, looking at this a little more closely, I think this may effectively be a netty bug after all. It looks like no matter what pool/thread a pooled bytebuf is allocated on, it gets returned to the pool of the thread that _releases_ it. This means it simply accumulates indefinitely (up to the pool limit, which defaults to 32Mb) in the SEPWorkers, since they never themselves _allocate_, only release. [~norman] is that analysis correct? If so, it looks like this behaviour is somewhat unexpected and not ideal. However we can work around it for now. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte >Assignee: Benedict > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing inform
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095780#comment-14095780 ] Benedict commented on CASSANDRA-7743: - It looks like the problem is caused by a number of changes in 2.1 composing to yield especially bad behaviour. We use pooled buffers in netty, but we also introduced an SEPWorker pool that has many threads (more than the number that actually service any single pool), and all threads may eventually service work on the netty executor side. This gives us ~130 threads periodically performing this work, and each of them apparently allocates a buffer at some point. These buffers are unfortunately allocated from a threadlocal pool, which starts at 16Mb, so each thread retains at least 16Mb of largely useless memory. The best fix will be to stop the SEPWorker tasks from allocating any buffers, but [~tjake] has pointed out we can also tweak some settings to mitigate the negative impact of this kind of problem as well. I'll look into a patch tomorrow. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095698#comment-14095698 ] Pierre Laporte commented on CASSANDRA-7743: --- [~tjake] Sure, I just started a new test with this option > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 > UN 10.240.72.183 896
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095535#comment-14095535 ] T Jake Luciani commented on CASSANDRA-7743: --- Can we run this with -Dio.netty.leakDetectionLevel=PARANOID ? > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 > UN 10.240.72.183
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095524#comment-14095524 ] T Jake Luciani commented on CASSANDRA-7743: --- It sounds like the safest bet may be to not use the pooled allocator at all > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 > UN 1
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095411#comment-14095411 ] Benedict commented on CASSANDRA-7743: - No, but I don't think it's likely to be related, since they would still be collected when unreferenced, so we'd likely see LEAK DETECTOR warnings from netty at which time the associated resources would also be freed, so we'd be somwhat unlikely to see the bug. No harm in trying, of course, but it sounds like it takes a few days to reproduce. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Ho
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095406#comment-14095406 ] Sylvain Lebresne commented on CASSANDRA-7743: - Has this been tried/reproduced on the current 2.1 branch, notably post CASSANDRA-7735? > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > Fix For: 2.1.0 > > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a89038238
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094030#comment-14094030 ] Pierre Laporte commented on CASSANDRA-7743: --- Sure, I have uploaded one here : https://drive.google.com/file/d/0BxvGkaXP3ayeMDlRTWJ2MVhvT0E/edit?usp=sharing > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0% > c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 >
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093976#comment-14093976 ] Benedict commented on CASSANDRA-7743: - Could we get some heap dumps? Sounds to me like it's possibly a netty bug, or a ref counting bug coupled with a leaked/held reference somewhere. We need to see where these ByteBuffer references are being retained and why. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 1
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093965#comment-14093965 ] Pierre Laporte commented on CASSANDRA-7743: --- [~benedict] Actually, the nodes are running with memtable_allocation_type: heap_buffers. [~jbellis] The test failed on bigger instance too. I just realized that setting -XX:MaxDirectMemorySize=-1 is useless since it is the default value. Now I am doubting -1 really means "unlimited"... Restarting a new one with -XX:MaxDirectMemorySize=1G to see if things change. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093206#comment-14093206 ] Benedict commented on CASSANDRA-7743: - Are you running with memtable_allocation_type: offheap_buffers? If so, switch to the offheap_objects. If not, it's surprising to be hitting that limit with netty buffers, as we don't allocate them anywhere else. Either way, the fact that this is failing inside netty is surprising, since this is prior to the fix for CASSANDRA-7695, so we shouldn't in principle be allocating direct buffers with netty. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093010#comment-14093010 ] Pierre Laporte commented on CASSANDRA-7743: --- [~enigmacurry] Eclipse MAT shows 300k instances of java.nio.ByteBuffer[] but retaining only ~26MB. It only accounts for in-heap data. [~jbellis] Ok I am going to start two new tests: one on n1-standard-1 with -XX:MaxDirectMemorySize=-1 and another one on n1-standard-2 without this setting > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092904#comment-14092904 ] Jonathan Ellis commented on CASSANDRA-7743: --- This means you need a larger MaxDirectMemorySize, but we've avoided allocateDirect in favor of Unsafe in the past, in part because of this problem. /cc [~benedict] > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 > UN 10.240.137.253 1.1 MB 256 100.0%
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092892#comment-14092892 ] Ryan McGuire commented on CASSANDRA-7743: - I'd recommend running [MAT|http://www.eclipse.org/mat/] on of the core files to be able to examine what exactly is eating up the ram. Although, I'm not sure if this helps with "Direct buffer memory" as I've only used it to debug things before we went off-heap. > Possible C* OOM issue during long running test > -- > > Key: CASSANDRA-7743 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Google Compute Engine, n1-standard-1 >Reporter: Pierre Laporte > > During a long running test, we ended up with a lot of > "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra > instances. > Here is an example of stacktrace from system.log : > {code} > ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - > Unexpected exception during request > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.7.0_25] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > ~[na:1.7.0_25] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.20.Final.jar:4.0.20.Final] > at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] > {code} > The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 > vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance > running the test. > After ~2.5 days, several requests start to fail and we see the previous > stacktraces in the system.log file. > The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory > available. > {code} > $ free -m > total used free sharedbuffers cached > Mem: 3702 3532169 0161854 > -/+ buffers/cache: 2516 1185 > Swap:0 0 0 > $ head -n 4 /proc/meminfo > MemTotal:3791292 kB > MemFree: 173568 kB > Buffers: 165608 kB > Cached: 874752 kB > {code} > These errors do not affect all the queries we run. The cluster is still > responsive but is unable to display tracing information using cqlsh : > {code} > $ ./bin/nodetool --host 10.240.137.253 status duration_test > Datacenter: DC1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 10.240.98.27925.17 KB 256 100.0% > 41