[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-21 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105217#comment-14105217
 ] 

Benedict commented on CASSANDRA-7743:
-

[~normanm] IMO the netty behaviour is surprising and likely to bite other 
projects as well, however it can be worked around if you realise it's there - 
but only with careful code analysis, it's hard to be certain you aren't 
allocating/releasing on other threads. It would be useful I think to have some 
warnings logged by netty if you initialise a new threadlocal memory pool on 
_returning_ a bytebuf, as this might well be indicative of pathological 
behaviour (you'd expect a thread to have allocated at least once before 
releasing if it is likely to allocate again). It might even be nice to 
explicitly define which threads are permitted to pool memory, so that you 
cannot accidentally build up pools on worker threads without noticing through 
accidental allocations as well. This wasn't a problem for us here, but I could 
see us accidentally introducing a bug like that pretty easily in future.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1 rc6
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:  

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-21 Thread Norman Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105211#comment-14105211
 ] 

Norman Maurer commented on CASSANDRA-7743:
--

[~benedict] so no netty issue at all ?

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1 rc6
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  10.240

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100197#comment-14100197
 ] 

Benedict commented on CASSANDRA-7743:
-

Did you see the actual error, or have more info than meminfo? Because that is 
not at all conclusive by itself.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1 rc6
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%   

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-17 Thread Kishan Karunaratne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100080#comment-14100080
 ] 

Kishan Karunaratne commented on CASSANDRA-7743:
---

I'm running rc5 + the patch, and the issue still shows up. 
I patched rc5 with the one file, and ran "ant realclean jar" to compile. I hope 
this command didn't re-pull from git.

$ free -m
 total   used   free sharedbuffers cached
Mem:  3702   2667   1035  0  1144
-/+ buffers/cache:   2520   1181
Swap:0  0  0

$ head -n 4 /proc/meminfo
MemTotal:3791292 kB
MemFree: 1060548 kB
Buffers:1280 kB
Cached:   148968 kB

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1 rc6
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to disp

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-15 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098977#comment-14098977
 ] 

T Jake Luciani commented on CASSANDRA-7743:
---

Looks good +1

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  10.240.72.183   896.57 KB  256 

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096651#comment-14096651
 ] 

Benedict commented on CASSANDRA-7743:
-

bq. well it will be released after a while if not used.

how long? it shouldn't ever be used, and it looks like it accumulates gigabytes 
in total over the course of a few days (around 16-32Mb per thread)

bq.  just pass in 0 for "int tinyCacheSize, int smallCacheSize, int 
normalCacheSize".

Won't that obviate most of the benefit of the pooled buffers? 

I plan to simply prevent our deallocating on the other threads.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Do

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Norman Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096647#comment-14096647
 ] 

Norman Maurer commented on CASSANDRA-7743:
--

[~benedict] well it will be released after a while if not used. But I think for 
your use-case it would be best to disable the cache which can be done via the 
PooledByteBufAllocator constructor just pass in 0 for "int tinyCacheSize, int 
smallCacheSize, int normalCacheSize".

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
> 

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096638#comment-14096638
 ] 

Benedict commented on CASSANDRA-7743:
-

We're conflating two pools maybe :)

I mean the "pool" of memory the thread can allocate from. So, to confirm I have 
this right, if you have two threads A and B, A only allocating and B only 
releasing, you would get memory accumulating up to max pool size in B, and A 
always allocating new memory?

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID   

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Norman Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096634#comment-14096634
 ] 

Norman Maurer commented on CASSANDRA-7743:
--

[~benedict] Yeah it add to the cache of the "releasing" thread that is right.. 
I thought you talk about return to pool.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096631#comment-14096631
 ] 

Benedict commented on CASSANDRA-7743:
-

I haven't got to that stage yet, I'm just analysing the code right now. It's 
why I asked for your input, was hoping you could disabuse me if I'm completely 
wrong. I don't 100% understand the control flow, as it doesn't make much sense 
(to me) to be adding it to a different cache. However if you look in 
PooledByteBuf.deallocate(), it calls PoolArena.free() to release the memory, 
which in turn calls parent.threadCache.get().add() to cache its memory; 
obviously the threadCache.get() is grabbing the threadlocal cache for the 
thread releasing, not the source PoolThreadCache.

Also worth noting I'm not convinced that, even if I'm correct, this fully 
explains the behaviour. We should only release on a different thread if an 
exception occurs during processing anyway, so I'm still digging for a more 
satisfactory full explanation of the behaviour.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Norman Maurer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096625#comment-14096625
 ] 

Norman Maurer commented on CASSANDRA-7743:
--

[~benedict] hmm.. it should always get returned to the pool that it was 
allocated from. Could you provide me with an easy way to reproduce ?

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096619#comment-14096619
 ] 

Benedict commented on CASSANDRA-7743:
-

Hmm. So, looking at this a little more closely, I think this may effectively be 
a netty bug after all. It looks like no matter what pool/thread a pooled 
bytebuf is allocated on, it gets returned to the pool of the thread that 
_releases_ it. This means it simply accumulates indefinitely (up to the pool 
limit, which defaults to 32Mb) in the SEPWorkers, since they never themselves 
_allocate_, only release.

[~norman] is that analysis correct? If so, it looks like this behaviour is 
somewhat unexpected and not ideal. However we can work around it for now.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>Assignee: Benedict
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing inform

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095780#comment-14095780
 ] 

Benedict commented on CASSANDRA-7743:
-

It looks like the problem is caused by a number of changes in 2.1 composing to 
yield especially bad behaviour. We use pooled buffers in netty, but we also 
introduced an SEPWorker pool that has many threads (more than the number that 
actually service any single pool), and all threads may eventually service work 
on the netty executor side. This gives us ~130 threads periodically performing 
this work, and each of them apparently allocates a buffer at some point. These 
buffers are unfortunately allocated from a threadlocal pool, which starts at 
16Mb, so each thread retains at least 16Mb of largely useless memory.

The best fix will be to stop the SEPWorker tasks from allocating any buffers, 
but [~tjake] has pointed out we can also tweak some settings to mitigate the 
negative impact of this kind of problem as well.

I'll look into a patch tomorrow.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Pierre Laporte (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095698#comment-14095698
 ] 

Pierre Laporte commented on CASSANDRA-7743:
---

[~tjake] Sure, I just started a new test with this option

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  10.240.72.183   896

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095535#comment-14095535
 ] 

T Jake Luciani commented on CASSANDRA-7743:
---

Can we run this with  -Dio.netty.leakDetectionLevel=PARANOID ?

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  10.240.72.183 

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095524#comment-14095524
 ] 

T Jake Luciani commented on CASSANDRA-7743:
---

It sounds like the safest bet may be to not use the pooled allocator at all

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  1

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095411#comment-14095411
 ] 

Benedict commented on CASSANDRA-7743:
-

No, but I don't think it's likely to be related, since they would still be 
collected when unreferenced, so we'd likely see LEAK DETECTOR warnings from 
netty at which time the associated resources would also be freed, so we'd be 
somwhat unlikely to see the bug.

No harm in trying, of course, but it sounds like it takes a few days to 
reproduce.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Ho

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-13 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095406#comment-14095406
 ] 

Sylvain Lebresne commented on CASSANDRA-7743:
-

Has this been tried/reproduced on the current  2.1 branch, notably post 
CASSANDRA-7735?

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
> Fix For: 2.1.0
>
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a89038238

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-12 Thread Pierre Laporte (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094030#comment-14094030
 ] 

Pierre Laporte commented on CASSANDRA-7743:
---

Sure, I have uploaded one here : 
https://drive.google.com/file/d/0BxvGkaXP3ayeMDlRTWJ2MVhvT0E/edit?usp=sharing

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%
> c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> 

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093976#comment-14093976
 ] 

Benedict commented on CASSANDRA-7743:
-

Could we get some heap dumps? Sounds to me like it's possibly a netty bug, or a 
ref counting bug coupled with a leaked/held reference somewhere. We need to see 
where these ByteBuffer references are being retained and why.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  1

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-12 Thread Pierre Laporte (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093965#comment-14093965
 ] 

Pierre Laporte commented on CASSANDRA-7743:
---

[~benedict] Actually, the nodes are running with memtable_allocation_type: 
heap_buffers.

[~jbellis] The test failed on bigger instance too.  I just realized that 
setting -XX:MaxDirectMemorySize=-1 is useless since it is the default value.  
Now I am doubting -1 really means "unlimited"...  Restarting a new one with 
-XX:MaxDirectMemorySize=1G to see if things change.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093206#comment-14093206
 ] 

Benedict commented on CASSANDRA-7743:
-

Are you running with memtable_allocation_type: offheap_buffers? If so, switch 
to the offheap_objects. 

If not, it's surprising to be hitting that limit with netty buffers, as we 
don't allocate them anywhere else. Either way, the fact that this is failing 
inside netty is surprising, since this is prior to the fix for CASSANDRA-7695, 
so we shouldn't in principle be allocating direct buffers with netty.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load  

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-11 Thread Pierre Laporte (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093010#comment-14093010
 ] 

Pierre Laporte commented on CASSANDRA-7743:
---

[~enigmacurry] Eclipse MAT shows 300k instances of java.nio.ByteBuffer[] but 
retaining only ~26MB. It only accounts for in-heap data.

[~jbellis] Ok I am going to start two new tests: one on n1-standard-1 with 
-XX:MaxDirectMemorySize=-1 and another one on n1-standard-2 without this setting

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092904#comment-14092904
 ] 

Jonathan Ellis commented on CASSANDRA-7743:
---

This means you need a larger MaxDirectMemorySize, but we've avoided 
allocateDirect in favor of Unsafe in the past, in part because of this problem. 
/cc [~benedict]

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB 256 100.0%  

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

2014-08-11 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092892#comment-14092892
 ] 

Ryan McGuire commented on CASSANDRA-7743:
-

I'd recommend running [MAT|http://www.eclipse.org/mat/] on of the core files to 
be able to examine what exactly is eating up the ram. Although, I'm not sure if 
this helps with "Direct buffer memory" as I've only used it to debug things 
before we went off-heap.

> Possible C* OOM issue during long running test
> --
>
> Key: CASSANDRA-7743
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Google Compute Engine, n1-standard-1
>Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of 
> "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra 
> instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - 
> Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) 
> ~[na:1.7.0_25]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) 
> ~[na:1.7.0_25]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) 
> ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>  ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
> at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 
> vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance 
> running the test.
> After ~2.5 days, several requests start to fail and we see the previous 
> stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory 
> available.
> {code}
> $ free -m
> total  used   free sharedbuffers cached
> Mem:  3702   3532169  0161854
> -/+ buffers/cache:   2516   1185
> Swap:0  0  0
> $ head -n 4 /proc/meminfo
> MemTotal:3791292 kB
> MemFree:  173568 kB
> Buffers:  165608 kB
> Cached:   874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still 
> responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens  Owns (effective)  Host ID  
>  Rack
> UN  10.240.98.27925.17 KB  256 100.0%
> 41