[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-03-01 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781793#comment-16781793
 ] 

Benedict commented on CASSANDRA-15006:
--

{quote}Do you have any idea what is the source of these "objects with arbitrary 
lifetimes"?
{quote}
Yes, sorry if I wasn't clear.  The {{ChunkCache}} (which is like Cassandra 
3.x's internal equivalent of the linux page cache, but also for 
post-decompression 'pages') uses Cassandra's {{BufferPool}} which is designed 
for allocations that are freed in _near to_ the same sequence in which they 
were allocated.  The {{ChunkCache}} is LRU, however, so its contents can remain 
there potentially forever, breaking this assumption.

The {{BufferPool}} allocates in units of 128KiB, meaning it will also only make 
available for reuse memory when all 128KiB have been freed.  It looks like you 
have 64KiB compression chunk size (which is the default for 3.x), meaning this 
will typically only require pairs of allocations to be freed together.  
However, this is enough to leave many dangling partially used 128KiB units, 
where their unused portion is useless for the time being.

It's up to you how you address this - lowering the configuration settings for 
these properties, raising your memory limits, or downgrading C*.  It should not 
be the case that memory would grow unboundedly, only to some fraction above the 
normal chunk cache / buffer pool limits.  Certainly no more than twice, and I 
would anticipate no more than about 30% or so (but my math's is rusty so I 
won't try to calculate a guess based on any assumed distribution).

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-03-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781779#comment-16781779
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

Thanks [~benedict]! Awesome work, your analysis sounds very reasonable!

I checked the logs and unfortunately these servers only keep logs for 5 days so 
the logs from the node startups are since long lost.

But I did find bunch of "INFO Maximum memory usage reached (531628032), cannot 
allocate chunk of 1048576" log entries. Pretty much one every hour on the hour. 
Which probably corresponding with time of the hourly cassandra snapshots taken 
on each node.

Do you have any idea what is the source of these "objects with arbitrary 
lifetimes"? And why it (at least in my tests) appears to increase linearly 
forever. If they are related to repairs somehow I would assume that they would 
not increase from one repair to the next?

Also, your proposed workaround for 3.11.x to lower the chunk cache and buffer 
pool settings. Would that "fix" the problem or simply buy some more time until 
the process runs out of memory.

I guess instead of lowering these two settings simply raising the configured 
memory limit from 3GiB to 4 or 5 GiB without changing the heap size setting 
would work equally well?

I have no problem with raising my (rather low) memory limit if I knew that I 
would end up with a setup that will not run out of memory no matter how long it 
will be running.

Again, thanks for your help!

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-03-01 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781711#comment-16781711
 ] 

Benedict commented on CASSANDRA-15006:
--

Thanks [~jborgstrom].

After some painful VisualVM usage (OQL is powerful but horrible), it looks like 
my initial thoughts were on the money:
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it)'), 'it.capacity') 
 ** Total DirectByteBuffer capacity where the buffer is not a slice of another 
buffer and is not backed by a file descriptor
 ** 25th: 5.8236923E8
 ** 29th: 6.39534433E8
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && isInNettyPool(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is in Netty's pool
 ** 25th: 3.3554432E7
 ** 29th: 3.3554432E7
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && !isInChunkCache(it) && isHintsBuffer(it)'), 
'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is use for Hints
 ** 25th: 3.3554432E7
 ** 29th: 3.3554432E7
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && isMaybeMacroChunk(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is very likely a 
{{BufferPool}} macro chunk
 ** 25th: 5.14756608E8
 ** 29th: 5.33704704E8
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && isInChunkCache(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is in the chunk cache, but 
is not managed by the BufferPool
 ** 25th: 0
 ** 29th: 3.8076416E7
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
isOwnerOfMemory(it) && !isMaybeMacroChunk(it) && !isInNettyPoolOrChunkCache(it) 
&& !isHintsBuffer(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity that is not explained by one of the above, 
and is not use for hints (which uses a stable 32MiB)
 ** 25th: 503758.0
 ** 29th: 69.0
 # sum(heap.objects('java.nio.DirectByteBuffer', 'true', '!isFileBacked(it) && 
!isOwnerOfMemory(it) && isInChunkCache(it)'), 'it.capacity')
 ** Total DirectByteBuffer capacity where the buffer is in the chunk cache, and 
_is_ managed by the BufferPool
 ** 25th: 4.72383488E8
 ** 29th: 4.10779648E8

So, basically, the ChunkCache is beginning to allocate memory directly because 
the BufferPool has run out of space.  It has run out of space because it was 
never intended to be used for objects with arbitrary lifetimes.

This was already on my radar as something to address, but it won't be addressed 
for a couple of months I expect, and I don't know which versions will be 
targeted for a fix.  It should be that the 3.0.x line does not have this 
problem.  If you have yet to go live, I would recommend using 3.0.x.  
Otherwise, lower your chunk cache and buffer pool settings.

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-03-01 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781543#comment-16781543
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

[~benedict] I've now emailed you links to the heap dumps I have.

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-25 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777017#comment-16777017
 ] 

Benedict commented on CASSANDRA-15006:
--

Hi [~jborgstrom],

It's _possible_ there is a memory leak with repair, however seeing your logs 
and a heap dump remain the next best steps.

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - Cassandra.png, 
> cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777009#comment-16777009
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

Yay, I think I finally found a way to get the direct memory usage to stop 
growing and level out!

I've just attached an updated screenshot. And as you can see the linear growth 
of "java.nio:type=BufferPool,name=direct/MemoryUsed" has finally stopped!

It appears that as soon as I disabled the hourly "nodetool repair --full" last 
Friday direct memory usage first dropped 10-20% and then leveled out.

But I don't really understand how running "repair --full" every hour can cause 
the direct memory usage to grow linearly. Each "repair --full" invocation only 
takes a couple of minutes so the GC would have plenty of time to reclaim any 
resources used long before the next "repair --full" job starts.

Could this perhaps be related to CASSANDRA-14096?

 

Also. I did not run "repair --full" every hour when I triggered the OOM error 
the first time. That setup only ran an incremental repair every 5 days. I only 
later started to run repairs more frequently to try to be able to reproduce the 
problem faster.

 

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, Screenshot_2019-02-25 Grafana - 
> Cassandra.png.2019_02_25_16_21_30.0.svg, cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775223#comment-16775223
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

My test cluster has now been running for one more week since my last report 
with the same CQL read/write load and one full repair every hour.

There we can see the "java.nio:type=BufferPool,name=direct/MemoryUsed" graph 
still shows a very linear growth (10-15MB/day).

Last Saturday there were a short outage causing one of the nodes to restart. 
This seems to have caused that node to report a values that is consistently 
about 300MB lower than the other nodes but besides that it is now increasing at 
the same rate as the other two nodes.

The next thing I'm going to try is to change the "nodetool repair --full" 
frequency from once per hour to once every 5 days to see if that has any affect 
on the rate of the leak.

 

Btw, do we know anyone else how are monitoring the 
"java.nio:type=BufferPool,name=direct" JMX Mbean and are able to graph it? It 
would be very interesting to see how it looks for other clusters.

 

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, Screenshot_2019-02-22 Grafana 
> - Cassandra.png, cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769406#comment-16769406
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

Hi [~benedict]. Sorry I just missed your reply when I wrote my own reply 
yesterday.
h4. About the graphs

All graphed values are from standard java/cassandra JMX metrics. The name of 
each graph should explain exactly which MBean is used and what attribute of 
that MBean that is plotted.

So for example the "java.nio:type=BufferPool,name=direct/MemoryUsed" graph 
plots the "MemoryUsed" attribute of the JMX Mbean with objectName 
"java.nio:type=BufferPool,name=direct". Let me know if I should explain any of 
the other graphs in more detail but they all follow the same naming pattern.
h5. Memory limit

I run cassandra using kubernetes. Kubernetes lets you define hard memory limits 
that are enforced using cgroups. When I initially ran into this issue I had a 
hard memory limit of 3GiB. But in order to see if this would eventually level 
out I changed that to 4GiB when started the current set of nodes.

How did you determine that the process has "much more than this (3GiB)" 
committed to it?

If we can assume that all "DirectByteBufferR" usage refer to mapped memory and 
not actual off-heap allocations I think the process still stays under the 4GiB 
for now. It just does not level out the way I expect.
h5. Log files and configuration

I'm afraid the log entries from the node startup have already been deleted but 
I've uploaded my cassandra.yaml and the command line used to start java.
h5. Table truncation

I did truncate a large table yesterday. And as expected the 
"java.nio:type=BufferPool,name=mapped/MemoryUsed" graphs shows that the amount 
of mapped memory immediately decreased from 6GiB to 0.6GiB corresponding well 
with the amount of data left on the system. This value also seems mostly 
unchanged now 24 hours later. Probably because not much new data has been 
written since yesterday.

The "java.nio:type=BufferPool,name=direct/MemoryUsed" graph also immediately 
dropped from 800+MiB to around 500MiB. Interestingly this value has now almost 
24h later almost made it's way back up to over 750MiB without that much more 
data being written to cassandra since yesterday.

I forgot one detail. In addition to a fairly standard CQL load (mostly selects 
and some inserts) I've also been running "while true; nodetool repair --full && 
sleep 1h; done" since day 1.
h5. Heap Dump

I've saved a number of heap dumps during the test (both before and after I 
truncated the table for example) so I'll try to work out a way to make them 
downloadable

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png, 
> Screenshot_2019-02-15 Grafana - Cassandra.png, cassandra.yaml, cmdline.txt
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768386#comment-16768386
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

Ok, I just tested to truncate the largest table. This lowered the load on each 
node from about 6GiB to 0.5GiB.

I've attached a new screenshot that shows that this resulted in a dramatic 
reduction of the size of both the direct and mapped java.nio allocations.

It looks like long lived SSTables can "accumulate" more and more 
DirectByteBuffer allocations over time (in addition to their Chunk cache 
usage). These "accumulating" allocations are not freed until the corresponding 
SSTable file is unloaded (table truncation, compaction, etc).

Am I missing something?

 

 

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-14 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768374#comment-16768374
 ] 

Benedict commented on CASSANDRA-15006:
--

Hi [~jborgstrom],

The {{DirectByteBufferR}} simply means it is a read only byte buffer.  This 
might be mapped.  Unfortunately, given conflation of terms around 
{{BufferPool}} it is hard to understand what your graphs mean.  Could you 
explicitly define them for me?  What does each graph title directly map to; how 
is it being produced?

I would not bother truncating any table.  Ideally, we would get a heap dump 
posted somewhere privately for us to download and analyse.

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra(1).png, Screenshot_2019-02-14 Grafana - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16768279#comment-16768279
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

I've just uploaded an updated Grafana screenshot that shows that the direct 
(off-heap) allocations are still increasing linearly after 22 days.

The references screenshot is from a tool called JXRay and I believe the top 
5.8GiB entry is for something called java.nio.DirectByteBufferR and not 
java.nio.DirectByteBuffer. I'm no java developer but I believe that represents 
mmapped-memory and not off-heap allocated memory. This matches well with the 
top right graph (java.nio:type=BufferPool,name=mapped)

Since I'm continuously both adding new data and accessing all existing data I 
guess it is expected that the amount of mapped memory increases to match the 
size of the ever increasing SSTables on disk. But I guess that is fine since 
the Linux OOM killer should not kill the java process for using too much memory 
mapped memory.

But the OOM killer will kill the java process if too much off-heap memory is 
used. Perhaps Cassandra for some reason needs to allocate a bit off 
direct/off-heap memory for every chunk of memory mapped region it accesses?

 

Next I'll try to truncate the largest table to see what kind of effect that 
will have on the java.nio.DirectByteBuffer usage.

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png, Screenshot_2019-02-14 Grafana 
> - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-08 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763463#comment-16763463
 ] 

Benedict commented on CASSANDRA-15006:
--

So, in CASSANDRA-11993 the BufferPool began being abused for jobs it wasn't 
intended for.  Its main user is now the chunk cache, so the lifetime of its 
buffers is considerably longer than was intended, and without any necessary 
correlation to the invocation of {{free()}} for allocations from a given chunk. 
 This is almost certainly a bug, as BufferPool chunks may be mostly unused but 
remain allocated while a single ChunkCache chunk needs it.  

But it's unclear from the data you've posted if this has anything to do with 
your significant memory usage.

I'm not used to the tooling you've posted images from, but it looks like 
there's 5.8GiB of buffers in total, and only around 500MiB of buffers who are 
reachable from the chunk cache or the buffer pool?  It looks like we want to 
figure out what (presumably global) variable that {{ByteBuffer[]}} corresponds 
to first.

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-05 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760685#comment-16760685
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

I just attached a screenshot of the java.nio.DirectByteBuffers reference chains

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: CASSANDRA-15006-reference-chains.png, 
> Screenshot_2019-02-04 Grafana - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-05 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760655#comment-16760655
 ] 

Alex Petrov commented on CASSANDRA-15006:
-

[~jborgstrom] could you take a look at what retains those byte buffers? If you 
take some heap inspection tool, it can be seen by analysing roots. 

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: Screenshot_2019-02-04 Grafana - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15006) Possible java.nio.DirectByteBuffer leak

2019-02-05 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760636#comment-16760636
 ] 

Jonas Borgström commented on CASSANDRA-15006:
-

I took a heap dump of on of the nodes this morning and compared it against an 
older one using jxray. I'm no expert but it looks like the memory allocated by 
"org.apache.cassandra.utils.memory.BufferPool" keeps increasing. Since the 
ChunkCache seems capped at HEAP/4 what else is using the BufferPool?

> Possible java.nio.DirectByteBuffer leak
> ---
>
> Key: CASSANDRA-15006
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15006
> Project: Cassandra
>  Issue Type: Bug
> Environment: cassandra: 3.11.3
> jre: openjdk version "1.8.0_181"
> heap size: 2GB
> memory limit: 3GB (cgroup)
> I started one of the nodes with "-Djdk.nio.maxCachedBufferSize=262144" but 
> that did not seem to make any difference.
>Reporter: Jonas Borgström
>Priority: Major
> Attachments: Screenshot_2019-02-04 Grafana - Cassandra.png
>
>
> While testing a 3 node 3.11.3 cluster I noticed that the nodes were suddenly 
> killed by the Linux OOM killer after running without issues for 4-5 weeks.
> After enabling more metrics and leaving the nodes running for 12 days it sure 
> looks like the
> "java.nio:type=BufferPool,name=direct" Mbean shows a very linear growth 
> (approx 15MiB/24h, see attached screenshot). Is this expected to keep growing 
> linearly after 12 days with a constant load?
>  
> In my setup the growth/leak is about 15MiB/day so I guess in most setups it 
> would take quite a few days until it becomes noticeable. I'm able to see the 
> same type of slow growth in other production clusters even though the graph 
> data is more noisy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org