[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-19 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198670#comment-17198670
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-15214 at 9/19/20, 7:21 AM:
--

{quote}So it is possible that errors from Netty internal are not bubbled up
{quote}
As I have said, they are not, and that is the reason this ticket was filed.  I 
ascertained this at a time when I was intimately familiar with Netty’s 
workings, which I am not any longer.  I may have made a mistake, or Netty may 
have been updated to a version where this changed, but let's operate under the 
assumption I was correct at the time of filing this ticket, until proven 
otherwise.

The non-propagation of OOM by inspectThrowable was probably used precisely 
because propagating it was thought to achieve nothing besides logging against 
Netty's internal loggers (and failing to shutdown would leave the channel in a 
worse state, as we would not have finished tidying up as a result of the 
exception), but I agree we should have left a TODO directly in the code.


was (Author: benedict):
As I have said, they do not - unless you are confident I am wrong. That is the 
reason this ticket was filed, and I ascertained this at a time when I was 
intimately familiar with Netty’s workings.  I may have made a mistake, or Netty 
may have been updated to a version where this changed, but please do not 
operate under that assumption. The non-propagation of OOM by inspectThrowable 
is not relevant, and probably was used precisely because propagating it 
achieves nothing but logging against Netty's internal loggers.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2020-09-19 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198670#comment-17198670
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-15214 at 9/19/20, 7:15 AM:
--

As I have said, they do not - unless you are confident I am wrong. That is the 
reason this ticket was filed, and I ascertained this at a time when I was 
intimately familiar with Netty’s workings.  I may have made a mistake, or Netty 
may have been updated to a version where this changed, but please do not 
operate under that assumption. The non-propagation of OOM by inspectThrowable 
is not relevant, and probably was used precisely because propagating it 
achieves nothing but logging against Netty's internal loggers.


was (Author: benedict):
As I have said, they do not - unless you are confident I am wrong. That is the 
reason this ticket was filed, and I ascertained this at a time when I was 
intimately familiar with Netty’s workings. The non-propagation of OOM by 
inspectThrowable is irrelevant.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Assignee: Yifan Cai
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2020-05-04 Thread Joey Lynch (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099546#comment-17099546
 ] 

Joey Lynch edited comment on CASSANDRA-15214 at 5/5/20, 5:57 AM:
-

Quick update on this from the jvmquake side we are now building [architecture 
specific artifacts|https://github.com/Netflix-Skunkworks/jvmquake/releases] 
that will work with any JVM newer than Java 8, they link only against the 
platform specific libc (we're also now testing on Java 8 and 11, on both zulu 
and openjdk JVMs). I think this means it would be plausible to include the 
{{libjvmquake-linux-x86_64.so}} in {{libs}} and then have a switch on uname -s 
-m to determine to pick it up or not. Right now we're only building for linux 
amd64 but if there is interest I can generate more architectures (linux arm 
probably makes sense, and could do osx). I also still like the idea of having a 
agents/available and agents/enabled folder like apache does for modules, users 
can just symlink agents from one to the other to include them (and we can 
symlink jamm and jvmquake by default).

[~yifanc] I agree that the OutOfMemory conditions that do not result in "true" 
JVM OOM (meaning that it would cause a heapdump via {{HeapDumpOnOutOfMemory}}) 
such as direct buffer allocations will not get caught by jvmquake, my testing 
confirms your findings, although the jvmquake GC instability algorithm will 
still trigger in various real world scenarios I've run into.

I feel like the right move might be to walk back a small bit of CASSANDRA-13006 
where we stopped forcibly killing the JVM ourselves and let the JVM do it. 
Specifically if the OOM message contains "Direct buffer memory" we could do 
what jvmquake does and force the JVM into a "normal" OOM by [allocating large 
long 
arrays|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/src/jvmquake.c#L103].
 This will then trigger a proper OOM and get us heap dumping. It's relatively 
easy to ignore the "sacrificial" long array in a heap dump and we could log 
clearly what is happening.


was (Author: jolynch):
Quick update on this from the jvmquake side we are now building [architecture 
specific artifacts|https://github.com/Netflix-Skunkworks/jvmquake/releases] 
that will work with any JVM newer than Java 8, they link only against the 
platform specific libc (we're also now testing on Java 8 and 11, on both zulu 
and openjdk JVMs). I think this means it would be plausible to include the 
{{libjvmquake-linux-x86_64.so}} in {{libs}} and then have a switch on uname -s 
-m to determine to pick it up or not. Right now we're only building for linux 
amd64 but if there is interest I can generate more architectures (linux arm 
probably makes sense, and could do osx). I also still like the idea of having a 
agents/available and agents/enabled folder like apache does for modules, users 
can just symlink agents from one to the other to include them (and we can 
symlink jamm and jvmquake by default).

[~yifanc] I agree that the OutOfMemory conditions that do not result in "true" 
JVM OOM (meaning that it would cause a heapdump via {{HeapDumpOnOutOfMemory}}) 
will not get caught by jvmquake, my testing confirms your findings, although 
the jvmquake GC instability algorithm will still trigger in various real world 
scenarios I've run into.

I feel like the right move mightly be to walk back a small bit of 
CASSANDRA-13006 where we stopped forcibly killing the JVM ourselves and let the 
JVM do it. Specifically if the OOM message contains "Direct buffer memory" we 
could do what jvmquake does and force the JVM into a "normal" OOM by 
[allocating large long 
arrays|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/src/jvmquake.c#L103].
 This will then trigger a proper OOM and get us heap dumping. It's relatively 
easy to ignore the "sacrificial" long array in a heap dump and we could log 
clearly what is happening.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict Elliott Smith
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a 

[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-05 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900394#comment-16900394
 ] 

Benedict edited comment on CASSANDRA-15214 at 8/5/19 8:45 PM:
--

Sorry, I completely forgot to respond to this ticket so thanks for bumping it 
[~djoshi3]

>From my POV, including a C JVMTI agent is absolutely fine, [~jolynch].  We'd 
>have to take a closer look at jvmkill and jvmquake, and do our own brief audit 
>of the version we include to ensure it seems to behave reasonably.  But I 
>don't see any problem with utilising non-Java functionality.


was (Author: benedict):
Sorry, I completely forgot to respond to this ticket so thanks for bumping it 
[~djoshi3]

>From my POV, including a C JVMTI agent is absolutely fine.  We'd have to take 
>a closer look at jvmkill and jvmquake, and do our own brief audit of the 
>version we include to ensure it seems to behave reasonably.  But I don't see 
>any problem with utilising non-Java functionality.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2019-08-05 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900379#comment-16900379
 ] 

Dinesh Joshi edited comment on CASSANDRA-15214 at 8/5/19 8:07 PM:
--

I think this issue might be related to 
https://bugs.openjdk.java.net/browse/JDK-8027434. Other projects that use the 
JVM have run into a similar issue and the usual solution is to use 
[jvmkill|https://github.com/airlift/jvmkill]. The issue at hand is that when a 
JVM runs out of memory (heap or otherwise), it enters an undefined state. In 
this situation, I would not expect the handlers to work as expected. I think we 
should either use jvmkill or 
[jvmquake|https://github.com/Netflix-Skunkworks/jvmquake] to solve this issue 
as it has proven to be reliable and Netflix, Facebook and other large JVM users 
are actively using it.


was (Author: djoshi3):
I think this issue might be related to 
https://bugs.openjdk.java.net/browse/JDK-8027434. Other projects that use the 
JVM have run into a similar issue and the usual solution is to use 
[jvmkill|https://github.com/airlift/jvmkill]. The issue at hand is when a JVM 
has run out of memory (heap or otherwise), it enters an undefined state. In 
this situation, I would not expect the handlers to work as expected either. I 
think we should either use jvmkill or 
[jvmquake|https://github.com/Netflix-Skunkworks/jvmquake] to solve this issue 
as it has proven to be reliable and Netflix, Facebook and other large JVM users 
are actively using it.

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2019-07-21 Thread Tomas Shestakov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889767#comment-16889767
 ] 

Tomas Shestakov edited comment on CASSANDRA-15214 at 7/21/19 5:08 PM:
--

There is two options to handle *OOM* in java version >= 8u92  
[https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html]

-XX:+ExitOnOutOfMemoryError

-XX:+CrashOnOutOfMemoryError


was (Author: shestakov):
There is two options to handle *OOM* in java 8u92 
[https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html]

-XX:+ExitOnOutOfMemoryError

-XX:+CrashOnOutOfMemoryError

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

2019-07-16 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886475#comment-16886475
 ] 

Joseph Lynch edited comment on CASSANDRA-15214 at 7/16/19 9:36 PM:
---

We've (Netlfix) found handling OOMs to be generally hard to do correctly in all 
the various Java codebases we have so we built an agent solution which attaches 
to the JVM in [https://github.com/Netflix-Skunkworks/jvmquake]. I think the 
only reason that we couldn't just directly include that in C* is because it's a 
C JVMTI agent instead of a Java one, but perhaps we could just solve this with 
some documentation and making it really easy to include agents (which is useful 
regardless)?  I can also spend some time and see if I can make it a java agent 
instead of a c one.

The following is the patch for supporting easy pluggable agents for C*:
{noformat}
diff --git a/conf/cassandra-env.sh b/conf/cassandra-env.sh
index d6c48be0a3..92061db3ab 100644
--- a/conf/cassandra-env.sh
+++ b/conf/cassandra-env.sh
@@ -134,6 +134,29 @@ do
   JVM_OPTS="$JVM_OPTS $opt"
 done
 
+# Pull in any agents present in CASSANDRA_HOME
+for agent_file in ${CASSANDRA_HOME}/agents/*.jar; do
+  if [ -e "${agent_file}" ]; then
+base_file="${agent_file%.jar}"
+if [ -s "${base_file}.options" ]; then
+  options=`cat ${base_file}.options`
+  agent_file="${agent_file}=${options}"
+fi
+JVM_OPTS="$JVM_OPTS -javaagent:${agent_file}"
+  fi
+done
+
+for agent_file in ${CASSANDRA_HOME}/agents/*.so; do
+  if [ -e "${agent_file}" ]; then
+base_file="${agent_file%.so}"
+if [ -s "${base_file}.options" ]; then
+  options=`cat ${base_file}.options`
+  agent_file="${agent_file}=${options}"
+fi
+JVM_OPTS="$JVM_OPTS -agentpath:${agent_file}"
+  fi
+done
{noformat}
Then we can just drop agents into the {{CASSANDRA_HOME/agents}} folder and they 
are loaded automatically by Cassandra. From a security perspective this is 
identical to "drop a jar".


was (Author: jolynch):
We've (Netlfix) found handling OOMs to be generally hard to do correctly in all 
the various Java codebases we have so we built an agent solution which attaches 
to the JVM in [https://github.com/Netflix-Skunkworks/jvmquake]. I think the 
only reason that we couldn't just directly include that in C* is because it's a 
C JVMTI agent instead of a Java one, but perhaps we could just solve this with 
some documentation and making it really easy to include agents (which is useful 
regardless)?

The following is the patch for supporting easy pluggable agents for C*:
{noformat}
diff --git a/conf/cassandra-env.sh b/conf/cassandra-env.sh
index d6c48be0a3..92061db3ab 100644
--- a/conf/cassandra-env.sh
+++ b/conf/cassandra-env.sh
@@ -134,6 +134,29 @@ do
   JVM_OPTS="$JVM_OPTS $opt"
 done
 
+# Pull in any agents present in CASSANDRA_HOME
+for agent_file in ${CASSANDRA_HOME}/agents/*.jar; do
+  if [ -e "${agent_file}" ]; then
+base_file="${agent_file%.jar}"
+if [ -s "${base_file}.options" ]; then
+  options=`cat ${base_file}.options`
+  agent_file="${agent_file}=${options}"
+fi
+JVM_OPTS="$JVM_OPTS -javaagent:${agent_file}"
+  fi
+done
+
+for agent_file in ${CASSANDRA_HOME}/agents/*.so; do
+  if [ -e "${agent_file}" ]; then
+base_file="${agent_file%.so}"
+if [ -s "${base_file}.options" ]; then
+  options=`cat ${base_file}.options`
+  agent_file="${agent_file}=${options}"
+fi
+JVM_OPTS="$JVM_OPTS -agentpath:${agent_file}"
+  fi
+done
{noformat}
Then we can just drop agents into the {{CASSANDRA_HOME/agents}} folder and they 
are loaded automatically by Cassandra. From a security perspective this is 
identical to "drop a jar".

> OOMs caught and not rethrown
> 
>
> Key: CASSANDRA-15214
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client, Messaging/Internode
>Reporter: Benedict
>Priority: Normal
> Fix For: 4.0
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, 
> so presently there is no way to ensure that an OOM reaches the JVM handler to 
> trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a 
> single thread spawned at startup that waits for any exceptions we must 
> propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed 
> future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org