Re: OOM not always detected by Mesos Slave
I found no such file in this case. On Wed, Nov 12, 2014 at 8:53 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: I find the OOM logging from the kernel in /var/log/kern.log. On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson wsoren...@hubspot.com wrote: I missed the call-to-action here, regarding adding logs. I have some logs from a recent occurrence (this seems to happen quite frequently.) However, in this case, I can't find a corresponding message anywhere on the system that refers to a kernel OOM (is there a place to check besides /var/log/messages or /var/log/dmesg?) One problem we have with sizing for JVM-based tasks is appropriately estimating max thread counts. https://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Re: OOM not always detected by Mesos Slave
In reply to your original issue: It is possible to influence the kernel OOM killer in its decision on which process to kill to free memory. An OOM score is computed for each process and it depends on age (tends to kill shortest living) and usage (tends to kill larger memory users), i.e., this generally favors killing something other than the executor. This score could be adjusted to more strongly prefer not killing the executor by setting and OOM adjustment. See https://issues.apache.org/jira/browse/MESOS-416 which discusses this setting for the master and slave. We could then check for an OOM, even if the executor exits 0, and report accordingly. Does that address your original question? Ian On Thu, Nov 13, 2014 at 5:29 AM, Whitney Sorenson wsoren...@hubspot.com wrote: I found no such file in this case. On Wed, Nov 12, 2014 at 8:53 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: I find the OOM logging from the kernel in /var/log/kern.log. On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson wsoren...@hubspot.com wrote: I missed the call-to-action here, regarding adding logs. I have some logs from a recent occurrence (this seems to happen quite frequently.) However, in this case, I can't find a corresponding message anywhere on the system that refers to a kernel OOM (is there a place to check besides /var/log/messages or /var/log/dmesg?) One problem we have with sizing for JVM-based tasks is appropriately estimating max thread counts. https://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119
Re: OOM not always detected by Mesos Slave
Created: [MESOS-2105] Reliably report OOM even if the executor exits normally https://issues.apache.org/jira/browse/MESOS-2105 On Thu, Nov 13, 2014 at 12:07 PM, Whitney Sorenson wsoren...@hubspot.com wrote: Yeah I think so, ultimately what me and my users are looking for is consistency in the reporting of TASK_FAILED when an OOM is involved. If any OOM happens I'd rather the entire process tree always be taken out and that it be reliably reported as such. On Thu, Nov 13, 2014 at 1:03 PM, Ian Downes ian.dow...@gmail.com wrote: In reply to your original issue: It is possible to influence the kernel OOM killer in its decision on which process to kill to free memory. An OOM score is computed for each process and it depends on age (tends to kill shortest living) and usage (tends to kill larger memory users), i.e., this generally favors killing something other than the executor. This score could be adjusted to more strongly prefer not killing the executor by setting and OOM adjustment. See https://issues.apache.org/jira/browse/MESOS-416 which discusses this setting for the master and slave. We could then check for an OOM, even if the executor exits 0, and report accordingly. Does that address your original question? Ian On Thu, Nov 13, 2014 at 5:29 AM, Whitney Sorenson wsoren...@hubspot.com wrote: I found no such file in this case. On Wed, Nov 12, 2014 at 8:53 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: I find the OOM logging from the kernel in /var/log/kern.log. On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson wsoren...@hubspot.com wrote: I missed the call-to-action here, regarding adding logs. I have some logs from a recent occurrence (this seems to happen quite frequently.) However, in this case, I can't find a corresponding message anywhere on the system that refers to a kernel OOM (is there a place to check besides /var/log/messages or /var/log/dmesg?) One problem we have with sizing for JVM-based tasks is appropriately estimating max thread counts. https://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends
Re: OOM not always detected by Mesos Slave
I missed the call-to-action here, regarding adding logs. I have some logs from a recent occurrence (this seems to happen quite frequently.) However, in this case, I can't find a corresponding message anywhere on the system that refers to a kernel OOM (is there a place to check besides /var/log/messages or /var/log/dmesg?) One problem we have with sizing for JVM-based tasks is appropriately estimating max thread counts. https://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Re: OOM not always detected by Mesos Slave
I find the OOM logging from the kernel in /var/log/kern.log. On Wed, Nov 12, 2014 at 2:51 PM, Whitney Sorenson wsoren...@hubspot.com wrote: I missed the call-to-action here, regarding adding logs. I have some logs from a recent occurrence (this seems to happen quite frequently.) However, in this case, I can't find a corresponding message anywhere on the system that refers to a kernel OOM (is there a place to check besides /var/log/messages or /var/log/dmesg?) One problem we have with sizing for JVM-based tasks is appropriately estimating max thread counts. https://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Re: OOM not always detected by Mesos Slave
M Sent via the Samsung GALAXY S® 5, an A pTT 4G LTE smartphone Original message From: Whitney Sorenson wsoren...@hubspot.com Date:11/12/2014 5:51 PM (GMT-05:00) To: user@mesos.apache.org Cc: Ian Downes idow...@twitter.com Subject: Re: OOM not always detected by Mesos Slave I missed the call-to-action here, regarding adding logs. I have some logs from a recent occurrence (this seems to happen quite frequently.) However, in this case, I can't find a corresponding message anywhere on the system that refers to a kernel OOM (is there a place to check besides /var/log/messages or /var/log/dmesg?) One problem we have with sizing for JVM-based tasks is appropriately estimating max thread counts. https://gist.github.com/wsorenson/d2e49b96e84af86c9492 On Fri, Sep 12, 2014 at 9:12 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: +Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Re: OOM not always detected by Mesos Slave
+Ian Sorry for the delay, when your cgroup OOMs a few things will occur: (1) The kernel will notify mesos-slave about the OOM event. (2) The kernel's OOM killer will pick a process in your cgroup to kill. (3) Once notified, mesos-slave will begin destroying the cgroup. (4) Once the executor terminates, any tasks that were non-terminal on the executor will have status updates sent with the OOM message. This does not all happen atomically, so it is possible that the kernel kills your task process and your executor sends a status update before the slave completes the destruction of the cgroup. Userspace OOM handling is supported, and we tried using it in the past, but it is not reliable: https://issues.apache.org/jira/browse/MESOS-662 http://lwn.net/Articles/317814/ http://lwn.net/Articles/552789/ http://lwn.net/Articles/590960/ http://lwn.net/Articles/591990/ Since you have the luxury of avoiding the OOM killer (JVM flags w/ padding), I would recommend leveraging that for now. Do you have the logs for your issue? My guess is that it took time for us to destroy the cgroup (possibly due to freezer issues) and so there was plenty of time for your executor to send the status update to the slave. On Sat, Sep 6, 2014 at 6:56 AM, Whitney Sorenson wsoren...@hubspot.com wrote: We already pad the JVM and make room for our executor, and we try to get users to give the correct allowances. However, to be fair, your answer to my question about how Mesos is handling OOMs is to suggest we avoid them. I think we're always going to experience some cgroup OOMs and if we'd be better off if we had a consistent way of handling them. On Fri, Sep 5, 2014 at 3:20 PM, Tomas Barton barton.to...@gmail.com wrote: There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Re: OOM not always detected by Mesos Slave
There is some overhead for the JVM itself, which should be added to the total usage of memory for the task. So you can't have the same amount of memory for the task as you pass to java, -Xmx parameter. On 2 September 2014 20:43, Benjamin Mahler benjamin.mah...@gmail.com wrote: Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
Re: OOM not always detected by Mesos Slave
Looks like you're using the JVM, can you set all of your JVM flags to limit the memory consumption? This would favor an OutOfMemoryError instead of OOMing the cgroup. On Thu, Aug 28, 2014 at 5:51 AM, Whitney Sorenson wsoren...@hubspot.com wrote: Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
OOM not always detected by Mesos Slave
Recently, I've seen at least one case where a process inside of a task inside of a cgroup exceeded memory limits and the process was killed directly. The executor recognized the process was killed and sent a TASK_FAILED. However, it seems far more common to see the executor process itself destroyed and the mesos slave (I'm making some assumptions here about how it all works) sends a TASK_FAILED which includes information about the memory usage. Is there something we can do to make this behavior more consistent? Alternatively, can we provide some functionality to hook into so we don't need to duplicate the work of the mesos slave in order to provide the same information in the TASK_FAILED message? I think users would like to know definitively that the task OOM'd, whereas in the case where the underlying task is killed it may take a lot of digging to find the underlying cause if you aren't looking for it. -Whitney Here are relevant lines from messages in case something else is amiss: Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067321] Task in /mesos/2dda5398-6aa6-49bb-8904-37548eae837e killed as a result of limit of /mesos/2dda5398-6aa6-49bb-8904-37548eae837e Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.067334] memory: usage 917420kB, limit 917504kB, failcnt 106672 Aug 27 23:24:07 ip-10-237-165-119 kernel: [2604343.066947] java7 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0