Re: memory limit exceeded ==> KILL instead of TERM (first)
SIGKILL can't be caught. 2016-02-12 11:29 GMT+01:00 haosdent: > >Is there a specific reason why the slave does not first send a TERM > signal, and if that does not help after a certain timeout, send a KILL > signal? > >That would give us a chance to cleanup consul registrations (and other > cleanup). > I think maybe this flow more complex? How about you register a KILL signal > listener to cleanup consul registration? > > > On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske > wrote: > >> Hi, >> >> we have a Mesos (0.27) cluster running with (here relevant) slave options: >> --cgroups_enable_cfs=true >> --cgroups_limit_swap=true >> --isolation=cgroups/cpu,cgroups/mem >> >> What we see happening is that people are running Tasks (Java >> applications) and specify a memory resource limit that is too low, which >> cause these tasks to be terminated, see logs below. >> That's all fine, after all you should specify reasonable memory limits. >> It looks like the slave sends a KILL signal when the limit is reached, so >> the application has no chance to do recovery termination, which (in our >> case) results in consul registrations not being cleaned up. >> Is there a specific reason why the slave does not first send a TERM >> signal, and if that does not help after a certain timeout, send a KILL >> signal? >> That would give us a chance to cleanup consul registrations (and other >> cleanup). >> >> kind regards, >> Harry >> >> >> I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container >> bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource >> mem(*):160 and will be terminated >> >> I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container >> 'bed2585a-c361-4c66-afd9-69e70e748ae2' >> >> I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup >> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 >> >> I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup >> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after >> 104.21376ms >> >> I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup >> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 >> >> I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup >> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after >> 2.123008ms >> >> I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@ >> 10.239.204.142:43950 exited >> >> I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for >> container 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited >> >> I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request >> for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2 >> >> I0212 09:27:49.389853 11062 slave.cpp:3816] Executor >> 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework >> 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed >> > > > > -- > Best Regards, > Haosdent Huang >
memory limit exceeded ==> KILL instead of TERM (first)
Hi, we have a Mesos (0.27) cluster running with (here relevant) slave options: --cgroups_enable_cfs=true --cgroups_limit_swap=true --isolation=cgroups/cpu,cgroups/mem What we see happening is that people are running Tasks (Java applications) and specify a memory resource limit that is too low, which cause these tasks to be terminated, see logs below. That's all fine, after all you should specify reasonable memory limits. It looks like the slave sends a KILL signal when the limit is reached, so the application has no chance to do recovery termination, which (in our case) results in consul registrations not being cleaned up. Is there a specific reason why the slave does not first send a TERM signal, and if that does not help after a certain timeout, send a KILL signal? That would give us a chance to cleanup consul registrations (and other cleanup). kind regards, Harry I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource mem(*):160 and will be terminated I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container 'bed2585a-c361-4c66-afd9-69e70e748ae2' I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after 104.21376ms I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after 2.123008ms I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@10.239.204.142:43950 exited I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for container 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2 I0212 09:27:49.389853 11062 slave.cpp:3816] Executor 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed
Re: memory limit exceeded ==> KILL instead of TERM (first)
I'm not familiar with why SIGKILL is sent directly without SIGTERM, but is it possible to have your consul registry cleaned up when task killed by adding consul health checks? On Fri, Feb 12, 2016 at 6:12 PM, Harry Metskewrote: > Hi, > > we have a Mesos (0.27) cluster running with (here relevant) slave options: > --cgroups_enable_cfs=true > --cgroups_limit_swap=true > --isolation=cgroups/cpu,cgroups/mem > > What we see happening is that people are running Tasks (Java applications) > and specify a memory resource limit that is too low, which cause these > tasks to be terminated, see logs below. > That's all fine, after all you should specify reasonable memory limits. > It looks like the slave sends a KILL signal when the limit is reached, so > the application has no chance to do recovery termination, which (in our > case) results in consul registrations not being cleaned up. > Is there a specific reason why the slave does not first send a TERM > signal, and if that does not help after a certain timeout, send a KILL > signal? > That would give us a chance to cleanup consul registrations (and other > cleanup). > > kind regards, > Harry > > > I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container > bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource > mem(*):160 and will be terminated > > I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container > 'bed2585a-c361-4c66-afd9-69e70e748ae2' > > I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 > > I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after > 104.21376ms > > I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 > > I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after > 2.123008ms > > I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@ > 10.239.204.142:43950 exited > > I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for container > 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited > > I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request > for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2 > > I0212 09:27:49.389853 11062 slave.cpp:3816] Executor > 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework > 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed >
Re: memory limit exceeded ==> KILL instead of TERM (first)
>Is there a specific reason why the slave does not first send a TERM signal, and if that does not help after a certain timeout, send a KILL signal? >That would give us a chance to cleanup consul registrations (and other cleanup). I think maybe this flow more complex? How about you register a KILL signal listener to cleanup consul registration? On Fri, Feb 12, 2016 at 6:12 PM, Harry Metskewrote: > Hi, > > we have a Mesos (0.27) cluster running with (here relevant) slave options: > --cgroups_enable_cfs=true > --cgroups_limit_swap=true > --isolation=cgroups/cpu,cgroups/mem > > What we see happening is that people are running Tasks (Java applications) > and specify a memory resource limit that is too low, which cause these > tasks to be terminated, see logs below. > That's all fine, after all you should specify reasonable memory limits. > It looks like the slave sends a KILL signal when the limit is reached, so > the application has no chance to do recovery termination, which (in our > case) results in consul registrations not being cleaned up. > Is there a specific reason why the slave does not first send a TERM > signal, and if that does not help after a certain timeout, send a KILL > signal? > That would give us a chance to cleanup consul registrations (and other > cleanup). > > kind regards, > Harry > > > I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container > bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource > mem(*):160 and will be terminated > > I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container > 'bed2585a-c361-4c66-afd9-69e70e748ae2' > > I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 > > I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after > 104.21376ms > > I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 > > I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup > /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after > 2.123008ms > > I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@ > 10.239.204.142:43950 exited > > I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for container > 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited > > I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request > for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2 > > I0212 09:27:49.389853 11062 slave.cpp:3816] Executor > 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework > 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed > -- Best Regards, Haosdent Huang
Re: memory limit exceeded ==> KILL instead of TERM (first)
>I'm not familiar with why SIGKILL is sent directly without SIGTERM We send KILL in both posix_launcher and linux_launcher https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/launcher.cpp#L170 https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1566 >SIGKILL can't be caught. Seems could not cleanup consul registrations when receive killed in MesosContainerizer. Do you try DockerContainerizer? I think "docker stop" would send TERM first. On Fri, Feb 12, 2016 at 6:33 PM, Kamil Chmielewskiwrote: > SIGKILL can't be caught. > > 2016-02-12 11:29 GMT+01:00 haosdent : > >> >Is there a specific reason why the slave does not first send a TERM >> signal, and if that does not help after a certain timeout, send a KILL >> signal? >> >That would give us a chance to cleanup consul registrations (and other >> cleanup). >> I think maybe this flow more complex? How about you register a KILL >> signal listener to cleanup consul registration? >> >> >> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske >> wrote: >> >>> Hi, >>> >>> we have a Mesos (0.27) cluster running with (here relevant) slave >>> options: >>> --cgroups_enable_cfs=true >>> --cgroups_limit_swap=true >>> --isolation=cgroups/cpu,cgroups/mem >>> >>> What we see happening is that people are running Tasks (Java >>> applications) and specify a memory resource limit that is too low, which >>> cause these tasks to be terminated, see logs below. >>> That's all fine, after all you should specify reasonable memory limits. >>> It looks like the slave sends a KILL signal when the limit is reached, >>> so the application has no chance to do recovery termination, which (in our >>> case) results in consul registrations not being cleaned up. >>> Is there a specific reason why the slave does not first send a TERM >>> signal, and if that does not help after a certain timeout, send a KILL >>> signal? >>> That would give us a chance to cleanup consul registrations (and other >>> cleanup). >>> >>> kind regards, >>> Harry >>> >>> >>> I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container >>> bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource >>> mem(*):160 and will be terminated >>> >>> I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container >>> 'bed2585a-c361-4c66-afd9-69e70e748ae2' >>> >>> I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup >>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 >>> >>> I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup >>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after >>> 104.21376ms >>> >>> I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup >>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 >>> >>> I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed >>> cgroup /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 >>> after 2.123008ms >>> >>> I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@ >>> 10.239.204.142:43950 exited >>> >>> I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for >>> container 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited >>> >>> I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy >>> request for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2 >>> >>> I0212 09:27:49.389853 11062 slave.cpp:3816] Executor >>> 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework >>> 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed >>> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> > > -- Best Regards, Haosdent Huang
Re: memory limit exceeded ==> KILL instead of TERM (first)
> > On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske>>> wrote: >>> Is there a specific reason why the slave does not first send a TERM signal, and if that does not help after a certain timeout, send a KILL signal? That would give us a chance to cleanup consul registrations (and other cleanup). First of all it's wrong that you want to handle memory limit in your app. Things like this are outside of its scope. Your app can be lost because many different system or hardware failures that you just can't caught. You need to let it crash and design your architecture with this in mind. Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the right thing https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586 Best regards, Kamil
Re: memory limit exceeded ==> KILL instead of TERM (first)
We don't want to use Docker (yet) in this environment, so DockerContainerizer is not an option. After thinking a bit longer, I tend to agree with Kamil and let the problem be handled differently. Thanks for the amazing fast responses! kind regards, Harry On 12 February 2016 at 12:28, Kamil Chmielewskiwrote: > On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske wrote: > > Is there a specific reason why the slave does not first send a TERM > signal, and if that does not help after a certain timeout, send a KILL > signal? > That would give us a chance to cleanup consul registrations (and other > cleanup). > > > First of all it's wrong that you want to handle memory limit in your app. > Things like this are outside of its scope. Your app can be lost because > many different system or hardware failures that you just can't caught. You > need to let it crash and design your architecture with this in mind. > Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the > right thing > https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586 > > Best regards, > Kamil >
Re: Managing Persistency via Frameworks (HDFS, Cassandra)
Hi Tommy, thanks a lot for sharing. And yes, that is what I figured. For PoC/Testing environments the frameworks work just fine. -- Andreas On Tue, Feb 9, 2016 at 1:01 PM, tommy xiaowrote: > Hi Andreas, > > I have recommend my customer to build a hdfs pool resources outside mesos > cluster in general concerns. But in development or stage environment, use > mesos to manage your hdfs culster is ideal purpose. when mesos community > give more production case, then we can upgrade the develop cluster to > production cluster easily. > > > 2016-02-09 14:50 GMT+08:00 Andreas Fritzler : > >> Hi Klaus, >> >> thanks for your reply. I am aware of the frameworks provided by >> mesosphere and I already tried them out in a POC setup. From looking at the >> HDFS documentation [1] however, the framework seems to be still in beta. >> >> "HDFS is available at the beta level and not recommended for Mesosphere >> DCOS production systems." >> >> I think what my questions are boiling down to is the following: should I >> use a Mesos framework to manage persistency within my Mesos cluster or >> should I do it outside with other means - e.g. using Ambari to setup a >> shared HDFS etc. >> >> If I would use those frameworks, how is your experience regarding the >> life cycle management? Scaling out instances, upgrading to newer versions >> etc. >> >> Regards, >> Andreas >> >> [1] https://docs.mesosphere.com/manage-service/hdfs/ >> >> On Tue, Feb 9, 2016 at 1:05 AM, Klaus Ma wrote: >> >>> Hi Andreas, >>> >>> I think Mesosphere has done some work on your questions, would you check >>> related repos at https://github.com/mesosphere ? >>> >>> >>> On Mon, Feb 8, 2016 at 9:43 PM Andreas Fritzler < >>> andreas.fritz...@gmail.com> wrote: >>> Hi, I have a couple of questions around the persistency topic within a Mesos cluster: 1. Any takes on the quality of the HDFS [1] and the Cassandra [2] frameworks? Does anybody have any experiences in running those frameworks in production? 2. How well are those frameworks performing if I want to use them to separate tenants on one Mesos cluster? (HDFS is not dockerized yet?) 3. How about scaling out/down existing framework instances? Is that even possible? Couldn't find anything in the docs/github. 4. Upgrading a running instance: wondering how that is managed in those frameworks. There is an open issue for the HDFS [3] part. For cassandra the scheduler update seems to be smooth, however changing the underlying Cassandra version seems to be tricky [4]. Regards, Andreas [1] https://github.com/mesosphere/hdfs [2] https://github.com/mesosphere/cassandra-mesos [3] https://github.com/mesosphere/hdfs/issues/23 [4] https://github.com/mesosphere/cassandra-mesos/issues/137 >>> -- >>> >>> Regards, >>> >>> Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer >>> IBM Platform Development & Support, STG, IBM GCG >>> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me >>> >> >> > > > -- > Deshi Xiao > Twitter: xds2000 > E-mail: xiaods(AT)gmail.com >
Re: memory limit exceeded ==> KILL instead of TERM (first)
In larger deployments, with many applications, you may not always be able to ask good memory practices from app developers. We've found that reporting *why* a job was killed, with details of container utilization, is an effective way of helping app developers get better at mem mgmt. The alternative, just having jobs die, incentives bad behaviors. For example, a hurried job owner may just double memory of the executor, trading slack for stability. On Fri, Feb 12, 2016 at 6:36 AM Harry Metskewrote: > We don't want to use Docker (yet) in this environment, so DockerContainerizer > is not an option. > After thinking a bit longer, I tend to agree with Kamil and let the > problem be handled differently. > > Thanks for the amazing fast responses! > > kind regards, > Harry > > > On 12 February 2016 at 12:28, Kamil Chmielewski > wrote: > >> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske > wrote: > >> >> Is there a specific reason why the slave does not first send a TERM >> signal, and if that does not help after a certain timeout, send a KILL >> signal? >> That would give us a chance to cleanup consul registrations (and >> other cleanup). >> >> >> First of all it's wrong that you want to handle memory limit in your app. >> Things like this are outside of its scope. Your app can be lost because >> many different system or hardware failures that you just can't caught. You >> need to let it crash and design your architecture with this in mind. >> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the >> right thing >> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586 >> >> Best regards, >> Kamil >> > >
Re: Docker Containerizer: custom name possible?
>is there some way to inject an >externally defined string into the container name *before* Mesos >launches the container, so that ever after for the life of that >container the container name contains that string? So far could not inject custom string to docker container name launched by Mesos. On Fri, Feb 12, 2016 at 11:53 PM, Edward Burnswrote: > > On Thu, 11 Feb 2016 19:09:29 +0800, tommy xiao > said: > > TX> if you have more concerns on the request, please file a issue to > TX> discussion. > > > On Thu, 11 Feb 2016 19:19:39 +0800, haosdent > said: > > HD> If you want inject inside container, the name stored in > HD> MESOS_CONTAINER_NAME. If you want inject outside, you could get it > HD> by /state endpoint. The container name is combined by > HD> DOCKER_NAME_PREFIX + slaveId + DOCKER_NAME_SEPERATOR + containerId. > > Thanks for your responses. Haosdent, is there some way to inject an > externally defined string into the container name *before* Mesos > launches the container, so that ever after for the life of that > container the container name contains that string? > > I appreciate your tolerance of my newbieness. > > Thanks, > > Ed > -- Best Regards, Haosdent Huang
Re: Docker Containerizer: custom name possible?
> On Thu, 11 Feb 2016 19:09:29 +0800, tommy xiaosaid: TX> if you have more concerns on the request, please file a issue to TX> discussion. > On Thu, 11 Feb 2016 19:19:39 +0800, haosdent said: HD> If you want inject inside container, the name stored in HD> MESOS_CONTAINER_NAME. If you want inject outside, you could get it HD> by /state endpoint. The container name is combined by HD> DOCKER_NAME_PREFIX + slaveId + DOCKER_NAME_SEPERATOR + containerId. Thanks for your responses. Haosdent, is there some way to inject an externally defined string into the container name *before* Mesos launches the container, so that ever after for the life of that container the container name contains that string? I appreciate your tolerance of my newbieness. Thanks, Ed
Re: memory limit exceeded ==> KILL instead of TERM (first)
+1 to what kamil said. That is exactly the reason why we designed it that way. Also, the why is included in the status update message. @vinodkone > On Feb 12, 2016, at 6:08 AM, David J. Palaitis> wrote: > > In larger deployments, with many applications, you may not always be able to > ask good memory practices from app developers. We've found that reporting > *why* a job was killed, with details of container utilization, is an > effective way of helping app developers get better at mem mgmt. > > The alternative, just having jobs die, incentives bad behaviors. For example, > a hurried job owner may just double memory of the executor, trading slack for > stability. > >> On Fri, Feb 12, 2016 at 6:36 AM Harry Metske wrote: >> We don't want to use Docker (yet) in this environment, so >> DockerContainerizer is not an option. >> After thinking a bit longer, I tend to agree with Kamil and let the problem >> be handled differently. >> >> Thanks for the amazing fast responses! >> >> kind regards, >> Harry >> >> >> On 12 February 2016 at 12:28, Kamil Chmielewski wrote: >>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske >>> wrote: >> >>> >>> Is there a specific reason why the slave does not first send a TERM >>> signal, and if that does not help after a certain timeout, send a KILL >>> signal? >>> That would give us a chance to cleanup consul registrations (and other >>> cleanup). >>> >>> First of all it's wrong that you want to handle memory limit in your app. >>> Things like this are outside of its scope. Your app can be lost because >>> many different system or hardware failures that you just can't caught. You >>> need to let it crash and design your architecture with this in mind. >>> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the >>> right thing >>> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586 >>> >>> Best regards, >>> Kamil
Re: memory limit exceeded ==> KILL instead of TERM (first)
David, that's exactly the scenario I am afraid of, developers specifying way too large memory requirements, just to make sure their tasks don't get killed. Any suggestions on how to report this *why* to the developers, as far as I know the only place where you find the reason is in the logfile of the slave, the UI only tells the task failed, not the reason. (we could put some logfile monitoring in place picking up these messages of course, but if there are better ways, we are always interested) kind regards, Harry On 12 February 2016 at 15:08, David J. Palaitiswrote: > In larger deployments, with many applications, you may not always be able > to ask good memory practices from app developers. We've found that > reporting *why* a job was killed, with details of container utilization, is > an effective way of helping app developers get better at mem mgmt. > > The alternative, just having jobs die, incentives bad behaviors. For > example, a hurried job owner may just double memory of the executor, > trading slack for stability. > > On Fri, Feb 12, 2016 at 6:36 AM Harry Metske > wrote: > >> We don't want to use Docker (yet) in this environment, so DockerContainerizer >> is not an option. >> After thinking a bit longer, I tend to agree with Kamil and let the >> problem be handled differently. >> >> Thanks for the amazing fast responses! >> >> kind regards, >> Harry >> >> >> On 12 February 2016 at 12:28, Kamil Chmielewski >> wrote: >> >>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske >> wrote: >> >>> >>> Is there a specific reason why the slave does not first send a TERM >>> signal, and if that does not help after a certain timeout, send a KILL >>> signal? >>> That would give us a chance to cleanup consul registrations (and >>> other cleanup). >>> >>> >>> First of all it's wrong that you want to handle memory limit in your >>> app. Things like this are outside of its scope. Your app can be lost >>> because many different system or hardware failures that you just can't >>> caught. You need to let it crash and design your architecture with this in >>> mind. >>> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the >>> right thing >>> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586 >>> >>> Best regards, >>> Kamil >>> >> >>
Re: memory limit exceeded ==> KILL instead of TERM (first)
hey Harry, As Vinod said, the mesos-slave/agent will issue a status update about the OOM condition. This will be received by the scheduler of the framework. In the storm-mesos framework we just log the messages (see below), but you might consider somehow exposing these messages directly to the app owners: Received status update: {"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin 8777434\npgpgout 8805691\nswap 96878592\ninactive_anon 644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file 0\nunevictable 0\nhierarchical_memory_limit 2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache 20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin 8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon 644186112\ntotal_active_anon 1355497472\ntotal_inactive_file 20480\ntotal_active_file 0\ntotal_unevictable 0"} - Erik On Fri, Feb 12, 2016 at 10:24 AM, Harry Metskewrote: > David, > > that's exactly the scenario I am afraid of, developers specifying way too > large memory requirements, just to make sure their tasks don't get killed. > Any suggestions on how to report this *why* to the developers, as far as I > know the only place where you find the reason is in the logfile of the > slave, the UI only tells the task failed, not the reason. > > (we could put some logfile monitoring in place picking up these messages > of course, but if there are better ways, we are always interested) > > kind regards, > Harry > > > On 12 February 2016 at 15:08, David J. Palaitis < > david.j.palai...@gmail.com> wrote: > >> In larger deployments, with many applications, you may not always be able >> to ask good memory practices from app developers. We've found that >> reporting *why* a job was killed, with details of container utilization, is >> an effective way of helping app developers get better at mem mgmt. >> >> The alternative, just having jobs die, incentives bad behaviors. For >> example, a hurried job owner may just double memory of the executor, >> trading slack for stability. >> >> On Fri, Feb 12, 2016 at 6:36 AM Harry Metske >> wrote: >> >>> We don't want to use Docker (yet) in this environment, so >>> DockerContainerizer >>> is not an option. >>> After thinking a bit longer, I tend to agree with Kamil and let the >>> problem be handled differently. >>> >>> Thanks for the amazing fast responses! >>> >>> kind regards, >>> Harry >>> >>> >>> On 12 February 2016 at 12:28, Kamil Chmielewski >>> wrote: >>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske >>> wrote: >>> Is there a specific reason why the slave does not first send a TERM signal, and if that does not help after a certain timeout, send a KILL signal? That would give us a chance to cleanup consul registrations (and other cleanup). First of all it's wrong that you want to handle memory limit in your app. Things like this are outside of its scope. Your app can be lost because many different system or hardware failures that you just can't caught. You need to let it crash and design your architecture with this in mind. Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the right thing https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586 Best regards, Kamil >>> >>> >
Re: memory limit exceeded ==> KILL instead of TERM (first)
2016-02-12 19:41 GMT+01:00 Erik Weathers: > hey Harry, > > As Vinod said, the mesos-slave/agent will issue a status update about the > OOM condition. This will be received by the scheduler of the framework. > In the storm-mesos framework we just log the messages (see below), but you > might consider somehow exposing these messages directly to the app owners: > > Received status update: > {"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory > limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY > STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin > 8777434\npgpgout 8805691\nswap 96878592\ninactive_anon > 644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file > 0\nunevictable 0\nhierarchical_memory_limit > 2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache > 20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin > 8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon > 644186112\ntotal_active_anon 1355497472\ntotal_inactive_file > 20480\ntotal_active_file 0\ntotal_unevictable 0"} > > - > Marathon also presents this information. Developers will see it on Debug tab in Last Task Failure Section. Best Regards, Kamil
Re: memory limit exceeded ==> KILL instead of TERM (first)
>> we could put some logfile monitoring in place picking up these messages of course that's about what we came up with. >> the mesos-slave/agent will issue a status update about the OOM condition. ok. definitely missed that one - this will help alot. thanks @vinod On Fri, Feb 12, 2016 at 2:41 PM, Harry Metskewrote: > Yup, I just noticed it's there :-) > > tx, > Harry > > > On 12 February 2016 at 20:38, Kamil Chmielewski > wrote: > >> 2016-02-12 19:41 GMT+01:00 Erik Weathers : >> >>> hey Harry, >>> >>> As Vinod said, the mesos-slave/agent will issue a status update about >>> the OOM condition. This will be received by the scheduler of the >>> framework. In the storm-mesos framework we just log the messages (see >>> below), but you might consider somehow exposing these messages directly to >>> the app owners: >>> >>> Received status update: >>> {"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory >>> limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY >>> STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin >>> 8777434\npgpgout 8805691\nswap 96878592\ninactive_anon >>> 644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file >>> 0\nunevictable 0\nhierarchical_memory_limit >>> 2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache >>> 20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin >>> 8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon >>> 644186112\ntotal_active_anon 1355497472\ntotal_inactive_file >>> 20480\ntotal_active_file 0\ntotal_unevictable 0"} >>> >>> - >>> >> >> Marathon also presents this information. Developers will see it on Debug >> tab in Last Task Failure Section. >> >> Best Regards, >> Kamil >> > >
Precision of scalar resources
tl;dr: If you use resource values with more than three decimal digits of precision (e.g., you are launching a task that uses 2.5001 CPUs), please speak up! Mesos uses floating point to represent scalar resource values, such as the number of CPUs in a resource offer or dynamic reservation. The master does resource math in floating point, which leads to a few problems: * due to roundoff error, frameworks can receive offers that have unexpected resource values (e.g., MESOS-3990) * various internal assertions in the master can fail due to roundoff error (e.g., MESOS-3552). In the long term, we can solve these problems by switching to a fixed-point representation for scalar values. However, that will require a long deprecation cycle. In the short term, we should make floating point behavior more reliable. To do that, I propose: (1) Resource values will support AT MOST three decimal digits of precision. Additional precision in resource values will be discarded (via rounding). (2) The master will internally used a fixed-point representation to avoid unpredictable roundoff behavior. For more details, please see the design doc here: https://docs.google.com/document/d/14qLxjZsfIpfynbx0USLJR0GELSq8hdZJUWw6kaY_DXc -- comments welcome! Thanks, Neil
Re: Updated agent resources with every offer.
Say your task asks for 1cpu and disk. After task terminates, mesos immediately offers back 1cpu and 1gb disk. It makes sense for cpu but not so much for disk. Mesos slave overcommits the disk in that sense. Mainly to allow task owners access to sandbox data after task termination. The asynchronous gc thread garbage collects the sandbox if there is disk space pressure on the host. @vinodkone > On Feb 12, 2016, at 5:26 PM, Arkal Arjun Raowrote: > > That can be modified with the right values for gc_delay. > > I'm running a very basic test test where I accept a request, write a files to > the sandbox, sleep for 100s, then exit. After exit, I probe the next offer. > > Having not specified any value for disk_watch_interval and assuming it is the > default 60s, the new offer should have disk = (Original value - size of file > i wrote to sandbox), right? Am i missing something here? > > Arjun > >> On Fri, Feb 12, 2016 at 5:05 PM, Chong Chen wrote: >> Hi, >> >> I think the garbage collector of Mesos agent will remove the directory of >> the finished task. >> >> Thanks! >> >> >> >> From: Arkal Arjun Rao [mailto:aa...@ucsc.edu] >> Sent: Friday, February 12, 2016 4:22 PM >> To: user@mesos.apache.org >> Subject: Re: Updated agent resources with every offer. >> >> >> >> Hi Vinod, >> >> >> >> Thanks for the reply. I think I understand what you mean. Could you clarify >> these follow-up questions? >> >> >> >> 1. So if I did write to the sandbox, mesos would know and send the correct >> offer? >> >> 2. And if so, and this might be hacky, if i bind mounted my docker folder >> (where all cached images are stored) into a sandbox directory, do you think >> Mesos will register the correct state of the disk in the offer? (Suppose I >> were to spawn a possibly persistent job that requests 0 cores, 0 memory and >> 0gb and use it's sandbox) >> >> >> >> Thanks again, >> >> Arjun >> >> >> >> On Fri, Feb 12, 2016 at 4:08 PM, Vinod Kone wrote: >> >> If your job is writing stuff outside the sandbox it is up to your framework >> to do that resource accounting. It is really tricky for Mesos to do that. >> For example, the second job might be launched even before the first one >> finishes. >> >> >> >> On Fri, Feb 12, 2016 at 3:46 PM, Arkal Arjun Rao wrote: >> >> Hi All, >> >> >> >> I'm new to Mesos and I'm working on a framework that strongly considers the >> disk value in an offer before making a decision. My jobs don't run in the >> agent's sandbox and may use docker to pull images from my dockerhub and run >> containers on input data downloaded from S3. >> >> >> >> My jobs clean up after themselves but do not delete the cached docker images >> after they complete so a later job can use them directly without the delay >> of downloading the image again. I cannot predict how much a job will leave >> behind. >> >> >> >> Leaving behind files after the job means that the disk space available for >> the next job is less than the disk value the current job had when it >> started. However the offer made to the master does not appear to update the >> disk parameter before making the new offer. Is there any way to get the >> executor driver to update the value passed in the disk field of resource >> offers? >> >> >> >> Here's a Stack overflow with more details >> http://stackoverflow.com/questions/35354841/setup-mesos-to-provide-up-to-date-disk-in-offers >> >> >> >> Thanks in advance, >> >> Arjun Arkal Rao >> >> >> >> PhD Candidate, >> >> Haussler Lab, >> >> UC Santa Cruz, >> >> USA >> >> >> >> >> >> >> >> >> >> >> -- >> >> Arjun Arkal Rao >> >> >> >> PhD Student, >> >> Haussler Lab, >> >> UC Santa Cruz, >> >> USA >> >> >> >> aa...@ucsc.edu >> > > > > -- > Arjun Arkal Rao > > PhD Student, > Haussler Lab, > UC Santa Cruz, > USA > > aa...@ucsc.edu >
Re: Updated agent resources with every offer.
That can be modified with the right values for gc_delay. I'm running a very basic test test where I accept a request, write a files to the sandbox, sleep for 100s, then exit. After exit, I probe the next offer. Having not specified any value for disk_watch_interval and assuming it is the default 60s, the new offer should have disk = (Original value - size of file i wrote to sandbox), right? Am i missing something here? Arjun On Fri, Feb 12, 2016 at 5:05 PM, Chong Chenwrote: > Hi, > > I think the garbage collector of Mesos agent will remove the directory of > the finished task. > > Thanks! > > > > *From:* Arkal Arjun Rao [mailto:aa...@ucsc.edu] > *Sent:* Friday, February 12, 2016 4:22 PM > *To:* user@mesos.apache.org > *Subject:* Re: Updated agent resources with every offer. > > > > Hi Vinod, > > > > Thanks for the reply. I think I understand what you mean. Could you > clarify these follow-up questions? > > > > 1. So if I did write to the sandbox, mesos would know and send the correct > offer? > > 2. And if so, and this might be hacky, if i bind mounted my docker folder > (where all cached images are stored) into a sandbox directory, do you think > Mesos will register the correct state of the disk in the offer? (Suppose I > were to spawn a possibly persistent job that requests 0 cores, 0 memory and > 0gb and use it's sandbox) > > > > Thanks again, > > Arjun > > > > On Fri, Feb 12, 2016 at 4:08 PM, Vinod Kone wrote: > > If your job is writing stuff outside the sandbox it is up to your > framework to do that resource accounting. It is really tricky for Mesos to > do that. For example, the second job might be launched even before the > first one finishes. > > > > On Fri, Feb 12, 2016 at 3:46 PM, Arkal Arjun Rao wrote: > > Hi All, > > > > I'm new to Mesos and I'm working on a framework that strongly considers > the disk value in an offer before making a decision. My jobs don't run in > the agent's sandbox and may use docker to pull images from my dockerhub and > run containers on input data downloaded from S3. > > > > My jobs clean up after themselves but do not delete the cached docker > images after they complete so a later job can use them directly without the > delay of downloading the image again. I cannot predict how much a job will > leave behind. > > > > Leaving behind files after the job means that the disk space available for > the next job is less than the disk value the current job had when it > started. However the offer made to the master does not appear to update the > disk parameter before making the new offer. Is there any way to get the > executor driver to update the value passed in the disk field of resource > offers? > > > > Here's a Stack overflow with more details > http://stackoverflow.com/questions/35354841/setup-mesos-to-provide-up-to-date-disk-in-offers > > > > Thanks in advance, > > Arjun Arkal Rao > > > > PhD Candidate, > > Haussler Lab, > > UC Santa Cruz, > > USA > > > > > > > > > > -- > > Arjun Arkal Rao > > > > PhD Student, > > Haussler Lab, > > UC Santa Cruz, > > USA > > > > aa...@ucsc.edu > > > -- Arjun Arkal Rao PhD Student, Haussler Lab, UC Santa Cruz, USA aa...@ucsc.edu
Re: Updated agent resources with every offer.
If your job is writing stuff outside the sandbox it is up to your framework to do that resource accounting. It is really tricky for Mesos to do that. For example, the second job might be launched even before the first one finishes. On Fri, Feb 12, 2016 at 3:46 PM, Arkal Arjun Raowrote: > Hi All, > > I'm new to Mesos and I'm working on a framework that strongly considers > the disk value in an offer before making a decision. My jobs don't run in > the agent's sandbox and may use docker to pull images from my dockerhub and > run containers on input data downloaded from S3. > > My jobs clean up after themselves but do not delete the cached docker > images after they complete so a later job can use them directly without the > delay of downloading the image again. I cannot predict how much a job will > leave behind. > > Leaving behind files after the job means that the disk space available for > the next job is less than the disk value the current job had when it > started. However the offer made to the master does not appear to update the > disk parameter before making the new offer. Is there any way to get the > executor driver to update the value passed in the disk field of resource > offers? > > Here's a Stack overflow with more details > http://stackoverflow.com/questions/35354841/setup-mesos-to-provide-up-to-date-disk-in-offers > > Thanks in advance, > Arjun Arkal Rao > > PhD Candidate, > Haussler Lab, > UC Santa Cruz, > USA > >