Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Kamil Chmielewski
SIGKILL can't be caught.

2016-02-12 11:29 GMT+01:00 haosdent :

> >Is there a specific reason why the slave does not first send a TERM
> signal, and if that does not help after a certain timeout, send a KILL
> signal?
> >That would give us a chance to cleanup consul registrations (and other
> cleanup).
> I think maybe this flow more complex? How about you register a KILL signal
> listener to cleanup consul registration?
>
>
> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
> wrote:
>
>> Hi,
>>
>> we have a Mesos (0.27) cluster running with (here relevant) slave options:
>> --cgroups_enable_cfs=true
>> --cgroups_limit_swap=true
>> --isolation=cgroups/cpu,cgroups/mem
>>
>> What we see happening is that people are running Tasks (Java
>> applications) and specify a memory resource limit that is too low, which
>> cause these tasks to be terminated, see logs below.
>> That's all fine, after all you should specify reasonable memory limits.
>> It looks like the slave sends a KILL signal when the limit is reached, so
>> the application has no chance to do recovery termination, which (in our
>> case) results in consul registrations not being cleaned up.
>> Is there a specific reason why the slave does not first send a TERM
>> signal, and if that does not help after a certain timeout, send a KILL
>> signal?
>> That would give us a chance to cleanup consul registrations (and other
>> cleanup).
>>
>> kind regards,
>> Harry
>>
>>
>> I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container
>> bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource
>> mem(*):160 and will be terminated
>>
>> I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container
>> 'bed2585a-c361-4c66-afd9-69e70e748ae2'
>>
>> I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup
>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>>
>> I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup
>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
>> 104.21376ms
>>
>> I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup
>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>>
>> I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup
>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
>> 2.123008ms
>>
>> I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@
>> 10.239.204.142:43950 exited
>>
>> I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for
>> container 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited
>>
>> I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request
>> for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2
>>
>> I0212 09:27:49.389853 11062 slave.cpp:3816] Executor
>> 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework
>> 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Harry Metske
Hi,

we have a Mesos (0.27) cluster running with (here relevant) slave options:
--cgroups_enable_cfs=true
--cgroups_limit_swap=true
--isolation=cgroups/cpu,cgroups/mem

What we see happening is that people are running Tasks (Java applications)
and specify a memory resource limit that is too low, which cause these
tasks to be terminated, see logs below.
That's all fine, after all you should specify reasonable memory limits.
It looks like the slave sends a KILL signal when the limit is reached, so
the application has no chance to do recovery termination, which (in our
case) results in consul registrations not being cleaned up.
Is there a specific reason why the slave does not first send a TERM signal,
and if that does not help after a certain timeout, send a KILL signal?
That would give us a chance to cleanup consul registrations (and other
cleanup).

kind regards,
Harry


I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container
bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource
mem(*):160 and will be terminated

I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container
'bed2585a-c361-4c66-afd9-69e70e748ae2'

I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup
/sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2

I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup
/sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
104.21376ms

I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup
/sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2

I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup
/sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
2.123008ms

I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@10.239.204.142:43950
exited

I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for container
'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited

I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request
for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2

I0212 09:27:49.389853 11062 slave.cpp:3816] Executor
'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework
7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Shuai Lin
I'm not familiar with why SIGKILL is sent directly without SIGTERM, but is
it possible to have your consul registry cleaned up when task killed by
adding consul health checks?

On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
wrote:

> Hi,
>
> we have a Mesos (0.27) cluster running with (here relevant) slave options:
> --cgroups_enable_cfs=true
> --cgroups_limit_swap=true
> --isolation=cgroups/cpu,cgroups/mem
>
> What we see happening is that people are running Tasks (Java applications)
> and specify a memory resource limit that is too low, which cause these
> tasks to be terminated, see logs below.
> That's all fine, after all you should specify reasonable memory limits.
> It looks like the slave sends a KILL signal when the limit is reached, so
> the application has no chance to do recovery termination, which (in our
> case) results in consul registrations not being cleaned up.
> Is there a specific reason why the slave does not first send a TERM
> signal, and if that does not help after a certain timeout, send a KILL
> signal?
> That would give us a chance to cleanup consul registrations (and other
> cleanup).
>
> kind regards,
> Harry
>
>
> I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container
> bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource
> mem(*):160 and will be terminated
>
> I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container
> 'bed2585a-c361-4c66-afd9-69e70e748ae2'
>
> I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>
> I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
> 104.21376ms
>
> I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>
> I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
> 2.123008ms
>
> I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@
> 10.239.204.142:43950 exited
>
> I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for container
> 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited
>
> I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request
> for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2
>
> I0212 09:27:49.389853 11062 slave.cpp:3816] Executor
> 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework
> 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed
>


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread haosdent
>Is there a specific reason why the slave does not first send a TERM
signal, and if that does not help after a certain timeout, send a KILL
signal?
>That would give us a chance to cleanup consul registrations (and other
cleanup).
I think maybe this flow more complex? How about you register a KILL signal
listener to cleanup consul registration?


On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
wrote:

> Hi,
>
> we have a Mesos (0.27) cluster running with (here relevant) slave options:
> --cgroups_enable_cfs=true
> --cgroups_limit_swap=true
> --isolation=cgroups/cpu,cgroups/mem
>
> What we see happening is that people are running Tasks (Java applications)
> and specify a memory resource limit that is too low, which cause these
> tasks to be terminated, see logs below.
> That's all fine, after all you should specify reasonable memory limits.
> It looks like the slave sends a KILL signal when the limit is reached, so
> the application has no chance to do recovery termination, which (in our
> case) results in consul registrations not being cleaned up.
> Is there a specific reason why the slave does not first send a TERM
> signal, and if that does not help after a certain timeout, send a KILL
> signal?
> That would give us a chance to cleanup consul registrations (and other
> cleanup).
>
> kind regards,
> Harry
>
>
> I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container
> bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource
> mem(*):160 and will be terminated
>
> I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container
> 'bed2585a-c361-4c66-afd9-69e70e748ae2'
>
> I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>
> I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
> 104.21376ms
>
> I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>
> I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed cgroup
> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
> 2.123008ms
>
> I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@
> 10.239.204.142:43950 exited
>
> I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for container
> 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited
>
> I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy request
> for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2
>
> I0212 09:27:49.389853 11062 slave.cpp:3816] Executor
> 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework
> 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed
>



-- 
Best Regards,
Haosdent Huang


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread haosdent
>I'm not familiar with why SIGKILL is sent directly without SIGTERM
We send KILL in both posix_launcher and linux_launcher
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/launcher.cpp#L170
https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1566

>SIGKILL can't be caught.
Seems could not cleanup consul registrations when receive killed in
MesosContainerizer. Do you try DockerContainerizer? I think "docker stop"
would send TERM first.

On Fri, Feb 12, 2016 at 6:33 PM, Kamil Chmielewski 
wrote:

> SIGKILL can't be caught.
>
> 2016-02-12 11:29 GMT+01:00 haosdent :
>
>> >Is there a specific reason why the slave does not first send a TERM
>> signal, and if that does not help after a certain timeout, send a KILL
>> signal?
>> >That would give us a chance to cleanup consul registrations (and other
>> cleanup).
>> I think maybe this flow more complex? How about you register a KILL
>> signal listener to cleanup consul registration?
>>
>>
>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
>> wrote:
>>
>>> Hi,
>>>
>>> we have a Mesos (0.27) cluster running with (here relevant) slave
>>> options:
>>> --cgroups_enable_cfs=true
>>> --cgroups_limit_swap=true
>>> --isolation=cgroups/cpu,cgroups/mem
>>>
>>> What we see happening is that people are running Tasks (Java
>>> applications) and specify a memory resource limit that is too low, which
>>> cause these tasks to be terminated, see logs below.
>>> That's all fine, after all you should specify reasonable memory limits.
>>> It looks like the slave sends a KILL signal when the limit is reached,
>>> so the application has no chance to do recovery termination, which (in our
>>> case) results in consul registrations not being cleaned up.
>>> Is there a specific reason why the slave does not first send a TERM
>>> signal, and if that does not help after a certain timeout, send a KILL
>>> signal?
>>> That would give us a chance to cleanup consul registrations (and other
>>> cleanup).
>>>
>>> kind regards,
>>> Harry
>>>
>>>
>>> I0212 09:27:49.238371 11062 containerizer.cpp:1460] Container
>>> bed2585a-c361-4c66-afd9-69e70e748ae2 has reached its limit for resource
>>> mem(*):160 and will be terminated
>>>
>>> I0212 09:27:49.238418 11062 containerizer.cpp:1227] Destroying container
>>> 'bed2585a-c361-4c66-afd9-69e70e748ae2'
>>>
>>> I0212 09:27:49.240932 11062 cgroups.cpp:2427] Freezing cgroup
>>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>>>
>>> I0212 09:27:49.345171 11062 cgroups.cpp:1409] Successfully froze cgroup
>>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2 after
>>> 104.21376ms
>>>
>>> I0212 09:27:49.347303 11062 cgroups.cpp:2445] Thawing cgroup
>>> /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>>>
>>> I0212 09:27:49.349453 11062 cgroups.cpp:1438] Successfullly thawed
>>> cgroup /sys/fs/cgroup/freezer/mesos/bed2585a-c361-4c66-afd9-69e70e748ae2
>>> after 2.123008ms
>>>
>>> I0212 09:27:49.359627 11062 slave.cpp:3481] executor(1)@
>>> 10.239.204.142:43950 exited
>>>
>>> I0212 09:27:49.381942 11062 containerizer.cpp:1443] Executor for
>>> container 'bed2585a-c361-4c66-afd9-69e70e748ae2' has exited
>>>
>>> I0212 09:27:49.389766 11062 provisioner.cpp:306] Ignoring destroy
>>> request for unknown container bed2585a-c361-4c66-afd9-69e70e748ae2
>>>
>>> I0212 09:27:49.389853 11062 slave.cpp:3816] Executor
>>> 'fulltest02.6cd29bd8-d162-11e5-a4df-005056aa67df' of framework
>>> 7baec9af-018f-4a4c-822a-117d61187471-0001 terminated with signal Killed
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


-- 
Best Regards,
Haosdent Huang


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Kamil Chmielewski
>
> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
>>> wrote:
>>>

 Is there a specific reason why the slave does not first send a TERM
 signal, and if that does not help after a certain timeout, send a KILL
 signal?
 That would give us a chance to cleanup consul registrations (and other
 cleanup).


First of all it's wrong that you want to handle memory limit in your app.
Things like this are outside of its scope. Your app can be lost because
many different system or hardware failures that you just can't caught. You
need to let it crash and design your architecture with this in mind.
Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the
right thing
https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586

Best regards,
Kamil


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Harry Metske
We don't want to use Docker (yet) in this environment, so DockerContainerizer
is not an option.
After thinking a bit longer, I tend to agree with Kamil and let the problem
be handled differently.

Thanks for the amazing fast responses!

kind regards,
Harry


On 12 February 2016 at 12:28, Kamil Chmielewski  wrote:

> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
 wrote:

>
> Is there a specific reason why the slave does not first send a TERM
> signal, and if that does not help after a certain timeout, send a KILL
> signal?
> That would give us a chance to cleanup consul registrations (and other
> cleanup).
>
>
> First of all it's wrong that you want to handle memory limit in your app.
> Things like this are outside of its scope. Your app can be lost because
> many different system or hardware failures that you just can't caught. You
> need to let it crash and design your architecture with this in mind.
> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the
> right thing
> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586
>
> Best regards,
> Kamil
>


Re: Managing Persistency via Frameworks (HDFS, Cassandra)

2016-02-12 Thread Andreas Fritzler
Hi Tommy,

thanks a lot for sharing. And yes, that is what I figured. For PoC/Testing
environments the frameworks work just fine.

-- Andreas

On Tue, Feb 9, 2016 at 1:01 PM, tommy xiao  wrote:

> Hi Andreas,
>
> I have recommend my customer to build a hdfs pool resources outside mesos
> cluster in general concerns. But in development or stage environment, use
> mesos to manage your hdfs culster is ideal purpose. when mesos community
> give more production case, then we can upgrade the develop cluster to
> production cluster easily.
>
>
> 2016-02-09 14:50 GMT+08:00 Andreas Fritzler :
>
>> Hi Klaus,
>>
>> thanks for your reply. I am aware of the frameworks provided by
>> mesosphere and I already tried them out in a POC setup. From looking at the
>> HDFS documentation [1] however, the framework seems to be still in beta.
>>
>> "HDFS is available at the beta level and not recommended for Mesosphere
>> DCOS production systems."
>>
>> I think what my questions are boiling down to is the following: should I
>> use a Mesos framework to manage persistency within my Mesos cluster or
>> should I do it outside with other means - e.g. using Ambari to setup a
>> shared HDFS etc.
>>
>> If I would use those frameworks, how is your experience regarding the
>> life cycle management? Scaling out instances, upgrading to newer versions
>> etc.
>>
>> Regards,
>> Andreas
>>
>> [1] https://docs.mesosphere.com/manage-service/hdfs/
>>
>> On Tue, Feb 9, 2016 at 1:05 AM, Klaus Ma  wrote:
>>
>>> Hi Andreas,
>>>
>>> I think Mesosphere has done some work on your questions, would you check
>>> related repos at https://github.com/mesosphere ?
>>>
>>>
>>> On Mon, Feb 8, 2016 at 9:43 PM Andreas Fritzler <
>>> andreas.fritz...@gmail.com> wrote:
>>>
 Hi,

 I have a couple of questions around the persistency topic within a
 Mesos cluster:

 1. Any takes on the quality of the HDFS [1] and the Cassandra [2]
 frameworks? Does anybody have any experiences in running those frameworks
 in production?

 2. How well are those frameworks performing if I want to use them to
 separate tenants on one Mesos cluster? (HDFS is not dockerized yet?)

 3. How about scaling out/down existing framework instances? Is that
 even possible? Couldn't find anything in the docs/github.

 4. Upgrading a running instance: wondering how that is managed in those
 frameworks. There is an open issue for the HDFS [3] part. For cassandra the
 scheduler update seems to be smooth, however changing the underlying
 Cassandra version seems to be tricky [4].

 Regards,
 Andreas

 [1] https://github.com/mesosphere/hdfs
 [2] https://github.com/mesosphere/cassandra-mesos
 [3] https://github.com/mesosphere/hdfs/issues/23
 [4] https://github.com/mesosphere/cassandra-mesos/issues/137

>>> --
>>>
>>> Regards,
>>> 
>>> Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
>>> IBM Platform Development & Support, STG, IBM GCG
>>> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>>>
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread David J. Palaitis
In larger deployments, with many applications, you may not always be able
to ask good memory practices from app developers. We've found that
reporting *why* a job was killed, with details of container utilization, is
an effective way of helping app developers get better at mem mgmt.

The alternative, just having jobs die, incentives bad behaviors. For
example, a hurried job owner may just double memory of the executor,
trading slack for stability.

On Fri, Feb 12, 2016 at 6:36 AM Harry Metske  wrote:

> We don't want to use Docker (yet) in this environment, so DockerContainerizer
> is not an option.
> After thinking a bit longer, I tend to agree with Kamil and let the
> problem be handled differently.
>
> Thanks for the amazing fast responses!
>
> kind regards,
> Harry
>
>
> On 12 February 2016 at 12:28, Kamil Chmielewski 
> wrote:
>
>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
> wrote:
>
>>
>> Is there a specific reason why the slave does not first send a TERM
>> signal, and if that does not help after a certain timeout, send a KILL
>> signal?
>> That would give us a chance to cleanup consul registrations (and
>> other cleanup).
>>
>>
>> First of all it's wrong that you want to handle memory limit in your app.
>> Things like this are outside of its scope. Your app can be lost because
>> many different system or hardware failures that you just can't caught. You
>> need to let it crash and design your architecture with this in mind.
>> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the
>> right thing
>> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586
>>
>> Best regards,
>> Kamil
>>
>
>


Re: Docker Containerizer: custom name possible?

2016-02-12 Thread haosdent
>is there some way to inject an
>externally defined string into the container name *before* Mesos
>launches the container, so that ever after for the life of that
>container the container name contains that string?

So far could not inject custom string to docker container name launched by
Mesos.


On Fri, Feb 12, 2016 at 11:53 PM, Edward Burns 
wrote:

> > On Thu, 11 Feb 2016 19:09:29 +0800, tommy xiao 
> said:
>
> TX> if you have more concerns on the request, please file a issue to
> TX> discussion.
>
> > On Thu, 11 Feb 2016 19:19:39 +0800, haosdent 
> said:
>
> HD> If you want inject inside container, the name stored in
> HD> MESOS_CONTAINER_NAME. If you want inject outside, you could get it
> HD> by /state endpoint. The container name is combined by
> HD> DOCKER_NAME_PREFIX + slaveId + DOCKER_NAME_SEPERATOR + containerId.
>
> Thanks for your responses.  Haosdent, is there some way to inject an
> externally defined string into the container name *before* Mesos
> launches the container, so that ever after for the life of that
> container the container name contains that string?
>
> I appreciate your tolerance of my newbieness.
>
> Thanks,
>
> Ed
>



-- 
Best Regards,
Haosdent Huang


Re: Docker Containerizer: custom name possible?

2016-02-12 Thread Edward Burns
> On Thu, 11 Feb 2016 19:09:29 +0800, tommy xiao  said:

TX> if you have more concerns on the request, please file a issue to
TX> discussion.

> On Thu, 11 Feb 2016 19:19:39 +0800, haosdent  said:

HD> If you want inject inside container, the name stored in
HD> MESOS_CONTAINER_NAME. If you want inject outside, you could get it
HD> by /state endpoint. The container name is combined by
HD> DOCKER_NAME_PREFIX + slaveId + DOCKER_NAME_SEPERATOR + containerId.

Thanks for your responses.  Haosdent, is there some way to inject an
externally defined string into the container name *before* Mesos
launches the container, so that ever after for the life of that
container the container name contains that string?

I appreciate your tolerance of my newbieness.

Thanks,

Ed


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Vinod Kone
+1 to what kamil said. That is exactly the reason why we designed it that way. 

Also, the why is included in the status update message. 

@vinodkone

> On Feb 12, 2016, at 6:08 AM, David J. Palaitis  
> wrote:
> 
> In larger deployments, with many applications, you may not always be able to 
> ask good memory practices from app developers. We've found that reporting 
> *why* a job was killed, with details of container utilization, is an 
> effective way of helping app developers get better at mem mgmt. 
> 
> The alternative, just having jobs die, incentives bad behaviors. For example, 
> a hurried job owner may just double memory of the executor, trading slack for 
> stability. 
> 
>> On Fri, Feb 12, 2016 at 6:36 AM Harry Metske  wrote:
>> We don't want to use Docker (yet) in this environment, so 
>> DockerContainerizer is not an option.
>> After thinking a bit longer, I tend to agree with Kamil and let the problem 
>> be handled differently.
>> 
>> Thanks for the amazing fast responses!
>> 
>> kind regards,
>> Harry
>> 
>> 
>> On 12 February 2016 at 12:28, Kamil Chmielewski  wrote:
>>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske  
>>> wrote:
>> 
>>> 
>>> Is there a specific reason why the slave does not first send a TERM 
>>> signal, and if that does not help after a certain timeout, send a KILL 
>>> signal?
>>> That would give us a chance to cleanup consul registrations (and other 
>>> cleanup).
>>> 
>>> First of all it's wrong that you want to handle memory limit in your app. 
>>> Things like this are outside of its scope. Your app can be lost because 
>>> many different system or hardware failures that you just can't caught. You 
>>> need to let it crash and design your architecture with this in mind.
>>> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the 
>>> right thing 
>>> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586
>>> 
>>> Best regards,
>>> Kamil


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Harry Metske
David,

that's exactly the scenario I am afraid of, developers specifying way too
large memory requirements, just to make sure their tasks don't get killed.
Any suggestions on how to report this *why* to the developers, as far as I
know the only place where you find the reason is in the logfile of the
slave, the UI only tells the task failed, not the reason.

(we could put some logfile monitoring in place picking up these messages of
course, but if there are better ways, we are always interested)

kind regards,
Harry


On 12 February 2016 at 15:08, David J. Palaitis 
wrote:

> In larger deployments, with many applications, you may not always be able
> to ask good memory practices from app developers. We've found that
> reporting *why* a job was killed, with details of container utilization, is
> an effective way of helping app developers get better at mem mgmt.
>
> The alternative, just having jobs die, incentives bad behaviors. For
> example, a hurried job owner may just double memory of the executor,
> trading slack for stability.
>
> On Fri, Feb 12, 2016 at 6:36 AM Harry Metske 
> wrote:
>
>> We don't want to use Docker (yet) in this environment, so DockerContainerizer
>> is not an option.
>> After thinking a bit longer, I tend to agree with Kamil and let the
>> problem be handled differently.
>>
>> Thanks for the amazing fast responses!
>>
>> kind regards,
>> Harry
>>
>>
>> On 12 February 2016 at 12:28, Kamil Chmielewski 
>> wrote:
>>
>>> On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
>> wrote:
>>
>>>
>>> Is there a specific reason why the slave does not first send a TERM
>>> signal, and if that does not help after a certain timeout, send a KILL
>>> signal?
>>> That would give us a chance to cleanup consul registrations (and
>>> other cleanup).
>>>
>>>
>>> First of all it's wrong that you want to handle memory limit in your
>>> app. Things like this are outside of its scope. Your app can be lost
>>> because many different system or hardware failures that you just can't
>>> caught. You need to let it crash and design your architecture with this in
>>> mind.
>>> Secondly Mesos SIGKILL is consistent with linux OOM killer and it do the
>>> right thing
>>> https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586
>>>
>>> Best regards,
>>> Kamil
>>>
>>
>>


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Erik Weathers
hey Harry,

As Vinod said, the mesos-slave/agent will issue a status update about the
OOM condition.  This will be received by the scheduler of the framework.
In the storm-mesos framework we just log the messages (see below), but you
might consider somehow exposing these messages directly to the app owners:

Received status update:
{"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory
limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY
STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin
8777434\npgpgout 8805691\nswap 96878592\ninactive_anon
644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file
0\nunevictable 0\nhierarchical_memory_limit
2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache
20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin
8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon
644186112\ntotal_active_anon 1355497472\ntotal_inactive_file
20480\ntotal_active_file 0\ntotal_unevictable 0"}

- Erik

On Fri, Feb 12, 2016 at 10:24 AM, Harry Metske 
wrote:

> David,
>
> that's exactly the scenario I am afraid of, developers specifying way too
> large memory requirements, just to make sure their tasks don't get killed.
> Any suggestions on how to report this *why* to the developers, as far as I
> know the only place where you find the reason is in the logfile of the
> slave, the UI only tells the task failed, not the reason.
>
> (we could put some logfile monitoring in place picking up these messages
> of course, but if there are better ways, we are always interested)
>
> kind regards,
> Harry
>
>
> On 12 February 2016 at 15:08, David J. Palaitis <
> david.j.palai...@gmail.com> wrote:
>
>> In larger deployments, with many applications, you may not always be able
>> to ask good memory practices from app developers. We've found that
>> reporting *why* a job was killed, with details of container utilization, is
>> an effective way of helping app developers get better at mem mgmt.
>>
>> The alternative, just having jobs die, incentives bad behaviors. For
>> example, a hurried job owner may just double memory of the executor,
>> trading slack for stability.
>>
>> On Fri, Feb 12, 2016 at 6:36 AM Harry Metske 
>> wrote:
>>
>>> We don't want to use Docker (yet) in this environment, so 
>>> DockerContainerizer
>>> is not an option.
>>> After thinking a bit longer, I tend to agree with Kamil and let the
>>> problem be handled differently.
>>>
>>> Thanks for the amazing fast responses!
>>>
>>> kind regards,
>>> Harry
>>>
>>>
>>> On 12 February 2016 at 12:28, Kamil Chmielewski 
>>> wrote:
>>>
 On Fri, Feb 12, 2016 at 6:12 PM, Harry Metske 
>>> wrote:
>>>

 Is there a specific reason why the slave does not first send a TERM
 signal, and if that does not help after a certain timeout, send a KILL
 signal?
 That would give us a chance to cleanup consul registrations (and
 other cleanup).


 First of all it's wrong that you want to handle memory limit in your
 app. Things like this are outside of its scope. Your app can be lost
 because many different system or hardware failures that you just can't
 caught. You need to let it crash and design your architecture with this in
 mind.
 Secondly Mesos SIGKILL is consistent with linux OOM killer and it do
 the right thing
 https://github.com/torvalds/linux/blob/4e5448a31d73d0e944b7adb9049438a09bc332cb/mm/oom_kill.c#L586

 Best regards,
 Kamil

>>>
>>>
>


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread Kamil Chmielewski
2016-02-12 19:41 GMT+01:00 Erik Weathers :

> hey Harry,
>
> As Vinod said, the mesos-slave/agent will issue a status update about the
> OOM condition.  This will be received by the scheduler of the framework.
> In the storm-mesos framework we just log the messages (see below), but you
> might consider somehow exposing these messages directly to the app owners:
>
> Received status update:
> {"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory
> limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY
> STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin
> 8777434\npgpgout 8805691\nswap 96878592\ninactive_anon
> 644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file
> 0\nunevictable 0\nhierarchical_memory_limit
> 2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache
> 20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin
> 8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon
> 644186112\ntotal_active_anon 1355497472\ntotal_inactive_file
> 20480\ntotal_active_file 0\ntotal_unevictable 0"}
>
> -
>

Marathon also presents this information. Developers will see it on Debug
tab in Last Task Failure Section.

Best Regards,
Kamil


Re: memory limit exceeded ==> KILL instead of TERM (first)

2016-02-12 Thread David J. Palaitis
>> we could put some logfile monitoring in place picking up these messages
of course

that's about what we came up with.

>> the mesos-slave/agent will issue a status update about the OOM
condition.

ok. definitely missed that one - this will help alot. thanks @vinod


On Fri, Feb 12, 2016 at 2:41 PM, Harry Metske 
wrote:

> Yup, I just noticed it's there :-)
>
> tx,
> Harry
>
>
> On 12 February 2016 at 20:38, Kamil Chmielewski 
> wrote:
>
>> 2016-02-12 19:41 GMT+01:00 Erik Weathers :
>>
>>> hey Harry,
>>>
>>> As Vinod said, the mesos-slave/agent will issue a status update about
>>> the OOM condition.  This will be received by the scheduler of the
>>> framework.  In the storm-mesos framework we just log the messages (see
>>> below), but you might consider somehow exposing these messages directly to
>>> the app owners:
>>>
>>> Received status update:
>>> {"task_id":"TASK_ID","slave_id":"20150806-001422-1801655306-5050-22041-S65","state":"TASK_FAILED","message":"Memory
>>> limit exceeded: Requested: 2200MB Maximum Used: 2200MB\n\nMEMORY
>>> STATISTICS: \ncache 20480\nrss 1811943424\nmapped_file 0\npgpgin
>>> 8777434\npgpgout 8805691\nswap 96878592\ninactive_anon
>>> 644186112\nactive_anon 1357594624\ninactive_file 20480\nactive_file
>>> 0\nunevictable 0\nhierarchical_memory_limit
>>> 2306867200\nhierarchical_memsw_limit 9223372036854775807\ntotal_cache
>>> 20480\ntotal_rss 1811943424\ntotal_mapped_file 0\ntotal_pgpgin
>>> 8777434\ntotal_pgpgout 8805691\ntotal_swap 96878592\ntotal_inactive_anon
>>> 644186112\ntotal_active_anon 1355497472\ntotal_inactive_file
>>> 20480\ntotal_active_file 0\ntotal_unevictable 0"}
>>>
>>> -
>>>
>>
>> Marathon also presents this information. Developers will see it on Debug
>> tab in Last Task Failure Section.
>>
>> Best Regards,
>> Kamil
>>
>
>


Precision of scalar resources

2016-02-12 Thread Neil Conway
tl;dr:

If you use resource values with more than three decimal digits of
precision (e.g., you are launching a task that uses 2.5001 CPUs),
please speak up!



Mesos uses floating point to represent scalar resource values, such as
the number of CPUs in a resource offer or dynamic reservation. The
master does resource math in floating point, which leads to a few
problems:

* due to roundoff error, frameworks can receive offers that have
unexpected resource values (e.g., MESOS-3990)
* various internal assertions in the master can fail due to roundoff
error (e.g., MESOS-3552).

In the long term, we can solve these problems by switching to a
fixed-point representation for scalar values. However, that will
require a long deprecation cycle.

In the short term, we should make floating point behavior more
reliable. To do that, I propose:

(1) Resource values will support AT MOST three decimal digits of
precision. Additional precision in resource values will be discarded
(via rounding).

(2) The master will internally used a fixed-point representation to
avoid unpredictable roundoff behavior.

For more details, please see the design doc here:
https://docs.google.com/document/d/14qLxjZsfIpfynbx0USLJR0GELSq8hdZJUWw6kaY_DXc
-- comments welcome!

Thanks,
Neil


Re: Updated agent resources with every offer.

2016-02-12 Thread Vinod Kone
Say your task asks for 1cpu and  disk. After task terminates, mesos immediately 
offers back 1cpu and 1gb disk. It makes sense for cpu but not so much for disk. 

Mesos slave overcommits the disk in that sense. Mainly to allow task owners 
access to sandbox data after task termination. The asynchronous gc thread 
garbage collects the sandbox if there is disk space pressure on the host. 


@vinodkone

> On Feb 12, 2016, at 5:26 PM, Arkal Arjun Rao  wrote:
> 
> That can be modified with the right values for gc_delay. 
> 
> I'm running a very basic test test where I accept a request, write a files to 
> the sandbox, sleep for 100s, then exit. After exit, I probe the next offer.
> 
> Having not specified any value for disk_watch_interval and assuming it is the 
> default 60s, the new offer should have disk = (Original value - size of file 
> i wrote to sandbox), right? Am i missing something here?
> 
> Arjun
> 
>> On Fri, Feb 12, 2016 at 5:05 PM, Chong Chen  wrote:
>> Hi,
>> 
>> I think the garbage collector of Mesos agent will remove the directory of 
>> the finished task.
>> 
>> Thanks!
>> 
>>  
>> 
>> From: Arkal Arjun Rao [mailto:aa...@ucsc.edu] 
>> Sent: Friday, February 12, 2016 4:22 PM
>> To: user@mesos.apache.org
>> Subject: Re: Updated agent resources with every offer.
>> 
>>  
>> 
>> Hi Vinod,
>> 
>>  
>> 
>> Thanks for the reply. I think I understand what you mean. Could you clarify 
>> these follow-up questions?
>> 
>>  
>> 
>> 1. So if I did write to the sandbox, mesos would know and send the correct 
>> offer?
>> 
>> 2. And if so, and this might be hacky, if i bind mounted my docker folder 
>> (where all cached images are stored) into a sandbox directory, do you think 
>> Mesos will register the correct state of the disk in the offer? (Suppose I 
>> were to spawn a possibly persistent job that requests 0 cores, 0 memory and 
>> 0gb and use it's sandbox)
>> 
>>  
>> 
>> Thanks again,
>> 
>> Arjun
>> 
>>  
>> 
>> On Fri, Feb 12, 2016 at 4:08 PM, Vinod Kone  wrote:
>> 
>> If your job is writing stuff outside the sandbox it is up to your framework 
>> to do that resource accounting. It is really tricky for Mesos to do that. 
>> For example, the second job might be launched even before the first one 
>> finishes.
>> 
>>  
>> 
>> On Fri, Feb 12, 2016 at 3:46 PM, Arkal Arjun Rao  wrote:
>> 
>> Hi All,
>> 
>>  
>> 
>> I'm new to Mesos and I'm working on a  framework that strongly considers the 
>> disk value in an offer before making a decision. My jobs don't run in the 
>> agent's sandbox and may use docker to pull images from my dockerhub and run 
>> containers on input data downloaded from S3.
>> 
>>  
>> 
>> My jobs clean up after themselves but do not delete the cached docker images 
>> after they complete so a later job can use them directly without the delay 
>> of downloading the image again. I cannot predict how much a job will leave 
>> behind.
>> 
>>  
>> 
>> Leaving behind files after the job means that the disk space available for 
>> the next job is less than the disk value the current job had when it 
>> started. However the offer made to the master does not appear to update the 
>> disk parameter before making the new offer. Is there any way to get the 
>> executor driver to update the value passed in the disk field of resource 
>> offers?
>> 
>>  
>> 
>> Here's a Stack overflow with more details 
>> http://stackoverflow.com/questions/35354841/setup-mesos-to-provide-up-to-date-disk-in-offers
>> 
>>  
>> 
>> Thanks in advance,
>> 
>> Arjun Arkal Rao
>> 
>>  
>> 
>> PhD Candidate,
>> 
>> Haussler Lab,
>> 
>> UC Santa Cruz,
>> 
>> USA
>> 
>>  
>> 
>>  
>> 
>> 
>> 
>> 
>>  
>> 
>> --
>> 
>> Arjun Arkal Rao
>> 
>>  
>> 
>> PhD Student,
>> 
>> Haussler Lab,
>> 
>> UC Santa Cruz,
>> 
>> USA
>> 
>>  
>> 
>> aa...@ucsc.edu
>> 
> 
> 
> 
> -- 
> Arjun Arkal Rao
> 
> PhD Student,
> Haussler Lab,
> UC Santa Cruz,
> USA
> 
> aa...@ucsc.edu
> 


Re: Updated agent resources with every offer.

2016-02-12 Thread Arkal Arjun Rao
That can be modified with the right values for gc_delay.

I'm running a very basic test test where I accept a request, write a files
to the sandbox, sleep for 100s, then exit. After exit, I probe the next
offer.

Having not specified any value for disk_watch_interval and assuming it is
the default 60s, the new offer should have disk = (Original value - size of
file i wrote to sandbox), right? Am i missing something here?

Arjun

On Fri, Feb 12, 2016 at 5:05 PM, Chong Chen  wrote:

> Hi,
>
> I think the garbage collector of Mesos agent will remove the directory of
> the finished task.
>
> Thanks!
>
>
>
> *From:* Arkal Arjun Rao [mailto:aa...@ucsc.edu]
> *Sent:* Friday, February 12, 2016 4:22 PM
> *To:* user@mesos.apache.org
> *Subject:* Re: Updated agent resources with every offer.
>
>
>
> Hi Vinod,
>
>
>
> Thanks for the reply. I think I understand what you mean. Could you
> clarify these follow-up questions?
>
>
>
> 1. So if I did write to the sandbox, mesos would know and send the correct
> offer?
>
> 2. And if so, and this might be hacky, if i bind mounted my docker folder
> (where all cached images are stored) into a sandbox directory, do you think
> Mesos will register the correct state of the disk in the offer? (Suppose I
> were to spawn a possibly persistent job that requests 0 cores, 0 memory and
> 0gb and use it's sandbox)
>
>
>
> Thanks again,
>
> Arjun
>
>
>
> On Fri, Feb 12, 2016 at 4:08 PM, Vinod Kone  wrote:
>
> If your job is writing stuff outside the sandbox it is up to your
> framework to do that resource accounting. It is really tricky for Mesos to
> do that. For example, the second job might be launched even before the
> first one finishes.
>
>
>
> On Fri, Feb 12, 2016 at 3:46 PM, Arkal Arjun Rao  wrote:
>
> Hi All,
>
>
>
> I'm new to Mesos and I'm working on a  framework that strongly considers
> the disk value in an offer before making a decision. My jobs don't run in
> the agent's sandbox and may use docker to pull images from my dockerhub and
> run containers on input data downloaded from S3.
>
>
>
> My jobs clean up after themselves but do not delete the cached docker
> images after they complete so a later job can use them directly without the
> delay of downloading the image again. I cannot predict how much a job will
> leave behind.
>
>
>
> Leaving behind files after the job means that the disk space available for
> the next job is less than the disk value the current job had when it
> started. However the offer made to the master does not appear to update the
> disk parameter before making the new offer. Is there any way to get the
> executor driver to update the value passed in the disk field of resource
> offers?
>
>
>
> Here's a Stack overflow with more details
> http://stackoverflow.com/questions/35354841/setup-mesos-to-provide-up-to-date-disk-in-offers
>
>
>
> Thanks in advance,
>
> Arjun Arkal Rao
>
>
>
> PhD Candidate,
>
> Haussler Lab,
>
> UC Santa Cruz,
>
> USA
>
>
>
>
>
>
>
>
>
> --
>
> Arjun Arkal Rao
>
>
>
> PhD Student,
>
> Haussler Lab,
>
> UC Santa Cruz,
>
> USA
>
>
>
> aa...@ucsc.edu
>
>
>



-- 
Arjun Arkal Rao

PhD Student,
Haussler Lab,
UC Santa Cruz,
USA

aa...@ucsc.edu


Re: Updated agent resources with every offer.

2016-02-12 Thread Vinod Kone
If your job is writing stuff outside the sandbox it is up to your framework
to do that resource accounting. It is really tricky for Mesos to do that.
For example, the second job might be launched even before the first one
finishes.

On Fri, Feb 12, 2016 at 3:46 PM, Arkal Arjun Rao  wrote:

> Hi All,
>
> I'm new to Mesos and I'm working on a  framework that strongly considers
> the disk value in an offer before making a decision. My jobs don't run in
> the agent's sandbox and may use docker to pull images from my dockerhub and
> run containers on input data downloaded from S3.
>
> My jobs clean up after themselves but do not delete the cached docker
> images after they complete so a later job can use them directly without the
> delay of downloading the image again. I cannot predict how much a job will
> leave behind.
>
> Leaving behind files after the job means that the disk space available for
> the next job is less than the disk value the current job had when it
> started. However the offer made to the master does not appear to update the
> disk parameter before making the new offer. Is there any way to get the
> executor driver to update the value passed in the disk field of resource
> offers?
>
> Here's a Stack overflow with more details
> http://stackoverflow.com/questions/35354841/setup-mesos-to-provide-up-to-date-disk-in-offers
>
> Thanks in advance,
> Arjun Arkal Rao
>
> PhD Candidate,
> Haussler Lab,
> UC Santa Cruz,
> USA
>
>