[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954102#comment-14954102 ] Jie Yu commented on MESOS-2035: --- commit 3c96155a4618000a0896bd42f7ca1e2a363b48fd Author: Jie YuDate: Thu Sep 24 18:42:34 2015 -0700 Added TaskStatus::Reason to containerizer Termination message. Review: https://reviews.apache.org/r/38746 > Add reason to containerizer proto Termination > - > > Key: MESOS-2035 > URL: https://issues.apache.org/jira/browse/MESOS-2035 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.21.0 >Reporter: Dominic Hamon >Assignee: Jie Yu > Labels: twitter > Fix For: 0.26.0 > > > When an isolator kills a task, the reason is unknown. As part of MESOS-1830, > the reason is set to a general one but ideally we would have the termination > reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908917#comment-14908917 ] Jie Yu commented on MESOS-2035: --- Summary of the current semantics and proposed a new semantics in the following doc: https://docs.google.com/document/d/1klGDAu5yBVf-CGWLqvELLIfxLfRaisGkhi6Gn7952-4/edit?usp=sharing > Add reason to containerizer proto Termination > - > > Key: MESOS-2035 > URL: https://issues.apache.org/jira/browse/MESOS-2035 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.21.0 >Reporter: Dominic Hamon >Assignee: Jie Yu > Labels: mesosphere > > When an isolator kills a task, the reason is unknown. As part of MESOS-1830, > the reason is set to a general one but ideally we would have the termination > reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876421#comment-14876421 ] Vinod Kone commented on MESOS-2035: --- [~js84] ping! are you working on this? if not, i would like someone else to take over. > Add reason to containerizer proto Termination > - > > Key: MESOS-2035 > URL: https://issues.apache.org/jira/browse/MESOS-2035 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.21.0 >Reporter: Dominic Hamon >Assignee: Joerg Schad > Labels: mesosphere > > When an isolator kills a task, the reason is unknown. As part of MESOS-1830, > the reason is set to a general one but ideally we would have the termination > reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743985#comment-14743985 ] Vinod Kone commented on MESOS-2035: --- We are seeing this in production when we enabled disk isolation. Let's get this fixed asap. cc [~jieyu] > Add reason to containerizer proto Termination > - > > Key: MESOS-2035 > URL: https://issues.apache.org/jira/browse/MESOS-2035 > Project: Mesos > Issue Type: Improvement > Components: slave >Affects Versions: 0.21.0 >Reporter: Dominic Hamon >Assignee: Joerg Schad > Labels: mesosphere > > When an isolator kills a task, the reason is unknown. As part of MESOS-1830, > the reason is set to a general one but ideally we would have the termination > reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631199#comment-14631199 ] Mike Michel commented on MESOS-2035: It would be very helpfull if the information would be written to the sandbox too. The stderr log is already used to write mesos info to the box when a container was started I0717 12:30:01.219111 54012 exec.cpp:132] Version: 0.22.1 Starting task mike-website_frontend_apache.bd016cba-2c6e-11e5-bc1c-02016fccc167 I0717 12:30:01.226969 54028 exec.cpp:206] Executor registered on slave 20150626-195146-1694738624-5050-2 Is it possible to extend this with the information for a failed start? The slave already has this info in it's own logfile failed to start: Failed to 'docker pull mikemichel/notexist': exit status = exited with status 1 stderr = time=2015-07-15T01:48:57+02:00 level=fatal msg=Error pulling image (latest) from mikemichel/notexist, HTTP code 400 This way you have the info availabe in the mesos ui. Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Labels: mesosphere When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603143#comment-14603143 ] Joerg Schad commented on MESOS-2035: https://reviews.apache.org/r/35927/ Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Labels: mesosphere When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601841#comment-14601841 ] Joerg Schad commented on MESOS-2035: Review Chain Contributor: Joerg Schad Reviewer: AlexR Contributor: Jie Yu Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical Labels: mesosphere When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598060#comment-14598060 ] Niklas Quarfot Nielsen commented on MESOS-2035: --- [~js84] Do you still want to be on this ticket? Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical Labels: mesosphere When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582561#comment-14582561 ] Niklas Quarfot Nielsen commented on MESOS-2035: --- [~js84] How far did you get with this? We need this for the QoS Controller implementation in the slave :) Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579680#comment-14579680 ] Jie Yu commented on MESOS-2035: --- [~nnielsen] Can you help with the implementation? You may also want to sync with [~js84] since this ticket is currently assigned to him. I can shepherd this and do reviews. Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579658#comment-14579658 ] Niklas Quarfot Nielsen commented on MESOS-2035: --- [~vinodkone] You are added as shepherd on this; what is the current state on this effort? [~jieyu] SGTM - I can help (implementation, reviews) if we go this route. Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579844#comment-14579844 ] Timothy Chen commented on MESOS-2035: - Hi Jie, SGTM about the approach of the field. There are going to be some different reasons why a containerizer failed to launch (REASON_DOCKER_PULL_FAILED, REASON_FETCH_FAILED) but I think the message part can also help add more details. Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578025#comment-14578025 ] Jie Yu commented on MESOS-2035: --- This problem pops again when we are implementing oversubscription. See MESOS-2653 and https://reviews.apache.org/r/34720/ for details. Here is my proposal for solving this issue: 1) We add a TaskStatus::Reason field in containerizer::Termination protobuf (and deprecate the 'killed' field) 2) In slave's per executor data structure (struct Executor), we maintain an optional 'reason' field. When the slave destroys a container (e.g., due to registration timeout, failed to set resource limits, failed to launch container, qos controller kill, etc.), it will save the 'reason' field in struct Executor. 3) Containerizer is responsible for setting the 'reason' field inside containerizer::Termination (e.g., REASON_MEMORY_LIMIT, REASON_DISK_LIMIT, etc.) 4) In sendExecutorTerminatedStatusUpdate, we look at both reasons (one from slave's executor data structure and one from Termination protobuf). The current proposal is to prefer the reason from Termination protobuf. But in the future, when we allow multiple reasons to be sent (MESOS-2657), we can send both to the scheduler. Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Minor When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578027#comment-14578027 ] Jie Yu commented on MESOS-2035: --- cc [~tnachen] [~idownes] Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Joerg Schad Priority: Critical When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533682#comment-14533682 ] Jay Buffington commented on MESOS-2035: --- Part of this should be to also remove the Termination.killed field since Reason is a more useful/generic version of what that was intended to accomplish. Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Priority: Minor When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2035) Add reason to containerizer proto Termination
[ https://issues.apache.org/jira/browse/MESOS-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532750#comment-14532750 ] Jay Buffington commented on MESOS-2035: --- Review for fix is at https://reviews.apache.org/r/33249/ Add reason to containerizer proto Termination - Key: MESOS-2035 URL: https://issues.apache.org/jira/browse/MESOS-2035 Project: Mesos Issue Type: Improvement Components: slave Affects Versions: 0.21.0 Reporter: Dominic Hamon Assignee: Jay Buffington Priority: Minor When an isolator kills a task, the reason is unknown. As part of MESOS-1830, the reason is set to a general one but ideally we would have the termination reason to pass through to the status update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)