[jira] [Commented] (MESOS-1416) mesos-0.19.0 build directory is read-only
[ https://issues.apache.org/jira/browse/MESOS-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163405#comment-14163405 ] Da Ma commented on MESOS-1416: -- Hi team, Would you share the steps to reproduce this issue? I'm a newer of Mesos :). Thanks Da Ma mesos-0.19.0 build directory is read-only - Key: MESOS-1416 URL: https://issues.apache.org/jira/browse/MESOS-1416 Project: Mesos Issue Type: Bug Components: build Environment: Ubuntu 13.10 Reporter: Vinson Lee Priority: Blocker The build creates a read-only mesos-0.19.0 directory. This blocks Jenkins builds because the workspace cannot be automatically cleaned by the git plugin. {noformat} [...] warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/fatal.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/memory.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/uuid.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/numify.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp [...] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-1871: --- Description: {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to this issue: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. was: {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to this issue: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. As expected, it fails on Linux, but surprisingly, it works on Mac OS 10.9.4. Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to this issue: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-1871: --- Description: {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. was: {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to this issue: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163607#comment-14163607 ] Alexander Rukletsov commented on MESOS-1871: It looks like this issue consists of two parts. 1. If CommandExecutor starts a task via {{sh -c}}, we reap the wrong process. Instead of reaping {{sh -c}} it makes sense to monitor and reap the actual task process, or the whole process tree rooted at {{sh -c}}, i.e. call {{reaped()}} only when all process in the tree terminate. Otherwise—as illustrated by the test in the description—{{reaped()}} happily disables escalation leaving the task process orphaned in the system. 2. In case we manage to enter {{escalated()}} callback, we should ensure all child of {{sh -c}} receive {{SIGKILL}}. I'm not sure current implementation via {{os::killtree}} provides such a guarantee. As proposed by [~idownes], POSIX process groups might be a solution and reap the whole group. However, it would be still nice to obtain an OS pid of the task process, in order to deliver in status updates messages related to the task process, and not to the wrapper {{sh -c}}. Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-1871: -- Assignee: Alexander Rukletsov Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-156) Create framework that provides a high level resource request language
[ https://issues.apache.org/jira/browse/MESOS-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163620#comment-14163620 ] Jay Buffington commented on MESOS-156: -- It looks like this was opened before Aurora and Marathon were open sourced. I suspect these frameworks meet your needs. Can this Jira be closed? Create framework that provides a high level resource request language - Key: MESOS-156 URL: https://issues.apache.org/jira/browse/MESOS-156 Project: Mesos Issue Type: Story Components: framework Reporter: Andy Konwinski Original Estimate: 2m Remaining Estimate: 2m One of the primary points of confusion about Mesos is the mechanism it provides frameworks to acquire new resources (e.g. cpu, ram, etc.). Currently, frameworks receive callbacks with resource offers which they can accept (entirely or only a portion) or reject. When they accept them, they provide a task to be executed on those resources. Many engineers we have spoken to have said that they would find it more intuitive to provide their executable up front with a description of which and how many resources they want, and then have mesos do the scheduling. I propose that Mesos should ship with a framework that can very easily be installed and run by new users, and this framework should accept Launch Job Requests expressed via some language that describes the resource requirements and where to find the executable for the tasks in the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1874) Offer network interfaces as resources
Jay Buffington created MESOS-1874: - Summary: Offer network interfaces as resources Key: MESOS-1874 URL: https://issues.apache.org/jira/browse/MESOS-1874 Project: Mesos Issue Type: Improvement Reporter: Jay Buffington I have a use case where I want two tasks to bind to the same port on the same slave, but on different interfaces. Ports are offered as a resource, but it is assumed that the task will bind to all interfaces (0.0.0.0). If task A gets the Port resources of 31201 and only binds to 127.0.0.1:31201, task B cannot get that Port and bind to 10.1.2.3:31201 on the same host even though 10.1.2.3:31201 is unused. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163661#comment-14163661 ] Timothy St. Clair commented on MESOS-1871: -- Doesn't the executor get isolated by its container? If this is not the case, then my world view is incorrect :-/ Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1046) Use of leading underscore in names (global symbols and defines)
[ https://issues.apache.org/jira/browse/MESOS-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163710#comment-14163710 ] Dominic Hamon commented on MESOS-1046: -- We can replace the include guards with the guidance from http://google-styleguide.googlecode.com/svn/trunk/cppguide.html#The__define_Guard. Continuations are a more intrusive change. The underscore scheme works really well for indicating continuations, but we do use two or more continuations in places. I'm loathe to suggest numbering ({{launch}}, {{launch1}}, {{launch2}}) as i find that difficult to parse. perhaps breaking up the underscores with a character like 'c' for continuation: {{launch}}, {{c_launch}}, {{c_c_launch}}? Use of leading underscore in names (global symbols and defines) --- Key: MESOS-1046 URL: https://issues.apache.org/jira/browse/MESOS-1046 Project: Mesos Issue Type: Improvement Components: technical debt Affects Versions: 0.19.0 Reporter: Till Toenshoff Priority: Minor Labels: c, c++, libprocess, mesos, standards, stout Even though this appears to be a very common standard breach, I thought it would still be nice to play entirely by the rules. If I get things right, then according to the 1999 C standard as well as the 2003 C++ standard, using leading underscores followed by a capital letter and maybe even more importantly, using double-underscores are reserved for the implementation of those standards. This appears to apply for both, global namespace symbols as well as defines. We are currently using double-underscores in our include-guards and it may be wise to fix that and any other collision with the standards in relation with the use of underscores. A nice compilation of the related standard quotes can be found at http://stackoverflow.com/a/228797/91282 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1416) mesos-0.19.0 build directory is read-only
[ https://issues.apache.org/jira/browse/MESOS-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163732#comment-14163732 ] Timothy St. Clair commented on MESOS-1416: -- I don't believe this should be a problem no master, if not, please let us know. mesos-0.19.0 build directory is read-only - Key: MESOS-1416 URL: https://issues.apache.org/jira/browse/MESOS-1416 Project: Mesos Issue Type: Bug Components: build Environment: Ubuntu 13.10 Reporter: Vinson Lee Priority: Blocker The build creates a read-only mesos-0.19.0 directory. This blocks Jenkins builds because the workspace cannot be automatically cleaned by the git plugin. {noformat} [...] warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/fatal.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/memory.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/uuid.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/numify.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp [...] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163768#comment-14163768 ] Alexander Rukletsov commented on MESOS-1871: I think what happens is that the task process escapes its process tree and is not killed by {{PosixLauncher}}. Here is an orphaned process after launching the first test: {code} alex@alex-hh.local: ~ $ ps aux | grep handler alex 5641 0.0 0.0 2432784624 s003 S+6:52PM 0:00.00 grep handler alex 5620 0.0 0.0 2447700688 ?? S 6:52PM 0:00.00 sh -c ( handler() { echo SIGTERM; }; trap 'handler TERM' SIGTERM; echo $$; echo $(which sleep); while true; do date; sleep 1; done; exit 0 ) alex@alex-hh.local: ~ $ ps -p 5620 -o ppid= 1 {code} Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1871) Sending SIGTERM to a task command may render it orphaned
[ https://issues.apache.org/jira/browse/MESOS-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163803#comment-14163803 ] Ian Downes commented on MESOS-1871: --- I looked at the code: os::killtree()'s behavior is incorrect. 1. The posix launcher puts the executor into it's own session with setsid. 2. The posix launcher calls os::killtree(pid, SIGKILL, true, true) where the trues are for killing all processes in group and session. 3. os::killtree() *returns early* if it can't find the *process* with pid (which is the scenario you're describing) so it doesn't actually continue to kill everything in the process group/session. I modified the code early this year and perpetuated the existing bug. I'll file a ticket on this. Sending SIGTERM to a task command may render it orphaned Key: MESOS-1871 URL: https://issues.apache.org/jira/browse/MESOS-1871 Project: Mesos Issue Type: Bug Components: slave Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov {{CommandExecutor}} launches tasks wrapping them into {{sh -c}}. That means signals are sent to the top process—that is {{sh -c}}—and not to the task directly. Though {{SIGTERM}} is propagated by {{sh -c}} down the process tree, if the task is unresponsive to {{SIGTERM}}, {{sh -c}} terminates reporting success to the {{CommandExecutor}}, rendering the task detached from the parent process and still running. Because the {{CommandExecutor}} thinks the command terminated normally, its OS process exits normally and may not trigger containerizer's escalation which destroys cgroups. Here is the test related to the first part: [https://gist.github.com/rukletsov/68259dfb02421813f9e6]. Here is the test related to the second part: [https://gist.github.com/rukletsov/3f19ecc7389fa51e65c0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1875) os::killtree() incorrectly returns early if pid has terminated
Ian Downes created MESOS-1875: - Summary: os::killtree() incorrectly returns early if pid has terminated Key: MESOS-1875 URL: https://issues.apache.org/jira/browse/MESOS-1875 Project: Mesos Issue Type: Bug Affects Versions: 0.20.1, 0.19.1, 0.20.0, 0.19.0, 0.18.2, 0.18.1, 0.18.0 Reporter: Ian Downes If groups == true and/or sessions == true then os::kill tree should continue to signal all processes in the process group and/or session, even if the leading pid has terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1875) os::killtree() incorrectly returns early if pid has terminated
[ https://issues.apache.org/jira/browse/MESOS-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-1875: -- Description: If groups == true and/or sessions == true then os::killtree() should continue to signal all processes in the process group and/or session, even if the leading pid has terminated. (was: If groups == true and/or sessions == true then os::kill tree should continue to signal all processes in the process group and/or session, even if the leading pid has terminated.) os::killtree() incorrectly returns early if pid has terminated -- Key: MESOS-1875 URL: https://issues.apache.org/jira/browse/MESOS-1875 Project: Mesos Issue Type: Bug Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.19.0, 0.20.0, 0.19.1, 0.20.1 Reporter: Ian Downes If groups == true and/or sessions == true then os::killtree() should continue to signal all processes in the process group and/or session, even if the leading pid has terminated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-156) Create framework that provides a high level resource request language
[ https://issues.apache.org/jira/browse/MESOS-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163866#comment-14163866 ] Andy Konwinski commented on MESOS-156: -- Sure! Create framework that provides a high level resource request language - Key: MESOS-156 URL: https://issues.apache.org/jira/browse/MESOS-156 Project: Mesos Issue Type: Story Components: framework Reporter: Andy Konwinski Original Estimate: 2m Remaining Estimate: 2m One of the primary points of confusion about Mesos is the mechanism it provides frameworks to acquire new resources (e.g. cpu, ram, etc.). Currently, frameworks receive callbacks with resource offers which they can accept (entirely or only a portion) or reject. When they accept them, they provide a task to be executed on those resources. Many engineers we have spoken to have said that they would find it more intuitive to provide their executable up front with a description of which and how many resources they want, and then have mesos do the scheduling. I propose that Mesos should ship with a framework that can very easily be installed and run by new users, and this framework should accept Launch Job Requests expressed via some language that describes the resource requirements and where to find the executable for the tasks in the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1416) mesos-0.19.0 build directory is read-only
[ https://issues.apache.org/jira/browse/MESOS-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163732#comment-14163732 ] Timothy St. Clair edited comment on MESOS-1416 at 10/8/14 6:33 PM: --- I don't believe this should be a problem on master, if not, please let us know. was (Author: tstclair): I don't believe this should be a problem no master, if not, please let us know. mesos-0.19.0 build directory is read-only - Key: MESOS-1416 URL: https://issues.apache.org/jira/browse/MESOS-1416 Project: Mesos Issue Type: Bug Components: build Environment: Ubuntu 13.10 Reporter: Vinson Lee Priority: Blocker The build creates a read-only mesos-0.19.0 directory. This blocks Jenkins builds because the workspace cannot be automatically cleaned by the git plugin. {noformat} [...] warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/gzip.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/fatal.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/linkedhashmap.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/foreach.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/memory.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/hashset.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/format.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/error.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/uuid.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/net.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/numify.hpp warning: failed to remove mesos-0.19.0/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp [...] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1848) DRFAllocatorTest.DRFAllocatorProcess is flaky
[ https://issues.apache.org/jira/browse/MESOS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163973#comment-14163973 ] Till Toenshoff commented on MESOS-1848: --- Turned out the described symptoms were caused by a custom sasl installation I did on that VM. After removing all traces of it and rebuilding against a proper one, everything went back to normal. That certainly does not really help for pinning the problem to the exact cause but it did the job for me. DRFAllocatorTest.DRFAllocatorProcess is flaky - Key: MESOS-1848 URL: https://issues.apache.org/jira/browse/MESOS-1848 Project: Mesos Issue Type: Bug Components: test Environment: Fedora 20 Reporter: Vinod Kone Observed this on CI. This is pretty strange because the authentication of both the framework and slave timed out at the very beginning, even though we don't manipulate clocks. {code} [ RUN ] DRFAllocatorTest.DRFAllocatorProcess Using temporary directory '/tmp/DRFAllocatorTest_DRFAllocatorProcess_igiR9X' I0929 20:11:12.801327 16997 leveldb.cpp:176] Opened db in 489720ns I0929 20:11:12.801627 16997 leveldb.cpp:183] Compacted db in 168280ns I0929 20:11:12.801784 16997 leveldb.cpp:198] Created db iterator in 5820ns I0929 20:11:12.801898 16997 leveldb.cpp:204] Seeked to beginning of db in 1285ns I0929 20:11:12.802039 16997 leveldb.cpp:273] Iterated through 0 keys in the db in 792ns I0929 20:11:12.802160 16997 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0929 20:11:12.802441 17012 recover.cpp:425] Starting replica recovery I0929 20:11:12.802623 17012 recover.cpp:451] Replica is in EMPTY status I0929 20:11:12.803251 17012 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0929 20:11:12.803427 17012 recover.cpp:188] Received a recover response from a replica in EMPTY status I0929 20:11:12.803632 17012 recover.cpp:542] Updating replica status to STARTING I0929 20:11:12.803911 17012 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 33999ns I0929 20:11:12.804033 17012 replica.cpp:320] Persisted replica status to STARTING I0929 20:11:12.804245 17012 recover.cpp:451] Replica is in STARTING status I0929 20:11:12.804592 17012 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0929 20:11:12.804775 17012 recover.cpp:188] Received a recover response from a replica in STARTING status I0929 20:11:12.804952 17012 recover.cpp:542] Updating replica status to VOTING I0929 20:11:12.805115 17012 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 15990ns I0929 20:11:12.805234 17012 replica.cpp:320] Persisted replica status to VOTING I0929 20:11:12.805366 17012 recover.cpp:556] Successfully joined the Paxos group I0929 20:11:12.805539 17012 recover.cpp:440] Recover process terminated I0929 20:11:12.809062 17017 master.cpp:312] Master 20140929-201112-2759502016-47295-16997 (fedora-20) started on 192.168.122.164:47295 I0929 20:11:12.809432 17017 master.cpp:358] Master only allowing authenticated frameworks to register I0929 20:11:12.809546 17017 master.cpp:363] Master only allowing authenticated slaves to register I0929 20:11:12.810169 17017 credentials.hpp:36] Loading credentials for authentication from '/tmp/DRFAllocatorTest_DRFAllocatorProcess_igiR9X/credentials' I0929 20:11:12.810510 17017 master.cpp:392] Authorization enabled I0929 20:11:12.811841 17016 master.cpp:120] No whitelist given. Advertising offers for all slaves I0929 20:11:12.812099 17013 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@192.168.122.164:47295 I0929 20:11:12.813006 17017 master.cpp:1241] The newly elected leader is master@192.168.122.164:47295 with id 20140929-201112-2759502016-47295-16997 I0929 20:11:12.813164 17017 master.cpp:1254] Elected as the leading master! I0929 20:11:12.813279 17017 master.cpp:1072] Recovering from registrar I0929 20:11:12.813487 17013 registrar.cpp:312] Recovering registrar I0929 20:11:12.813824 17013 log.cpp:656] Attempting to start the writer I0929 20:11:12.814256 17013 replica.cpp:474] Replica received implicit promise request with proposal 1 I0929 20:11:12.814419 17013 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 25049ns I0929 20:11:12.814581 17013 replica.cpp:342] Persisted promised to 1 I0929 20:11:12.814909 17013 coordinator.cpp:230] Coordinator attemping to fill missing position I0929 20:11:12.815340 17013 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0929 20:11:12.815497 17013 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 19855ns I0929 20:11:12.815636 17013
[jira] [Commented] (MESOS-1847) mesos-ec2 launch: tries to rsync before ssh is available
[ https://issues.apache.org/jira/browse/MESOS-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164094#comment-14164094 ] Killian Murphy commented on MESOS-1847: --- I had the same issue. Adding --wait 600 worked for me. Adding --wait 180 did not. Testing with sshing into the created VM after the failure looks like about 7-8 minutes before sshd is ready for login. The only way to recover for me was destroy and recreate with the additional --wait option. Here's the failure: killian@nore ~/development/mesos/mesos-0.20.1/ec2: ./mesos_ec2.py -k kdefault -i ~/AWS/id_rsa-kdefault -s 1 launch k_mesos Setting up security groups... Checking for running cluster... Launching instances... Launched slaves, regid = r-87bd89ac Launched master, regid = r-65bf8b4e Waiting for instances to start up... Waiting 60 more seconds... Deploying files to master... ssh: connect to host ec2-54-237-156-217.compute-1.amazonaws.com port 22: Connection refused rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at /SourceCache/rsync/rsync-42/rsync/io.c(452) [sender=2.6.9] Traceback (most recent call last): File ./mesos_ec2.py, line 571, in module main() File ./mesos_ec2.py, line 480, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File ./mesos_ec2.py, line 334, in setup_cluster deploy_files(conn, deploy. + opts.os, opts, master_nodes, slave_nodes, zoo_nodes) File ./mesos_ec2.py, line 445, in deploy_files subprocess.check_call(command, shell=True) File /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py, line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'rsync -rv -e 'ssh -o StrictHostKeyChecking=no -i /Users/killian/AWS/id_rsa-kdefault' '/var/folders/8t/hp2txtm56h3byl8q5cdd33bmgp/T/tmp5VZqO3/' 'r...@ec2-54-237-156-217.compute-1.amazonaws.com:/'' returned non-zero exit status 255 mesos-ec2 launch: tries to rsync before ssh is available Key: MESOS-1847 URL: https://issues.apache.org/jira/browse/MESOS-1847 Project: Mesos Issue Type: Bug Components: ec2 Reporter: Kevin Matzen If you don't specify a wait time that is long enough, then wait_for_cluster will return once the instances have launched, but ssh will not necessarily be available. deploy_files will execute rsync and then possibly fail. ssh should be tested before continuing onto the file deployment stage. It's not really clear to me why opts.wait is even a thing when you can simply test for the availability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
Benjamin Mahler created MESOS-1876: -- Summary: Remove deprecated 'slave_id' field in ReregisterSlaveMessage. Key: MESOS-1876 URL: https://issues.apache.org/jira/browse/MESOS-1876 Project: Mesos Issue Type: Task Components: technical debt Reporter: Benjamin Mahler This is to follow through on removing the deprecated field that we've been phasing out. In 0.21.0, this field will no longer be read: {code} message ReregisterSlaveMessage { // TODO(bmahler): slave_id is deprecated. // 0.21.0: Now an optional field. Always written, never read. // 0.22.0: Remove this field. optional SlaveID slave_id = 1; required SlaveInfo slave = 2; repeated ExecutorInfo executor_infos = 4; repeated Task tasks = 3; repeated Archive.Framework completed_frameworks = 5; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
[ https://issues.apache.org/jira/browse/MESOS-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1876: --- Priority: Trivial (was: Major) Remove deprecated 'slave_id' field in ReregisterSlaveMessage. - Key: MESOS-1876 URL: https://issues.apache.org/jira/browse/MESOS-1876 Project: Mesos Issue Type: Task Components: technical debt Reporter: Benjamin Mahler Priority: Trivial This is to follow through on removing the deprecated field that we've been phasing out. In 0.21.0, this field will no longer be read: {code} message ReregisterSlaveMessage { // TODO(bmahler): slave_id is deprecated. // 0.21.0: Now an optional field. Always written, never read. // 0.22.0: Remove this field. optional SlaveID slave_id = 1; required SlaveInfo slave = 2; repeated ExecutorInfo executor_infos = 4; repeated Task tasks = 3; repeated Archive.Framework completed_frameworks = 5; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1877) Unify stout include style
Cody Maloney created MESOS-1877: --- Summary: Unify stout include style Key: MESOS-1877 URL: https://issues.apache.org/jira/browse/MESOS-1877 Project: Mesos Issue Type: Bug Components: stout Reporter: Cody Maloney Priority: Minor Some of the files in stout use relative includes (stringify.hpp for example), while others use absolute includes (resulth.pp) to get to files which live inside of stout. They should all use one format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1878) Access to sandbox on slave from master UI does not show the sandbox contents
Anindya Sinha created MESOS-1878: Summary: Access to sandbox on slave from master UI does not show the sandbox contents Key: MESOS-1878 URL: https://issues.apache.org/jira/browse/MESOS-1878 Project: Mesos Issue Type: Bug Components: webui Reporter: Anindya Sinha Priority: Minor From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave1:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1879) Handle a temporary one-way slave -- master socket closure.
Benjamin Mahler created MESOS-1879: -- Summary: Handle a temporary one-way slave -- master socket closure. Key: MESOS-1879 URL: https://issues.apache.org/jira/browse/MESOS-1879 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Priority: Minor In the same spirit as MESOS-1668, we want to correctly handle a scenario where the slave -- master socket closes, and a new socket can be immediately re-established. If this occurs, the ping / pongs will resume but there may be dropped messages sent by the slave, and so a re-registration would be a good safety net. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1878) Access to sandbox on slave from master UI does not show the sandbox contents
[ https://issues.apache.org/jira/browse/MESOS-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anindya Sinha updated MESOS-1878: - Description: From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. was: From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave1:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. Access to sandbox on slave from master UI does not show the sandbox contents Key: MESOS-1878 URL: https://issues.apache.org/jira/browse/MESOS-1878 Project: Mesos Issue Type: Bug Components: webui Reporter: Anindya Sinha Priority: Minor From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1880) config options page on mesos.apache.org out of date and missing versioning info.
Jay Buffington created MESOS-1880: - Summary: config options page on mesos.apache.org out of date and missing versioning info. Key: MESOS-1880 URL: https://issues.apache.org/jira/browse/MESOS-1880 Project: Mesos Issue Type: Improvement Reporter: Jay Buffington Assignee: Dave Lester http://mesos.apache.org/documentation/latest/configuration/ is old. For example slave options doesn't list --containerizers which was introduced in 0.20.0. Also, I think there should be a note that the list on that page is for a particular version. mesos-slave --help is the best way to get all the options for the particular version you're running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1880) config options page on mesos.apache.org out of date and missing versioning info.
[ https://issues.apache.org/jira/browse/MESOS-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1880: -- Issue Type: Documentation (was: Improvement) config options page on mesos.apache.org out of date and missing versioning info. Key: MESOS-1880 URL: https://issues.apache.org/jira/browse/MESOS-1880 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Jay Buffington Assignee: Dave Lester Labels: newbie http://mesos.apache.org/documentation/latest/configuration/ is old. For example slave options doesn't list --containerizers which was introduced in 0.20.0. Also, I think there should be a note that the list on that page is for a particular version. mesos-slave --help is the best way to get all the options for the particular version you're running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1880) config options page on mesos.apache.org out of date and missing versioning info.
[ https://issues.apache.org/jira/browse/MESOS-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1880: -- Labels: newbie (was: ) config options page on mesos.apache.org out of date and missing versioning info. Key: MESOS-1880 URL: https://issues.apache.org/jira/browse/MESOS-1880 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Jay Buffington Assignee: Dave Lester Labels: newbie http://mesos.apache.org/documentation/latest/configuration/ is old. For example slave options doesn't list --containerizers which was introduced in 0.20.0. Also, I think there should be a note that the list on that page is for a particular version. mesos-slave --help is the best way to get all the options for the particular version you're running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1880) config options page on mesos.apache.org out of date and missing versioning info.
[ https://issues.apache.org/jira/browse/MESOS-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1880: -- Component/s: documentation config options page on mesos.apache.org out of date and missing versioning info. Key: MESOS-1880 URL: https://issues.apache.org/jira/browse/MESOS-1880 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Jay Buffington Assignee: Dave Lester Labels: newbie http://mesos.apache.org/documentation/latest/configuration/ is old. For example slave options doesn't list --containerizers which was introduced in 0.20.0. Also, I think there should be a note that the list on that page is for a particular version. mesos-slave --help is the best way to get all the options for the particular version you're running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1878) Access to sandbox on slave from master UI does not show the sandbox contents
[ https://issues.apache.org/jira/browse/MESOS-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anindya Sinha updated MESOS-1878: - Description: From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. Update: The issue has been introduced by the following 2 commits: ca2e8ef MESOS-1857 Fixed path::join() on older libstdc++ which lack back(). b08fccf Switched path::join() to be variadic Note that the commit ca2e8ef fixes a build issue (on older libstd++) on top of the commit b08fccf. was: From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. Access to sandbox on slave from master UI does not show the sandbox contents Key: MESOS-1878 URL: https://issues.apache.org/jira/browse/MESOS-1878 Project: Mesos Issue Type: Bug Components: webui Reporter: Anindya Sinha Priority: Minor From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. Update: The issue has been introduced by the following 2 commits: ca2e8ef MESOS-1857 Fixed path::join() on older libstdc++ which lack back(). b08fccf Switched path::join() to be variadic Note that the commit ca2e8ef fixes a build issue (on older libstd++) on top of the commit b08fccf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1878) Access to sandbox on slave from master UI does not show the sandbox contents
[ https://issues.apache.org/jira/browse/MESOS-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1878: --- Target Version/s: 0.21.0 Affects Version/s: 0.21.0 Assignee: Cody Maloney [~cmaloney] can you take a look at this? Access to sandbox on slave from master UI does not show the sandbox contents Key: MESOS-1878 URL: https://issues.apache.org/jira/browse/MESOS-1878 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 0.21.0 Reporter: Anindya Sinha Assignee: Cody Maloney Priority: Minor From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. Update: The issue has been introduced by the following 2 commits: ca2e8ef MESOS-1857 Fixed path::join() on older libstdc++ which lack back(). b08fccf Switched path::join() to be variadic Note that the commit ca2e8ef fixes a build issue (on older libstd++) on top of the commit b08fccf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1869) UpdateFramework message might reach the slave before Reregistered message and get dropped
[ https://issues.apache.org/jira/browse/MESOS-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164526#comment-14164526 ] Benjamin Mahler commented on MESOS-1869: Fixed as part of MESOS-1696: https://reviews.apache.org/r/26206/ UpdateFramework message might reach the slave before Reregistered message and get dropped - Key: MESOS-1869 URL: https://issues.apache.org/jira/browse/MESOS-1869 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Benjamin Mahler In reregisterSlave() we send 'SlaveReregisteredMessage' before we link the slave pid, which means a temporary socket will be created and used. Subsequently, after linking, we send the UpdateFrameworkMessage, which creates and uses a persistent socket. This might lead to out-of-order delivery, resulting in UpdateFrameworkMessage reaching the slave before the SlaveReregisteredMessage and getting dropped because the slave is not yet (re-)registered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1830) Expose master stats differentiating between master-generated and slave-generated LOST tasks
[ https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164536#comment-14164536 ] Vinod Kone commented on MESOS-1830: --- I added a proposal for how it could look in the attached review. Please take a look. Feedback welcome on the review or here. Expose master stats differentiating between master-generated and slave-generated LOST tasks --- Key: MESOS-1830 URL: https://issues.apache.org/jira/browse/MESOS-1830 Project: Mesos Issue Type: Story Components: master Reporter: Bill Farner Priority: Minor The master exports a monotonically-increasing counter of tasks transitioned to TASK_LOST. This loses fidelity of the source of the lost task. A first step in exposing the source of lost tasks might be to just differentiate between TASK_LOST transitions initiated by the master vs the slave (and maybe bad input from the scheduler). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1469) No output from review bot on timeout
[ https://issues.apache.org/jira/browse/MESOS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1469: --- Component/s: reviewbot No output from review bot on timeout Key: MESOS-1469 URL: https://issues.apache.org/jira/browse/MESOS-1469 Project: Mesos Issue Type: Bug Components: build, reviewbot Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor When the mesos review build times out, likely due to a long-running failing test, we have no output to debug. We should find a way to stream the output from the build instead of waiting for the build to finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1234) Mesos ReviewBot should look at old reviews first
[ https://issues.apache.org/jira/browse/MESOS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1234: --- Component/s: reviewbot Mesos ReviewBot should look at old reviews first Key: MESOS-1234 URL: https://issues.apache.org/jira/browse/MESOS-1234 Project: Mesos Issue Type: Improvement Components: reviewbot Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.19.0 Currently the ReviewBot looks at newest reviews first starving out old reviews if there are enough new/updated reviews. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1712) Automate disallowing of commits mixing mesos/libprocess/stout
[ https://issues.apache.org/jira/browse/MESOS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1712: --- Component/s: reviewbot Automate disallowing of commits mixing mesos/libprocess/stout - Key: MESOS-1712 URL: https://issues.apache.org/jira/browse/MESOS-1712 Project: Mesos Issue Type: Bug Components: reviewbot Reporter: Vinod Kone For various reasons, we don't want to mix mesos/libprocess/stout changes into a single commit. Typically, it is up to the reviewee/reviewer to catch this. It wold be nice to automate this via the pre-commit hook . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1881) Reviewbot should not apply reviews that are submitted.
Benjamin Mahler created MESOS-1881: -- Summary: Reviewbot should not apply reviews that are submitted. Key: MESOS-1881 URL: https://issues.apache.org/jira/browse/MESOS-1881 Project: Mesos Issue Type: Bug Components: reviewbot Reporter: Benjamin Mahler Priority: Trivial If a review contains a dependent review that is already submitted, reviewbot will still try to apply it and it will fail. We should skip dependent reviews that are marked as submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1882) Add a --dry-run to verify-reviews.py.
Benjamin Mahler created MESOS-1882: -- Summary: Add a --dry-run to verify-reviews.py. Key: MESOS-1882 URL: https://issues.apache.org/jira/browse/MESOS-1882 Project: Mesos Issue Type: Improvement Components: reviewbot Reporter: Benjamin Mahler Priority: Minor To improve the ease of making changes to verify-reviews.py, we should add the ability to pass a {{\-\-dry\-run}} flag. This will print all commands to be executed. Additional improvements that we may want to break out of this ticket: # Rename verify-reviews.py to verify_reviews.py to allow importing. # Make verify-reviews.py only execute when run as a {{\_\_main\_\_}}, if imported it should merely make the library methods / classes available, so that one can use the library from an interpreter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1873) Don't pass task-related arguments to mesos-executor
[ https://issues.apache.org/jira/browse/MESOS-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] R.B. Boyer updated MESOS-1873: -- Attachment: mesos_executor_overshare.v2.diff Attaching a second attempt at the patch (mesos_executor_overshare.v2.diff), this time I have reliably reproduced the fix in a test environment by recompiling the library and swapping it out on a running system. Don't pass task-related arguments to mesos-executor --- Key: MESOS-1873 URL: https://issues.apache.org/jira/browse/MESOS-1873 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.20.1 Environment: Linux 3.13.0-35-generic x86_64 Ubuntu-Precise Reporter: R.B. Boyer Attachments: mesos_executor_overshare.v2.diff Attempting to launch a task using the command executor with {{shell=false}} and passing arguments fails strangely. {noformat:title=CommandInfo proto} command { value: /my_program user: app shell: false arguments: my_program arguments: --start arguments: 2014-10-06 arguments: --end arguments: 2014-10-07 } {noformat} Dies with: {noformat:title=stderr} Failed to load unknown flag 'end' Usage: my_program [...] Supported options: --[no-]help Prints this help message (default: false) --[no-]override Whether or not to override the command the executor should run when the task is launched. Only this flag is expected to be on the command line and all arguments after the flag will be used as the subsequent 'argv' to be used with 'execvp' (default: false) {noformat} This is coming from a failed attempt to have the slave launch {{mesos-executor}}. This is due to an adverse interaction between new {{CommandInfo}} features and this blurb from {{src/slave/slave.cpp}}: {code} // Copy the CommandInfo to get the URIs and environment, but // update it to invoke 'mesos-executor' (unless we couldn't // resolve 'mesos-executor' via 'realpath', in which case just // echo the error and exit). executor.mutable_command()-MergeFrom(task.command()); Resultstring path = os::realpath( path::join(flags.launcher_dir, mesos-executor)); if (path.isSome()) { executor.mutable_command()-set_value(path.get()); } else { executor.mutable_command()-set_value( echo ' + (path.isError() ? path.error() : No such file or directory) + '; exit 1); } {code} This is failing to: * clear the {{arguments}} field * probably explicitly restore {{shell=true}} * clear {{container}} ? * clear {{user}} ? I was able to quickly fix this locally by making a man-in-the-middle program at {{/usr/local/libexec/mesos/mesos-executor}} that stripped all args before exec-ing the real {{mesos-executor}} binary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1873) Don't pass task-related arguments to mesos-executor
[ https://issues.apache.org/jira/browse/MESOS-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] R.B. Boyer updated MESOS-1873: -- Attachment: (was: mesos_executor_overshare.diff) Don't pass task-related arguments to mesos-executor --- Key: MESOS-1873 URL: https://issues.apache.org/jira/browse/MESOS-1873 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.20.1 Environment: Linux 3.13.0-35-generic x86_64 Ubuntu-Precise Reporter: R.B. Boyer Attachments: mesos_executor_overshare.v2.diff Attempting to launch a task using the command executor with {{shell=false}} and passing arguments fails strangely. {noformat:title=CommandInfo proto} command { value: /my_program user: app shell: false arguments: my_program arguments: --start arguments: 2014-10-06 arguments: --end arguments: 2014-10-07 } {noformat} Dies with: {noformat:title=stderr} Failed to load unknown flag 'end' Usage: my_program [...] Supported options: --[no-]help Prints this help message (default: false) --[no-]override Whether or not to override the command the executor should run when the task is launched. Only this flag is expected to be on the command line and all arguments after the flag will be used as the subsequent 'argv' to be used with 'execvp' (default: false) {noformat} This is coming from a failed attempt to have the slave launch {{mesos-executor}}. This is due to an adverse interaction between new {{CommandInfo}} features and this blurb from {{src/slave/slave.cpp}}: {code} // Copy the CommandInfo to get the URIs and environment, but // update it to invoke 'mesos-executor' (unless we couldn't // resolve 'mesos-executor' via 'realpath', in which case just // echo the error and exit). executor.mutable_command()-MergeFrom(task.command()); Resultstring path = os::realpath( path::join(flags.launcher_dir, mesos-executor)); if (path.isSome()) { executor.mutable_command()-set_value(path.get()); } else { executor.mutable_command()-set_value( echo ' + (path.isError() ? path.error() : No such file or directory) + '; exit 1); } {code} This is failing to: * clear the {{arguments}} field * probably explicitly restore {{shell=true}} * clear {{container}} ? * clear {{user}} ? I was able to quickly fix this locally by making a man-in-the-middle program at {{/usr/local/libexec/mesos/mesos-executor}} that stripped all args before exec-ing the real {{mesos-executor}} binary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1883) Possible race between reregistration, launching tasks, and rescinding offers
Dominic Hamon created MESOS-1883: Summary: Possible race between reregistration, launching tasks, and rescinding offers Key: MESOS-1883 URL: https://issues.apache.org/jira/browse/MESOS-1883 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Reporter: Dominic Hamon Priority: Minor When a framework reregisters, we rescind any offers we have sent, however the framework may attempt to launch tasks before the rescind message is received. This leads to a number of lost tasks due to invalid offers. Should we send offers before a framework is registered? -- This message was sent by Atlassian JIRA (v6.3.4#6332)