[jira] [Commented] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters
[ https://issues.apache.org/jira/browse/MESOS-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824829#comment-16824829 ] Jorge Machado commented on MESOS-9740: -- we are running mesos 1.7.1 and have one slave with ubuntu 18.04 wich has the master version compiled. We are only using the mesos containerizer and not the docker. It works fine for us. > Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents > from reregistering with 1.8+ masters > --- > > Key: MESOS-9740 > URL: https://issues.apache.org/jira/browse/MESOS-9740 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Joseph Wu >Assignee: Benno Evers >Priority: Blocker > Labels: foundations, mesosphere > > As part of MESOS-6874, the master now validates protobuf unions passed as > part of an {{ExecutorInfo::ContainerInfo}}. This prevents a task from > specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the > {{docker}} field (which is then ignored by the agent). > However, if a task was already launched with an invalid protobuf union, the > same validation will happen when the agent tries to reregister with the > master. In this case, if the master is upgraded to validate protobuf unions, > the agent reregistration will be rejected. > {code} > master.cpp:7201] Dropping re-registration of agent at > slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: > Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the > field `docker` set. > {code} > This bug was found when upgrading a 1.7.x test cluster to 1.8.0. When > MESOS-6874 was committed, I had assumed the invalid protobufs would be rare. > However, on the test cluster, 13/17 agents had at least one invalid > ContainerInfo when reregistering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9741) Test `SlaveRecoveryTest.AgentReconfigurationWithRunningTask` is flaky.
Greg Mann created MESOS-9741: Summary: Test `SlaveRecoveryTest.AgentReconfigurationWithRunningTask` is flaky. Key: MESOS-9741 URL: https://issues.apache.org/jira/browse/MESOS-9741 Project: Mesos Issue Type: Bug Environment: Ubuntu 14.04, SSL-enabled Reporter: Greg Mann Observed on internal CI, Ubuntu 14.04, SSL-enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters
Joseph Wu created MESOS-9740: Summary: Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters Key: MESOS-9740 URL: https://issues.apache.org/jira/browse/MESOS-9740 Project: Mesos Issue Type: Bug Affects Versions: 1.8.0 Reporter: Joseph Wu Assignee: Benno Evers As part of MESOS-6874, the master now validates protobuf unions passed as part of an {{ExecutorInfo::ContainerInfo}}. This prevents a task from specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the {{docker}} field (which is then ignored by the agent). However, if a task was already launched with an invalid protobuf union, the same validation will happen when the agent tries to reregister with the master. In this case, if the master is upgraded to validate protobuf unions, the agent reregistration will be rejected. {code} master.cpp:7201] Dropping re-registration of agent at slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the field `docker` set. {code} This bug was found when upgrading a 1.7.x test cluster to 1.8.0. When MESOS-6874 was committed, I had assumed the invalid protobufs would be rare. However, on the test cluster, 13/17 agents had at least one invalid ContainerInfo when reregistering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9739) When recovered agent marked gone, retain agent ID
Greg Mann created MESOS-9739: Summary: When recovered agent marked gone, retain agent ID Key: MESOS-9739 URL: https://issues.apache.org/jira/browse/MESOS-9739 Project: Mesos Issue Type: Improvement Reporter: Greg Mann When a recovered agent is marked gone, we could retain its agent ID so that if it attempts to reregister, we could send task status updates for its tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources
[ https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823580#comment-16823580 ] Greg Mann edited comment on MESOS-9619 at 4/23/19 11:11 PM: 1.8.x branch: {code} commit 356ff6e2805657e7df66896895728e2aabac2029 Author: Greg Mann Date: Fri Apr 19 00:34:15 2019 -0700 Enabled construction of `ResourceQuantities` from `Resources`. This patch adds a new static method which enables the construction of `ResourceQuantities` from `Resources`. Namely, this permits the inclusion of sets and ranges in the input resources used to construct `ResourceQuantities`. Review: https://reviews.apache.org/r/70507 {code} {code} commit ac6ab14f93ac19226b744c0e432c279a2e0ff2f7 Author: Greg Mann Date: Sat Apr 20 11:48:39 2019 -0700 Ensured that task groups do not specify overlapping ranges or sets. This patch adds validation to the master to ensure that task groups do not include resources with overlapping set- or range-valued resources, as this can crash the allocator. Review: https://reviews.apache.org/r/70472/ {code} was (Author: greggomann): Backports forthcoming > Mesos Master Crashes with Launch Group when using Port Resources > > > Key: MESOS-9619 > URL: https://issues.apache.org/jira/browse/MESOS-9619 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.3, 1.7.1 > Environment: > Testing in both Mesos 1.4.3 and Mesos 1.7.1 >Reporter: Nimi Wariboko Jr. >Assignee: Greg Mann >Priority: Critical > Labels: foundations, master, mesosphere > Attachments: mesos-master.log, mesos-master.snippet.log > > > Original Issue: > [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E] > When the ports resources is removed, Mesos functions normally (I'm able to > launch the task as many times as possible, while it always fails continually). > Attached is a snippet of the mesos master log from OFFER to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-2842) Master crashes when framework changes principal on re-registration
[ https://issues.apache.org/jira/browse/MESOS-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824584#comment-16824584 ] Gastón Kleiman edited comment on MESOS-2842 at 4/23/19 10:11 PM: - {noformat} commit 89daa08529e85f97acfe02c10b51d8c553a0c225 Author: Andrei Sekretenko Date: Tue Apr 23 11:32:48 2019 -0700 Added validation that the principal stays the same on resubscription. Review: https://reviews.apache.org/r/70379/ commit 64d00cdfefc3ea5939efe60eaf6a2df8e7e5f4eb Author: Andrei Sekretenko Date: Tue Apr 23 11:32:42 2019 -0700 Added tests to check that framework cannot change its principal. Review: https://reviews.apache.org/r/70377/ commit 2707eda4fa0bba58db63a9ec59574ac4de970fdc Author: Andrei Sekretenko Date: Tue Apr 23 11:32:31 2019 -0700 Deduplicated common validation code in Master::subscribe()'s. Review: https://reviews.apache.org/r/70408/ {noformat} was (Author: gkleiman): {noformat} commit 89daa08529e85f97acfe02c10b51d8c553a0c225 Author: Andrei Sekretenko Date: Tue Apr 23 11:32:48 2019 -0700 Added validation that the principal stays the same on resubscription. Review: https://reviews.apache.org/r/70379/ commit 64d00cdfefc3ea5939efe60eaf6a2df8e7e5f4eb Author: Andrei Sekretenko Date: Tue Apr 23 11:32:42 2019 -0700 Added tests to check that framework cannot change its principal. Review: https://reviews.apache.org/r/70377/ commit 2707eda4fa0bba58db63a9ec59574ac4de970fdc Author: Andrei Sekretenko Date: Tue Apr 23 11:32:31 2019 -0700 Deduplicated common validation code in Master::subscribe()'s. Review: https://reviews.apache.org/r/70408/ {noformat > Master crashes when framework changes principal on re-registration > -- > > Key: MESOS-2842 > URL: https://issues.apache.org/jira/browse/MESOS-2842 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Andrei Sekretenko >Priority: Critical > Labels: foundations, security > Fix For: 1.9.0 > > > The master should be updated to avoid crashing when a framework re-registers > with a different principal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9738) Add per-framework metrics for offer round trip time.
Meng Zhu created MESOS-9738: --- Summary: Add per-framework metrics for offer round trip time. Key: MESOS-9738 URL: https://issues.apache.org/jira/browse/MESOS-9738 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu This would provide more insights into framework responsiveness, help detect worrisome behaviors such as offer timeout, offer hoarding and etc. One tricky thing is that we need to take Mesos's own queuing delay into consideration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9737) Avoid allocating memory during fork-exec in subprocess.hpp.
Gilbert Song created MESOS-9737: --- Summary: Avoid allocating memory during fork-exec in subprocess.hpp. Key: MESOS-9737 URL: https://issues.apache.org/jira/browse/MESOS-9737 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Gilbert Song https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/posix/subprocess.hpp#L137 Os::strerror calls during fork-exec should be avoided, otherwise potential issues are not debuggable. Consider using fmtlib for error code conversion -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
[ https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824441#comment-16824441 ] Andrei Sekretenko commented on MESOS-7258: -- https://reviews.apache.org/r/70534/ https://reviews.apache.org/r/70533/ https://reviews.apache.org/r/70532/ https://reviews.apache.org/r/70531/ https://reviews.apache.org/r/70530/ > Provide scheduler calls to subscribe to additional roles and unsubscribe from > roles. > > > Key: MESOS-7258 > URL: https://issues.apache.org/jira/browse/MESOS-7258 > Project: Mesos > Issue Type: Improvement > Components: master, scheduler api >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > Labels: multitenancy, resource-management > > The current support for schedulers to subscribe to additional roles or > unsubscribe from some of their roles requires that the scheduler obtain a new > subscription with the master which invalidates the event stream. > A more lightweight mechanism would be to provide calls for the scheduler to > subscribe to additional roles or unsubscribe from some roles such that the > existing event stream remains open and offers to the new roles arrive on the > existing event stream. E.g. > SUBSCRIBE_TO_ROLE > UNSUBSCRIBE_FROM_ROLE > One open question pertains to the terminology here, whether we would want to > avoid using "subscribe" in this context. An alternative would be: > UPDATE_FRAMEWORK_INFO > Which provides a generic mechanism for a framework to perform framework info > updates without obtaining a new event stream. > In addition, it would be easier to use if it returned 200 on success and an > error response if invalid, etc. Rather than returning 202. > *NOTE*: Not specific to this issue, but we need to figure out how to allow > the framework to not leak reservations, e.g. MESOS-7651. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9736) Error building libgrpc++ on Mac from a source tarball
Benno Evers created MESOS-9736: -- Summary: Error building libgrpc++ on Mac from a source tarball Key: MESOS-9736 URL: https://issues.apache.org/jira/browse/MESOS-9736 Project: Mesos Issue Type: Bug Reporter: Benno Evers The following error was reported by [~tillt] trying to build the `1.8.0-rc2` release candidate on a MacOS machine: {noformat} make[2]: *** No rule to make target `../3rdparty/grpc-1.10.0/libs/opt/libgrpc++.a', needed by `libmesos.la'. Stop. {noformat} Looking into the issue, the following was theory was offered for the cause of the problem: {quote} I have the hunch that this isnt an macOS thing but instead a problem in our build setup which does (not intentionally) try to do certain things in parallel. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)