[jira] [Commented] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters

2019-04-23 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824829#comment-16824829
 ] 

Jorge Machado commented on MESOS-9740:
--

we are running mesos 1.7.1 and have one slave with ubuntu 18.04 wich has the 
master version compiled. We are only using the mesos containerizer and not the 
docker. It works fine for us.

> Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents 
> from reregistering with 1.8+ masters
> ---
>
> Key: MESOS-9740
> URL: https://issues.apache.org/jira/browse/MESOS-9740
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Joseph Wu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: foundations, mesosphere
>
> As part of MESOS-6874, the master now validates protobuf unions passed as 
> part of an {{ExecutorInfo::ContainerInfo}}.  This prevents a task from 
> specifying, for example, a {{ContainerInfo::MESOS}}, but filling out the 
> {{docker}} field (which is then ignored by the agent).
> However, if a task was already launched with an invalid protobuf union, the 
> same validation will happen when the agent tries to reregister with the 
> master.  In this case, if the master is upgraded to validate protobuf unions, 
> the agent reregistration will be rejected.
> {code}
> master.cpp:7201] Dropping re-registration of agent at 
> slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: 
> Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the 
> field `docker` set.
> {code}
> This bug was found when upgrading a 1.7.x test cluster to 1.8.0.  When 
> MESOS-6874 was committed, I had assumed the invalid protobufs would be rare.  
> However, on the test cluster, 13/17 agents had at least one invalid 
> ContainerInfo when reregistering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9741) Test `SlaveRecoveryTest.AgentReconfigurationWithRunningTask` is flaky.

2019-04-23 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9741:


 Summary: Test 
`SlaveRecoveryTest.AgentReconfigurationWithRunningTask` is flaky.
 Key: MESOS-9741
 URL: https://issues.apache.org/jira/browse/MESOS-9741
 Project: Mesos
  Issue Type: Bug
 Environment: Ubuntu 14.04, SSL-enabled
Reporter: Greg Mann


Observed on internal CI, Ubuntu 14.04, SSL-enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9740) Invalid protobuf unions in ExecutorInfo::ContainerInfo will prevent agents from reregistering with 1.8+ masters

2019-04-23 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9740:


 Summary: Invalid protobuf unions in ExecutorInfo::ContainerInfo 
will prevent agents from reregistering with 1.8+ masters
 Key: MESOS-9740
 URL: https://issues.apache.org/jira/browse/MESOS-9740
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Joseph Wu
Assignee: Benno Evers


As part of MESOS-6874, the master now validates protobuf unions passed as part 
of an {{ExecutorInfo::ContainerInfo}}.  This prevents a task from specifying, 
for example, a {{ContainerInfo::MESOS}}, but filling out the {{docker}} field 
(which is then ignored by the agent).

However, if a task was already launched with an invalid protobuf union, the 
same validation will happen when the agent tries to reregister with the master. 
 In this case, if the master is upgraded to validate protobuf unions, the agent 
reregistration will be rejected.

{code}
master.cpp:7201] Dropping re-registration of agent at 
slave(1)@172.31.47.126:5051 because it sent an invalid re-registration: 
Protobuf union `mesos.ContainerInfo` with `Type == MESOS` should not have the 
field `docker` set.
{code}

This bug was found when upgrading a 1.7.x test cluster to 1.8.0.  When 
MESOS-6874 was committed, I had assumed the invalid protobufs would be rare.  
However, on the test cluster, 13/17 agents had at least one invalid 
ContainerInfo when reregistering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9739) When recovered agent marked gone, retain agent ID

2019-04-23 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9739:


 Summary: When recovered agent marked gone, retain agent ID
 Key: MESOS-9739
 URL: https://issues.apache.org/jira/browse/MESOS-9739
 Project: Mesos
  Issue Type: Improvement
Reporter: Greg Mann


When a recovered agent is marked gone, we could retain its agent ID so that if 
it attempts to reregister, we could send task status updates for its tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources

2019-04-23 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823580#comment-16823580
 ] 

Greg Mann edited comment on MESOS-9619 at 4/23/19 11:11 PM:


1.8.x branch:
{code}
commit 356ff6e2805657e7df66896895728e2aabac2029
Author: Greg Mann 
Date:   Fri Apr 19 00:34:15 2019 -0700

Enabled construction of `ResourceQuantities` from `Resources`.

This patch adds a new static method which enables the
construction of `ResourceQuantities` from `Resources`.
Namely, this permits the inclusion of sets and ranges in the
input resources used to construct `ResourceQuantities`.

Review: https://reviews.apache.org/r/70507
{code}
{code}
commit ac6ab14f93ac19226b744c0e432c279a2e0ff2f7
Author: Greg Mann 
Date:   Sat Apr 20 11:48:39 2019 -0700

Ensured that task groups do not specify overlapping ranges or sets.

This patch adds validation to the master to ensure that task
groups do not include resources with overlapping set- or
range-valued resources, as this can crash the allocator.

Review: https://reviews.apache.org/r/70472/
{code}


was (Author: greggomann):
Backports forthcoming

> Mesos Master Crashes with Launch Group when using Port Resources
> 
>
> Key: MESOS-9619
> URL: https://issues.apache.org/jira/browse/MESOS-9619
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.3, 1.7.1
> Environment:  
> Testing in both Mesos 1.4.3 and Mesos 1.7.1
>Reporter: Nimi Wariboko Jr.
>Assignee: Greg Mann
>Priority: Critical
>  Labels: foundations, master, mesosphere
> Attachments: mesos-master.log, mesos-master.snippet.log
>
>
> Original Issue: 
> [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E]
>  When the ports resources is removed, Mesos functions normally (I'm able to 
> launch the task as many times as possible, while it always fails continually).
> Attached is a snippet of the mesos master log from OFFER to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-2842) Master crashes when framework changes principal on re-registration

2019-04-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824584#comment-16824584
 ] 

Gastón Kleiman edited comment on MESOS-2842 at 4/23/19 10:11 PM:
-

{noformat}
commit 89daa08529e85f97acfe02c10b51d8c553a0c225
Author: Andrei Sekretenko 
Date:   Tue Apr 23 11:32:48 2019 -0700

Added validation that the principal stays the same on resubscription.

Review: https://reviews.apache.org/r/70379/

commit 64d00cdfefc3ea5939efe60eaf6a2df8e7e5f4eb
Author: Andrei Sekretenko 
Date:   Tue Apr 23 11:32:42 2019 -0700

Added tests to check that framework cannot change its principal.

Review: https://reviews.apache.org/r/70377/

commit 2707eda4fa0bba58db63a9ec59574ac4de970fdc
Author: Andrei Sekretenko 
Date:   Tue Apr 23 11:32:31 2019 -0700

Deduplicated common validation code in Master::subscribe()'s.

Review: https://reviews.apache.org/r/70408/
{noformat}


was (Author: gkleiman):
{noformat}
commit 89daa08529e85f97acfe02c10b51d8c553a0c225
Author: Andrei Sekretenko 
Date:   Tue Apr 23 11:32:48 2019 -0700

Added validation that the principal stays the same on resubscription.

Review: https://reviews.apache.org/r/70379/

commit 64d00cdfefc3ea5939efe60eaf6a2df8e7e5f4eb
Author: Andrei Sekretenko 
Date:   Tue Apr 23 11:32:42 2019 -0700

Added tests to check that framework cannot change its principal.

Review: https://reviews.apache.org/r/70377/

commit 2707eda4fa0bba58db63a9ec59574ac4de970fdc
Author: Andrei Sekretenko 
Date:   Tue Apr 23 11:32:31 2019 -0700

Deduplicated common validation code in Master::subscribe()'s.

Review: https://reviews.apache.org/r/70408/
{noformat

> Master crashes when framework changes principal on re-registration
> --
>
> Key: MESOS-2842
> URL: https://issues.apache.org/jira/browse/MESOS-2842
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Sekretenko
>Priority: Critical
>  Labels: foundations, security
> Fix For: 1.9.0
>
>
> The master should be updated to avoid crashing when a framework re-registers 
> with a different principal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9738) Add per-framework metrics for offer round trip time.

2019-04-23 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9738:
---

 Summary: Add per-framework metrics for offer round trip time.
 Key: MESOS-9738
 URL: https://issues.apache.org/jira/browse/MESOS-9738
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu


This would provide more insights into framework responsiveness, help detect 
worrisome behaviors such as offer timeout, offer hoarding and etc.

One tricky thing is that we need to take Mesos's own queuing delay into 
consideration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9737) Avoid allocating memory during fork-exec in subprocess.hpp.

2019-04-23 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9737:
---

 Summary: Avoid allocating memory during fork-exec in 
subprocess.hpp.
 Key: MESOS-9737
 URL: https://issues.apache.org/jira/browse/MESOS-9737
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song


https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/posix/subprocess.hpp#L137

Os::strerror calls during fork-exec should be avoided, otherwise potential 
issues are not debuggable.
Consider using fmtlib for error code conversion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2019-04-23 Thread Andrei Sekretenko (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824441#comment-16824441
 ] 

Andrei Sekretenko commented on MESOS-7258:
--

https://reviews.apache.org/r/70534/

https://reviews.apache.org/r/70533/

https://reviews.apache.org/r/70532/

https://reviews.apache.org/r/70531/

https://reviews.apache.org/r/70530/

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: multitenancy, resource-management
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
>  UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.
> In addition, it would be easier to use if it returned 200 on success and an 
> error response if invalid, etc. Rather than returning 202.
> *NOTE*: Not specific to this issue, but we need to figure out how to allow 
> the framework to not leak reservations, e.g. MESOS-7651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9736) Error building libgrpc++ on Mac from a source tarball

2019-04-23 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9736:
--

 Summary: Error building libgrpc++ on Mac from a source tarball
 Key: MESOS-9736
 URL: https://issues.apache.org/jira/browse/MESOS-9736
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The following error was reported by [~tillt] trying to build the `1.8.0-rc2` 
release candidate on a MacOS machine:

{noformat}
make[2]: *** No rule to make target 
`../3rdparty/grpc-1.10.0/libs/opt/libgrpc++.a', needed by `libmesos.la'.  Stop.
{noformat}

Looking into the issue, the following was theory was offered for the cause of 
the problem:
{quote}
I have the hunch that this isnt an macOS thing but instead a problem in our 
build setup which does (not intentionally) try to do certain things in parallel.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)