Re: Review Request 70907: Added recovery of agent drain information.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70907/
---

(Updated June 26, 2019, 2:35 p.m.)


Review request for mesos, Greg Mann and Joseph Wu.


Changes
---

Rebase


Bugs: MESOS-9822
https://issues.apache.org/jira/browse/MESOS-9822


Repository: mesos


Description
---

With this patch the agent will, after executor reregistration finished,
replay any active drain information so remaining tasks are drained as
well. We need to wait until executors had a chance to register so they
are not terminated should we try to send kill task request before the
executor has registered.


Diffs (updated)
-

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/slave/state.hpp 45836e5805a5038c0b3b5563ea995c3fa90ab808 
  src/slave/state.cpp e0a850e03fb0d726bf7b02bbd6b298d81afae399 
  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70907/diff/4/

Changes: https://reviews.apache.org/r/70907/diff/3-4/


Testing
---

`make check`


Thanks,

Benjamin Bannier



Re: Review Request 70936: Adjusted task status updates during draining.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/
---

(Updated June 26, 2019, 2:35 p.m.)


Review request for mesos, Greg Mann and Joseph Wu.


Changes
---

Rebase; address comments from Greg


Bugs: MESOS-9823
https://issues.apache.org/jira/browse/MESOS-9823


Repository: mesos


Description
---

When a task is reported as killed to the agent during active agent
draining we now decorate the reported status with
`REASON_AGENT_DRAINING` if no other status was previously present. If
the draining marks the agent as gone via the `mark_gone` draining flag
we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
state.

This patch leaves some ambiguity in what triggered the kill since the
agent-executor protocol does not transport reasons; instead
the reason is here only inferred after the killed task has
been observed. This should usually be fine since due to the inherit race
between e.g., any user- and drain-triggered kill a user cannot
distinguish racy reasons.


Diffs (updated)
-

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70936/diff/2/

Changes: https://reviews.apache.org/r/70936/diff/1-2/


Testing
---

`make check`


Thanks,

Benjamin Bannier



Re: Review Request 70748: Changed semantics of TLS certificate verification flags.

2019-06-26 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70748/
---

(Updated June 26, 2019, 10:26 a.m.)


Review request for mesos, Alexander Rukletsov, Jan-Philip Gehrcke, Joseph Wu, 
and Till Toenshoff.


Changes
---

Adjusted info messages to new semantics.


Bugs: MESOS-9810
https://issues.apache.org/jira/browse/MESOS-9810


Repository: mesos


Description
---

This commit slightly updates the semants of the
`LIBPROCESS_SSL_VERIFY_CERT` and `LIBPROCESS_SSL_REQUIRE_CERT`
environment variables. The former now only applies to connections
in client mode and the latter now only applies to connections in
server mode.

In particular, in TLS server mode we now *only* verify client
certificates when `LIBPROCESS_SSL_REQUIRE_CERT` is set to `true`,
regardless of the value of `LIBPROCESS_SSL_VERIFY_CERT`.

In addtion, when in SSL client mode and  `LIBPROCESS_SSL_VERIFY_CERT`
has been set to `true`, enforce that the server actually presents a
certificate that can be verified. Note that this is expected to be
not a behavioural change in practice, since the TLS specification
already states that a server MUST always send a certificate unless an
anonymous cipher is used, and most TLS ciphersuites are configured to
exclude anonymous ciphers.


Diffs (updated)
-

  3rdparty/libprocess/src/openssl.hpp 17bec246e516261f8d772f1647c17f092fae82d1 
  3rdparty/libprocess/src/openssl.cpp e7dbd67913fa8e7fbbf60dee428e7e38895f86ce 
  3rdparty/libprocess/src/posix/libevent/libevent_ssl_socket.cpp 
29a1bf71c1df9d80370455a6269ecea0ec4193b0 
  3rdparty/libprocess/src/tests/ssl_tests.cpp 
6b8496aeeed79ae1bd39d7013f4f403b248fdd4c 


Diff: https://reviews.apache.org/r/70748/diff/8/

Changes: https://reviews.apache.org/r/70748/diff/7-8/


Testing (updated)
---

See end of this chain.


Thanks,

Benno Evers



Re: Review Request 70936: Adjusted task status updates during draining.

2019-06-26 Thread Benjamin Bannier


> On June 25, 2019, 11:48 p.m., Greg Mann wrote:
> > src/tests/slave_tests.cpp
> > Lines 12109 (patched)
> > 
> >
> > Maybe we should unconditionally set the reason to 
> > REASON_AGENT_DRAINING? It doesn't make sense to me that the reason received 
> > by the scheduler will depend on what stage the task launch was in when the 
> > drain message was received?

I was wondering about this as well. The crucial bit to communicate here is that 
the task terminated unexpectedly which we would provide in any case.

I feel providing frameworks slightly more granular information on where in the 
task's lifecycle this happened might be interesting as e.g., a task killed 
while running could have had application-level side effects while one killed 
during launch wouldn't. Ultimately the framework still wouldn't know where in 
its application-level lifecycle a running task was killed (and we wouldn't be 
able to provide that) and the framework would need to deal with this outside of 
Mesos, but I also do not see a reason to not make the lower level detail on 
whether the task was launched and of which Mesos is aware visible to the 
framework.

What do you think?


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216143
---


On June 26, 2019, 2:35 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> ---
> 
> (Updated June 26, 2019, 2:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
> https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 70897: Supported suppressedRoles in updateFramework() in V0 Java bindings.

2019-06-26 Thread Andrei Sekretenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70897/
---

(Updated June 26, 2019, 3:32 p.m.)


Review request for mesos and Benjamin Mahler.


Bugs: MESOS-9793
https://issues.apache.org/jira/browse/MESOS-9793


Repository: mesos


Description
---

Supported suppressedRoles in updateFramework() in V0 Java bindings.


Diffs (updated)
-

  src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp 
a21aca23cbef27b3c54bf1ae5834cbb457608130 
  src/java/src/org/apache/mesos/MesosSchedulerDriver.java 
55ebc8772d9183286c908b4dba342109f28394f4 
  src/java/src/org/apache/mesos/SchedulerDriver.java 
ee5a9e24956299d84ff03a0fc5a22e641470a03f 


Diff: https://reviews.apache.org/r/70897/diff/3/

Changes: https://reviews.apache.org/r/70897/diff/2-3/


Testing
---

`./bin/mesos-tests.sh --gtest_filter="ExamplesTest.JavaFramework" 
--gtest_break_on_failure --gtest_repeat=10` with a patch from 
https://reviews.apache.org/r/70815/


Thanks,

Andrei Sekretenko



Re: Review Request 70894: Provided ability to change suppressed roles via V0 updateFramework().

2019-06-26 Thread Andrei Sekretenko


> On June 25, 2019, 7:13 p.m., Benjamin Mahler wrote:
> > include/mesos/scheduler.hpp
> > Lines 290-293 (patched)
> > 
> >
> > // NOTE: if the framework is not connected to the master, the suppressed
> > // roles set stored by the driver will be cleared and the next 
> > re-registration
> > // will send an empty suppressed roles list.
> 
> Andrei Sekretenko wrote:
> Thanks! Fully agree that I need a better wording here.
> 
> I'm afraid these two phrases might mislead someone someday:
> - "next re-registration": it might be not clear to the reader if the 
> ongoing re-registrtation (SUBSCRIBE sent, but no FrameworkReregistered 
> received) is the "next" one or not.
> - "will send an empty list": if, for example, 
> updateFramework()/suppress() is called immediately after reviveOffers(), the 
> set which the driver will set will be what the caller wants - but it will not 
> be empty.
> 
> So I've tried to reword the comment (and remove implementation details) 
> without using these phrases.

typo: s/the set which the driver will **set**/the set which the driver will 
**send**/


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70894/#review216124
---


On June 26, 2019, 3:28 p.m., Andrei Sekretenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70894/
> ---
> 
> (Updated June 26, 2019, 3:28 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9793
> https://issues.apache.org/jira/browse/MESOS-9793
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch adds a list of suppressed roles to the arguments of
> scheduler driver's updateFramework() method to make it possible
> for V0 frameworks to selectively suppress/revive offers.
> 
> To keep re-registration conststent with updateFramework(), the required
> set of suppressed roles is now stored by the driver and used when it
> re-registers.
> 
> To keep reviveOffers() consistent with re-registration, reviveOffers()
> now clears the stored set and, when issued in a disconnected state of
> the driver, forces it to perform UPDATE_FRAMEWORK call upon
> re-connection.
> 
> NOTE: suppressOffers() has never been consistent with re-registration
> in this regard. Making it constent with re-registration might affect
> the behavior of existing frameworks, and will be performed in a separate
> patch.
> 
> 
> Diffs
> -
> 
>   include/mesos/scheduler.hpp 8c6774885ceb3b0644c32842a17fba39e2c3e472 
>   src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp 
> a21aca23cbef27b3c54bf1ae5834cbb457608130 
>   src/java/src/org/apache/mesos/SchedulerDriver.java 
> ee5a9e24956299d84ff03a0fc5a22e641470a03f 
>   src/python/interface/src/mesos/interface/__init__.py 
> f9642a0452e3161f3fd1b6a5d7ce0658d08c0b87 
>   src/sched/sched.cpp e6cc534264e3ff4edaa703a660afb2f896388802 
>   src/tests/master/update_framework_tests.cpp 
> 0d466e286adfd829bbe72cff9202860f7fcb043f 
> 
> 
> Diff: https://reviews.apache.org/r/70894/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> `./bin/mesos-tests.sh --gtest_filter="UpdateFrameworkV0*" 
> --gtest_break_on_failure --gtest_repeat=1000`
> 
> +a new test from the depending patch
> 
> 
> Thanks,
> 
> Andrei Sekretenko
> 
>



Re: Review Request 70884: Added optional 'host' string member to UPID.

2019-06-26 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70884/
---

(Updated June 26, 2019, 1:37 p.m.)


Review request for mesos, Benjamin Mahler, Joseph Wu, and Till Toenshoff.


Changes
---

Added benchmark results.


Bugs: MESOS-9858
https://issues.apache.org/jira/browse/MESOS-9858


Repository: mesos


Description
---

This allows client code to access the original hostname
that was used to specify a libprocess address.


Diffs
-

  3rdparty/libprocess/include/process/pid.hpp 
9f09ab46fa3ceaeac09b0fbf9f532728c4ed2d7a 
  3rdparty/libprocess/src/pid.cpp fdc61b5ab6c75b33ce33de7edd11e9302550f300 
  3rdparty/libprocess/src/process.cpp 799666f03d6a78708aa9336c2dd04bc9b5023aa0 


Diff: https://reviews.apache.org/r/70884/diff/3/


Testing (updated)
---

Results from running the `Process_BENCHMARK_ThroughputPerformance` with 
`--gtest_repeat=10`. There seems to be a slight degradation of performance on 
average, but the benchmark runtime seems to be dominated by other factors, as 
indicated by the large variance and presence of outliers on both sides.


# Master
2519083.849973
2401163.451671
2322734.358626
2285763.292828
1883268.801707
2179436.004374
2071825.301714
2105588.897964
2067228.706673


# Including UPID changes
2305142.571762
2194902.761484
2139113.533129
2024036.049603
1765327.993699
2172592.534808
2053217.096855
2103817.232509
1480080.596818
1808977.015807


Thanks,

Benno Evers



Re: Review Request 70895: Added a test for suppressing roles via V0 updateFramework().

2019-06-26 Thread Andrei Sekretenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70895/
---

(Updated June 26, 2019, 3:33 p.m.)


Review request for mesos and Benjamin Mahler.


Summary (updated)
-

Added a test for suppressing roles via V0 updateFramework().


Bugs: MESOS-9793
https://issues.apache.org/jira/browse/MESOS-9793


Repository: mesos


Description (updated)
---

Added a test for suppressing roles via V0 updateFramework().


Diffs (updated)
-

  src/tests/master/update_framework_tests.cpp 
0d466e286adfd829bbe72cff9202860f7fcb043f 


Diff: https://reviews.apache.org/r/70895/diff/3/

Changes: https://reviews.apache.org/r/70895/diff/2-3/


Testing
---

`./bin/mesos-tests.sh --gtest_filter="UpdateFrameworkV0*SuppressedRoles" 
--gtest_break_on_failure --gtest_repeat=1000`


Thanks,

Andrei Sekretenko



Re: Review Request 70895: Added a test for supperssing roles via V0 updateFramework().

2019-06-26 Thread Andrei Sekretenko


> On June 25, 2019, 7:21 p.m., Benjamin Mahler wrote:
> > src/tests/master/update_framework_tests.cpp
> > Lines 992-996 (patched)
> > 
> >
> > The framework update already occurred, and it may be in the allocator's 
> > queue, but that's fine. Why is this settling?

You are right, we don't need it now - I removed it from the patch.

Is it true that the update will always be performed atomically in the allocator 
(i.e. no deferred calls will be initiated by the allocator to complete the 
update)?

If this changes, adding the slave will race against the update... but most 
likely, we will have a whole bunch of other problems then;)


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70895/#review216126
---


On June 25, 2019, 3:30 p.m., Andrei Sekretenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70895/
> ---
> 
> (Updated June 25, 2019, 3:30 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9793
> https://issues.apache.org/jira/browse/MESOS-9793
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a test for supperssing roles via V0 updateFramework().
> 
> 
> Diffs
> -
> 
>   src/tests/master/update_framework_tests.cpp 
> 0d466e286adfd829bbe72cff9202860f7fcb043f 
> 
> 
> Diff: https://reviews.apache.org/r/70895/diff/2/
> 
> 
> Testing
> ---
> 
> `./bin/mesos-tests.sh --gtest_filter="UpdateFrameworkV0*SuppressedRoles" 
> --gtest_break_on_failure --gtest_repeat=1000`
> 
> 
> Thanks,
> 
> Andrei Sekretenko
> 
>



Re: Review Request 70894: Provided ability to change suppressed roles via V0 updateFramework().

2019-06-26 Thread Andrei Sekretenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70894/
---

(Updated June 26, 2019, 3:28 p.m.)


Review request for mesos and Benjamin Mahler.


Bugs: MESOS-9793
https://issues.apache.org/jira/browse/MESOS-9793


Repository: mesos


Description
---

This patch adds a list of suppressed roles to the arguments of
scheduler driver's updateFramework() method to make it possible
for V0 frameworks to selectively suppress/revive offers.

To keep re-registration conststent with updateFramework(), the required
set of suppressed roles is now stored by the driver and used when it
re-registers.

To keep reviveOffers() consistent with re-registration, reviveOffers()
now clears the stored set and, when issued in a disconnected state of
the driver, forces it to perform UPDATE_FRAMEWORK call upon
re-connection.

NOTE: suppressOffers() has never been consistent with re-registration
in this regard. Making it constent with re-registration might affect
the behavior of existing frameworks, and will be performed in a separate
patch.


Diffs (updated)
-

  include/mesos/scheduler.hpp 8c6774885ceb3b0644c32842a17fba39e2c3e472 
  src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp 
a21aca23cbef27b3c54bf1ae5834cbb457608130 
  src/java/src/org/apache/mesos/SchedulerDriver.java 
ee5a9e24956299d84ff03a0fc5a22e641470a03f 
  src/python/interface/src/mesos/interface/__init__.py 
f9642a0452e3161f3fd1b6a5d7ce0658d08c0b87 
  src/sched/sched.cpp e6cc534264e3ff4edaa703a660afb2f896388802 
  src/tests/master/update_framework_tests.cpp 
0d466e286adfd829bbe72cff9202860f7fcb043f 


Diff: https://reviews.apache.org/r/70894/diff/3/

Changes: https://reviews.apache.org/r/70894/diff/2-3/


Testing
---

make check
`./bin/mesos-tests.sh --gtest_filter="UpdateFrameworkV0*" 
--gtest_break_on_failure --gtest_repeat=1000`

+a new test from the depending patch


Thanks,

Andrei Sekretenko



Re: Review Request 70894: Provided ability to change suppressed roles via V0 updateFramework().

2019-06-26 Thread Andrei Sekretenko


> On June 25, 2019, 7:13 p.m., Benjamin Mahler wrote:
> > include/mesos/scheduler.hpp
> > Lines 290-293 (patched)
> > 
> >
> > // NOTE: if the framework is not connected to the master, the suppressed
> > // roles set stored by the driver will be cleared and the next 
> > re-registration
> > // will send an empty suppressed roles list.

Thanks! Fully agree that I need a better wording here.

I'm afraid these two phrases might mislead someone someday:
- "next re-registration": it might be not clear to the reader if the ongoing 
re-registrtation (SUBSCRIBE sent, but no FrameworkReregistered received) is the 
"next" one or not.
- "will send an empty list": if, for example, updateFramework()/suppress() is 
called immediately after reviveOffers(), the set which the driver will set will 
be what the caller wants - but it will not be empty.

So I've tried to reword the comment (and remove implementation details) without 
using these phrases.


> On June 25, 2019, 7:13 p.m., Benjamin Mahler wrote:
> > include/mesos/scheduler.hpp
> > Lines 352 (patched)
> > 
> >
> > Why the default here?

My impression at the moment of writing that was that updateFramework() will not 
be frequently called with non-empty suppressedRoles - however, now this 
impression looks totally wrong.

I've removed this default. 
Combination of default and virtual method might be confusing, and, more 
importantly, this default does not help readers be aware of the fact that 
updateFramework() always sets suppressed roles.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70894/#review216124
---


On June 25, 2019, 3:28 p.m., Andrei Sekretenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70894/
> ---
> 
> (Updated June 25, 2019, 3:28 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9793
> https://issues.apache.org/jira/browse/MESOS-9793
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch adds a list of suppressed roles to the arguments of
> scheduler driver's updateFramework() method to make it possible
> for V0 frameworks to selectively suppress/revive offers.
> 
> To keep re-registration conststent with updateFramework(), the required
> set of suppressed roles is now stored by the driver and used when it
> re-registers.
> 
> To keep reviveOffers() consistent with re-registration, reviveOffers()
> now clears the stored set and, when issued in a disconnected state of
> the driver, forces it to perform UPDATE_FRAMEWORK call upon
> re-connection.
> 
> NOTE: suppressOffers() has never been consistent with re-registration
> in this regard. Making it constent with re-registration might affect
> the behavior of existing frameworks, and will be performed in a separate
> patch.
> 
> 
> Diffs
> -
> 
>   include/mesos/scheduler.hpp 8c6774885ceb3b0644c32842a17fba39e2c3e472 
>   src/java/src/org/apache/mesos/SchedulerDriver.java 
> ee5a9e24956299d84ff03a0fc5a22e641470a03f 
>   src/python/interface/src/mesos/interface/__init__.py 
> f9642a0452e3161f3fd1b6a5d7ce0658d08c0b87 
>   src/sched/sched.cpp e6cc534264e3ff4edaa703a660afb2f896388802 
> 
> 
> Diff: https://reviews.apache.org/r/70894/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> `./bin/mesos-tests.sh --gtest_filter="UpdateFrameworkV0*" 
> --gtest_break_on_failure --gtest_repeat=1000`
> 
> +a new test from the depending patch
> 
> 
> Thanks,
> 
> Andrei Sekretenko
> 
>



Re: Review Request 70949: Added master minimum capability `QUOTA_V2`.

2019-06-26 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70949/#review216156
---



Ditto comment below w.r.t. the description, can you make the description more 
precise about the overall approach with this minimum capability and the 
registry?


docs/downgrades.md
Lines 57-77 (patched)


Per offline discussion, can this be more precise about the overall approach 
for this minimum capability in the registry?



include/mesos/mesos.proto
Lines 928 (patched)


can handle the new quota API, which supports setting limits (introduced in 
Mesos 1.9)



src/master/constants.cpp
Lines 28 (patched)


Is the master v2 quota capable yet in this patch?

I guess the agent doesn't care about this capability, but it seems nicer if 
we add it after the master can actually handle v2 style quota requests?


- Benjamin Mahler


On June 26, 2019, 5:58 a.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70949/
> ---
> 
> (Updated June 26, 2019, 5:58 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9601
> https://issues.apache.org/jira/browse/MESOS-9601
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This adds a new enum for the revamped quota feature
> in the master. When quota is configured in Mesos 1.9
> or higher, the `QUOTA_V2` minimum capability will be
> persisted into the registry. This will prevent any master
> downgrades until all quotas configured in Mesos 1.9 or higher
> are set back to the default (no guarantees and no limits).
> 
> 
> Diffs
> -
> 
>   docs/downgrades.md 3807254143d19f36079498c1a01adf744e21e8a2 
>   include/mesos/mesos.proto d2750a634bf58428a152e70aedfc06ff70b3aa84 
>   src/common/protobuf_utils.hpp f6ea9230d38079b24060922872ee93d9f038b98e 
>   src/master/constants.cpp 13b3467825c624fd3cb1652fbfec1a9631ca37e6 
>   src/tests/master_tests.cpp d7b4613980cb13d2009426b3a07690aac659cda7 
> 
> 
> Diff: https://reviews.apache.org/r/70949/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 70936: Adjusted task status updates during draining.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216160
---




src/slave/slave.cpp
Lines 5732-5734 (patched)


I think I got this part incorrect; we should unconditionally set a 
`TASK_GONE_BY_OPERATOR` state.


- Benjamin Bannier


On June 26, 2019, 2:35 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> ---
> 
> (Updated June 26, 2019, 2:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
> https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 70923: Added a registry operation for the DRAIN_AGENT call.

2019-06-26 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70923/
---

(Updated June 26, 2019, 1:44 p.m.)


Review request for mesos, Benjamin Bannier, Benjamin Mahler, Greg Mann, and 
Vinod Kone.


Changes
---

Rebased on newer protobufs, which added the `State` field into the Registry.

Also split out the modifications of unreachable operations into another review: 
https://reviews.apache.org/r/70956/


Bugs: MESOS-9814
https://issues.apache.org/jira/browse/MESOS-9814


Repository: mesos


Description (updated)
---

This adds the associated registry operation and fields for the
DRAIN_AGENT master call.  This call affects admitted or unreachable
agents, but this commit only deals with admitted agents.

Because this feature is not downgrade compatible, a minimum capability
is added when the draining feature is used.


Diffs (updated)
-

  src/master/registry.proto 239789322c45c6a8346332f50661a98b1851b685 
  src/master/registry_operations.hpp 5a3010d46c4eddb0d660eb7368ec836903477a9b 
  src/master/registry_operations.cpp c417c4d6c4272eeab15f5313ca5160d4a0aa4da4 
  src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 


Diff: https://reviews.apache.org/r/70923/diff/3/

Changes: https://reviews.apache.org/r/70923/diff/2-3/


Testing
---

make check


Thanks,

Joseph Wu



Review Request 70956: Modified registry operations for unreachable draining agents.

2019-06-26 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70956/
---

Review request for mesos, Benjamin Bannier, Benjamin Mahler, Greg Mann, and 
Vinod Kone.


Bugs: MESOS-9814
https://issues.apache.org/jira/browse/MESOS-9814


Repository: mesos


Description
---

When agents transition from reachable to unreachable, or vice
versa, the draining config is now copied too, along with the
'deactivated' boolean.


Diffs
-

  src/master/registry.proto 239789322c45c6a8346332f50661a98b1851b685 
  src/master/registry_operations.cpp c417c4d6c4272eeab15f5313ca5160d4a0aa4da4 
  src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 


Diff: https://reviews.apache.org/r/70956/diff/1/


Testing
---

make check


Thanks,

Joseph Wu



Re: Review Request 70950: Added a registry field for `QuotaConfig`.

2019-06-26 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70950/#review216157
---



Can you also expand the commit description with a more precise comment about 
how these two fields will be used?


src/master/registry.proto
Lines 136-139 (original), 136-147 (patched)


Ditto comment from the last review. After discussing this offline and 
understanding the overall approach, it's clear that someone looking at these 
two fields will not understand the overall approach taken.

Can you add a guiding comment here about these two fields? Might as well 
group them together:

```
  // GUIDING COMMENT
  repeated Quota quotas = 5 [deprecated = true];
  repeated quota.QuotaConfig quota_configs = 11;
```



src/master/registry.proto
Lines 140-141 (patched)


Can you file a ticket that tracks this removal for the 2.0 timeframe and 
link it here?


- Benjamin Mahler


On June 26, 2019, 1:37 a.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70950/
> ---
> 
> (Updated June 26, 2019, 1:37 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9601
> https://issues.apache.org/jira/browse/MESOS-9601
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A new field called `quota_configs` is added to persist the
> quota configurations of the cluster. This replaces the old
> `quotas` field which is deprecated and will be removed
> in Mesos 2.0.
> 
> 
> Diffs
> -
> 
>   src/master/registry.proto 239789322c45c6a8346332f50661a98b1851b685 
> 
> 
> Diff: https://reviews.apache.org/r/70950/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 70951: Updated registry operation `UpdateQuota` to persist `QuotaConfig`.

2019-06-26 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70951/#review216159
---




src/master/master.cpp
Lines 1740-1741 (patched)


Why not just do a .contains here?

Or CHECK_CONTAINS?



src/master/quota.hpp
Line 64 (original), 64 (patched)


What is this comment saying? This constructor is for the UPDATE_QUOTA 
master call path? Why say that?

There's also no other constructor, and this comment seems to suggest that 
there are others for other `XXX_QUOTA` things?



src/master/quota.cpp
Lines 57-63 (original), 58-67 (patched)


Storing a mutable reference to the configs seems to help readability here?

Also, Maybe find_if is clearer? And the if condition confused me a bit, 
maybe it's clearer to do it by default quota first, then whether it's found?

```
...& configs;

int configIndex = find_if(
configs.begin(),
configs.end(), 
[&](const QuotaConfig& config) {
  return role == config.role();
}) - configs.begin();

if (Quota(config) == DEFAULT_QUOTA)) {
  // Erase if present, otherwise no-op.
  if (configIndex < configs.size()) {
configs.DeleteSubrange(index, 1);
  }
} else {
  // Modify if present, otherwise insert.
  if (configIndex < configs.size()) {
*configs.Mutable(index) = config;
  } else {
*configs.Add() = config;
  }
}
```

Ditto for the infos logic below.



src/master/quota.cpp
Line 75 (original), 102 (patched)


s/std//



src/master/quota.cpp
Lines 83-84 (original), 105-113 (patched)


Ditto find_if?



src/master/quota.cpp
Lines 86-89 (original), 115-127 (patched)


Ditto here, feels more readable if thinking about the inner cases first:

// if capability needs to be present
  // make it present
// else: capability needs to be absent
  // make it absent



src/master/quota.cpp
Line 91 (original), 129 (patched)


This is supposed to return true when there's been a mutation, but it's 
always returning true?

The case where it's a no-op should return false.



src/tests/registrar_tests.cpp
Lines 927-948 (patched)


Pull this towards the top of the test body, before the 
twoResourceGuarantees stuff is declared?



src/tests/registrar_tests.cpp
Lines 944-946 (original), 965-967 (patched)


Shouldn't we be sanity checking the registry protobuf at this point? This 
can be done by hitting the registrar's endpoint.

Ditto in the other cases.



src/tests/registrar_tests.cpp
Lines 954-960 (original), 975-986 (patched)


I'm finding this test rather challenging to read, can we make it more 
readable than all the micro-checks by showing clearly what the expected 
registry looks like in json? E.g.

```
Registry expected = fromJSON(R"~(
{
  "configs": [ ... ]
  "capabilities": [ ... ]
}
)~");

EXPECT_EQ(expected, actual);
```

That will make it a lot less of a struggle to read


- Benjamin Mahler


On June 26, 2019, 6:03 a.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70951/
> ---
> 
> (Updated June 26, 2019, 6:03 a.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9601
> https://issues.apache.org/jira/browse/MESOS-9601
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The new operations will mutate the `quota_configs` field in
> the registry to persist `QuotaConfigs` configured by the new
> `UPDATE_QUOTA` call as well as the legacy `SET_QUOTA` and
> `REMOVE_QUOTA` calls.
> 
> The operation removes any entries in the legacy `quotas` field
> with the same role name. In addition, it also adds/removes the
> minimum capability `QUOTA_V2` accordingly: if `quota_configs`
> is empty the capability will be removed otherwise it will
> be added.
> 
> This operation replaces the `REMOVE_QUOTA` operation.
> 
> Also fixed affected tests.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 826362d1a9ed034a49934b1810d6e1e6cffd90cd 
>   src/master/quota.hpp 959e312b861fc0b59519a9ab5cc51061b55b534c 
>   src/master/quota.cpp 

Re: Review Request 70936: Adjusted task status updates during draining.

2019-06-26 Thread Benjamin Bannier


> On June 26, 2019, 10:16 p.m., Benjamin Bannier wrote:
> > src/slave/slave.cpp
> > Lines 5732-5734 (patched)
> > 
> >
> > I think I got this part incorrect; we should unconditionally set a 
> > `TASK_GONE_BY_OPERATOR` state.

Incorrect, we should preserve the original state if the draining does not 
`mark_gone`. Dropping.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216160
---


On June 26, 2019, 2:35 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> ---
> 
> (Updated June 26, 2019, 2:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
> https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Review Request 70957: Added registry operation for marking an agent as drained.

2019-06-26 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70957/
---

Review request for mesos, Benjamin Bannier, Benjamin Mahler, Greg Mann, and 
Vinod Kone.


Bugs: MESOS-9814
https://issues.apache.org/jira/browse/MESOS-9814


Repository: mesos


Description
---

This adds an operation purely for transitioning an agent from the
DRAINING state to the DRAINED state.  The master is expected to
validate that the targetted agent is DRAINING, otherwise, the
write will fail (which results in the master aborting).


Diffs
-

  src/master/registry_operations.hpp 5a3010d46c4eddb0d660eb7368ec836903477a9b 
  src/master/registry_operations.cpp c417c4d6c4272eeab15f5313ca5160d4a0aa4da4 
  src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 


Diff: https://reviews.apache.org/r/70957/diff/1/


Testing
---

make check


Thanks,

Joseph Wu



Re: Review Request 70924: Added registry operations for DE/RE-ACTIVATE_AGENT calls.

2019-06-26 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70924/
---

(Updated June 26, 2019, 1:46 p.m.)


Review request for mesos, Benjamin Bannier, Benjamin Mahler, Greg Mann, and 
Vinod Kone.


Changes
---

Rebased on newer protos and some changes to commits earlier.


Bugs: MESOS-9814
https://issues.apache.org/jira/browse/MESOS-9814


Repository: mesos


Description
---

This adds the associated registry operation and fields for the
DEACTIVATE_AGENT and REACTIVATE_AGENT master calls.

Like the DRAIN_AGENT call, the deactivation state also persists when
agents become unreachable/reachable, which is already handled by the
DRAIN_AGENT operation implementation.

Likewise, this feature is not downgrade compatible, and a minimum
capability is added when the deactivation feature is used.
If all draining or deactivated agents are reactivated, the minimum
capability is removed.


Diffs (updated)
-

  src/master/registry_operations.hpp 5a3010d46c4eddb0d660eb7368ec836903477a9b 
  src/master/registry_operations.cpp c417c4d6c4272eeab15f5313ca5160d4a0aa4da4 
  src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 


Diff: https://reviews.apache.org/r/70924/diff/2/

Changes: https://reviews.apache.org/r/70924/diff/1-2/


Testing
---

make check


Thanks,

Joseph Wu



Re: Review Request 70895: Added a test for suppressing roles via V0 updateFramework().

2019-06-26 Thread Benjamin Mahler


> On June 25, 2019, 7:21 p.m., Benjamin Mahler wrote:
> > src/tests/master/update_framework_tests.cpp
> > Lines 992-996 (patched)
> > 
> >
> > The framework update already occurred, and it may be in the allocator's 
> > queue, but that's fine. Why is this settling?
> 
> Andrei Sekretenko wrote:
> You are right, we don't need it now - I removed it from the patch.
> 
> Is it true that the update will always be performed atomically in the 
> allocator (i.e. no deferred calls will be initiated by the allocator to 
> complete the update)?
> 
> If this changes, adding the slave will race against the update... but 
> most likely, we will have a whole bunch of other problems then;)

> Is it true that the update will always be performed atomically in the 
> allocator (i.e. no deferred calls will be initiated by the allocator to 
> complete the update)?

It's true today, and no known ways this will become async.

> If this changes, adding the slave will race against the update... but most 
> likely, we will have a whole bunch of other problems then;)

Yes, that sounds like something we would want to surface, since it's going to 
cause other bugs.


- Benjamin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70895/#review216126
---


On June 26, 2019, 3:33 p.m., Andrei Sekretenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70895/
> ---
> 
> (Updated June 26, 2019, 3:33 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9793
> https://issues.apache.org/jira/browse/MESOS-9793
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a test for suppressing roles via V0 updateFramework().
> 
> 
> Diffs
> -
> 
>   src/tests/master/update_framework_tests.cpp 
> 0d466e286adfd829bbe72cff9202860f7fcb043f 
> 
> 
> Diff: https://reviews.apache.org/r/70895/diff/3/
> 
> 
> Testing
> ---
> 
> `./bin/mesos-tests.sh --gtest_filter="UpdateFrameworkV0*SuppressedRoles" 
> --gtest_break_on_failure --gtest_repeat=1000`
> 
> 
> Thanks,
> 
> Andrei Sekretenko
> 
>



Re: Review Request 70949: Added master minimum capability `QUOTA_V2`.

2019-06-26 Thread Meng Zhu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70949/
---

(Updated June 26, 2019, 9:24 p.m.)


Review request for mesos and Benjamin Mahler.


Changes
---

Addressed Ben's comment.


Bugs: MESOS-9601
https://issues.apache.org/jira/browse/MESOS-9601


Repository: mesos


Description (updated)
---

This adds a new enum for the revamped quota feature
in the master. When quota is configured in Mesos 1.9
or higher, the quota configurations will be persisted
into the `quota_configs` field in the registry. And
the `QUOTA_V2` minimum capability will be added to the
registry as well. This will prevent any master downgrades
until `quota_configs` becomes empty. This can be done by
setting the quota of the roles listed in `quota_configs`
back to the default (no guarantees and no limits).


Diffs (updated)
-

  docs/downgrades.md 3807254143d19f36079498c1a01adf744e21e8a2 
  include/mesos/mesos.proto eb1b09cf9f9c7c102d713170538c2ba210edb351 
  src/common/protobuf_utils.hpp f6ea9230d38079b24060922872ee93d9f038b98e 
  src/master/constants.cpp 13b3467825c624fd3cb1652fbfec1a9631ca37e6 
  src/tests/master_tests.cpp d7b4613980cb13d2009426b3a07690aac659cda7 


Diff: https://reviews.apache.org/r/70949/diff/3/

Changes: https://reviews.apache.org/r/70949/diff/2-3/


Testing
---

make check


Thanks,

Meng Zhu



Re: Review Request 70949: Added master minimum capability `QUOTA_V2`.

2019-06-26 Thread Meng Zhu


> On June 26, 2019, 12:18 p.m., Benjamin Mahler wrote:
> > Ditto comment below w.r.t. the description, can you make the description 
> > more precise about the overall approach with this minimum capability and 
> > the registry?

Done.


> On June 26, 2019, 12:18 p.m., Benjamin Mahler wrote:
> > src/master/constants.cpp
> > Lines 28 (patched)
> > 
> >
> > Is the master v2 quota capable yet in this patch?
> > 
> > I guess the agent doesn't care about this capability, but it seems 
> > nicer if we add it after the master can actually handle v2 style quota 
> > requests?

I think the `QUOTA_V2` here is essentially the ability to decode/recover from 
the quota registry. From this regard, the master gets that ability from this 
very patch.

But to answer your question, no, the master at the moment can't handle the 
`UPDATE_QUOTA` call. (But handle the call requires the registry change, right 
:)) Anyway, I am OK with postponing the patch and commit it when we wire up the 
call.


- Meng


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70949/#review216156
---


On June 25, 2019, 10:58 p.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70949/
> ---
> 
> (Updated June 25, 2019, 10:58 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9601
> https://issues.apache.org/jira/browse/MESOS-9601
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This adds a new enum for the revamped quota feature
> in the master. When quota is configured in Mesos 1.9
> or higher, the `QUOTA_V2` minimum capability will be
> persisted into the registry. This will prevent any master
> downgrades until all quotas configured in Mesos 1.9 or higher
> are set back to the default (no guarantees and no limits).
> 
> 
> Diffs
> -
> 
>   docs/downgrades.md 3807254143d19f36079498c1a01adf744e21e8a2 
>   include/mesos/mesos.proto d2750a634bf58428a152e70aedfc06ff70b3aa84 
>   src/common/protobuf_utils.hpp f6ea9230d38079b24060922872ee93d9f038b98e 
>   src/master/constants.cpp 13b3467825c624fd3cb1652fbfec1a9631ca37e6 
>   src/tests/master_tests.cpp d7b4613980cb13d2009426b3a07690aac659cda7 
> 
> 
> Diff: https://reviews.apache.org/r/70949/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 70950: Added a registry field for `QuotaConfig`.

2019-06-26 Thread Meng Zhu


> On June 26, 2019, 12:22 p.m., Benjamin Mahler wrote:
> > Can you also expand the commit description with a more precise comment 
> > about how these two fields will be used?

Done.


> On June 26, 2019, 12:22 p.m., Benjamin Mahler wrote:
> > src/master/registry.proto
> > Lines 140-141 (patched)
> > 
> >
> > Can you file a ticket that tracks this removal for the 2.0 timeframe 
> > and link it here?

MESOS-9866


- Meng


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70950/#review216157
---


On June 25, 2019, 6:37 p.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70950/
> ---
> 
> (Updated June 25, 2019, 6:37 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9601
> https://issues.apache.org/jira/browse/MESOS-9601
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A new field called `quota_configs` is added to persist the
> quota configurations of the cluster. This replaces the old
> `quotas` field which is deprecated and will be removed
> in Mesos 2.0.
> 
> 
> Diffs
> -
> 
>   src/master/registry.proto 239789322c45c6a8346332f50661a98b1851b685 
> 
> 
> Diff: https://reviews.apache.org/r/70950/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 70951: Updated registry operation `UpdateQuota` to persist `QuotaConfig`.

2019-06-26 Thread Meng Zhu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70951/
---

(Updated June 26, 2019, 9:50 p.m.)


Review request for mesos and Benjamin Mahler.


Changes
---

Addressed Ben's comments.


Bugs: MESOS-9601
https://issues.apache.org/jira/browse/MESOS-9601


Repository: mesos


Description
---

The new operations will mutate the `quota_configs` field in
the registry to persist `QuotaConfigs` configured by the new
`UPDATE_QUOTA` call as well as the legacy `SET_QUOTA` and
`REMOVE_QUOTA` calls.

The operation removes any entries in the legacy `quotas` field
with the same role name. In addition, it also adds/removes the
minimum capability `QUOTA_V2` accordingly: if `quota_configs`
is empty the capability will be removed otherwise it will
be added.

This operation replaces the `REMOVE_QUOTA` operation.

Also fixed affected tests.


Diffs (updated)
-

  src/master/master.cpp 826362d1a9ed034a49934b1810d6e1e6cffd90cd 
  src/master/quota.hpp 959e312b861fc0b59519a9ab5cc51061b55b534c 
  src/master/quota.cpp a7ee845d5916e859218b0168c7b682f77549aafc 
  src/master/quota_handler.cpp 476e87eb67e07a2ec6c7b258667cf94ff9c4f8eb 
  src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 


Diff: https://reviews.apache.org/r/70951/diff/3/

Changes: https://reviews.apache.org/r/70951/diff/2-3/


Testing
---

make check


Thanks,

Meng Zhu



Re: Review Request 70951: Updated registry operation `UpdateQuota` to persist `QuotaConfig`.

2019-06-26 Thread Meng Zhu


> On June 26, 2019, 1:21 p.m., Benjamin Mahler wrote:
> > src/master/quota.hpp
> > Line 64 (original), 64 (patched)
> > 
> >
> > What is this comment saying? This constructor is for the UPDATE_QUOTA 
> > master call path? Why say that?
> > 
> > There's also no other constructor, and this comment seems to suggest 
> > that there are others for other `XXX_QUOTA` things?

ah, left over from previous iteration. Removed.


> On June 26, 2019, 1:21 p.m., Benjamin Mahler wrote:
> > src/tests/registrar_tests.cpp
> > Lines 944-946 (original), 965-967 (patched)
> > 
> >
> > Shouldn't we be sanity checking the registry protobuf at this point? 
> > This can be done by hitting the registrar's endpoint.
> > 
> > Ditto in the other cases.

We could do that. But I think checking the recovered state does that indirectly 
and that is consistent with other tests. Let's keep it as is?


> On June 26, 2019, 1:21 p.m., Benjamin Mahler wrote:
> > src/tests/registrar_tests.cpp
> > Lines 954-960 (original), 975-986 (patched)
> > 
> >
> > I'm finding this test rather challenging to read, can we make it more 
> > readable than all the micro-checks by showing clearly what the expected 
> > registry looks like in json? E.g.
> > 
> > ```
> > Registry expected = fromJSON(R"~(
> > {
> >   "configs": [ ... ]
> >   "capabilities": [ ... ]
> > }
> > )~");
> > 
> > EXPECT_EQ(expected, actual);
> > ```
> > 
> > That will make it a lot less of a struggle to read

Used message differencer to avoid comparing field by field.


- Meng


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70951/#review216159
---


On June 26, 2019, 9:50 p.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70951/
> ---
> 
> (Updated June 26, 2019, 9:50 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-9601
> https://issues.apache.org/jira/browse/MESOS-9601
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The new operations will mutate the `quota_configs` field in
> the registry to persist `QuotaConfigs` configured by the new
> `UPDATE_QUOTA` call as well as the legacy `SET_QUOTA` and
> `REMOVE_QUOTA` calls.
> 
> The operation removes any entries in the legacy `quotas` field
> with the same role name. In addition, it also adds/removes the
> minimum capability `QUOTA_V2` accordingly: if `quota_configs`
> is empty the capability will be removed otherwise it will
> be added.
> 
> This operation replaces the `REMOVE_QUOTA` operation.
> 
> Also fixed affected tests.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 826362d1a9ed034a49934b1810d6e1e6cffd90cd 
>   src/master/quota.hpp 959e312b861fc0b59519a9ab5cc51061b55b534c 
>   src/master/quota.cpp a7ee845d5916e859218b0168c7b682f77549aafc 
>   src/master/quota_handler.cpp 476e87eb67e07a2ec6c7b258667cf94ff9c4f8eb 
>   src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 
> 
> 
> Diff: https://reviews.apache.org/r/70951/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Review Request 70959: Cleared agent drain state when draining is finished.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70959/
---

Review request for mesos, Greg Mann and Joseph Wu.


Bugs: MESOS-9860
https://issues.apache.org/jira/browse/MESOS-9860


Repository: mesos


Description
---

Once a draining agent has neither frameworks with pending tasks nor any
executors with either queued or launched tasks it has finished draining.
This patch adds handling of that case which clears both the in-memory
and persisted drain configuration.


Diffs
-

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 


Diff: https://reviews.apache.org/r/70959/diff/1/


Testing
---

`make check`

tested as part of https://reviews.apache.org/r/70960/


Thanks,

Benjamin Bannier



Re: Review Request 70907: Added recovery of agent drain information.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70907/
---

(Updated June 27, 2019, 1:55 a.m.)


Review request for mesos, Greg Mann and Joseph Wu.


Changes
---

Rename a variable missed in previous rebase


Bugs: MESOS-9822
https://issues.apache.org/jira/browse/MESOS-9822


Repository: mesos


Description
---

With this patch the agent will, after executor reregistration finished,
replay any active drain information so remaining tasks are drained as
well. We need to wait until executors had a chance to register so they
are not terminated should we try to send kill task request before the
executor has registered.


Diffs (updated)
-

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
  src/slave/state.hpp 45836e5805a5038c0b3b5563ea995c3fa90ab808 
  src/slave/state.cpp e0a850e03fb0d726bf7b02bbd6b298d81afae399 
  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70907/diff/5/

Changes: https://reviews.apache.org/r/70907/diff/4-5/


Testing
---

`make check`


Thanks,

Benjamin Bannier



Review Request 70960: Added test for agent to leave draining state on its own.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70960/
---

Review request for mesos, Greg Mann and Joseph Wu.


Bugs: MESOS-9860
https://issues.apache.org/jira/browse/MESOS-9860


Repository: mesos


Description
---

This patch adds a test which confirms that the agent leaves a draining
state on its own once all frameworks on the agent have no more pending
tasks and all their executors have neither launched or queued tasks.

The test uses the fact that the agent rejects task launches while
draining.


Diffs
-

  src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 


Diff: https://reviews.apache.org/r/70960/diff/1/


Testing
---

`make check`


Thanks,

Benjamin Bannier



Review Request 70958: Changed agent to fail task launches received during draining.

2019-06-26 Thread Benjamin Bannier

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70958/
---

Review request for mesos, Greg Mann and Joseph Wu.


Bugs: MESOS-9862
https://issues.apache.org/jira/browse/MESOS-9862


Repository: mesos


Description
---

With this patch the agent will now reject task launches while draining.
While we do not expect the master to send task launches to draining
agents it is still worthwhile to ensure no new tasks can be launched
while draining. This invariant simplifies e.g., the handling of drain
requests since we know that once the agent has entered a draining state
we only need to terminate existing tasks and no new tasks can appear.


Diffs
-

  src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 


Diff: https://reviews.apache.org/r/70958/diff/1/


Testing
---

`make check`

tested as part of https://reviews.apache.org/r/70960/


Thanks,

Benjamin Bannier



Re: Review Request 70936: Adjusted task status updates during draining.

2019-06-26 Thread Greg Mann


> On June 25, 2019, 9:48 p.m., Greg Mann wrote:
> > src/tests/slave_tests.cpp
> > Lines 12109 (patched)
> > 
> >
> > Maybe we should unconditionally set the reason to 
> > REASON_AGENT_DRAINING? It doesn't make sense to me that the reason received 
> > by the scheduler will depend on what stage the task launch was in when the 
> > drain message was received?
> 
> Benjamin Bannier wrote:
> I was wondering about this as well. The crucial bit to communicate here 
> is that the task terminated unexpectedly which we would provide in any case.
> 
> I feel providing frameworks slightly more granular information on where 
> in the task's lifecycle this happened might be interesting as e.g., a task 
> killed while running could have had application-level side effects while one 
> killed during launch wouldn't. Ultimately the framework still wouldn't know 
> where in its application-level lifecycle a running task was killed (and we 
> wouldn't be able to provide that) and the framework would need to deal with 
> this outside of Mesos, but I also do not see a reason to not make the lower 
> level detail on whether the task was launched and of which Mesos is aware 
> visible to the framework.
> 
> What do you think?

In the case that the agent is draining, and then the agent sets the 
REASON_TASK_KILLED_DURING_LAUNCH when killing a task, it seems that we must 
lose one of the two reasons, so we need to choose which one to retain.

I think that the cost of retaining the REASON_TASK_KILLED_DURING_LAUNCH is that 
the signal provided to schedulers by REASON_AGENT_DRAINING becomes much less 
consistent, and thus less useful. When the scheduler can't rely on all killed 
tasks from a draining agent to contain REASON_AGENT_DRAINING, then it can't 
always respond correctly to the killed task. In the case of stateful tasks, for 
example, the scheduler may want to relaunch a new instance immediately if the 
agent is being drained, rather than waiting for an offer containing persistent 
volumes which may never come.

My guess is that there are more cases in which REASON_AGENT_DRAINING will help 
schedulers take meaningful action, rather than 
REASON_TASK_KILLED_DURING_LAUNCH. The latter could be useful perhaps for 
schedulers which avoid performing cleanup for tasks which were never 
successfully launched? Maybe there are other use cases I'm not thinking of, but 
my intuition is that REASON_AGENT_DRAINED is more useful, so I'd vote for 
unconditionally setting the reason to REASON_AGENT_DRAINING.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70936/#review216143
---


On June 26, 2019, 12:35 p.m., Benjamin Bannier wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70936/
> ---
> 
> (Updated June 26, 2019, 12:35 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Joseph Wu.
> 
> 
> Bugs: MESOS-9823
> https://issues.apache.org/jira/browse/MESOS-9823
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When a task is reported as killed to the agent during active agent
> draining we now decorate the reported status with
> `REASON_AGENT_DRAINING` if no other status was previously present. If
> the draining marks the agent as gone via the `mark_gone` draining flag
> we additionally report `TASK_GONE_BY_OPERATOR` instead of the original
> state.
> 
> This patch leaves some ambiguity in what triggered the kill since the
> agent-executor protocol does not transport reasons; instead
> the reason is here only inferred after the killed task has
> been observed. This should usually be fine since due to the inherit race
> between e.g., any user- and drain-triggered kill a user cannot
> distinguish racy reasons.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 30039b0857a4d85b4b96fa95d7f8724d57cdec6e 
>   src/tests/slave_tests.cpp 7ba2df9fef9966cd09e0af3f4ef3bb260b6167f6 
> 
> 
> Diff: https://reviews.apache.org/r/70936/diff/2/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>



Re: Review Request 70949: Added master minimum capability `QUOTA_V2`.

2019-06-26 Thread Meng Zhu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70949/
---

(Updated June 25, 2019, 10:58 p.m.)


Review request for mesos and Benjamin Mahler.


Changes
---

Renamed the capability to `QUOTA_V2`.


Summary (updated)
-

Added master minimum capability `QUOTA_V2`.


Bugs: MESOS-9601
https://issues.apache.org/jira/browse/MESOS-9601


Repository: mesos


Description (updated)
---

This adds a new enum for the revamped quota feature
in the master. When quota is configured in Mesos 1.9
or higher, the `QUOTA_V2` minimum capability will be
persisted into the registry. This will prevent any master
downgrades until all quotas configured in Mesos 1.9 or higher
are set back to the default (no guarantees and no limits).


Diffs (updated)
-

  docs/downgrades.md 3807254143d19f36079498c1a01adf744e21e8a2 
  include/mesos/mesos.proto d2750a634bf58428a152e70aedfc06ff70b3aa84 
  src/common/protobuf_utils.hpp f6ea9230d38079b24060922872ee93d9f038b98e 
  src/master/constants.cpp 13b3467825c624fd3cb1652fbfec1a9631ca37e6 
  src/tests/master_tests.cpp d7b4613980cb13d2009426b3a07690aac659cda7 


Diff: https://reviews.apache.org/r/70949/diff/2/

Changes: https://reviews.apache.org/r/70949/diff/1-2/


Testing
---

make check


Thanks,

Meng Zhu



Re: Review Request 70951: Updated registry operation `UpdateQuota` to persist `QuotaConfig`.

2019-06-26 Thread Meng Zhu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70951/
---

(Updated June 25, 2019, 11:03 p.m.)


Review request for mesos and Benjamin Mahler.


Summary (updated)
-

Updated registry operation `UpdateQuota` to persist `QuotaConfig`.


Bugs: MESOS-9601
https://issues.apache.org/jira/browse/MESOS-9601


Repository: mesos


Description (updated)
---

The new operations will mutate the `quota_configs` field in
the registry to persist `QuotaConfigs` configured by the new
`UPDATE_QUOTA` call as well as the legacy `SET_QUOTA` and
`REMOVE_QUOTA` calls.

The operation removes any entries in the legacy `quotas` field
with the same role name. In addition, it also adds/removes the
minimum capability `QUOTA_V2` accordingly: if `quota_configs`
is empty the capability will be removed otherwise it will
be added.

This operation replaces the `REMOVE_QUOTA` operation.

Also fixed affected tests.


Diffs (updated)
-

  src/master/master.cpp 826362d1a9ed034a49934b1810d6e1e6cffd90cd 
  src/master/quota.hpp 959e312b861fc0b59519a9ab5cc51061b55b534c 
  src/master/quota.cpp a7ee845d5916e859218b0168c7b682f77549aafc 
  src/master/quota_handler.cpp 476e87eb67e07a2ec6c7b258667cf94ff9c4f8eb 
  src/tests/registrar_tests.cpp 81979d78f1348791e28f1afea4ca2999de6362b8 


Diff: https://reviews.apache.org/r/70951/diff/2/

Changes: https://reviews.apache.org/r/70951/diff/1-2/


Testing
---

make check


Thanks,

Meng Zhu