[mesos] 04/04: Minor cleanups in role_tests.cpp.

2019-07-15 Thread bmahler
This is an automated email from the ASF dual-hosted git repository.

bmahler pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit dcbee1dcdb6dd5915252530fa6b4ffad900fc80f
Author: Benjamin Mahler 
AuthorDate: Mon Jul 15 17:10:25 2019 -0400

Minor cleanups in role_tests.cpp.

Review: https://reviews.apache.org/r/71078
---
 src/tests/role_tests.cpp | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/tests/role_tests.cpp b/src/tests/role_tests.cpp
index 7a6c8a6..d6cc31b 100644
--- a/src/tests/role_tests.cpp
+++ b/src/tests/role_tests.cpp
@@ -287,7 +287,7 @@ TEST_F(RoleTest, ImplicitRoleStaticReservation)
 
 
 // This test checks that the "/roles" endpoint returns the expected
-// information when there are no active roles.
+// information when there are no known roles.
 TEST_F_TEMP_DISABLED_ON_WINDOWS(RoleTest, EndpointEmpty)
 {
   Try> master = StartMaster();
@@ -320,9 +320,9 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(RoleTest, EndpointEmpty)
 
 
 // This test checks that the "/roles" endpoint returns the expected
-// information when there are configured weights and explicit roles,
-// but no registered frameworks.
-TEST_F(RoleTest, EndpointNoFrameworks)
+// information when the role whitelist is used but no frameworks
+// are present.
+TEST_F(RoleTest, EndpointWithWhitelistNoFrameworks)
 {
   master::Flags masterFlags = CreateMasterFlags();
   masterFlags.roles = "role1,role2";



[mesos] branch master updated (c076c8c -> dcbee1d)

2019-07-15 Thread bmahler
This is an automated email from the ASF dual-hosted git repository.

bmahler pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git.


from c076c8c  Added test for agent to leave draining state on its own.
 new 382e9de  Fixed /roles and GET_ROLES to expose all known roles.
 new d4cb3a3  Added a test to ensure that roles with only reservations are 
exposed.
 new f99e181  Added a test to ensure that ancestor roles are exposed in 
/roles.
 new dcbee1d  Minor cleanups in role_tests.cpp.

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/master/http.cpp |   8 +-
 src/master/master.cpp   | 156 +++--
 src/master/master.hpp   | 144 +--
 src/master/readonly_handler.cpp |  41 -
 src/tests/role_tests.cpp| 185 ++--
 5 files changed, 358 insertions(+), 176 deletions(-)



[mesos] 01/04: Fixed /roles and GET_ROLES to expose all known roles.

2019-07-15 Thread bmahler
This is an automated email from the ASF dual-hosted git repository.

bmahler pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 382e9de30ab28edb59e7f443e908468c065968ab
Author: Benjamin Mahler 
AuthorDate: Mon Jul 15 14:24:08 2019 -0400

Fixed /roles and GET_ROLES to expose all known roles.

Previously, per MESOS-9888 and MESOS-9890, the /roles and GET_ROLES
APIs only exposed roles that had frameworks associated with them
(either because the framework is subscribed to the role, or there
is a framework with allocations to the role) or configured weight
and/or quota.

This approach omits some important cases:

  (1) Roles that have only reservations associated with them.
  (2) Roles that have only a parent relationship to other roles.

This patch exposes a function that returns all "known" roles based
on the criteria we care about:

  (1) Roles with configured weight or quota.
  (2) Roles with reservations.
  (3) Roles with frameworks subscribed or allocated resources.
  (4) Ancestor roles of (1), (2), or (3).

Also, the resource breakdowns are pulled out from the Role struct
and placed in a function that returns the breakdowns for all known
roles. This was done because there is currently not a Role struct
entry for all known roles.

Review: https://reviews.apache.org/r/71073
---
 src/master/http.cpp |   8 ++-
 src/master/master.cpp   | 156 
 src/master/master.hpp   | 144 ++---
 src/master/readonly_handler.cpp |  41 +--
 4 files changed, 178 insertions(+), 171 deletions(-)

diff --git a/src/master/http.cpp b/src/master/http.cpp
index cd0f40c..b077dd7 100644
--- a/src/master/http.cpp
+++ b/src/master/http.cpp
@@ -2684,7 +2684,7 @@ Future Master::Http::getRoles(
 .then(defer(master->self(),
 [this, contentType](const Owned& approvers)
   -> Response {
-  const vector filteredRoles = master->filterRoles(approvers);
+  const vector knownRoles = master->knownRoles();
 
   mesos::master::Response response;
   response.set_type(mesos::master::Response::GET_ROLES);
@@ -2692,7 +2692,11 @@ Future Master::Http::getRoles(
   mesos::master::Response::GetRoles* getRoles =
 response.mutable_get_roles();
 
-  foreach (const string& name, filteredRoles) {
+  foreach (const string& name, knownRoles) {
+if (!approvers->approved(name)) {
+  continue;
+}
+
 mesos::Role role;
 
 if (master->weights.contains(name)) {
diff --git a/src/master/master.cpp b/src/master/master.cpp
index 5247377..2a59f89 100644
--- a/src/master/master.cpp
+++ b/src/master/master.cpp
@@ -3515,46 +3515,59 @@ void Master::suppress(
 }
 
 
-vector Master::filterRoles(
-const Owned& approvers) const
-{
-  JSON::Object object;
-
-  // Compute the role names to return results for. When an explicit
-  // role whitelist has been configured, we use that list of names.
-  // When using implicit roles, the right behavior is a bit more
-  // subtle. There are no constraints on possible role names, so we
-  // instead list all the "interesting" roles: all roles with one or
-  // more registered frameworks, and all roles with a non-default
-  // weight or quota.
-  //
+vector Master::knownRoles() const
+{
   // NOTE: we use a `std::set` to store the role names to ensure a
   // deterministic output order.
   set roleList;
+
+  auto insertAncestors = [](const string& role) {
+foreach (const string& ancestor, roles::ancestors(role)) {
+  bool inserted = roleList.insert(ancestor).second;
+
+  // We can break here as an optimization since the ancestor
+  // will have had its ancestors inserted already.
+  if (!inserted) break;
+}
+  };
+
   if (roleWhitelist.isSome()) {
-const hashset& whitelist = roleWhitelist.get();
-roleList.insert(whitelist.begin(), whitelist.end());
+foreach (const string& role, *this->roleWhitelist) {
+  roleList.insert(role);
+  insertAncestors(role);
+}
   } else {
-hashset roles = this->roles.keys();
-roleList.insert(roles.begin(), roles.end());
+// In terms of building a complete set of known roles, we have to visit:
+//   (1) all entries of `roles` (which means there are frameworks
+//   subscribed to a role or have allocations to a role)
+//   (2) all reservation roles
+//   (3) all roles with configured weights or quotas
+//   (4) all ancestor roles of (1), (2), and (3).
 
-hashset weights = this->weights.keys();
-roleList.insert(weights.begin(), weights.end());
+foreachkey (const string& role, this->roles) {
+  roleList.insert(role);
+  insertAncestors(role);
+}
 
-hashset quotas = this->quotas.keys();
-roleList.insert(quotas.begin(), quotas.end());
-  }
+

[mesos] 02/04: Added a test to ensure that roles with only reservations are exposed.

2019-07-15 Thread bmahler
This is an automated email from the ASF dual-hosted git repository.

bmahler pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit d4cb3a34ce00aa6177f6d206c552e309ad250598
Author: Benjamin Mahler 
AuthorDate: Mon Jul 15 16:47:02 2019 -0400

Added a test to ensure that roles with only reservations are exposed.

This adds a test for MESOS-9888, to ensure that if a role has only
reservations associated with it, it gets exposed from /roles.

Review: https://reviews.apache.org/r/71074
---
 src/tests/role_tests.cpp | 78 +++-
 1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/src/tests/role_tests.cpp b/src/tests/role_tests.cpp
index 5066e41..5a6a01a 100644
--- a/src/tests/role_tests.cpp
+++ b/src/tests/role_tests.cpp
@@ -34,6 +34,8 @@
 #include "tests/mesos.hpp"
 #include "tests/resources_utils.hpp"
 
+#include "tests/master/mock_master_api_subscriber.hpp"
+
 using mesos::internal::master::Master;
 using mesos::internal::slave::Slave;
 
@@ -870,7 +872,81 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(RoleTest, 
EndpointImplicitRolesQuotas)
 
   EXPECT_EQ(*expected, *parse)
 << "expected " << stringify(*expected)
-<< " vs actual " << stringify(*parse);}
+<< " vs actual " << stringify(*parse);
+}
+
+
+// This test ensures that roles with only reservations are
+// included in the /roles endpoint.
+TEST_F_TEMP_DISABLED_ON_WINDOWS(RoleTest, EndpointImplicitRolesReservations)
+{
+  Try> master = StartMaster();
+  ASSERT_SOME(master);
+
+  v1::MockMasterAPISubscriber subscriber;
+
+  AWAIT_READY(subscriber.subscribe(master.get()->pid));
+
+  Future agentAdded;
+  EXPECT_CALL(subscriber, agentAdded(_))
+.WillOnce(FutureSatisfy());
+
+  Owned detector = master.get()->createDetector();
+
+  slave::Flags agentFlags = CreateSlaveFlags();
+  agentFlags.resources = "cpus(role):1;mem(role):10";
+
+  Try> slave = StartSlave(detector.get(), agentFlags);
+
+  AWAIT_READY(agentAdded);
+
+  // Check that the /roles endpoint contains the role.
+  {
+Future response = process::http::get(
+master.get()->pid,
+"roles",
+None(),
+createBasicAuthHeaders(DEFAULT_CREDENTIAL));
+
+AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);
+
+Try parse = JSON::parse(response->body);
+ASSERT_SOME(parse);
+
+Try expected = JSON::parse(
+"{"
+"  \"roles\": ["
+"{"
+"  \"frameworks\": [],"
+"  \"name\": \"role\","
+"  \"resources\": {},"
+"  \"allocated\": {},"
+"  \"offered\": {},"
+"  \"reserved\": {"
+"\"cpus\": 1.0,"
+"\"mem\":  10.0"
+"  },"
+"  \"quota\": {"
+"\"consumed\": {"
+"  \"cpus\": 1.0,"
+"  \"mem\": 10.0"
+"},"
+"\"guarantee\": {},"
+"\"limit\": {},"
+"\"role\": \"role\""
+"  },"
+"  \"weight\": 1.0"
+"}"
+"  ]"
+"}");
+
+ASSERT_SOME(expected);
+
+EXPECT_EQ(*expected, *parse)
+  << "expected " << stringify(*expected)
+  << " vs actual " << stringify(*parse);
+  }
+}
 
 
 // This test ensures that master adds/removes all roles of



[mesos] 03/04: Added a test to ensure that ancestor roles are exposed in /roles.

2019-07-15 Thread bmahler
This is an automated email from the ASF dual-hosted git repository.

bmahler pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit f99e181d34a78ef304f9072a6685f33be99ea07f
Author: Benjamin Mahler 
AuthorDate: Mon Jul 15 17:07:58 2019 -0400

Added a test to ensure that ancestor roles are exposed in /roles.

This adds a test for MESOS-9890, to ensure that ancestor roles with
no objects directly associated with them get exposed in /roles.

Review: https://reviews.apache.org/r/71077
---
 src/tests/role_tests.cpp | 99 
 1 file changed, 99 insertions(+)

diff --git a/src/tests/role_tests.cpp b/src/tests/role_tests.cpp
index 5a6a01a..7a6c8a6 100644
--- a/src/tests/role_tests.cpp
+++ b/src/tests/role_tests.cpp
@@ -949,6 +949,105 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(RoleTest, 
EndpointImplicitRolesReservations)
 }
 
 
+// This test ensures that ancestor roles are exposed when
+// there are no direct objects associated with them.
+//
+// TODO(bmahler): This currently only tests the reservation
+// case, but we should also test the allocation, framework
+// subsription, and quota/weight configuration cases.
+TEST_F_TEMP_DISABLED_ON_WINDOWS(RoleTest, EndpointImplicitRolesAncestors)
+{
+  Try> master = StartMaster();
+  ASSERT_SOME(master);
+
+  v1::MockMasterAPISubscriber subscriber;
+
+  AWAIT_READY(subscriber.subscribe(master.get()->pid));
+
+  Future agentAdded;
+  EXPECT_CALL(subscriber, agentAdded(_))
+.WillOnce(FutureSatisfy());
+
+  Owned detector = master.get()->createDetector();
+
+  slave::Flags agentFlags = CreateSlaveFlags();
+  agentFlags.resources = "cpus(ancestor/child):1;mem(ancestor/child):10;";
+
+  Try> slave = StartSlave(detector.get(), agentFlags);
+
+  AWAIT_READY(agentAdded);
+
+  // Check that the /roles endpoint contains the role and
+  // its ancestor.
+  {
+Future response = process::http::get(
+master.get()->pid,
+"roles",
+None(),
+createBasicAuthHeaders(DEFAULT_CREDENTIAL));
+
+AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);
+
+Try parse = JSON::parse(response->body);
+ASSERT_SOME(parse);
+
+Try expected = JSON::parse(
+"{"
+"  \"roles\": ["
+"{"
+"  \"frameworks\": [],"
+"  \"name\": \"ancestor\","
+"  \"resources\": {},"
+"  \"allocated\": {},"
+"  \"offered\": {},"
+"  \"reserved\": {"
+"\"cpus\": 1.0,"
+"\"mem\":  10.0"
+"  },"
+"  \"quota\": {"
+"\"consumed\": {"
+"  \"cpus\": 1.0,"
+"  \"mem\": 10.0"
+"},"
+"\"guarantee\": {},"
+"\"limit\": {},"
+"\"role\": \"ancestor\""
+"  },"
+"  \"weight\": 1.0"
+"},"
+"{"
+"  \"frameworks\": [],"
+"  \"name\": \"ancestor/child\","
+"  \"resources\": {},"
+"  \"allocated\": {},"
+"  \"offered\": {},"
+"  \"reserved\": {"
+"\"cpus\": 1.0,"
+"\"mem\":  10.0"
+"  },"
+"  \"quota\": {"
+"\"consumed\": {"
+"  \"cpus\": 1.0,"
+"  \"mem\": 10.0"
+"},"
+"\"guarantee\": {},"
+"\"limit\": {},"
+"\"role\": \"ancestor/child\""
+"  },"
+"  \"weight\": 1.0"
+"}"
+"  ]"
+"}");
+
+ASSERT_SOME(expected);
+
+EXPECT_EQ(*expected, *parse)
+  << "expected " << stringify(*expected)
+  << " vs actual " << stringify(*parse);
+  }
+}
+
+
 // This test ensures that master adds/removes all roles of
 // a multi-role framework when it registers/terminates.
 TEST_F_TEMP_DISABLED_ON_WINDOWS(



[mesos] 05/14: Updated an equality operator.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 27f0cd3519bafaf058e8347d482475e776d494e1
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:47 2019 -0700

Updated an equality operator.

This patch updates the equality operator for the `Task`
message to include two missing conditions. An equality
operator for `HealthCheck` is also added to make this
possible.

Review: https://reviews.apache.org/r/70900/
---
 include/mesos/type_utils.hpp |  1 +
 src/common/type_utils.cpp| 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/mesos/type_utils.hpp b/include/mesos/type_utils.hpp
index ed9190b..b9e6164 100644
--- a/include/mesos/type_utils.hpp
+++ b/include/mesos/type_utils.hpp
@@ -62,6 +62,7 @@ bool operator==(
 bool operator==(const DiscoveryInfo& left, const DiscoveryInfo& right);
 bool operator==(const Environment& left, const Environment& right);
 bool operator==(const ExecutorInfo& left, const ExecutorInfo& right);
+bool operator==(const HealthCheck& left, const HealthCheck& right);
 bool operator==(const Label& left, const Label& right);
 bool operator==(const Labels& left, const Labels& right);
 bool operator==(const MasterInfo& left, const MasterInfo& right);
diff --git a/src/common/type_utils.cpp b/src/common/type_utils.cpp
index a7eb0e9..16d6657 100644
--- a/src/common/type_utils.cpp
+++ b/src/common/type_utils.cpp
@@ -400,6 +400,12 @@ bool operator!=(const ExecutorInfo& left, const 
ExecutorInfo& right)
 }
 
 
+bool operator==(const HealthCheck& left, const HealthCheck& right)
+{
+  return google::protobuf::util::MessageDifferencer::Equals(left, right);
+}
+
+
 bool operator==(const MasterInfo& left, const MasterInfo& right)
 {
   return left.id() == right.id() &&
@@ -575,7 +581,9 @@ bool operator==(const Task& left, const Task& right)
 left.status_update_uuid() == right.status_update_uuid() &&
 left.labels() == right.labels() &&
 left.discovery() == right.discovery() &&
-left.user() == right.user();
+left.user() == right.user() &&
+left.container() == right.container() &&
+left.health_check() == right.health_check();
 }
 
 



[mesos] branch master updated (a32fd27 -> c076c8c)

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git.


from a32fd27  Updated 3 unit tests by changing IO switchboard to local mode.
 new 3c959eb  Added minimal agent handler for 'DrainSlaveMessage'.
 new 7d08b66  Added the DrainConfig to agent API outputs.
 new 04d25af  Added test for DrainConfig in agent API outputs.
 new ef19f29  Refactored the agent's task-killing code.
 new 27f0cd3  Updated an equality operator.
 new 3bb8287  Added kill policy to the 'Task' message.
 new e1c7985  Killed all tasks on the agent when draining.
 new 505928a  Added tests for task killing when draining the agent.
 new 1a32b31  Fixed pid checkpointing for `TestContainerizer`.
 new 54fb43e  Added recovery of agent drain information.
 new 1889268  Adjusted task status updates during draining.
 new a7044bd  Changed agent to fail task launches received during draining.
 new 654faf9  Cleared agent drain state when draining is finished.
 new c076c8c  Added test for agent to leave draining state on its own.

The 14 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 include/mesos/agent/agent.proto|   2 +
 include/mesos/mesos.proto  |   4 +
 include/mesos/type_utils.hpp   |   8 +
 include/mesos/v1/agent/agent.proto |   2 +
 include/mesos/v1/mesos.proto   |   4 +
 src/common/protobuf_utils.cpp  |   4 +
 src/common/type_utils.cpp  |  17 +-
 src/slave/http.cpp |  11 +
 src/slave/paths.cpp|   9 +
 src/slave/paths.hpp|   6 +
 src/slave/slave.cpp| 366 ++
 src/slave/slave.hpp|  31 +-
 src/slave/state.cpp|  16 +
 src/slave/state.hpp|   3 +
 src/tests/containerizer.cpp|  12 +
 src/tests/slave_tests.cpp  | 756 +
 16 files changed, 1173 insertions(+), 78 deletions(-)



[mesos] 08/14: Added tests for task killing when draining the agent.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 505928a3f51555bd3e45f2fc9787fdf890b28bfb
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:56 2019 -0700

Added tests for task killing when draining the agent.

Review: https://reviews.apache.org/r/70904/
---
 src/tests/slave_tests.cpp | 335 ++
 1 file changed, 335 insertions(+)

diff --git a/src/tests/slave_tests.cpp b/src/tests/slave_tests.cpp
index 8098a1a..147967d 100644
--- a/src/tests/slave_tests.cpp
+++ b/src/tests/slave_tests.cpp
@@ -94,6 +94,8 @@
 #include "tests/resources_utils.hpp"
 #include "tests/utils.hpp"
 
+#include "tests/containerizer/mock_containerizer.hpp"
+
 using namespace mesos::internal::slave;
 
 #ifdef USE_SSL_SOCKET
@@ -11881,6 +11883,339 @@ TEST_F(SlaveTest, DrainInfoInAPIOutputs)
   }
 }
 
+
+// When an agent receives a `DrainSlaveMessage`, it should kill running tasks.
+TEST_F(SlaveTest, DrainAgentKillsRunningTask)
+{
+  Clock::pause();
+
+  Try> master = StartMaster();
+  ASSERT_SOME(master);
+
+  Future updateSlaveMessage =
+FUTURE_PROTOBUF(UpdateSlaveMessage(), _, _);
+
+  StandaloneMasterDetector detector(master.get()->pid);
+
+  slave::Flags slaveFlags = CreateSlaveFlags();
+
+  Try> slave = StartSlave(, slaveFlags);
+  ASSERT_SOME(slave);
+
+  Clock::advance(slaveFlags.registration_backoff_factor);
+
+  AWAIT_READY(updateSlaveMessage);
+
+  auto scheduler = std::make_shared();
+
+  EXPECT_CALL(*scheduler, connected(_))
+.WillOnce(v1::scheduler::SendSubscribe(v1::DEFAULT_FRAMEWORK_INFO));
+
+  Future subscribed;
+  EXPECT_CALL(*scheduler, subscribed(_, _))
+.WillOnce(FutureArg<1>());
+
+  EXPECT_CALL(*scheduler, heartbeat(_))
+.WillRepeatedly(Return()); // Ignore heartbeats.
+
+  Future offers;
+  EXPECT_CALL(*scheduler, offers(_, _))
+.WillOnce(FutureArg<1>())
+.WillRepeatedly(Return()); // Ignore subsequent offers.
+
+  v1::scheduler::TestMesos mesos(
+  master.get()->pid,
+  ContentType::PROTOBUF,
+  scheduler);
+
+  AWAIT_READY(subscribed);
+
+  v1::FrameworkID frameworkId(subscribed->framework_id());
+
+  AWAIT_READY(offers);
+  ASSERT_FALSE(offers->offers().empty());
+
+  const v1::Offer& offer = offers->offers(0);
+  const v1::AgentID& agentId = offer.agent_id();
+
+  Future startingUpdate;
+  Future runningUpdate;
+  EXPECT_CALL(*scheduler, update(_, _))
+.WillOnce(DoAll(
+FutureArg<1>(),
+v1::scheduler::SendAcknowledge(frameworkId, agentId)))
+.WillOnce(DoAll(
+FutureArg<1>(),
+v1::scheduler::SendAcknowledge(frameworkId, agentId)));
+
+  v1::Resources resources =
+v1::Resources::parse("cpus:0.1;mem:32;disk:32").get();
+
+  v1::TaskInfo taskInfo =
+v1::createTask(agentId, resources, SLEEP_COMMAND(1000));
+
+  v1::Offer::Operation launch = v1::LAUNCH({taskInfo});
+
+  mesos.send(
+  v1::createCallAccept(
+  frameworkId,
+  offer,
+  {launch}));
+
+  AWAIT_READY(startingUpdate);
+  EXPECT_EQ(v1::TASK_STARTING, startingUpdate->status().state());
+
+  AWAIT_READY(runningUpdate);
+  EXPECT_EQ(v1::TASK_RUNNING, runningUpdate->status().state());
+
+  Future killedUpdate;
+  EXPECT_CALL(*scheduler, update(_, _))
+.WillOnce(FutureArg<1>());
+
+  // Simulate the master sending a `DrainSlaveMessage` to the agent.
+
+  // Immediately kill the task forcefully.
+  DurationInfo maxGracePeriod;
+  maxGracePeriod.set_nanoseconds(0);
+
+  DrainConfig drainConfig;
+  drainConfig.set_mark_gone(true);
+  drainConfig.mutable_max_grace_period()->CopyFrom(maxGracePeriod);
+
+  DrainSlaveMessage drainSlaveMessage;
+  drainSlaveMessage.mutable_config()->CopyFrom(drainConfig);
+
+  process::post(master.get()->pid, slave.get()->pid, drainSlaveMessage);
+
+  AWAIT_READY(killedUpdate);
+
+  EXPECT_EQ(v1::TASK_KILLED, killedUpdate->status().state());
+}
+
+
+// When the agent receives a `DrainSlaveMessage`, it should kill queued tasks.
+TEST_F(SlaveTest, DrainAgentKillsQueuedTask)
+{
+  Clock::pause();
+
+  Try> master = StartMaster();
+  ASSERT_SOME(master);
+
+  Future updateSlaveMessage =
+FUTURE_PROTOBUF(UpdateSlaveMessage(), _, _);
+
+  MockContainerizer mockContainerizer;
+  StandaloneMasterDetector detector(master.get()->pid);
+  slave::Flags slaveFlags = CreateSlaveFlags();
+
+  EXPECT_CALL(mockContainerizer, recover(_))
+.WillOnce(Return(Nothing()));
+
+  EXPECT_CALL(mockContainerizer, containers())
+.WillOnce(Return(hashset()));
+
+  Try> slave = StartSlave(
+  ,
+  ,
+  slaveFlags);
+  ASSERT_SOME(slave);
+
+  Clock::advance(slaveFlags.registration_backoff_factor);
+
+  AWAIT_READY(updateSlaveMessage);
+
+  auto scheduler = std::make_shared();
+
+  EXPECT_CALL(*scheduler, connected(_))
+.WillOnce(v1::scheduler::SendSubscribe(v1::DEFAULT_FRAMEWORK_INFO));
+
+  Future subscribed;
+  EXPECT_CALL(*scheduler, 

[mesos] 02/14: Added the DrainConfig to agent API outputs.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 7d08b667e446840dc31538d9d40705e3d8fb12a0
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:35 2019 -0700

Added the DrainConfig to agent API outputs.

Review: https://reviews.apache.org/r/70835/
---
 include/mesos/agent/agent.proto|  2 ++
 include/mesos/v1/agent/agent.proto |  2 ++
 src/slave/http.cpp | 11 +++
 3 files changed, 15 insertions(+)

diff --git a/include/mesos/agent/agent.proto b/include/mesos/agent/agent.proto
index 83eb7bb..3cb622d 100644
--- a/include/mesos/agent/agent.proto
+++ b/include/mesos/agent/agent.proto
@@ -569,6 +569,8 @@ message Response {
   // Contains the agent's information.
   message GetAgent {
 optional SlaveInfo slave_info = 1;
+
+optional DrainConfig drain_config = 2;
   }
 
   // Lists information about all resource providers known to the agent
diff --git a/include/mesos/v1/agent/agent.proto 
b/include/mesos/v1/agent/agent.proto
index f6574cb..4324ad6 100644
--- a/include/mesos/v1/agent/agent.proto
+++ b/include/mesos/v1/agent/agent.proto
@@ -569,6 +569,8 @@ message Response {
   // Contains the agent's information.
   message GetAgent {
 optional AgentInfo agent_info = 1;
+
+optional DrainConfig drain_config = 2;
   }
 
   // Lists information about all resource providers known to the agent
diff --git a/src/slave/http.cpp b/src/slave/http.cpp
index 69e6d74..321dca7 100644
--- a/src/slave/http.cpp
+++ b/src/slave/http.cpp
@@ -1331,6 +1331,12 @@ Future Http::state(
   writer->field("domain", slave->info.domain());
 }
 
+if (slave->drainConfig.isSome()) {
+  writer->field(
+  "drain_config",
+  JSON::Protobuf(slave->drainConfig.get()));
+}
+
 const Resources& totalResources = slave->totalResources;
 
 writer->field("resources", totalResources);
@@ -1842,6 +1848,11 @@ Future Http::getAgent(
 
   response.mutable_get_agent()->mutable_slave_info()->CopyFrom(slave->info);
 
+  if (slave->drainConfig.isSome()) {
+response.mutable_get_agent()->mutable_drain_config()->CopyFrom(
+slave->drainConfig.get());
+  }
+
   return OK(serialize(acceptType, evolve(response)),
 stringify(acceptType));
 }



[mesos] 07/14: Killed all tasks on the agent when draining.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit e1c7985e96d84693f3e41d3a50da5f5ea11b6cd8
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:51 2019 -0700

Killed all tasks on the agent when draining.

This patch updates the agent's `DrainSlaveMessage` handler
to kill all tasks on the agent when the message is received.

Review: https://reviews.apache.org/r/70903/
---
 include/mesos/type_utils.hpp |  6 +
 src/slave/slave.cpp  | 62 
 2 files changed, 68 insertions(+)

diff --git a/include/mesos/type_utils.hpp b/include/mesos/type_utils.hpp
index 2fd8a62..98a2995 100644
--- a/include/mesos/type_utils.hpp
+++ b/include/mesos/type_utils.hpp
@@ -338,6 +338,12 @@ inline bool operator<(const ContainerID& left, const 
ContainerID& right)
 }
 
 
+inline bool operator<(const DurationInfo& left, const DurationInfo& right)
+{
+  return left.nanoseconds() < right.nanoseconds();
+}
+
+
 inline bool operator<(const ExecutorID& left, const ExecutorID& right)
 {
   return left.value() < right.value();
diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 741c1f6..19b4769 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -999,6 +999,68 @@ void Slave::drain(
 << "Failed to checkpoint DrainConfig";
 
   drainConfig = drainSlaveMessage.config();
+
+  const Option maxGracePeriod =
+drainConfig->has_max_grace_period()
+  ? drainConfig->max_grace_period()
+  : Option::none();
+
+  auto calculateKillPolicy =
+[&](const Option& killPolicy) -> Option {
+  if (maxGracePeriod.isNone()) {
+return None();
+  }
+
+  KillPolicy killPolicyOverride;
+  
killPolicyOverride.mutable_grace_period()->CopyFrom(maxGracePeriod.get());
+
+  // Task kill policy is not set or unknown.
+  if (killPolicy.isNone() || !killPolicy->has_grace_period()) {
+return killPolicyOverride;
+  }
+
+  // Task kill policy is greater than the override.
+  if (maxGracePeriod.get() < killPolicy->grace_period()) {
+return killPolicyOverride;
+  }
+
+  return None();
+};
+
+  // Frameworks may be removed within `kill()` or `killPendingTask()` below,
+  // so we must copy them and their members before looping.
+  foreachvalue (Framework* framework, utils::copy(frameworks)) {
+typedef hashmap TaskMap;
+foreachvalue (const TaskMap& tasks, utils::copy(framework->pendingTasks)) {
+  foreachvalue (const TaskInfo& task, tasks) {
+killPendingTask(framework->id(), framework, task.task_id());
+  }
+}
+
+foreachvalue (Executor* executor, utils::copy(framework->executors)) {
+  foreachvalue (Task* task, executor->launchedTasks) {
+kill(framework->id(),
+ framework,
+ executor,
+ task->task_id(),
+ calculateKillPolicy(
+task->has_kill_policy()
+  ? task->kill_policy()
+  : Option::none()));
+  }
+
+  foreachvalue (const TaskInfo& task, utils::copy(executor->queuedTasks)) {
+kill(framework->id(),
+ framework,
+ executor,
+ task.task_id(),
+ calculateKillPolicy(
+task.has_kill_policy()
+  ? task.kill_policy()
+  : Option::none()));
+  }
+}
+  }
 }
 
 



[mesos] 13/14: Cleared agent drain state when draining is finished.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 654faf9244b0016f8a17623aca7812923b3a313a
Author: Benjamin Bannier 
AuthorDate: Mon Jul 15 10:26:23 2019 -0700

Cleared agent drain state when draining is finished.

Once a draining agent has neither frameworks with pending tasks nor any
executors with either queued or launched tasks it has finished draining.
This patch adds handling of that case which clears both the in-memory
and persisted drain configuration.

Review: https://reviews.apache.org/r/70959/
---
 src/slave/slave.cpp | 31 +++
 src/slave/slave.hpp |  4 
 2 files changed, 35 insertions(+)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index eecd71e..2477975 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -7067,6 +7067,8 @@ void Slave::removeFramework(Framework* framework)
   // Pass ownership of the framework pointer.
   completedFrameworks.set(framework->id(), Owned(framework));
 
+  updateDrainStatus();
+
   if (state == TERMINATING && frameworks.empty()) {
 terminate(self());
   }
@@ -8944,6 +8946,8 @@ void Slave::removeOperation(Operation* operation)
 
   checkpointResourceState(
   totalResources.filter(mesos::needCheckpointing), false);
+
+  updateDrainStatus();
 }
 
 
@@ -9768,6 +9772,33 @@ void Slave::initializeResourceProviderManager(
 }
 
 
+void Slave::updateDrainStatus()
+{
+  if (drainConfig.isNone()) {
+return;
+  }
+
+  bool drained = operations.empty() && frameworks.empty();
+
+  if (!drained) {
+return;
+  }
+
+  LOG(INFO) << "Agent finished draining";
+
+  const string drainConfigPath = paths::getDrainConfigPath(metaDir, info.id());
+
+  Try rm = os::rm(drainConfigPath);
+
+  if (rm.isError()) {
+EXIT(EXIT_FAILURE) << "Could not remove persisted drain configuration "
+   << "'" << drainConfigPath << "': " << rm.error();
+  }
+
+  drainConfig = None();
+}
+
+
 Framework::Framework(
 Slave* _slave,
 const Flags& slaveFlags,
diff --git a/src/slave/slave.hpp b/src/slave/slave.hpp
index 58bdd2a..58a5608 100644
--- a/src/slave/slave.hpp
+++ b/src/slave/slave.hpp
@@ -910,6 +910,10 @@ private:
   // If the agent is currently draining, contains the configuration used to
   // drain the agent. If NONE, the agent is not currently draining.
   Option drainConfig;
+
+  // Check whether draining is finished and possibly remove
+  // both in-memory and persisted drain configuration.
+  void updateDrainStatus();
 };
 
 



[mesos] 04/14: Refactored the agent's task-killing code.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit ef19f297be6c192f4d2cea0f9ed413a1dfaaf882
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:42 2019 -0700

Refactored the agent's task-killing code.

This patch factors the code responsible for killing tasks
out into two helper functions. This will facilitate the
calling of this common code by the agent-draining handler.

Review: https://reviews.apache.org/r/70899/
---
 src/slave/slave.cpp | 133 +++-
 src/slave/slave.hpp |  19 +++-
 2 files changed, 97 insertions(+), 55 deletions(-)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index fc688dc..741c1f6 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -3670,10 +3670,6 @@ void Slave::killTask(
 return;
   }
 
-  CHECK(framework->state == Framework::RUNNING ||
-framework->state == Framework::TERMINATING)
-<< framework->state;
-
   // We don't send a status update here because a terminating
   // framework cannot send acknowledgements.
   if (framework->state == Framework::TERMINATING) {
@@ -3683,54 +3679,10 @@ void Slave::killTask(
 return;
   }
 
-  // If the task is pending, we send a TASK_KILLED immediately.
-  // This will trigger a synchronous removal of the pending task,
-  // which prevents it from being launched.
-  if (framework->isPending(taskId)) {
-LOG(WARNING) << "Killing task " << taskId
- << " of framework " << frameworkId
- << " before it was launched";
-
-Option taskGroup =
-  framework->getTaskGroupForPendingTask(taskId);
-
-vector updates;
-if (taskGroup.isSome()) {
-  foreach (const TaskInfo& task, taskGroup->tasks()) {
-updates.push_back(protobuf::createStatusUpdate(
-frameworkId,
-info.id(),
-task.task_id(),
-TASK_KILLED,
-TaskStatus::SOURCE_SLAVE,
-id::UUID::random(),
-"A task within the task group was killed before"
-" delivery to the executor",
-TaskStatus::REASON_TASK_KILLED_DURING_LAUNCH,
-CHECK_NOTNONE(
-framework->getExecutorIdForPendingTask(task.task_id();
-  }
-} else {
-  updates.push_back(protobuf::createStatusUpdate(
-  frameworkId,
-  info.id(),
-  taskId,
-  TASK_KILLED,
-  TaskStatus::SOURCE_SLAVE,
-  id::UUID::random(),
-  "Killed before delivery to the executor",
-  TaskStatus::REASON_TASK_KILLED_DURING_LAUNCH,
-  CHECK_NOTNONE(
-  framework->getExecutorIdForPendingTask(taskId;
-}
+  CHECK(framework->state == Framework::RUNNING) << framework->state;
 
-foreach (const StatusUpdate& update, updates) {
-  // NOTE: Sending a terminal update (TASK_KILLED) synchronously
-  // removes the task/task group from 'framework->pendingTasks'
-  // and 'framework->pendingTaskGroups', so that it will not be
-  // launched.
-  statusUpdate(update, UPID());
-}
+  if (framework->isPending(taskId)) {
+killPendingTask(frameworkId, framework, taskId);
 
 return;
   }
@@ -3763,6 +3715,80 @@ void Slave::killTask(
 return;
   }
 
+  kill(frameworkId,
+   framework,
+   executor,
+   taskId,
+   (killTaskMessage.has_kill_policy()
+  ? killTaskMessage.kill_policy()
+  : Option::none()));
+}
+
+
+void Slave::killPendingTask(
+const FrameworkID& frameworkId,
+Framework* framework,
+const TaskID& taskId)
+{
+  LOG(WARNING) << "Killing task " << taskId
+   << " of framework " << frameworkId
+   << " before it was launched";
+
+  Option taskGroup =
+framework->getTaskGroupForPendingTask(taskId);
+
+  vector updates;
+  if (taskGroup.isSome()) {
+foreach (const TaskInfo& task, taskGroup->tasks()) {
+  updates.push_back(protobuf::createStatusUpdate(
+  frameworkId,
+  info.id(),
+  task.task_id(),
+  TASK_KILLED,
+  TaskStatus::SOURCE_SLAVE,
+  id::UUID::random(),
+  "A task within the task group was killed before"
+  " delivery to the executor",
+  TaskStatus::REASON_TASK_KILLED_DURING_LAUNCH,
+  CHECK_NOTNONE(
+  framework->getExecutorIdForPendingTask(task.task_id();
+}
+  } else {
+updates.push_back(protobuf::createStatusUpdate(
+frameworkId,
+info.id(),
+taskId,
+TASK_KILLED,
+TaskStatus::SOURCE_SLAVE,
+id::UUID::random(),
+"Killed before delivery to the executor",
+TaskStatus::REASON_TASK_KILLED_DURING_LAUNCH,
+CHECK_NOTNONE(
+framework->getExecutorIdForPendingTask(taskId;
+  }
+
+  foreach (const StatusUpdate& update, updates) {
+// NOTE: Sending a 

[mesos] 12/14: Changed agent to fail task launches received during draining.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit a7044bdcd91e467173bb44263658ac5c8df08d8c
Author: Benjamin Bannier 
AuthorDate: Mon Jul 15 10:26:18 2019 -0700

Changed agent to fail task launches received during draining.

With this patch the agent will now reject task launches while draining.
While we do not expect the master to send task launches to draining
agents it is still worthwhile to ensure no new tasks can be launched
while draining. This invariant simplifies e.g., the handling of drain
requests since we know that once the agent has entered a draining state
we only need to terminate existing tasks and no new tasks can appear.

Review: https://reviews.apache.org/r/70958/
---
 src/slave/slave.cpp | 74 +
 1 file changed, 52 insertions(+), 22 deletions(-)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 37385bd..eecd71e 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -2736,17 +2736,51 @@ void Slave::__run(
 CHECK(framework->removePendingTask(_task.task_id()));
   }
 
-  // Check task invariants.
+  // Check task launch invariants.
   //
   // TODO(bbannier): Instead of copy-pasting identical code to deal
   // with cases where tasks need to be terminated, consolidate code
   // below to decouple checking from terminating.
+  Option kill = None();
+
+  // Fail the launch if the agent is draining.
+  if (drainConfig.isSome()) {
+LOG(WARNING) << "Ignoring running " << taskOrTaskGroup(task, taskGroup)
+ << " of framework " << frameworkId
+ << " because the agent is draining";
+
+kill = "Task was received while agent was already draining";
+  }
+
+  if (kill.isSome()) {
+sendTaskDroppedUpdate(TaskStatus::REASON_SLAVE_DRAINING, *kill);
+
+// Refer to the comment after 'framework->removePendingTask' above
+// for why we need this.
+if (framework->idle()) {
+  removeFramework(framework);
+}
+
+if (launchExecutor.isSome() && launchExecutor.get()) {
+  // Master expects a new executor to be launched for this task(s).
+  // To keep the master executor entries updated, the agent needs to send
+  // `ExitedExecutorMessage` even though no executor launched.
+  sendExitedExecutorMessage(frameworkId, executorInfo.executor_id());
+
+  // See the declaration of `taskLaunchSequences` regarding its lifecycle
+  // management.
+  framework->taskLaunchSequences.erase(executorInfo.executor_id());
+}
+
+return;
+  }
+
+  CHECK_NONE(kill);
 
   // If the master sent resource versions, perform a best-effort check
   // that they are consistent with the resources the task uses.
   //
   // TODO(bbannier): Also check executor resources.
-  bool kill = false;
   if (!resourceVersionUuids.empty()) {
 hashset> usedResourceProviderIds;
 foreach (const TaskInfo& _task, tasks) {
@@ -2767,7 +2801,7 @@ void Slave::__run(
 CHECK(receivedResourceVersions.contains(None()));
 
 if (resourceVersion != receivedResourceVersions.at(None())) {
-  kill = true;
+  kill = "Task assumes outdated resource state";
 }
   } else {
 ResourceProvider* resourceProvider =
@@ -2776,16 +2810,14 @@ void Slave::__run(
 if (resourceProvider == nullptr ||
 resourceProvider->resourceVersion !=
   receivedResourceVersions.at(resourceProviderId.get())) {
-  kill = true;
+  kill = "Task assumes outdated resource state";
 }
   }
 }
   }
 
-  if (kill) {
-sendTaskDroppedUpdate(
-TaskStatus::REASON_INVALID_OFFERS,
-"Task assumes outdated resource state");
+  if (kill.isSome()) {
+sendTaskDroppedUpdate(TaskStatus::REASON_INVALID_OFFERS, *kill);
 
 // Refer to the comment after 'framework->removePendingTask' above
 // for why we need this.
@@ -2813,7 +2845,7 @@ void Slave::__run(
 return result;
   };
 
-  CHECK_EQ(kill, false);
+  CHECK_NONE(kill);
 
   // NOTE: If the task/task group or executor uses resources that are
   // checkpointed on the slave (e.g. persistent volumes), we should
@@ -2834,17 +2866,16 @@ void Slave::__run(
  << " for task " << _task
  << " of framework " << frameworkId;
 
-kill = true;
+kill =
+  "The checkpointed resources being used by the task or task group are 
"
+  "unknown to the agent";
 break;
   }
 }
   }
 
-  if (kill) {
-sendTaskDroppedUpdate(
-TaskStatus::REASON_RESOURCES_UNKNOWN,
-"The checkpointed resources being used by the task or task group are "
-"unknown to the agent");
+  if (kill.isSome()) {
+sendTaskDroppedUpdate(TaskStatus::REASON_RESOURCES_UNKNOWN, *kill);
 
 // Refer to the comment after 

[mesos] 09/14: Fixed pid checkpointing for `TestContainerizer`.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 1a32b31496600e3ad5d54d8c4e8497e7ef420b18
Author: Benjamin Bannier 
AuthorDate: Mon Jul 15 10:26:01 2019 -0700

Fixed pid checkpointing for `TestContainerizer`.

In order for a `MockExecutor` to be able to reregister after agent
restart a persisted pid is required. This patch adds checkpointing of
the pid.

Review: https://reviews.apache.org/r/70906/
---
 src/tests/containerizer.cpp | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/tests/containerizer.cpp b/src/tests/containerizer.cpp
index fab7e81..3ac992f 100644
--- a/src/tests/containerizer.cpp
+++ b/src/tests/containerizer.cpp
@@ -163,6 +163,18 @@ public:
   ContentType::PROTOBUF, executor, fullEnvironment));
 }
 
+// Checkpoint the forked pid if requested by the agent.
+if (pidCheckpointPath.isSome()) {
+  Try checkpointed = slave::state::checkpoint(
+  pidCheckpointPath.get(), stringify(::getpid()));
+
+  if (checkpointed.isError()) {
+LOG(ERROR) << "Failed to checkpoint container's forked pid to '"
+   << pidCheckpointPath.get() << "': " << checkpointed.error();
+return Failure("Could not checkpoint container's pid");
+  }
+}
+
 return slave::Containerizer::LaunchResult::SUCCESS;
   }
 



[mesos] 01/14: Added minimal agent handler for 'DrainSlaveMessage'.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 3c959eb769ec4a39721947e7ec173dd9eefc6af4
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:30 2019 -0700

Added minimal agent handler for 'DrainSlaveMessage'.

This patch adds a minimal handler to the agent for the
`DrainSlaveMessage`. This handler will later be extended
to implement the full functionality.

Review: https://reviews.apache.org/r/70834/
---
 src/slave/paths.cpp |  9 +
 src/slave/paths.hpp |  6 ++
 src/slave/slave.cpp | 20 
 src/slave/slave.hpp |  8 
 4 files changed, 43 insertions(+)

diff --git a/src/slave/paths.cpp b/src/slave/paths.cpp
index 1163c88..28a7cf9 100644
--- a/src/slave/paths.cpp
+++ b/src/slave/paths.cpp
@@ -55,6 +55,7 @@ namespace paths {
 // File names.
 const char BOOT_ID_FILE[] = "boot_id";
 const char SLAVE_INFO_FILE[] = "slave.info";
+const char DRAIN_CONFIG_FILE[] = "drain.config";
 const char FRAMEWORK_PID_FILE[] = "framework.pid";
 const char FRAMEWORK_INFO_FILE[] = "framework.info";
 const char LIBPROCESS_PID_FILE[] = "libprocess.pid";
@@ -658,6 +659,14 @@ string getResourcesTargetPath(
 }
 
 
+string getDrainConfigPath(
+const string& metaDir,
+const SlaveID& slaveId)
+{
+  return path::join(getSlavePath(metaDir, slaveId), DRAIN_CONFIG_FILE);
+}
+
+
 Try> getPersistentVolumePaths(
 const std::string& workDir)
 {
diff --git a/src/slave/paths.hpp b/src/slave/paths.hpp
index ad76826..e077587 100644
--- a/src/slave/paths.hpp
+++ b/src/slave/paths.hpp
@@ -76,6 +76,7 @@ namespace paths {
 //   |   |   |-- latest (symlink)
 //   |   |   |-- 
 //   |   |   |-- slave.info
+//   |   |   |-- drain.config
 //   |   |   |-- operations
 //   |   |   |   |-- 
 //   |   |   |   |-- operation.updates
@@ -422,6 +423,11 @@ std::string getResourcesTargetPath(
 const std::string& rootDir);
 
 
+std::string getDrainConfigPath(
+const std::string& metaDir,
+const SlaveID& slaveId);
+
+
 Try> getPersistentVolumePaths(
 const std::string& workDir);
 
diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 30039b0..fc688dc 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -749,6 +749,8 @@ void Slave::initialize()
   ::shutdown,
   ::message);
 
+  install(::drain);
+
   install(
   ::ping,
   ::connected);
@@ -982,6 +984,24 @@ void Slave::shutdown(const UPID& from, const string& 
message)
 }
 
 
+void Slave::drain(
+const UPID& from,
+DrainSlaveMessage&& drainSlaveMessage)
+{
+  LOG(INFO)
+<< "Checkpointing DrainConfig. Previous drain config was "
+<< (drainConfig.isSome() ? stringify(drainConfig.get()) : "NONE")
+<< ", new drain config is " << drainSlaveMessage.config();
+
+  CHECK_SOME(state::checkpoint(
+  paths::getDrainConfigPath(metaDir, info.id()),
+  drainSlaveMessage.config()))
+<< "Failed to checkpoint DrainConfig";
+
+  drainConfig = drainSlaveMessage.config();
+}
+
+
 void Slave::fileAttached(
 const Future& result,
 const string& path,
diff --git a/src/slave/slave.hpp b/src/slave/slave.hpp
index 6954f53..dbcceed 100644
--- a/src/slave/slave.hpp
+++ b/src/slave/slave.hpp
@@ -376,6 +376,10 @@ public:
   const process::UPID& from,
   const AcknowledgeOperationStatusMessage& acknowledgement);
 
+  void drain(
+  const process::UPID& from,
+  DrainSlaveMessage&& drainSlaveMessage);
+
   void executorLaunched(
   const FrameworkID& frameworkId,
   const ExecutorID& executorId,
@@ -885,6 +889,10 @@ private:
 
   // Operations that are checkpointed by the agent.
   hashmap checkpointedOperations;
+
+  // If the agent is currently draining, contains the configuration used to
+  // drain the agent. If NONE, the agent is not currently draining.
+  Option drainConfig;
 };
 
 



[mesos] 14/14: Added test for agent to leave draining state on its own.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit c076c8ce286abfb34de3d962a7ca1601d9494919
Author: Benjamin Bannier 
AuthorDate: Mon Jul 15 10:26:28 2019 -0700

Added test for agent to leave draining state on its own.

This patch adds a test which confirms that the agent leaves a draining
state on its own once all frameworks on the agent have no more pending
tasks and all their executors have neither launched or queued tasks.

The test uses the fact that the agent rejects task launches while
draining.

Review: https://reviews.apache.org/r/70960/
---
 src/tests/slave_tests.cpp | 188 ++
 1 file changed, 188 insertions(+)

diff --git a/src/tests/slave_tests.cpp b/src/tests/slave_tests.cpp
index 95f7780..1ed59ca 100644
--- a/src/tests/slave_tests.cpp
+++ b/src/tests/slave_tests.cpp
@@ -12223,6 +12223,194 @@ TEST_F(SlaveTest, DrainAgentKillsPendingTask)
 }
 
 
+// This test validates that a draining agent fails further task launch
+// attempts to protect its internal draining invariants, and that the
+// agent leaves the draining state on its own once all tasks have
+// terminated and their status updates have been acknowledged.
+TEST_F(SlaveTest, DrainingAgentRejectLaunch)
+{
+  Clock::pause();
+
+  master::Flags masterFlags = CreateMasterFlags();
+  Try> master = StartMaster(masterFlags);
+  ASSERT_SOME(master);
+
+  Future updateSlaveMessage =
+FUTURE_PROTOBUF(UpdateSlaveMessage(), _, _);
+
+  StandaloneMasterDetector detector(master.get()->pid);
+  slave::Flags slaveFlags = CreateSlaveFlags();
+
+  Try> slave = StartSlave(, slaveFlags);
+  ASSERT_SOME(slave);
+
+  Clock::advance(slaveFlags.registration_backoff_factor);
+
+  AWAIT_READY(updateSlaveMessage);
+
+  // Register a scheduler to launch tasks.
+  auto scheduler = std::make_shared();
+
+  EXPECT_CALL(*scheduler, connected(_))
+.WillOnce(v1::scheduler::SendSubscribe(v1::DEFAULT_FRAMEWORK_INFO));
+
+  Future subscribed;
+  EXPECT_CALL(*scheduler, subscribed(_, _))
+.WillOnce(FutureArg<1>());
+
+  Future offers1;
+  EXPECT_CALL(*scheduler, offers(_, _))
+.WillOnce(FutureArg<1>())
+.WillRepeatedly(Return()); // Ignore subsequent offers.
+
+  EXPECT_CALL(*scheduler, heartbeat(_))
+.WillRepeatedly(Return()); // Ignore heartbeats.
+
+  v1::scheduler::TestMesos mesos(
+  master.get()->pid, ContentType::PROTOBUF, scheduler);
+
+  AWAIT_READY(subscribed);
+
+  v1::FrameworkID frameworkId(subscribed->framework_id());
+
+  AWAIT_READY(offers1);
+  ASSERT_FALSE(offers1->offers().empty());
+
+  v1::Offer offer = offers1->offers(0);
+  v1::AgentID agentId = offer.agent_id();
+
+  // Launch a task. When the agent is put into draining state this task will be
+  // killed, but we will leave the draining state open even after the task is
+  // killed by not acknowledging the terminal task status update.
+  v1::Resources resources =
+v1::Resources::parse("cpus:0.1;mem:32;disk:32").get();
+
+  v1::TaskInfo taskInfo1 =
+v1::createTask(agentId, resources, SLEEP_COMMAND(1000), None());
+
+  // We do not acknowledge the KILLED update to control
+  // when the agent finishes draining.
+  Future runningUpdate1;
+  Future killedUpdate1;
+  EXPECT_CALL(*scheduler, update(_, _))
+.WillOnce(v1::scheduler::SendAcknowledge(frameworkId, agentId)) // 
Starting.
+.WillOnce(DoAll(
+v1::scheduler::SendAcknowledge(frameworkId, agentId),
+FutureArg<1>()))
+.WillOnce(FutureArg<1>());
+
+  mesos.send(
+  v1::createCallAccept(frameworkId, offer, {v1::LAUNCH({taskInfo1})}));
+
+  AWAIT_READY(runningUpdate1);
+
+  // Simulate the master sending a `DrainSlaveMessage` to the agent.
+  DurationInfo maxGracePeriod;
+  maxGracePeriod.set_nanoseconds(0);
+
+  DrainConfig drainConfig;
+  drainConfig.set_mark_gone(false);
+  drainConfig.mutable_max_grace_period()->CopyFrom(maxGracePeriod);
+
+  DrainSlaveMessage drainSlaveMessage;
+  drainSlaveMessage.mutable_config()->CopyFrom(drainConfig);
+
+  // Explicitly wait for the executor to be terminated.
+  Future executorTerminated =
+FUTURE_DISPATCH(_, ::executorTerminated);
+
+  process::post(master.get()->pid, slave.get()->pid, drainSlaveMessage);
+
+  // Wait until we have received the terminal task status update
+  // (which we did not acknowledge) before continuing. The agent will
+  // subsequentially be left in a draining state.
+  AWAIT_READY(killedUpdate1);
+  ASSERT_EQ(v1::TASK_KILLED, killedUpdate1->status().state());
+
+  Future offers2;
+  EXPECT_CALL(*scheduler, offers(_, _))
+.WillOnce(FutureArg<1>())
+.WillRepeatedly(Return()); // Ignore subsequent offers.
+
+  // Resume the clock so the containerizer can detect the terminated executor.
+  Clock::resume();
+  AWAIT_READY(executorTerminated);
+  Clock::pause();
+  Clock::settle();
+
+  

[mesos] 11/14: Adjusted task status updates during draining.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 18892687ffd98be35fc0f2012df5aae9c99a034e
Author: Benjamin Bannier 
AuthorDate: Mon Jul 15 10:26:10 2019 -0700

Adjusted task status updates during draining.

When a task is reported as killed to the agent during active agent
draining we now decorate the reported status with
`REASON_AGENT_DRAINING` unconditionally. If the draining marks the agent
as gone via the `mark_gone` draining flag we additionally report
`TASK_GONE_BY_OPERATOR` instead of the original state.

This patch leaves some ambiguity in what triggered the kill since the
agent-executor protocol does not transport reasons; instead
the reason is here only inferred after the killed task has
been observed. This should usually be fine since due to the inherit race
between e.g., any user- and drain-triggered kill a user cannot
distinguish racy reasons.

Review: https://reviews.apache.org/r/70936/
---
 src/slave/slave.cpp   | 34 ++
 src/tests/slave_tests.cpp | 44 +++-
 2 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 3878ab8..37385bd 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -5698,6 +5698,40 @@ void Slave::statusUpdate(StatusUpdate update, const 
Option& pid)
   update.mutable_status()->set_source(
   pid == UPID() ? TaskStatus::SOURCE_SLAVE : TaskStatus::SOURCE_EXECUTOR);
 
+  // If the agent is draining we provide additional
+  // information for KILLING or KILLED states.
+  if (drainConfig.isSome()) {
+switch (update.status().state()) {
+  case TASK_STAGING:
+  case TASK_STARTING:
+  case TASK_RUNNING:
+  case TASK_FAILED:
+  case TASK_FINISHED:
+  case TASK_ERROR:
+  case TASK_LOST:
+  case TASK_DROPPED:
+  case TASK_UNREACHABLE:
+  case TASK_GONE:
+  case TASK_GONE_BY_OPERATOR:
+  case TASK_UNKNOWN: {
+break;
+  }
+  case TASK_KILLING:
+  case TASK_KILLED: {
+// We unconditionally overwrite any previous reason to provide a
+// consistent signal that this task went away during draining.
+update.mutable_status()->set_reason(TaskStatus::REASON_SLAVE_DRAINING);
+
+// If the draining marks the agent as gone report tasks as
+// gone by operator.
+if (drainConfig->mark_gone()) {
+  update.mutable_status()->set_state(TASK_GONE_BY_OPERATOR);
+}
+break;
+  }
+}
+  }
+
   // Set TaskStatus.executor_id if not already set; overwrite existing
   // value if already set.
   if (update.has_executor_id()) {
diff --git a/src/tests/slave_tests.cpp b/src/tests/slave_tests.cpp
index 5f8e53c..95f7780 100644
--- a/src/tests/slave_tests.cpp
+++ b/src/tests/slave_tests.cpp
@@ -11989,7 +11989,9 @@ TEST_F(SlaveTest, DrainAgentKillsRunningTask)
 
   AWAIT_READY(killedUpdate);
 
-  EXPECT_EQ(v1::TASK_KILLED, killedUpdate->status().state());
+  EXPECT_EQ(v1::TASK_GONE_BY_OPERATOR, killedUpdate->status().state());
+  EXPECT_EQ(
+  v1::TaskStatus::REASON_AGENT_DRAINING, killedUpdate->status().reason());
 }
 
 
@@ -12108,7 +12110,9 @@ TEST_F(SlaveTest, DrainAgentKillsQueuedTask)
 
   AWAIT_READY(killedUpdate);
 
-  EXPECT_EQ(v1::TASK_KILLED, killedUpdate->status().state());
+  EXPECT_EQ(v1::TASK_GONE_BY_OPERATOR, killedUpdate->status().state());
+  EXPECT_EQ(
+  v1::TaskStatus::REASON_AGENT_DRAINING, killedUpdate->status().reason());
 }
 
 
@@ -12213,7 +12217,9 @@ TEST_F(SlaveTest, DrainAgentKillsPendingTask)
 
   AWAIT_READY(killedUpdate);
 
-  EXPECT_EQ(v1::TASK_KILLED, killedUpdate->status().state());
+  EXPECT_EQ(v1::TASK_GONE_BY_OPERATOR, killedUpdate->status().state());
+  EXPECT_EQ(
+  v1::TaskStatus::REASON_AGENT_DRAINING, killedUpdate->status().reason());
 }
 
 
@@ -12228,6 +12234,10 @@ TEST_F(SlaveTest, CheckpointedDrainInfo)
 
   slave::Flags slaveFlags = CreateSlaveFlags();
 
+  // Make the executor reregistration timeout less than the agent's
+  // registration backoff factor to avoid resent status updates.
+  slaveFlags.executor_reregistration_timeout = Milliseconds(2);
+
   ExecutorID executorId = DEFAULT_EXECUTOR_ID;
   MockExecutor exec(executorId);
   TestContainerizer containerizer();
@@ -12253,7 +12263,9 @@ TEST_F(SlaveTest, CheckpointedDrainInfo)
   MesosSchedulerDriver driver(
   , frameworkInfo, master.get()->pid, DEFAULT_CREDENTIAL);
 
-  EXPECT_CALL(sched, registered(_, _, _));
+  Future frameworkId;
+  EXPECT_CALL(sched, registered(_, _, _))
+.WillOnce(FutureSatisfy());
 
   Future> offers;
   EXPECT_CALL(sched, resourceOffers(_, _))
@@ -12262,6 +12274,8 @@ TEST_F(SlaveTest, CheckpointedDrainInfo)
 
   driver.start();
 
+  AWAIT_READY(frameworkId);
+
   AWAIT_READY(offers);
   

[mesos] 10/14: Added recovery of agent drain information.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 54fb43e7fb5fb1884e8f0ba087ae3db8cfc8d498
Author: Benjamin Bannier 
AuthorDate: Mon Jul 15 10:26:02 2019 -0700

Added recovery of agent drain information.

With this patch the agent will, after executor reregistration finished,
replay any active drain information so remaining tasks are drained as
well. We need to wait until executors had a chance to register so they
are not terminated should we try to send kill task request before the
executor has registered.

Review: https://reviews.apache.org/r/70907/
---
 src/slave/slave.cpp   |  12 +
 src/slave/state.cpp   |  16 +++
 src/slave/state.hpp   |   3 ++
 src/tests/slave_tests.cpp | 117 ++
 4 files changed, 148 insertions(+)

diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 19b4769..3878ab8 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -5630,6 +5630,16 @@ void Slave::reregisterExecutorTimeout()
 }
   }
 
+  // Replay any active draining.
+  if (drainConfig.isSome()) {
+DrainSlaveMessage drainSlaveMessage;
+*drainSlaveMessage.mutable_config() = *drainConfig;
+
+LOG(INFO) << "Replaying in-process agent draining";
+
+drain(self(), std::move(drainSlaveMessage));
+  }
+
   // Signal the end of recovery.
   // TODO(greggomann): Allow the agent to complete recovery before the executor
   // re-registration timeout has elapsed. See MESOS-7539
@@ -7512,6 +7522,8 @@ Future Slave::recover(const Try& 
state)
 // we can not reuse the id, we will either crash or erase it again.
 info.mutable_id()->CopyFrom(slaveState->info->id());
 
+drainConfig = slaveState->drainConfig;
+
 // Check for SlaveInfo compatibility.
 Try _compatible =
   compatible(slaveState->info.get(), info);
diff --git a/src/slave/state.cpp b/src/slave/state.cpp
index e0a850e..cd3fac7 100644
--- a/src/slave/state.cpp
+++ b/src/slave/state.cpp
@@ -203,6 +203,22 @@ Try SlaveState::recover(
 state.errors += framework->errors;
   }
 
+  // Recover any drain state.
+  const string drainConfigPath = paths::getDrainConfigPath(rootDir, slaveId);
+  if (os::exists(drainConfigPath)) {
+Result drainConfig = 
state::read(drainConfigPath);
+if (drainConfig.isError()) {
+  string message = "Failed to read agent state file '"
+   + drainConfigPath + "': " + drainConfig.error();
+
+  LOG(WARNING) << message;
+  state.errors++;
+}
+if (drainConfig.isSome()) {
+  state.drainConfig = *drainConfig;
+}
+  }
+
   const string resourceStatePath = paths::getResourceStatePath(rootDir);
   if (os::exists(resourceStatePath)) {
 Result resourceState =
diff --git a/src/slave/state.hpp b/src/slave/state.hpp
index 45836e5..6d6ae01 100644
--- a/src/slave/state.hpp
+++ b/src/slave/state.hpp
@@ -391,6 +391,9 @@ struct SlaveState
   // state didn't support checkpointing operations.
   Option> operations;
 
+  // The drain state of the agent, if any.
+  Option drainConfig;
+
   unsigned int errors;
 };
 
diff --git a/src/tests/slave_tests.cpp b/src/tests/slave_tests.cpp
index 147967d..5f8e53c 100644
--- a/src/tests/slave_tests.cpp
+++ b/src/tests/slave_tests.cpp
@@ -12216,6 +12216,123 @@ TEST_F(SlaveTest, DrainAgentKillsPendingTask)
   EXPECT_EQ(v1::TASK_KILLED, killedUpdate->status().state());
 }
 
+
+// This test verifies that if the agent recovers that it is in
+// draining state any tasks after the restart are killed.
+TEST_F(SlaveTest, CheckpointedDrainInfo)
+{
+  Clock::pause();
+
+  master::Flags masterFlags = CreateMasterFlags();
+  Try> master = StartMaster(masterFlags);
+
+  slave::Flags slaveFlags = CreateSlaveFlags();
+
+  ExecutorID executorId = DEFAULT_EXECUTOR_ID;
+  MockExecutor exec(executorId);
+  TestContainerizer containerizer();
+
+  Future updateSlaveMessage =
+FUTURE_PROTOBUF(UpdateSlaveMessage(), _, _);
+
+  StandaloneMasterDetector detector(master.get()->pid);
+  Try> slave =
+StartSlave(, , slaveFlags);
+  ASSERT_SOME(slave);
+
+  // Advance the clock to trigger the agent registration.
+  Clock::advance(slaveFlags.registration_backoff_factor);
+
+  AWAIT_READY(updateSlaveMessage);
+
+  // Start a framework.
+  FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
+  frameworkInfo.set_checkpoint(true);
+
+  MockScheduler sched;
+  MesosSchedulerDriver driver(
+  , frameworkInfo, master.get()->pid, DEFAULT_CREDENTIAL);
+
+  EXPECT_CALL(sched, registered(_, _, _));
+
+  Future> offers;
+  EXPECT_CALL(sched, resourceOffers(_, _))
+.WillOnce(FutureArg<1>())
+.WillRepeatedly(Return()); // Ignore subsequent offers.
+
+  driver.start();
+
+  AWAIT_READY(offers);
+  ASSERT_FALSE(offers->empty());
+
+  SlaveID slaveId = offers.get()[0].slave_id();
+  TaskInfo task = createTask(
+  slaveId,
+  

[mesos] 06/14: Added kill policy to the 'Task' message.

2019-07-15 Thread grag
This is an automated email from the ASF dual-hosted git repository.

grag pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git

commit 3bb8287378bfdbef74288212302d0f2628d51b23
Author: Greg Mann 
AuthorDate: Mon Jul 15 10:25:49 2019 -0700

Added kill policy to the 'Task' message.

Review: https://reviews.apache.org/r/70901/
---
 include/mesos/mesos.proto | 4 
 include/mesos/type_utils.hpp  | 1 +
 include/mesos/v1/mesos.proto  | 4 
 src/common/protobuf_utils.cpp | 4 
 src/common/type_utils.cpp | 9 -
 5 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/mesos/mesos.proto b/include/mesos/mesos.proto
index e0a2391..324f686 100644
--- a/include/mesos/mesos.proto
+++ b/include/mesos/mesos.proto
@@ -2338,6 +2338,10 @@ message Task {
 
   // TODO(greggomann): Add the task's `CheckInfo`. See MESOS-8780.
 
+  // The kill policy used for this task when it is killed. It's possible for
+  // this policy to be overridden by the scheduler when killing the task.
+  optional KillPolicy kill_policy = 16;
+
   // Specific user under which task is running.
   optional string user = 14;
 }
diff --git a/include/mesos/type_utils.hpp b/include/mesos/type_utils.hpp
index b9e6164..2fd8a62 100644
--- a/include/mesos/type_utils.hpp
+++ b/include/mesos/type_utils.hpp
@@ -63,6 +63,7 @@ bool operator==(const DiscoveryInfo& left, const 
DiscoveryInfo& right);
 bool operator==(const Environment& left, const Environment& right);
 bool operator==(const ExecutorInfo& left, const ExecutorInfo& right);
 bool operator==(const HealthCheck& left, const HealthCheck& right);
+bool operator==(const KillPolicy& left, const KillPolicy& right);
 bool operator==(const Label& left, const Label& right);
 bool operator==(const Labels& left, const Labels& right);
 bool operator==(const MasterInfo& left, const MasterInfo& right);
diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
index af29a14..aa9c525 100644
--- a/include/mesos/v1/mesos.proto
+++ b/include/mesos/v1/mesos.proto
@@ -2327,6 +2327,10 @@ message Task {
 
   // TODO(greggomann): Add the task's `CheckInfo`. See MESOS-8780.
 
+  // The kill policy used for this task when it is killed. It's possible for
+  // this policy to be overridden by the scheduler when killing the task.
+  optional KillPolicy kill_policy = 16;
+
   // Specific user under which task is running.
   optional string user = 14;
 }
diff --git a/src/common/protobuf_utils.cpp b/src/common/protobuf_utils.cpp
index 0112fcb..c91d543 100644
--- a/src/common/protobuf_utils.cpp
+++ b/src/common/protobuf_utils.cpp
@@ -416,6 +416,10 @@ Task createTask(
 t.mutable_health_check()->CopyFrom(task.health_check());
   }
 
+  if (task.has_kill_policy()) {
+t.mutable_kill_policy()->CopyFrom(task.kill_policy());
+  }
+
   // Copy `user` if set.
   if (task.has_command() && task.command().has_user()) {
 t.set_user(task.command().user());
diff --git a/src/common/type_utils.cpp b/src/common/type_utils.cpp
index 16d6657..5bf7113 100644
--- a/src/common/type_utils.cpp
+++ b/src/common/type_utils.cpp
@@ -406,6 +406,12 @@ bool operator==(const HealthCheck& left, const 
HealthCheck& right)
 }
 
 
+bool operator==(const KillPolicy& left, const KillPolicy& right)
+{
+  return google::protobuf::util::MessageDifferencer::Equals(left, right);
+}
+
+
 bool operator==(const MasterInfo& left, const MasterInfo& right)
 {
   return left.id() == right.id() &&
@@ -583,7 +589,8 @@ bool operator==(const Task& left, const Task& right)
 left.discovery() == right.discovery() &&
 left.user() == right.user() &&
 left.container() == right.container() &&
-left.health_check() == right.health_check();
+left.health_check() == right.health_check() &&
+left.kill_policy() == right.kill_policy();
 }