[mesos] 02/02: Updated `upgrades.md` for the configurable shared memory project.
This is an automated email from the ASF dual-hosted git repository. qianzhang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 5dfa256ac63775b3942b68fdc99f6a58345f1ab8 Author: Qian Zhang AuthorDate: Tue Aug 27 10:16:52 2019 +0800 Updated `upgrades.md` for the configurable shared memory project. --- docs/upgrades.md | 6 ++ 1 file changed, 6 insertions(+) diff --git a/docs/upgrades.md b/docs/upgrades.md index 2be13fb..63eb1bb 100644 --- a/docs/upgrades.md +++ b/docs/upgrades.md @@ -51,17 +51,21 @@ We categorize the changes as follows: A Linux NNP isolator A hostname_validation_scheme C TLS certificate verification behaviour + C Configurable IPC namespace and /dev/shm A docker_ignore_runtime + A disallow_sharing_agent_ipc_namespace + A default_container_shm_size + A LinuxInfo.ipc_mode and LinuxInfo.shm_size @@ -532,6 +536,8 @@ We categorize the changes as follows: would have been successfull. Users that rely on incoming connection requests presenting valid TLS certificates should make sure that the `LIBPROCESS_SSL_REQUIRE_CERT` option is set to true. + +* The Mesos containerizer now supports configurable IPC namespace and /dev/shm. Container can be configured to have a private IPC namespace and /dev/shm or share them from its parent via the field `LinuxInfo.ipc_mode`, and the size of its private /dev/shm is also configurable via the field `LinuxInfo.shm_size`. Operators can control whether it is allowed to share host's IPC namespace and /dev/shm with top level containers via the agent flag `--disallow_sharing_agent_ipc_namespace`, and s [...] ## Upgrading from 1.7.x to 1.8.x ##
[mesos] branch master updated (50dcd56 -> 5dfa256)
This is an automated email from the ASF dual-hosted git repository. qianzhang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git. from 50dcd56 Added agent reactivations to the existing agent draining tests. new 9a5b298 Added MESOS-9795 to the 1.9.0 release highlights. new 5dfa256 Updated `upgrades.md` for the configurable shared memory project. The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGELOG| 23 +-- docs/upgrades.md | 6 ++ 2 files changed, 19 insertions(+), 10 deletions(-)
[mesos] 01/02: Added MESOS-9795 to the 1.9.0 release highlights.
This is an automated email from the ASF dual-hosted git repository. qianzhang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 9a5b2986a74006cb68e2262b4b2d5f7e22058a27 Author: Qian Zhang AuthorDate: Tue Aug 27 09:29:21 2019 +0800 Added MESOS-9795 to the 1.9.0 release highlights. The style of the Containerization section in the 1.9.0 release highlights was also updated to be consistent with other sections. --- CHANGELOG | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/CHANGELOG b/CHANGELOG index 58cf418..a5bb8d5 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -14,19 +14,22 @@ This release contains the following highlights: * Containerization: -* [MESOS-9760] - A new `--docker_ignore_runtime` flag has been - added. This causes the agent to ignore any runtime configuration - present in Docker images. +* A new `--docker_ignore_runtime` flag has been added. This causes the agent + to ignore any runtime configuration present in Docker images. (MESOS-9760) -* [MESOS-9770] - Add no-new-privileges isolator. An additional - Linux isolator has been added to support enabling the no_new_privs - process control flag. +* Add no-new-privileges isolator. A new Linux isolator has been added to + support enabling the no_new_privs process control flag. (MESOS-9770) -* [MESOS-9771] - The Mesos containerizer now masks sensitive paths - in `/proc` for containers that do not share the host's PID namespace. +* The Mesos containerizer now masks sensitive paths in `/proc` for + containers that do not share the host's PID namespace. (MESOS-9771) -* [MESOS-9900] - The Mesos containerizer now includes ephemeral - overlayfs storage in the task disk quota as well as sandbox storage. +* The Mesos containerizer now supports configurable IPC namespace and + /dev/shm. Container can be configured to have a private IPC namespace + and /dev/shm or share them from its parent, and the size of its private + /dev/shm is also configurable. (MESOS-9795) + +* The Mesos containerizer now includes ephemeral overlayfs storage in the + task disk quota as well as sandbox storage. (MESOS-9900) Additional API Changes:
[mesos] 02/05: Refactored master draining test setup.
This is an automated email from the ASF dual-hosted git repository. josephwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 4f078398f7010d982a1c4ee95a1e3f628813e6fe Author: Joseph Wu AuthorDate: Mon Jul 29 19:43:31 2019 -0700 Refactored master draining test setup. Tests of this feature will generally require a master, agent, framework, and a single task to be launched at the beginning of the test. This moves this common code into the test SetUp. This also changes the `post(...)` helper to return the http::Response object instead of parsing it. The response for DRAIN_AGENT calls does not return an object, so the tests were not checking the response before. Review: https://reviews.apache.org/r/71315 --- src/tests/master_draining_tests.cpp | 494 +--- 1 file changed, 175 insertions(+), 319 deletions(-) diff --git a/src/tests/master_draining_tests.cpp b/src/tests/master_draining_tests.cpp index 16d0c85..eae809f 100644 --- a/src/tests/master_draining_tests.cpp +++ b/src/tests/master_draining_tests.cpp @@ -14,6 +14,7 @@ // See the License for the specific language governing permissions and // limitations under the License. +#include #include #include @@ -73,6 +74,130 @@ class MasterDrainingTest public WithParamInterface { public: + // Creates a master, agent, framework, and launches one sleep task. + void SetUp() override + { +MesosTest::SetUp(); + +Clock::pause(); + +// Create the master. +masterFlags = CreateMasterFlags(); +Try> _master = StartMaster(masterFlags); +ASSERT_SOME(_master); +master = _master.get(); + +Future slaveRegisteredMessage = + FUTURE_PROTOBUF(SlaveRegisteredMessage(), _, _); + +// Create the agent. +agentFlags = CreateSlaveFlags(); +detector = master.get()->createDetector(); +Try> _slave = StartSlave(detector.get(), agentFlags); +ASSERT_SOME(_slave); +slave = _slave.get(); + +Clock::advance(agentFlags.registration_backoff_factor); +AWAIT_READY(slaveRegisteredMessage); + +// Create the framework. +scheduler = std::make_shared(); + +frameworkInfo = v1::DEFAULT_FRAMEWORK_INFO; +frameworkInfo.set_checkpoint(true); +frameworkInfo.add_capabilities()->set_type( +v1::FrameworkInfo::Capability::PARTITION_AWARE); + +EXPECT_CALL(*scheduler, connected(_)) + .WillOnce(v1::scheduler::SendSubscribe(frameworkInfo)); + +Future subscribed; +EXPECT_CALL(*scheduler, subscribed(_, _)) + .WillOnce(FutureArg<1>()); + +EXPECT_CALL(*scheduler, heartbeat(_)) + .WillRepeatedly(Return()); // Ignore heartbeats. + +Future offers; +EXPECT_CALL(*scheduler, offers(_, _)) + .WillOnce(FutureArg<1>()) + .WillRepeatedly(Return()); + +mesos = std::make_shared( +master.get()->pid, ContentType::PROTOBUF, scheduler); + +AWAIT_READY(subscribed); +frameworkId = subscribed->framework_id(); + +// Launch a sleep task. +AWAIT_READY(offers); +ASSERT_FALSE(offers->offers().empty()); + +const v1::Offer& offer = offers->offers(0); +agentId = offer.agent_id(); + +Try resources = + v1::Resources::parse("cpus:0.1;mem:64;disk:64"); + +ASSERT_SOME(resources); + +taskInfo = v1::createTask(agentId, resources.get(), SLEEP_COMMAND(1000)); + +testing::Sequence updateSequence; +Future startingUpdate; +Future runningUpdate; + +// Make sure the agent receives these two acknowledgements. +Future startingAck = + FUTURE_PROTOBUF(StatusUpdateAcknowledgementMessage(), _, _); +Future runningAck = + FUTURE_PROTOBUF(StatusUpdateAcknowledgementMessage(), _, _); + +EXPECT_CALL( +*scheduler, +update(_, AllOf( +TaskStatusUpdateTaskIdEq(taskInfo.task_id()), +TaskStatusUpdateStateEq(v1::TASK_STARTING + .InSequence(updateSequence) + .WillOnce(DoAll( + FutureArg<1>(), + v1::scheduler::SendAcknowledge(frameworkId, agentId))); + +EXPECT_CALL( +*scheduler, +update(_, AllOf( + TaskStatusUpdateTaskIdEq(taskInfo.task_id()), + TaskStatusUpdateStateEq(v1::TASK_RUNNING + .InSequence(updateSequence) + .WillOnce(DoAll( + FutureArg<1>(), + v1::scheduler::SendAcknowledge(frameworkId, agentId))); + +mesos->send( +v1::createCallAccept( +frameworkId, +offer, +{v1::LAUNCH({taskInfo})})); + +AWAIT_READY(startingUpdate); +AWAIT_READY(startingAck); +AWAIT_READY(runningUpdate); +AWAIT_READY(runningAck); + } + + void TearDown() override + { +mesos.reset(); +scheduler.reset(); +slave.reset(); +detector.reset(); +master.reset(); + +Clock::resume(); + +MesosTest::TearDown(); + } + master::Flags CreateMasterFlags() override { // Turn off
[mesos] branch master updated (c104977 -> 50dcd56)
This is an automated email from the ASF dual-hosted git repository. josephwu pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git. from c104977 Updated site's middleman versions. new 5124b29 Moved master-side agent draining tests into a separate file. new 4f07839 Refactored master draining test setup. new 1e36619 Added draining tests for empty agents. new 5c57128 Added draining test for momentarily disconnected agents. new 50dcd56 Added agent reactivations to the existing agent draining tests. The 5 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/Makefile.am |3 +- src/tests/CMakeLists.txt|1 + src/tests/api_tests.cpp | 541 --- src/tests/master_draining_tests.cpp | 1018 +++ 4 files changed, 1021 insertions(+), 542 deletions(-) create mode 100644 src/tests/master_draining_tests.cpp
[mesos] 04/05: Added draining test for momentarily disconnected agents.
This is an automated email from the ASF dual-hosted git repository. josephwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 5c5712869876cad50a34af29cdcbfac9b1e9eb45 Author: Joseph Wu AuthorDate: Mon Aug 19 12:11:18 2019 -0700 Added draining test for momentarily disconnected agents. This exercises the agent draining code when the agent is disconnected from the master at the time of starting draining. Draining is expected to proceed once the agent reregisters. Review: https://reviews.apache.org/r/71317 --- src/tests/master_draining_tests.cpp | 201 1 file changed, 201 insertions(+) diff --git a/src/tests/master_draining_tests.cpp b/src/tests/master_draining_tests.cpp index 674f5b5..235bf1b 100644 --- a/src/tests/master_draining_tests.cpp +++ b/src/tests/master_draining_tests.cpp @@ -254,6 +254,99 @@ TEST_P(MasterAlreadyDrainedTest, DrainAgentMarkGone) } +// When an operator submits a DRAIN_AGENT call with an agent that has +// momentarily disconnected, the call should succeed, and the agent should +// be drained when it returns to the cluster. +TEST_P(MasterAlreadyDrainedTest, DrainAgentDisconnected) +{ + // Simulate an agent crash, so that it disconnects from the master. + slave->terminate(); + slave.reset(); + + ContentType contentType = GetParam(); + + // Ensure that the agent is disconnected (not active). + { +v1::master::Call call; +call.set_type(v1::master::Call::GET_AGENTS); + +Future response = + post(master->pid, call, contentType); +AWAIT_ASSERT_RESPONSE_STATUS_EQ(http::OK().status, response); + +Try getAgents = + deserialize(contentType, response->body); +ASSERT_SOME(getAgents); + +ASSERT_EQ(v1::master::Response::GET_AGENTS, getAgents->type()); +ASSERT_EQ(getAgents->get_agents().agents_size(), 1); + +const v1::master::Response::GetAgents::Agent& agent = +getAgents->get_agents().agents(0); + +EXPECT_EQ(agent.active(), false); +EXPECT_EQ(agent.deactivated(), false); + } + + // Start draining the disconnected agent. + { +v1::master::Call::DrainAgent drainAgent; +drainAgent.mutable_agent_id()->CopyFrom(agentId); + +v1::master::Call call; +call.set_type(v1::master::Call::DRAIN_AGENT); +call.mutable_drain_agent()->CopyFrom(drainAgent); + +AWAIT_EXPECT_RESPONSE_STATUS_EQ( +http::OK().status, +post(master->pid, call, contentType)); + } + + // Bring the agent back. + Future slaveReregisteredMessage = +FUTURE_PROTOBUF(SlaveReregisteredMessage(), _, _); + + Future drainSlaveMesage = +FUTURE_PROTOBUF(DrainSlaveMessage(), _, _); + + Try> recoveredSlave = +StartSlave(detector.get(), agentFlags); + ASSERT_SOME(recoveredSlave); + + Clock::advance(agentFlags.executor_reregistration_timeout); + Clock::settle(); + Clock::advance(agentFlags.registration_backoff_factor); + Clock::settle(); + AWAIT_READY(slaveReregisteredMessage); + + // The agent should be told to drain once it reregisters. + AWAIT_READY(drainSlaveMesage); + + // Ensure that the agent is marked as DRAINED in the master now. + { +v1::master::Call call; +call.set_type(v1::master::Call::GET_AGENTS); + +Future response = + post(master->pid, call, contentType); +AWAIT_ASSERT_RESPONSE_STATUS_EQ(http::OK().status, response); + +Try getAgents = + deserialize(contentType, response->body); +ASSERT_SOME(getAgents); + +ASSERT_EQ(v1::master::Response::GET_AGENTS, getAgents->type()); +ASSERT_EQ(getAgents->get_agents().agents_size(), 1); + +const v1::master::Response::GetAgents::Agent& agent = +getAgents->get_agents().agents(0); + +EXPECT_EQ(agent.deactivated(), true); +EXPECT_EQ(mesos::v1::DRAINED, agent.drain_info().state()); + } +} + + // When an operator submits a DRAIN_AGENT call for an agent that has gone // unreachable, the call should succeed, and the agent should be drained // if/when it returns to the cluster. @@ -627,6 +720,114 @@ TEST_P(MasterDrainingTest, DrainAgentMarkGone) } +// When an operator submits a DRAIN_AGENT call with an agent that has +// momentarily disconnected, the call should succeed, and the agent should +// be drained when it returns to the cluster. +TEST_P(MasterDrainingTest, DrainAgentDisconnected) +{ + // Simulate an agent crash, so that it disconnects from the master. + slave->terminate(); + slave.reset(); + + ContentType contentType = GetParam(); + + // Ensure that the agent is disconnected (not active). + { +v1::master::Call call; +call.set_type(v1::master::Call::GET_AGENTS); + +Future response = + post(master->pid, call, contentType); +AWAIT_ASSERT_RESPONSE_STATUS_EQ(http::OK().status, response); + +Try getAgents = + deserialize(contentType, response->body); +ASSERT_SOME(getAgents); + +ASSERT_EQ(v1::master::Response::GET_AGENTS, getAgents->type());
[mesos] 01/05: Moved master-side agent draining tests into a separate file.
This is an automated email from the ASF dual-hosted git repository. josephwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 5124b290ddc368e2e7cc3d56173fb4b3137af620 Author: Joseph Wu AuthorDate: Wed Jul 24 15:45:22 2019 -0700 Moved master-side agent draining tests into a separate file. The test bodies were not changed, besides renaming the test class. Review: https://reviews.apache.org/r/71314 --- src/Makefile.am | 3 +- src/tests/CMakeLists.txt| 1 + src/tests/api_tests.cpp | 541 - src/tests/master_draining_tests.cpp | 662 4 files changed, 665 insertions(+), 542 deletions(-) diff --git a/src/Makefile.am b/src/Makefile.am index a89cd61..577acfd 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -2608,7 +2608,8 @@ mesos_tests_SOURCES = \ tests/master_allocator_tests.cpp \ tests/master_authorization_tests.cpp \ tests/master_benchmarks.cpp \ - tests/master_contender_detector_tests.cpp \ + tests/master_contender_detector_tests.cpp\ + tests/master_draining_tests.cpp \ tests/master_load_tests.cpp \ tests/master_maintenance_tests.cpp \ tests/master_quota_tests.cpp \ diff --git a/src/tests/CMakeLists.txt b/src/tests/CMakeLists.txt index 04c552a..1e53b39 100644 --- a/src/tests/CMakeLists.txt +++ b/src/tests/CMakeLists.txt @@ -105,6 +105,7 @@ set(MESOS_TESTS_SRC hook_tests.cpp http_authentication_tests.cpp http_fault_tolerance_tests.cpp + master_draining_tests.cpp master_load_tests.cpp master_maintenance_tests.cpp master_slave_reconciliation_tests.cpp diff --git a/src/tests/api_tests.cpp b/src/tests/api_tests.cpp index a735a20..bd207ea 100644 --- a/src/tests/api_tests.cpp +++ b/src/tests/api_tests.cpp @@ -5470,547 +5470,6 @@ TEST_P(MasterAPITest, OperationUpdatesUponUnreachable) } -// When an operator submits a DRAIN_AGENT call, the agent should kill all -// running tasks. -TEST_P(MasterAPITest, DrainAgent) -{ - Clock::pause(); - - master::Flags masterFlags = CreateMasterFlags(); - Try> master = StartMaster(masterFlags); - ASSERT_SOME(master); - - Future slaveRegisteredMessage = -FUTURE_PROTOBUF(SlaveRegisteredMessage(), _, _); - - slave::Flags agentFlags = CreateSlaveFlags(); - Owned detector = master.get()->createDetector(); - Try> slave = StartSlave(detector.get(), agentFlags); - ASSERT_SOME(slave); - - Clock::advance(agentFlags.registration_backoff_factor); - - AWAIT_READY(slaveRegisteredMessage); - - auto scheduler = std::make_shared(); - - v1::FrameworkInfo frameworkInfo = v1::DEFAULT_FRAMEWORK_INFO; - frameworkInfo.add_capabilities()->set_type( - v1::FrameworkInfo::Capability::PARTITION_AWARE); - - EXPECT_CALL(*scheduler, connected(_)) -.WillOnce(v1::scheduler::SendSubscribe(frameworkInfo)); - - Future subscribed; - EXPECT_CALL(*scheduler, subscribed(_, _)) -.WillOnce(FutureArg<1>()); - - EXPECT_CALL(*scheduler, heartbeat(_)) -.WillRepeatedly(Return()); // Ignore heartbeats. - - Future offers; - EXPECT_CALL(*scheduler, offers(_, _)) -.WillOnce(FutureArg<1>()) -.WillRepeatedly(Return()); - - auto mesos = std::make_shared( - master.get()->pid, ContentType::PROTOBUF, scheduler); - - AWAIT_READY(subscribed); - v1::FrameworkID frameworkId(subscribed->framework_id()); - - AWAIT_READY(offers); - ASSERT_FALSE(offers->offers().empty()); - - const v1::Offer& offer = offers->offers(0); - const v1::AgentID& agentId = offer.agent_id(); - - Try resources = -v1::Resources::parse("cpus:0.1;mem:64;disk:64"); - - ASSERT_SOME(resources); - - v1::TaskInfo taskInfo = -v1::createTask(agentId, resources.get(), SLEEP_COMMAND(1000)); - - testing::Sequence updateSequence; - Future startingUpdate; - Future runningUpdate; - - EXPECT_CALL( - *scheduler, - update(_, AllOf( - TaskStatusUpdateTaskIdEq(taskInfo.task_id()), - TaskStatusUpdateStateEq(v1::TASK_STARTING -.InSequence(updateSequence) -.WillOnce(DoAll( -FutureArg<1>(), -v1::scheduler::SendAcknowledge(frameworkId, agentId))); - - EXPECT_CALL( - *scheduler, - update(_, AllOf( -TaskStatusUpdateTaskIdEq(taskInfo.task_id()), -TaskStatusUpdateStateEq(v1::TASK_RUNNING -.InSequence(updateSequence) -.WillOnce(DoAll( -FutureArg<1>(), -v1::scheduler::SendAcknowledge(frameworkId, agentId))) -.WillRepeatedly(Return()); - - mesos->send( - v1::createCallAccept( - frameworkId, - offer, - {v1::LAUNCH({taskInfo})})); - - AWAIT_READY(startingUpdate); -
[mesos] 05/05: Added agent reactivations to the existing agent draining tests.
This is an automated email from the ASF dual-hosted git repository. josephwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 50dcd56a42ee03d354f39cb029befe9e60e7f0bf Author: Joseph Wu AuthorDate: Mon Aug 19 14:35:34 2019 -0700 Added agent reactivations to the existing agent draining tests. This adds an extra step to a couple of the agent draining tests, which calls REACTIVATE_AGENT at the end. Review: https://reviews.apache.org/r/71318 --- src/tests/master_draining_tests.cpp | 93 + 1 file changed, 93 insertions(+) diff --git a/src/tests/master_draining_tests.cpp b/src/tests/master_draining_tests.cpp index 235bf1b..f1a00df 100644 --- a/src/tests/master_draining_tests.cpp +++ b/src/tests/master_draining_tests.cpp @@ -563,11 +563,18 @@ TEST_P(MasterDrainingTest, DrainAgent) FutureArg<1>(), v1::scheduler::SendAcknowledge(frameworkId, agentId))); + Future killedAck = +FUTURE_PROTOBUF(StatusUpdateAcknowledgementMessage(), _, _); + Future registrarApplyDrained; + Future registrarApplyReactivated; EXPECT_CALL(*master->registrar, apply(_)) .WillOnce(DoDefault()) .WillOnce(DoAll( FutureSatisfy(), +Invoke(master->registrar.get(), ::unmocked_apply))) +.WillOnce(DoAll( +FutureSatisfy(), Invoke(master->registrar.get(), ::unmocked_apply))); ContentType contentType = GetParam(); @@ -587,6 +594,7 @@ TEST_P(MasterDrainingTest, DrainAgent) } AWAIT_READY(killedUpdate); + AWAIT_READY(killedAck); AWAIT_READY(registrarApplyDrained); // Ensure that the update acknowledgement has been processed. @@ -676,6 +684,33 @@ TEST_P(MasterDrainingTest, DrainAgent) ASSERT_SOME(stateDrainStartTime); EXPECT_LT(0, stateDrainStartTime->as()); } + + // Reactivate the agent and expect to get the agent in an offer. + Future offers; + EXPECT_CALL(*scheduler, offers(_, _)) +.WillOnce(FutureArg<1>()); + + { +v1::master::Call::ReactivateAgent reactivateAgent; +reactivateAgent.mutable_agent_id()->CopyFrom(agentId); + +v1::master::Call call; +call.set_type(v1::master::Call::REACTIVATE_AGENT); +call.mutable_reactivate_agent()->CopyFrom(reactivateAgent); + +AWAIT_EXPECT_RESPONSE_STATUS_EQ( +http::OK().status, +post(master->pid, call, contentType)); + } + + AWAIT_READY(registrarApplyReactivated); + + Clock::advance(masterFlags.allocation_interval); + Clock::settle(); + + AWAIT_READY(offers); + ASSERT_FALSE(offers->offers().empty()); + EXPECT_EQ(agentId, offers->offers(0).agent_id()); } @@ -788,6 +823,9 @@ TEST_P(MasterDrainingTest, DrainAgentDisconnected) FutureArg<1>(), v1::scheduler::SendAcknowledge(frameworkId, agentId))); + Future killedAck = +FUTURE_PROTOBUF(StatusUpdateAcknowledgementMessage(), _, _); + Try> recoveredSlave = StartSlave(detector.get(), agentFlags); ASSERT_SOME(recoveredSlave); @@ -802,6 +840,7 @@ TEST_P(MasterDrainingTest, DrainAgentDisconnected) // The agent should be told to drain once it reregisters. AWAIT_READY(drainSlaveMesage); AWAIT_READY(killedUpdate); + AWAIT_READY(killedAck); // Ensure that the agent is marked as DRAINED in the master now. { @@ -825,6 +864,31 @@ TEST_P(MasterDrainingTest, DrainAgentDisconnected) EXPECT_EQ(agent.deactivated(), true); EXPECT_EQ(mesos::v1::DRAINED, agent.drain_info().state()); } + + // Reactivate the agent and expect to get the agent in an offer. + Future offers; + EXPECT_CALL(*scheduler, offers(_, _)) +.WillOnce(FutureArg<1>()); + + { +v1::master::Call::ReactivateAgent reactivateAgent; +reactivateAgent.mutable_agent_id()->CopyFrom(agentId); + +v1::master::Call call; +call.set_type(v1::master::Call::REACTIVATE_AGENT); +call.mutable_reactivate_agent()->CopyFrom(reactivateAgent); + +AWAIT_EXPECT_RESPONSE_STATUS_EQ( +http::OK().status, +post(master->pid, call, contentType)); + } + + Clock::advance(masterFlags.allocation_interval); + Clock::settle(); + + AWAIT_READY(offers); + ASSERT_FALSE(offers->offers().empty()); + EXPECT_EQ(agentId, offers->offers(0).agent_id()); } @@ -870,6 +934,9 @@ TEST_P(MasterDrainingTest, DrainAgentUnreachable) FutureArg<1>(), v1::scheduler::SendAcknowledge(frameworkId, agentId))); + Future killedAck = +FUTURE_PROTOBUF(StatusUpdateAcknowledgementMessage(), _, _); + // Simulate an agent crash, so that it disconnects from the master. slave->terminate(); slave.reset(); @@ -918,6 +985,32 @@ TEST_P(MasterDrainingTest, DrainAgentUnreachable) AWAIT_READY(drainSlaveMesage); AWAIT_READY(runningUpdate); AWAIT_READY(killedUpdate); + AWAIT_READY(killedAck); + + // Reactivate the agent and expect to get the agent in an offer. + Future offers; + EXPECT_CALL(*scheduler, offers(_, _)) +.WillOnce(FutureArg<1>()); + + { +
[mesos] 03/05: Added draining tests for empty agents.
This is an automated email from the ASF dual-hosted git repository. josephwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 1e3661982eba6da71a5ca8178472ef762d9fc780 Author: Joseph Wu AuthorDate: Wed Aug 7 09:02:01 2019 -0700 Added draining tests for empty agents. This splits the existing agent draining tests into two variants: 1) where the agent has nothing running, and 2) where the agent has one task running. Review: https://reviews.apache.org/r/71316 --- src/tests/master_draining_tests.cpp | 294 ++-- 1 file changed, 250 insertions(+), 44 deletions(-) diff --git a/src/tests/master_draining_tests.cpp b/src/tests/master_draining_tests.cpp index eae809f..674f5b5 100644 --- a/src/tests/master_draining_tests.cpp +++ b/src/tests/master_draining_tests.cpp @@ -42,6 +42,8 @@ #include "common/protobuf_utils.hpp" #include "common/resources_utils.hpp" +#include "master/registry_operations.hpp" + #include "messages/messages.hpp" #include "tests/cluster.hpp" @@ -69,12 +71,12 @@ namespace mesos { namespace internal { namespace tests { -class MasterDrainingTest +class MasterAlreadyDrainedTest : public MesosTest, public WithParamInterface { public: - // Creates a master, agent, framework, and launches one sleep task. + // Creates a master and agent. void SetUp() override { MesosTest::SetUp(); @@ -99,6 +101,251 @@ public: Clock::advance(agentFlags.registration_backoff_factor); AWAIT_READY(slaveRegisteredMessage); +agentId = evolve(slaveRegisteredMessage->slave_id()); + } + + void TearDown() override + { +slave.reset(); +detector.reset(); +master.reset(); + +Clock::resume(); + +MesosTest::TearDown(); + } + + master::Flags CreateMasterFlags() override + { +// Turn off periodic allocations to avoid the race between +// `HierarchicalAllocator::updateAvailable()` and periodic allocations. +master::Flags flags = MesosTest::CreateMasterFlags(); +flags.allocation_interval = Seconds(1000); +return flags; + } + + // Helper function to post a request to "/api/v1" master endpoint and return + // the response. + Future post( + const process::PID& pid, + const v1::master::Call& call, + const ContentType& contentType, + const Credential& credential = DEFAULT_CREDENTIAL) + { +http::Headers headers = createBasicAuthHeaders(credential); +headers["Accept"] = stringify(contentType); + +return http::post( +pid, +"api/v1", +headers, +serialize(contentType, call), +stringify(contentType)); + } + +protected: + master::Flags masterFlags; + Owned master; + Owned detector; + + slave::Flags agentFlags; + Owned slave; + v1::AgentID agentId; +}; + + +// These tests are parameterized by the content type of the HTTP request. +INSTANTIATE_TEST_CASE_P( +ContentType, +MasterAlreadyDrainedTest, +::testing::Values(ContentType::PROTOBUF, ContentType::JSON)); + + +// When an operator submits a DRAIN_AGENT call, the agent with nothing running +// should be immediately transitioned to the DRAINED state. +TEST_P(MasterAlreadyDrainedTest, DrainAgent) +{ + Future registrarApplyDrained; + EXPECT_CALL(*master->registrar, apply(_)) +.WillOnce(DoDefault()) +.WillOnce(DoAll( +FutureSatisfy(), +Invoke(master->registrar.get(), ::unmocked_apply))); + + ContentType contentType = GetParam(); + + { +v1::master::Call::DrainAgent drainAgent; +drainAgent.mutable_agent_id()->CopyFrom(agentId); +drainAgent.mutable_max_grace_period()->set_seconds(10); + +v1::master::Call call; +call.set_type(v1::master::Call::DRAIN_AGENT); +call.mutable_drain_agent()->CopyFrom(drainAgent); + +AWAIT_EXPECT_RESPONSE_STATUS_EQ( +http::OK().status, +post(master->pid, call, contentType)); + } + + AWAIT_READY(registrarApplyDrained); + + mesos::v1::DrainInfo drainInfo; + drainInfo.set_state(mesos::v1::DRAINED); + drainInfo.mutable_config()->set_mark_gone(false); + drainInfo.mutable_config()->mutable_max_grace_period() +->set_nanoseconds(Seconds(10).ns()); + + // Ensure that the agent's drain info is reflected in the master's + // GET_AGENTS response. + { +v1::master::Call call; +call.set_type(v1::master::Call::GET_AGENTS); + +Future response = + post(master->pid, call, contentType); +AWAIT_ASSERT_RESPONSE_STATUS_EQ(http::OK().status, response); + +Try getAgents = + deserialize(contentType, response->body); +ASSERT_SOME(getAgents); + +ASSERT_EQ(v1::master::Response::GET_AGENTS, getAgents->type()); +ASSERT_EQ(getAgents->get_agents().agents_size(), 1); + +const v1::master::Response::GetAgents::Agent& agent = +getAgents->get_agents().agents(0); + +EXPECT_EQ(agent.deactivated(), true); + +EXPECT_EQ(agent.drain_info(), drainInfo); +EXPECT_LT(0,
[mesos] branch master updated: Updated site's middleman versions.
This is an automated email from the ASF dual-hosted git repository. bbannier pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git The following commit(s) were added to refs/heads/master by this push: new c104977 Updated site's middleman versions. c104977 is described below commit c104977894e2abb36aa0a78456c54fb74a20543e Author: Benjamin Bannier AuthorDate: Mon Aug 26 18:18:07 2019 +0200 Updated site's middleman versions. Review: https://reviews.apache.org/r/71368/ --- site/Gemfile | 8 site/Gemfile.lock | 20 ++-- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/site/Gemfile b/site/Gemfile index c492030..c0df4e1 100644 --- a/site/Gemfile +++ b/site/Gemfile @@ -1,9 +1,9 @@ source 'https://rubygems.org' -gem 'middleman', '3.4.0' -gem 'middleman-livereload', '3.4.6' -gem 'middleman-syntax', '3.0.0' -gem 'middleman-blog', '3.5.3' +gem 'middleman', '~>3' +gem 'middleman-livereload', '~>3' +gem 'middleman-syntax', '~>3' +gem 'middleman-blog', '~>3' # Middleman has an undeclared dependency on `tzinfo-data` for # generating timestamps. diff --git a/site/Gemfile.lock b/site/Gemfile.lock index 63c48e7..87d825c 100644 --- a/site/Gemfile.lock +++ b/site/Gemfile.lock @@ -52,14 +52,14 @@ GEM listen (3.0.8) rb-fsevent (~> 0.9, >= 0.9.4) rb-inotify (~> 0.9, >= 0.9.7) -middleman (3.4.0) +middleman (3.4.1) coffee-script (~> 2.2) compass (>= 1.0.0, < 2.0.0) compass-import-once (= 1.0.5) execjs (~> 2.0) haml (>= 4.0.5) kramdown (~> 1.2) - middleman-core (= 3.4.0) + middleman-core (= 3.4.1) middleman-sprockets (>= 3.1.2) sass (>= 3.4.0, < 4.0) uglifier (~> 2.5) @@ -67,7 +67,7 @@ GEM addressable (~> 2.3.5) middleman-core (~> 3.2) tzinfo (>= 0.3.0) -middleman-core (3.4.0) +middleman-core (3.4.1) activesupport (~> 4.1) bundler (~> 1.1) capybara (~> 2.4.4) @@ -88,9 +88,9 @@ GEM sprockets (~> 2.12.1) sprockets-helpers (~> 1.1.0) sprockets-sass (~> 1.3.0) -middleman-syntax (3.0.0) +middleman-syntax (3.2.0) middleman-core (>= 3.2) - rouge (~> 2.0) + rouge (~> 3.2) mime-types (3.2.2) mime-types-data (~> 3.2015) mime-types-data (3.2019.0331) @@ -116,7 +116,7 @@ GEM ffi (~> 1.0) rdiscount (2.2.0.1) ref (2.0.0) -rouge (2.2.1) +rouge (3.9.0) sass (3.4.25) sprockets (2.12.5) hike (~> 1.2) @@ -151,10 +151,10 @@ PLATFORMS DEPENDENCIES htmlentities - middleman (= 3.4.0) - middleman-blog (= 3.5.3) - middleman-livereload (= 3.4.6) - middleman-syntax (= 3.0.0) + middleman (~> 3) + middleman-blog (~> 3) + middleman-livereload (~> 3) + middleman-syntax (~> 3) rake rdiscount (= 2.2.0.1) therubyracer
[mesos] 01/03: Added MESOS-9887 to the 1.8.2 CHANGELOG.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 967f105cea4bc31780bfe76bd2d62ad71ffae221 Author: Andrei Budnik AuthorDate: Mon Aug 26 15:02:40 2019 +0200 Added MESOS-9887 to the 1.8.2 CHANGELOG. --- CHANGELOG | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG b/CHANGELOG index fe08b76..a215e5c 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -49,6 +49,7 @@ Release Notes - Mesos - Version 1.8.2 (WIP) * [MESOS-9785] - Frameworks recovered from reregistered agents are not reported to master `/api/v1` subscribers. * [MESOS-9836] - Docker containerizer overwrites `/mesos/slave` cgroups. * [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct. + * [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor. * [MESOS-9893] - `volume/secret` isolator should cleanup the stored secret from runtime directory when the container is destroyed. * [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent.
[mesos] 03/03: Added MESOS-9887 to the 1.6.3 CHANGELOG.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 24e989e66507932809c7d852a4a62720de7cb27b Author: Andrei Budnik AuthorDate: Mon Aug 26 14:44:54 2019 +0200 Added MESOS-9887 to the 1.6.3 CHANGELOG. --- CHANGELOG | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG b/CHANGELOG index 3cd7661..58cf418 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -986,6 +986,7 @@ Release Notes - Mesos - Version 1.6.3 (WIP) * [MESOS-9856] - REVIVE call with specified role(s) clears filters for all roles of a framework. * [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct. * [MESOS-9870] - Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master. + * [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor. * [MESOS-9893] - `volume/secret` isolator should cleanup the stored secret from runtime directory when the container is destroyed. ** Improvement
[mesos] branch master updated (f0be237 -> 24e989e)
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git. from f0be237 Fixed out-of-order processing of terminal status updates in agent. new 967f105 Added MESOS-9887 to the 1.8.2 CHANGELOG. new 6b2d101 Added MESOS-9887 to the 1.7.3 CHANGELOG. new 24e989e Added MESOS-9887 to the 1.6.3 CHANGELOG. The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGELOG | 3 +++ 1 file changed, 3 insertions(+)
[mesos] 02/03: Added MESOS-9887 to the 1.7.3 CHANGELOG.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 6b2d101770ae8853d021b8cc5d0f5ae587302a54 Author: Andrei Budnik AuthorDate: Mon Aug 26 14:58:45 2019 +0200 Added MESOS-9887 to the 1.7.3 CHANGELOG. --- CHANGELOG | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG b/CHANGELOG index a215e5c..3cd7661 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -504,6 +504,7 @@ Release Notes - Mesos - Version 1.7.3 (WIP) * [MESOS-9856] - REVIVE call with specified role(s) clears filters for all roles of a framework. * [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct. * [MESOS-9870] - Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master. + * [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor. * [MESOS-9893] - `volume/secret` isolator should cleanup the stored secret from runtime directory when the container is destroyed. * [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent.
[mesos] 02/03: Fixed out-of-order processing of terminal status updates in agent.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.8.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit 4bbb0376cd584a4160a2c5c2f0ac4f3ecaa5e622 Author: Andrei Budnik AuthorDate: Tue Aug 20 19:24:44 2019 +0200 Fixed out-of-order processing of terminal status updates in agent. Previously, Mesos agent could send TASK_FAILED status update on executor termination while processing of TASK_FINISHED status update was in progress. Processing of task status updates involves sending requests to the containerizer, which might finish processing of these requests out-of-order, e.g. `MesosContainerizer::status`. Also, the agent does not overwrite status of the terminal status update once it's stored in the `terminatedTasks`. Hence, there was a race condition between two terminal status updates. Note that V1 Executors are not affected by this problem because they wait for an acknowledgement of the terminal status update by the agent before terminating. This patch introduces a new data structure `pendingStatusUpdates`, which holds a list of status updates that are being processed. This data structure allows validating the order of processing of status updates by the agent. Review: https://reviews.apache.org/r/71343 --- src/slave/slave.cpp | 62 ++--- src/slave/slave.hpp | 6 ++ 2 files changed, 65 insertions(+), 3 deletions(-) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index 50a7d68..8d8cef3 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5727,6 +5727,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) metrics.valid_status_updates++; + executor->addPendingTaskStatus(status); + // Before sending update, we need to retrieve the container status // if the task reached the executor. For tasks that are queued, we // do not need to send the container status and we must @@ -5938,6 +5940,17 @@ void Slave::___statusUpdate( VLOG(1) << "Task status update manager successfully handled status update " << update; + const TaskStatus& status = update.status(); + + Executor* executor = nullptr; + Framework* framework = getFramework(update.framework_id()); + if (framework != nullptr) { +executor = framework->getExecutor(status.task_id()); +if (executor != nullptr) { + executor->removePendingTaskStatus(status); +} + } + if (pid == UPID()) { return; } @@ -5945,7 +5958,7 @@ void Slave::___statusUpdate( StatusUpdateAcknowledgementMessage message; message.mutable_framework_id()->MergeFrom(update.framework_id()); message.mutable_slave_id()->MergeFrom(update.slave_id()); - message.mutable_task_id()->MergeFrom(update.status().task_id()); + message.mutable_task_id()->MergeFrom(status.task_id()); message.set_uuid(update.uuid()); // Task status update manager successfully handled the status update. @@ -5957,14 +5970,12 @@ void Slave::___statusUpdate( send(pid.get(), message); } else { // Acknowledge the HTTP based executor. -Framework* framework = getFramework(update.framework_id()); if (framework == nullptr) { LOG(WARNING) << "Ignoring sending acknowledgement for status update " << update << " of unknown framework"; return; } -Executor* executor = framework->getExecutor(update.status().task_id()); if (executor == nullptr) { // Refer to the comments in 'statusUpdate()' on when this can // happen. @@ -10520,6 +10531,33 @@ void Executor::recoverTask(const TaskState& state, bool recheckpointTask) } +void Executor::addPendingTaskStatus(const TaskStatus& status) +{ + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + pendingStatusUpdates[status.task_id()][uuid] = status; +} + + +void Executor::removePendingTaskStatus(const TaskStatus& status) +{ + const TaskID& taskId = status.task_id(); + + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + + if (!pendingStatusUpdates.contains(taskId) || + !pendingStatusUpdates[taskId].contains(uuid)) { +LOG(WARNING) << "Unknown pending status update (uuid: " << uuid << ")"; +return; + } + + pendingStatusUpdates[taskId].erase(uuid); + + if (pendingStatusUpdates[taskId].empty()) { +pendingStatusUpdates.erase(taskId); + } +} + + Try Executor::updateTaskState(const TaskStatus& status) { bool terminal = protobuf::isTerminalState(status.state()); @@ -10543,6 +10581,24 @@ Try Executor::updateTaskState(const TaskStatus& status) task = launchedTasks.at(status.task_id()); if (terminal) { + if (pendingStatusUpdates.contains(status.task_id())) { +auto statusUpdates = pendingStatusUpdates[status.task_id()].values(); + +auto firstTerminal = std::find_if( +statusUpdates.begin(), +statusUpdates.end(), +
[mesos] 01/03: Added missing `return` statement in `Slave::statusUpdate`.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.8.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit 14abb82925cdbce746238bc20dc7b8c279a96a67 Author: Andrei Budnik AuthorDate: Fri Aug 23 14:36:18 2019 +0200 Added missing `return` statement in `Slave::statusUpdate`. Previously, if `statusUpdate` was called for a pending task, it would forward the status update and then continue executing `statusUpdate`, which then checks if there is an executor that is aware of this task. Given that a pending task is not known to any executor, it would always handle it by forwarding status update one more time. This patch adds missing `return` statement, which fixes the issue. Review: https://reviews.apache.org/r/71361 --- src/slave/slave.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index bf87be0..50a7d68 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5659,6 +5659,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) taskStatusUpdateManager->update(update, info.id()) .onAny(defer(self(), ::___statusUpdate, lambda::_1, update, pid)); + +return; } Executor* executor = framework->getExecutor(status.task_id());
[mesos] branch 1.8.x updated (f3aa802 -> adc958f)
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a change to branch 1.8.x in repository https://gitbox.apache.org/repos/asf/mesos.git. from f3aa802 Added MESOS-9836 to the 1.8.2 CHANGELOG. new 14abb82 Added missing `return` statement in `Slave::statusUpdate`. new 4bbb037 Fixed out-of-order processing of terminal status updates in agent. new adc958f Added MESOS-9887 to the 1.8.2 CHANGELOG. The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGELOG | 1 + src/slave/slave.cpp | 64 ++--- src/slave/slave.hpp | 6 + 3 files changed, 68 insertions(+), 3 deletions(-)
[mesos] 03/03: Added MESOS-9887 to the 1.8.2 CHANGELOG.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.8.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit adc958f553c3728aab5529de56b0ddc30c0f9b68 Author: Andrei Budnik AuthorDate: Mon Aug 26 15:02:40 2019 +0200 Added MESOS-9887 to the 1.8.2 CHANGELOG. --- CHANGELOG | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG b/CHANGELOG index b3fca25..ff89605 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -6,6 +6,7 @@ Release Notes - Mesos - Version 1.8.2 (WIP) * [MESOS-9785] - Frameworks recovered from reregistered agents are not reported to master `/api/v1` subscribers. * [MESOS-9836] - Docker containerizer overwrites `/mesos/slave` cgroups. * [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct. + * [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor. * [MESOS-9893] - `volume/secret` isolator should cleanup the stored secret from runtime directory when the container is destroyed. * [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent.
[mesos] 02/03: Fixed out-of-order processing of terminal status updates in agent.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.7.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit b7dcc984476904d6d17f7bf699295dfa9ac8a66e Author: Andrei Budnik AuthorDate: Tue Aug 20 19:24:44 2019 +0200 Fixed out-of-order processing of terminal status updates in agent. Previously, Mesos agent could send TASK_FAILED status update on executor termination while processing of TASK_FINISHED status update was in progress. Processing of task status updates involves sending requests to the containerizer, which might finish processing of these requests out-of-order, e.g. `MesosContainerizer::status`. Also, the agent does not overwrite status of the terminal status update once it's stored in the `terminatedTasks`. Hence, there was a race condition between two terminal status updates. Note that V1 Executors are not affected by this problem because they wait for an acknowledgement of the terminal status update by the agent before terminating. This patch introduces a new data structure `pendingStatusUpdates`, which holds a list of status updates that are being processed. This data structure allows validating the order of processing of status updates by the agent. Review: https://reviews.apache.org/r/71343 --- src/slave/slave.cpp | 62 ++--- src/slave/slave.hpp | 6 ++ 2 files changed, 65 insertions(+), 3 deletions(-) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index edfe3d0..f10aac2 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5486,6 +5486,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) metrics.valid_status_updates++; + executor->addPendingTaskStatus(status); + // Before sending update, we need to retrieve the container status // if the task reached the executor. For tasks that are queued, we // do not need to send the container status and we must @@ -5697,6 +5699,17 @@ void Slave::___statusUpdate( VLOG(1) << "Task status update manager successfully handled status update " << update; + const TaskStatus& status = update.status(); + + Executor* executor = nullptr; + Framework* framework = getFramework(update.framework_id()); + if (framework != nullptr) { +executor = framework->getExecutor(status.task_id()); +if (executor != nullptr) { + executor->removePendingTaskStatus(status); +} + } + if (pid == UPID()) { return; } @@ -5704,7 +5717,7 @@ void Slave::___statusUpdate( StatusUpdateAcknowledgementMessage message; message.mutable_framework_id()->MergeFrom(update.framework_id()); message.mutable_slave_id()->MergeFrom(update.slave_id()); - message.mutable_task_id()->MergeFrom(update.status().task_id()); + message.mutable_task_id()->MergeFrom(status.task_id()); message.set_uuid(update.uuid()); // Task status update manager successfully handled the status update. @@ -5716,14 +5729,12 @@ void Slave::___statusUpdate( send(pid.get(), message); } else { // Acknowledge the HTTP based executor. -Framework* framework = getFramework(update.framework_id()); if (framework == nullptr) { LOG(WARNING) << "Ignoring sending acknowledgement for status update " << update << " of unknown framework"; return; } -Executor* executor = framework->getExecutor(update.status().task_id()); if (executor == nullptr) { // Refer to the comments in 'statusUpdate()' on when this can // happen. @@ -9861,6 +9872,33 @@ void Executor::recoverTask(const TaskState& state, bool recheckpointTask) } +void Executor::addPendingTaskStatus(const TaskStatus& status) +{ + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + pendingStatusUpdates[status.task_id()][uuid] = status; +} + + +void Executor::removePendingTaskStatus(const TaskStatus& status) +{ + const TaskID& taskId = status.task_id(); + + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + + if (!pendingStatusUpdates.contains(taskId) || + !pendingStatusUpdates[taskId].contains(uuid)) { +LOG(WARNING) << "Unknown pending status update (uuid: " << uuid << ")"; +return; + } + + pendingStatusUpdates[taskId].erase(uuid); + + if (pendingStatusUpdates[taskId].empty()) { +pendingStatusUpdates.erase(taskId); + } +} + + Try Executor::updateTaskState(const TaskStatus& status) { bool terminal = protobuf::isTerminalState(status.state()); @@ -9884,6 +9922,24 @@ Try Executor::updateTaskState(const TaskStatus& status) task = launchedTasks.at(status.task_id()); if (terminal) { + if (pendingStatusUpdates.contains(status.task_id())) { +auto statusUpdates = pendingStatusUpdates[status.task_id()].values(); + +auto firstTerminal = std::find_if( +statusUpdates.begin(), +statusUpdates.end(), +
[mesos] 01/03: Added missing `return` statement in `Slave::statusUpdate`.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.7.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit 2d62e8ae0ef94f78c9b32be258a08d1e6e2382df Author: Andrei Budnik AuthorDate: Fri Aug 23 14:36:18 2019 +0200 Added missing `return` statement in `Slave::statusUpdate`. Previously, if `statusUpdate` was called for a pending task, it would forward the status update and then continue executing `statusUpdate`, which then checks if there is an executor that is aware of this task. Given that a pending task is not known to any executor, it would always handle it by forwarding status update one more time. This patch adds missing `return` statement, which fixes the issue. Review: https://reviews.apache.org/r/71361 --- src/slave/slave.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index 1c33579..edfe3d0 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5418,6 +5418,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) taskStatusUpdateManager->update(update, info.id()) .onAny(defer(self(), ::___statusUpdate, lambda::_1, update, pid)); + +return; } Executor* executor = framework->getExecutor(status.task_id());
[mesos] 03/03: Added MESOS-9887 to the 1.7.3 CHANGELOG.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.7.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit 80d42b9a2c9223665a82bbaaf3cbc222a094e2ef Author: Andrei Budnik AuthorDate: Mon Aug 26 14:58:45 2019 +0200 Added MESOS-9887 to the 1.7.3 CHANGELOG. --- CHANGELOG | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG b/CHANGELOG index 06c88db..1178228 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -29,6 +29,7 @@ Release Notes - Mesos - Version 1.7.3 (WIP) * [MESOS-9856] - REVIVE call with specified role(s) clears filters for all roles of a framework. * [MESOS-9868] - NetworkInfo from the agent /state endpoint is not correct. * [MESOS-9870] - Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master. + * [MESOS-9887] - Race condition between two terminal task status updates for Docker/Command executor. * [MESOS-9893] - `volume/secret` isolator should cleanup the stored secret from runtime directory when the container is destroyed. * [MESOS-9925] - Default executor takes a couple of seconds to start and subscribe Mesos agent.
[mesos] 01/03: Added missing `return` statement in `Slave::statusUpdate`.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.6.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit cc79f22fb07cfad8f248150d5a3040f846998c3a Author: Andrei Budnik AuthorDate: Fri Aug 23 14:36:18 2019 +0200 Added missing `return` statement in `Slave::statusUpdate`. Previously, if `statusUpdate` was called for a pending task, it would forward the status update and then continue executing `statusUpdate`, which then checks if there is an executor that is aware of this task. Given that a pending task is not known to any executor, it would always handle it by forwarding status update one more time. This patch adds missing `return` statement, which fixes the issue. Review: https://reviews.apache.org/r/71361 --- src/slave/slave.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index 2a90e96..176d3fb 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5388,6 +5388,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) taskStatusUpdateManager->update(update, info.id()) .onAny(defer(self(), ::___statusUpdate, lambda::_1, update, pid)); + +return; } Executor* executor = framework->getExecutor(status.task_id());
[mesos] 02/03: Fixed out-of-order processing of terminal status updates in agent.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch 1.6.x in repository https://gitbox.apache.org/repos/asf/mesos.git commit 3ad802ebbe34565a2fa995d834ba4928c20e5e62 Author: Andrei Budnik AuthorDate: Tue Aug 20 19:24:44 2019 +0200 Fixed out-of-order processing of terminal status updates in agent. Previously, Mesos agent could send TASK_FAILED status update on executor termination while processing of TASK_FINISHED status update was in progress. Processing of task status updates involves sending requests to the containerizer, which might finish processing of these requests out-of-order, e.g. `MesosContainerizer::status`. Also, the agent does not overwrite status of the terminal status update once it's stored in the `terminatedTasks`. Hence, there was a race condition between two terminal status updates. Note that V1 Executors are not affected by this problem because they wait for an acknowledgement of the terminal status update by the agent before terminating. This patch introduces a new data structure `pendingStatusUpdates`, which holds a list of status updates that are being processed. This data structure allows validating the order of processing of status updates by the agent. Review: https://reviews.apache.org/r/71343 --- src/slave/slave.cpp | 62 ++--- src/slave/slave.hpp | 6 ++ 2 files changed, 65 insertions(+), 3 deletions(-) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index 176d3fb..0861ac2 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5456,6 +5456,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) metrics.valid_status_updates++; + executor->addPendingTaskStatus(status); + // Before sending update, we need to retrieve the container status // if the task reached the executor. For tasks that are queued, we // do not need to send the container status and we must @@ -5667,6 +5669,17 @@ void Slave::___statusUpdate( VLOG(1) << "Task status update manager successfully handled status update " << update; + const TaskStatus& status = update.status(); + + Executor* executor = nullptr; + Framework* framework = getFramework(update.framework_id()); + if (framework != nullptr) { +executor = framework->getExecutor(status.task_id()); +if (executor != nullptr) { + executor->removePendingTaskStatus(status); +} + } + if (pid == UPID()) { return; } @@ -5674,7 +5687,7 @@ void Slave::___statusUpdate( StatusUpdateAcknowledgementMessage message; message.mutable_framework_id()->MergeFrom(update.framework_id()); message.mutable_slave_id()->MergeFrom(update.slave_id()); - message.mutable_task_id()->MergeFrom(update.status().task_id()); + message.mutable_task_id()->MergeFrom(status.task_id()); message.set_uuid(update.uuid()); // Task status update manager successfully handled the status update. @@ -5686,14 +5699,12 @@ void Slave::___statusUpdate( send(pid.get(), message); } else { // Acknowledge the HTTP based executor. -Framework* framework = getFramework(update.framework_id()); if (framework == nullptr) { LOG(WARNING) << "Ignoring sending acknowledgement for status update " << update << " of unknown framework"; return; } -Executor* executor = framework->getExecutor(update.status().task_id()); if (executor == nullptr) { // Refer to the comments in 'statusUpdate()' on when this can // happen. @@ -9759,6 +9770,33 @@ void Executor::recoverTask(const TaskState& state, bool recheckpointTask) } +void Executor::addPendingTaskStatus(const TaskStatus& status) +{ + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + pendingStatusUpdates[status.task_id()][uuid] = status; +} + + +void Executor::removePendingTaskStatus(const TaskStatus& status) +{ + const TaskID& taskId = status.task_id(); + + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + + if (!pendingStatusUpdates.contains(taskId) || + !pendingStatusUpdates[taskId].contains(uuid)) { +LOG(WARNING) << "Unknown pending status update (uuid: " << uuid << ")"; +return; + } + + pendingStatusUpdates[taskId].erase(uuid); + + if (pendingStatusUpdates[taskId].empty()) { +pendingStatusUpdates.erase(taskId); + } +} + + Try Executor::updateTaskState(const TaskStatus& status) { bool terminal = protobuf::isTerminalState(status.state()); @@ -9782,6 +9820,24 @@ Try Executor::updateTaskState(const TaskStatus& status) task = launchedTasks.at(status.task_id()); if (terminal) { + if (pendingStatusUpdates.contains(status.task_id())) { +auto statusUpdates = pendingStatusUpdates[status.task_id()].values(); + +auto firstTerminal = std::find_if( +statusUpdates.begin(), +statusUpdates.end(), +
[mesos] branch 1.6.x updated (9badb3b -> d77029f)
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a change to branch 1.6.x in repository https://gitbox.apache.org/repos/asf/mesos.git. from 9badb3b Added MESOS-9836 to the 1.6.3 CHANGELOG. new cc79f22 Added missing `return` statement in `Slave::statusUpdate`. new 3ad802e Fixed out-of-order processing of terminal status updates in agent. new d77029f Added MESOS-9887 to the 1.6.3 CHANGELOG. The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: CHANGELOG | 1 + src/slave/slave.cpp | 64 ++--- src/slave/slave.hpp | 6 + 3 files changed, 68 insertions(+), 3 deletions(-)
[mesos] branch master updated (48c20bf -> f0be237)
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git. from 48c20bf Updated site's dependencies. new 8aae23e Added missing `return` statement in `Slave::statusUpdate`. new f0be237 Fixed out-of-order processing of terminal status updates in agent. The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: src/slave/slave.cpp | 64 ++--- src/slave/slave.hpp | 6 + 2 files changed, 67 insertions(+), 3 deletions(-)
[mesos] 01/02: Added missing `return` statement in `Slave::statusUpdate`.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit 8aae23ec7cd4bc50532df0b1d1ea6ec23ce078f8 Author: Andrei Budnik AuthorDate: Fri Aug 23 14:36:18 2019 +0200 Added missing `return` statement in `Slave::statusUpdate`. Previously, if `statusUpdate` was called for a pending task, it would forward the status update and then continue executing `statusUpdate`, which then checks if there is an executor that is aware of this task. Given that a pending task is not known to any executor, it would always handle it by forwarding status update one more time. This patch adds missing `return` statement, which fixes the issue. Review: https://reviews.apache.org/r/71361 --- src/slave/slave.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index 882040d..45f1584 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5879,6 +5879,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) taskStatusUpdateManager->update(update, info.id()) .onAny(defer(self(), ::___statusUpdate, lambda::_1, update, pid)); + +return; } Executor* executor = framework->getExecutor(status.task_id());
[mesos] 02/02: Fixed out-of-order processing of terminal status updates in agent.
This is an automated email from the ASF dual-hosted git repository. abudnik pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git commit f0be23765531b05661ed7f1b124faf96744aa80b Author: Andrei Budnik AuthorDate: Tue Aug 20 19:24:44 2019 +0200 Fixed out-of-order processing of terminal status updates in agent. Previously, Mesos agent could send TASK_FAILED status update on executor termination while processing of TASK_FINISHED status update was in progress. Processing of task status updates involves sending requests to the containerizer, which might finish processing of these requests out-of-order, e.g. `MesosContainerizer::status`. Also, the agent does not overwrite status of the terminal status update once it's stored in the `terminatedTasks`. Hence, there was a race condition between two terminal status updates. Note that V1 Executors are not affected by this problem because they wait for an acknowledgement of the terminal status update by the agent before terminating. This patch introduces a new data structure `pendingStatusUpdates`, which holds a list of status updates that are being processed. This data structure allows validating the order of processing of status updates by the agent. Review: https://reviews.apache.org/r/71343 --- src/slave/slave.cpp | 62 ++--- src/slave/slave.hpp | 6 ++ 2 files changed, 65 insertions(+), 3 deletions(-) diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp index 45f1584..4e93656 100644 --- a/src/slave/slave.cpp +++ b/src/slave/slave.cpp @@ -5947,6 +5947,8 @@ void Slave::statusUpdate(StatusUpdate update, const Option& pid) metrics.valid_status_updates++; + executor->addPendingTaskStatus(status); + // Before sending update, we need to retrieve the container status // if the task reached the executor. For tasks that are queued, we // do not need to send the container status and we must @@ -6158,6 +6160,17 @@ void Slave::___statusUpdate( VLOG(1) << "Task status update manager successfully handled status update " << update; + const TaskStatus& status = update.status(); + + Executor* executor = nullptr; + Framework* framework = getFramework(update.framework_id()); + if (framework != nullptr) { +executor = framework->getExecutor(status.task_id()); +if (executor != nullptr) { + executor->removePendingTaskStatus(status); +} + } + if (pid == UPID()) { return; } @@ -6165,7 +6178,7 @@ void Slave::___statusUpdate( StatusUpdateAcknowledgementMessage message; message.mutable_framework_id()->MergeFrom(update.framework_id()); message.mutable_slave_id()->MergeFrom(update.slave_id()); - message.mutable_task_id()->MergeFrom(update.status().task_id()); + message.mutable_task_id()->MergeFrom(status.task_id()); message.set_uuid(update.uuid()); // Task status update manager successfully handled the status update. @@ -6177,14 +6190,12 @@ void Slave::___statusUpdate( send(pid.get(), message); } else { // Acknowledge the HTTP based executor. -Framework* framework = getFramework(update.framework_id()); if (framework == nullptr) { LOG(WARNING) << "Ignoring sending acknowledgement for status update " << update << " of unknown framework"; return; } -Executor* executor = framework->getExecutor(update.status().task_id()); if (executor == nullptr) { // Refer to the comments in 'statusUpdate()' on when this can // happen. @@ -10795,6 +10806,33 @@ void Executor::recoverTask(const TaskState& state, bool recheckpointTask) } +void Executor::addPendingTaskStatus(const TaskStatus& status) +{ + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + pendingStatusUpdates[status.task_id()][uuid] = status; +} + + +void Executor::removePendingTaskStatus(const TaskStatus& status) +{ + const TaskID& taskId = status.task_id(); + + auto uuid = id::UUID::fromBytes(status.uuid()).get(); + + if (!pendingStatusUpdates.contains(taskId) || + !pendingStatusUpdates[taskId].contains(uuid)) { +LOG(WARNING) << "Unknown pending status update (uuid: " << uuid << ")"; +return; + } + + pendingStatusUpdates[taskId].erase(uuid); + + if (pendingStatusUpdates[taskId].empty()) { +pendingStatusUpdates.erase(taskId); + } +} + + Try Executor::updateTaskState(const TaskStatus& status) { bool terminal = protobuf::isTerminalState(status.state()); @@ -10818,6 +10856,24 @@ Try Executor::updateTaskState(const TaskStatus& status) task = launchedTasks.at(status.task_id()); if (terminal) { + if (pendingStatusUpdates.contains(status.task_id())) { +auto statusUpdates = pendingStatusUpdates[status.task_id()].values(); + +auto firstTerminal = std::find_if( +statusUpdates.begin(), +statusUpdates.end(), +
[mesos] branch master updated: Updated site's dependencies.
This is an automated email from the ASF dual-hosted git repository. bbannier pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/mesos.git The following commit(s) were added to refs/heads/master by this push: new 48c20bf Updated site's dependencies. 48c20bf is described below commit 48c20bf257da60eaf714017efec0d4a80c203c04 Author: Benjamin Bannier AuthorDate: Mon Aug 26 10:06:18 2019 +0200 Updated site's dependencies. This bumps e.g., `nokogiri` to a version not affected by CVE-2019-5477 anymore (not that it would have any impact on our use of it). Review: https://reviews.apache.org/r/71367/ --- site/Gemfile.lock | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/site/Gemfile.lock b/site/Gemfile.lock index 343d3e6..63c48e7 100644 --- a/site/Gemfile.lock +++ b/site/Gemfile.lock @@ -1,7 +1,7 @@ GEM remote: https://rubygems.org/ specs: -activesupport (4.2.10) +activesupport (4.2.11.1) i18n (~> 0.7) minitest (~> 5.1) thread_safe (~> 0.3, >= 0.3.4) @@ -13,7 +13,7 @@ GEM rack (>= 1.0.0) rack-test (>= 0.5.4) xpath (~> 2.0) -chunky_png (1.3.10) +chunky_png (1.3.11) coffee-script (2.4.1) coffee-script-source execjs @@ -36,8 +36,8 @@ GEM erubis (2.7.0) eventmachine (1.2.7) execjs (2.7.0) -ffi (1.9.25) -haml (5.0.4) +ffi (1.11.1) +haml (5.1.2) temple (>= 0.8.0) tilt hike (1.2.3) @@ -46,7 +46,7 @@ GEM htmlentities (4.3.4) http_parser.rb (0.6.0) i18n (0.7.0) -json (2.1.0) +json (2.2.0) kramdown (1.17.0) libv8 (3.16.14.19) listen (3.0.8) @@ -93,11 +93,11 @@ GEM rouge (~> 2.0) mime-types (3.2.2) mime-types-data (~> 3.2015) -mime-types-data (3.2018.0812) +mime-types-data (3.2019.0331) mini_portile2 (2.4.0) minitest (5.11.3) multi_json (1.13.1) -nokogiri (1.10.2) +nokogiri (1.10.4) mini_portile2 (~> 2.4.0) padrino-helpers (0.12.9) i18n (~> 0.6, >= 0.6.7) @@ -110,10 +110,10 @@ GEM rack rack-test (1.1.0) rack (>= 1.0, < 3) -rake (12.3.1) +rake (12.3.3) rb-fsevent (0.10.3) -rb-inotify (0.9.10) - ffi (>= 0.5.0, < 2) +rb-inotify (0.10.0) + ffi (~> 1.0) rdiscount (2.2.0.1) ref (2.0.0) rouge (2.2.1) @@ -128,16 +128,16 @@ GEM sprockets-sass (1.3.1) sprockets (~> 2.0) tilt (~> 1.1) -temple (0.8.0) +temple (0.8.1) therubyracer (0.12.3) libv8 (~> 3.16.14.15) ref -thor (0.20.0) +thor (0.20.3) thread_safe (0.3.6) tilt (1.4.1) tzinfo (1.2.5) thread_safe (~> 0.1) -tzinfo-data (1.2018.5) +tzinfo-data (1.2019.2) tzinfo (>= 1.0.0) uber (0.0.15) uglifier (2.7.2) @@ -161,4 +161,4 @@ DEPENDENCIES tzinfo-data BUNDLED WITH - 1.16.1 + 1.17.2