[jira] [Updated] (MESOS-3272) CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer is flaky.
[ https://issues.apache.org/jira/browse/MESOS-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-3272: --- Shepherd: Benjamin Mahler Thanks [~qiujian], I will shepherd the fix. > CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer is flaky. > > > Key: MESOS-3272 > URL: https://issues.apache.org/jira/browse/MESOS-3272 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Paul Brett >Assignee: Jian Qiu > Attachments: build.log > > > Test aborts when configured with python, libevent and SSL on Ubuntu12. > [ RUN ] > CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer > *** Aborted at 1439667937 (unix time) try "date -d @1439667937" if you are > using GNU date *** > PC: @ 0x7feba972a753 (unknown) > *** SIGSEGV (@0x0) received by PID 4359 (TID 0x7febabf897c0) from PID 0; > stack trace: *** > @ 0x7feba8f7dcb0 (unknown) > @ 0x7feba972a753 (unknown) > @ 0x7febaaa69328 process::dispatch<>() > @ 0x7febaaa5e9a7 cgroups::freezer::thaw() > @ 0xba64ff > mesos::internal::tests::CgroupsAnyHierarchyWithCpuMemoryTest_ROOT_CGROUPS_FreezeNonFreezer_Test::TestBody() > @ 0xc199a3 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xc0f947 testing::Test::Run() > @ 0xc0f9ee testing::TestInfo::Run() > @ 0xc0faf5 testing::TestCase::Run() > @ 0xc0fda8 testing::internal::UnitTestImpl::RunAllTests() > @ 0xc10064 testing::UnitTest::Run() > @ 0x4b3273 main > @ 0x7feba8bd176d (unknown) > @ 0x4bf1f1 (unknown) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3273) EventCall Test Framework is flaky
[ https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-3273: --- Target Version/s: (was: 0.26.0) > EventCall Test Framework is flaky > - > > Key: MESOS-3273 > URL: https://issues.apache.org/jira/browse/MESOS-3273 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 0.24.0 > Environment: > https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull >Reporter: Vinod Kone > Labels: flaky-test, tech-debt, twitter > > Observed this on ASF CI. h/t [~haosd...@gmail.com] > Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master. > {code} > [ RUN ] ExamplesTest.EventCallFramework > Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx' > I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the > driver is aborted! > Shutting down > Sending SIGTERM to process tree at pid 26061 > Killing the following process trees: > [ > ] > Shutting down > Sending SIGTERM to process tree at pid 26062 > Shutting down > Killing the following process trees: > [ > ] > Sending SIGTERM to process tree at pid 26063 > Killing the following process trees: > [ > ] > Shutting down > Sending SIGTERM to process tree at pid 26098 > Killing the following process trees: > [ > ] > Shutting down > Sending SIGTERM to process tree at pid 26099 > Killing the following process trees: > [ > ] > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on > 172.17.2.10:60249 for 16 cpus > I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR > I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0 > I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms > I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms > I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns > I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in > 8429ns > I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the > db in 4219ns > I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery > I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status > I0813 19:55:17.181970 26126 master.cpp:378] Master > 20150813-195517-167907756-60249-26100 (297daca2d01a) started on > 172.17.2.10:60249 > I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: > --acls="permissive: false > register_frameworks { > principals { > type: SOME > values: "test-principal" > } > roles { > type: SOME > values: "*" > } > } > run_tasks { > principals { > type: SOME > values: "test-principal" > } > users { > type: SOME > values: "mesos" > } > } > " --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="false" > --authenticators="crammd5" > --credentials="/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials" > --framework_sorter="drf" --help="false" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" > --registry_strict="false" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.24.0/src/webui" --work_dir="/tmp/mesos-II8Gua" > --zk_session_timeout="10secs" > I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated > frameworks to register > I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated > slaves to register > I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for > authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' > W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials > file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. > It is recommended that your credentials file is NOT accessible by others. > I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' > authenticator > I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL > I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover respons
[jira] [Updated] (MESOS-4029) ContentType/SchedulerTest is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4029: --- Shepherd: Benjamin Mahler Assignee: Anand Mazumdar Summary: ContentType/SchedulerTest is flaky. (was: ContentType/SchedulerTest seems flaky.) Took a look. For the traces with the mock expectation crashing (traces containing UntypedInvokeWith), the issue appears to be that our stack object dependencies in the tests are not safely ordered. Specifically, we pass a pointer of the {{Queue}} ([here|https://github.com/apache/mesos/blob/0.26.0-rc2/src/tests/scheduler_tests.cpp#L147]) into the expectations on {{Callbacks}} above. During destruction of the test, the {{Mesos}} class will destruct **after** the {{Queue}} is already destructed. If a non-HEARTBEAT event arrives in this window, the expectation will try to dereference the destructed {{Queue}} object. > ContentType/SchedulerTest is flaky. > --- > > Key: MESOS-4029 > URL: https://issues.apache.org/jira/browse/MESOS-4029 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.26.0 >Reporter: Till Toenshoff >Assignee: Anand Mazumdar > Labels: flaky, flaky-test > > SSL build, [Ubuntu > 14.04|https://github.com/tillt/mesos-vagrant-ci/blob/master/ubuntu14/setup.sh], > non-root test run. > {noformat} > [--] 22 tests from ContentType/SchedulerTest > [ RUN ] ContentType/SchedulerTest.Subscribe/0 > [ OK ] ContentType/SchedulerTest.Subscribe/0 (48 ms) > *** Aborted at 1448928007 (unix time) try "date -d @1448928007" if you are > using GNU date *** > [ RUN ] ContentType/SchedulerTest.Subscribe/1 > PC: @ 0x1451b8e > testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() > *** SIGSEGV (@0x10030) received by PID 21320 (TID 0x2b549e5d4700) from > PID 48; stack trace: *** > @ 0x2b54c95940b7 os::Linux::chained_handler() > @ 0x2b54c9598219 JVM_handle_linux_signal > @ 0x2b5496300340 (unknown) > @ 0x1451b8e > testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() > @ 0xe2ea6d > _ZN7testing8internal18FunctionMockerBaseIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeIS6_SaIS6_E10InvokeWithERKSt5tupleIJSC_EE > @ 0xe2b1bc testing::internal::FunctionMocker<>::Invoke() > @ 0x1118aed > mesos::internal::tests::SchedulerTest::Callbacks::received() > @ 0x111c453 > _ZNKSt7_Mem_fnIMN5mesos8internal5tests13SchedulerTest9CallbacksEFvRKSt5queueINS0_2v19scheduler5EventESt5dequeIS8_SaIS8_EclIJSE_EvEEvRS4_DpOT_ > @ 0x111c001 > _ZNSt5_BindIFSt7_Mem_fnIMN5mesos8internal5tests13SchedulerTest9CallbacksEFvRKSt5queueINS1_2v19scheduler5EventESt5dequeIS9_SaIS9_ESt17reference_wrapperIS5_ESt12_PlaceholderILi16__callIvJSF_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE > @ 0x111b90d > _ZNSt5_BindIFSt7_Mem_fnIMN5mesos8internal5tests13SchedulerTest9CallbacksEFvRKSt5queueINS1_2v19scheduler5EventESt5dequeIS9_SaIS9_ESt17reference_wrapperIS5_ESt12_PlaceholderILi1clIJSF_EvEET0_DpOT_ > @ 0x111ae09 std::_Function_handler<>::_M_invoke() > @ 0x2b5493c6da09 std::function<>::operator()() > @ 0x2b5493c688ee process::AsyncExecutorProcess::execute<>() > @ 0x2b5493c6db2a > _ZZN7process8dispatchI7NothingNS_20AsyncExecutorProcessERKSt8functionIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeIS8_SaIS8_ESC_PvSG_SC_SJ_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSO_FSL_T1_T2_T3_ET4_T5_T6_ENKUlPNS_11ProcessBaseEE_clES11_ > @ 0x2b5493c765a4 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingNS0_20AsyncExecutorProcessERKSt8functionIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeISC_SaISC_ESG_PvSK_SG_SN_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSS_FSP_T1_T2_T3_ET4_T5_T6_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2b54946b1201 std::function<>::operator()() > @ 0x2b549469960f process::ProcessBase::visit() > @ 0x2b549469d480 process::DispatchEvent::visit() > @ 0x9dc0ba process::ProcessBase::serve() > @ 0x2b54946958cc process::ProcessManager::resume() > @ 0x2b5494692a9c > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x2b549469ccac > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x2b549469cc5c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x2b549469cbee > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invo
[jira] [Created] (MESOS-4064) Add ContainerInfo to internal Task protobuf.
Benjamin Mahler created MESOS-4064: -- Summary: Add ContainerInfo to internal Task protobuf. Key: MESOS-4064 URL: https://issues.apache.org/jira/browse/MESOS-4064 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler In what seems like an oversight, when ContainerInfo was added to TaskInfo, it was not added to our internal Task protobuf. Also, unlike the agent, it appears that the master does not use protobuf::createTask. We should try remove the manual construction in the master in favor of construction through protobuf::createTask. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4066) Expose when agent is recovering in the agent's /state.json endpoint.
Benjamin Mahler created MESOS-4066: -- Summary: Expose when agent is recovering in the agent's /state.json endpoint. Key: MESOS-4066 URL: https://issues.apache.org/jira/browse/MESOS-4066 Project: Mesos Issue Type: Task Components: slave Reporter: Benjamin Mahler Currently when a user is hitting /state.json on the agent, it may return partial state if the agent has failed over and is recovering. There is currently no clear way to tell if this is the case when looking at a response, so the user may incorrectly interpret the agent as being empty of tasks. We could consider exposing the 'state' enum of the agent in the endpoint: {code} enum State { RECOVERING, // Slave is doing recovery. DISCONNECTED, // Slave is not connected to the master. RUNNING, // Slave has (re-)registered. TERMINATING, // Slave is shutting down. } state; {code} This may be a bit tricky to maintain as far as backwards-compatibility of the endpoint, if we were to alter this enum. Exposing this would allow users to be more informed about the state of the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4024) HealthCheckTest.CheckCommandTimeout flaky
[ https://issues.apache.org/jira/browse/MESOS-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039701#comment-15039701 ] Benjamin Mahler commented on MESOS-4024: [~haosd...@gmail.com] please do not link to Jenkins, the data gets garbage collected. Do you have logs for this that you can paste in the ticket? > HealthCheckTest.CheckCommandTimeout flaky > - > > Key: MESOS-4024 > URL: https://issues.apache.org/jira/browse/MESOS-4024 > Project: Mesos > Issue Type: Bug >Reporter: haosdent >Assignee: haosdent > Attachments: HealthCheckTest_CheckCommandTimeout.log > > > https://builds.apache.org/job/Mesos/1288/COMPILER=gcc,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleText -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042325#comment-15042325 ] Benjamin Mahler commented on MESOS-3851: The fix is committed, it would be great to re-enable the CHECKs in order to detect this issue should it still be present: {noformat} commit 4201c2c3e5849a01d0a63769404bad03792ae5de Author: Anand Mazumdar Date: Fri Dec 4 14:15:26 2015 -0800 Linked against the executor in the agent to ensure ordered message delivery. Previously, we did not `link` against the executor `PID` while (re)-registering. This might lead to libprocess creating ephemeral sockets everytime a `send` was invoked. This was leading to races where messages might appear on the Executor out of order. This change does a `link` on the executor PID to ensure ordered message delivery. Review: https://reviews.apache.org/r/40660 {noformat} > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-
[jira] [Commented] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover
[ https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047385#comment-15047385 ] Benjamin Mahler commented on MESOS-4048: This ticket is independent from MESOS-4049 in that it is discussing the current inconsistent approaches to agent partition handling (case 1 and 2 above). When we were implementing master recovery, we wanted to use health checking to determine when an agent should be removed, but there were some implementation difficulties that led to the addition of {{--slave_reregistration_timer}} instead. This approach is a bit scary because we may remove healthy agents that for some reason (e.g. ZK connectivity issues) could not re-register with the master after master failover. This was why we put in place some safety nets ({{--recovery_slave_removal_limit}} and we were able to re-use used the removal rate limiting). The point of this ticket is to look into removing {{--slave_reregistration_timer}} entirely and have the master perform the same health check based partition detection that it does in the steady state. So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. partitioned). This ticket is about *how* we determine that an agent is unhealthy (e.g. partitioned). Specifically, we want to determine it in a consistent way rather than having one approach in steady state and a different approach after master failover. Make sense? > Consider unifying slave timeout behavior between steady state and master > failover > - > > Key: MESOS-4048 > URL: https://issues.apache.org/jira/browse/MESOS-4048 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway >Assignee: Anindya Sinha >Priority: Minor > Labels: mesosphere > > Currently, there are two timeouts that control what happens when an agent is > partitioned from the master: > 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the > master waits before declaring a slave to be dead in the "steady state" > 2. {{slave_reregister_timeout}} controls how long the master waits for a > slave to reregister after master failover. > It is unclear whether these two cases really merit being treated differently > -- it might be simpler for operators to configure a single timeout that > controls how long the master waits before declaring that a slave is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover
[ https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047385#comment-15047385 ] Benjamin Mahler edited comment on MESOS-4048 at 12/8/15 8:06 PM: - This ticket is independent from MESOS-4049 in that it is discussing the current inconsistent approaches to agent partition detection (case 1 and 2 above). When we were implementing master recovery, we wanted to use health checking to determine when an agent is unhealthy, but there were some implementation difficulties that led to the addition of {{\-\-slave_reregistration_timer}} instead. This approach is a bit scary because we may remove healthy agents that for some reason (e.g. ZK connectivity issues) could not re-register with the master after master failover. This was why we put in place some safety nets ({{\-\-recovery_slave_removal_limit}} and we were able to re-use used the removal rate limiting). The point of this ticket is to look into removing {{\-\-slave_reregistration_timer}} entirely and have the master perform the same health check based partition detection that it does in the steady state. So, MESOS-4049 is about what we do *when* an agent is unhealthy. This ticket is about *how* we determine that an agent is unhealthy. Specifically, we want to determine it in a consistent way rather than having one approach in steady state and a different approach after master failover. Make sense? was (Author: bmahler): This ticket is independent from MESOS-4049 in that it is discussing the current inconsistent approaches to agent partition handling (case 1 and 2 above). When we were implementing master recovery, we wanted to use health checking to determine when an agent should be removed, but there were some implementation difficulties that led to the addition of {{--slave_reregistration_timer}} instead. This approach is a bit scary because we may remove healthy agents that for some reason (e.g. ZK connectivity issues) could not re-register with the master after master failover. This was why we put in place some safety nets ({{--recovery_slave_removal_limit}} and we were able to re-use used the removal rate limiting). The point of this ticket is to look into removing {{--slave_reregistration_timer}} entirely and have the master perform the same health check based partition detection that it does in the steady state. So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. partitioned). This ticket is about *how* we determine that an agent is unhealthy (e.g. partitioned). Specifically, we want to determine it in a consistent way rather than having one approach in steady state and a different approach after master failover. Make sense? > Consider unifying slave timeout behavior between steady state and master > failover > - > > Key: MESOS-4048 > URL: https://issues.apache.org/jira/browse/MESOS-4048 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway >Assignee: Anindya Sinha >Priority: Minor > Labels: mesosphere > > Currently, there are two timeouts that control what happens when an agent is > partitioned from the master: > 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the > master waits before declaring a slave to be dead in the "steady state" > 2. {{slave_reregister_timeout}} controls how long the master waits for a > slave to reregister after master failover. > It is unclear whether these two cases really merit being treated differently > -- it might be simpler for operators to configure a single timeout that > controls how long the master waits before declaring that a slave is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
Benjamin Mahler created MESOS-4106: -- Summary: The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures. Key: MESOS-4106 URL: https://issues.apache.org/jira/browse/MESOS-4106 Project: Mesos Issue Type: Bug Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1, 0.21.2, 0.21.1, 0.20.1, 0.20.0 Reporter: Benjamin Mahler Priority: Blocker This was reported by [~tan] experimenting with health checks. Many tasks were launched with the following health check, taken from the container stdout/stderr: {code} Launching health check process: /usr/local/libexec/mesos/mesos-health-check --executor=(1)@127.0.0.1:39629 --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} --task_id=sleepy-2 {code} This should have led to all tasks getting killed due to {{\-\-consecutive_failures}} being set, however, only some tasks get killed, while other remain running. It turns out that the health check binary does a {{send}} and promptly exits. Unfortunately, this may lead to a message drop since libprocess may not have sent this message over the socket by the time the process exits. We work around this in the command executor with a manual sleep, which has been around since the svn days. See [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049651#comment-15049651 ] Benjamin Mahler commented on MESOS-4106: This is also possibly the reason for MESOS-1613. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-4106: -- Assignee: Benjamin Mahler [~haosd...@gmail.com]: From my testing so far, yes. I will send a fix and re-enable the test from MESOS-1613. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049850#comment-15049850 ] Benjamin Mahler commented on MESOS-1613: For posterity, I also wasn't able to reproduce this by just running in repetition. However, when I ran one {{openssl speed}} for each core on my laptop in order to induce load, I could reproduce easily. We probably want to direct folks to try this when they are having trouble reproducing something flaky from CI. I will post a fix through MESOS-4106. > HealthCheckTest.ConsecutiveFailures is flaky > > > Key: MESOS-1613 > URL: https://issues.apache.org/jira/browse/MESOS-1613 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: Ubuntu 10.04 GCC >Reporter: Vinod Kone >Assignee: Timothy Chen > Labels: flaky, mesosphere > > {code} > [ RUN ] HealthCheckTest.ConsecutiveFailures > Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' > I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms > I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms > I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns > I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in > 2125ns > I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the > db in 10747ns > I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery > I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status > I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to > STARTING > I0717 04:39:59.301985 5031 master.cpp:288] Master > 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 > I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing > authenticated frameworks to register > I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing > authenticated slaves to register > I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' > I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled > I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:40280 > I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising > offers for all slaves > I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.325097ms > I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to > STARTING > I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is > master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 > I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! > I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar > I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar > I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status > I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING > I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.204157ms > I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to > VOTING > I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos > group > I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated > I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer > I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.271822ms > I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 > I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to > leve
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049889#comment-15049889 ] Benjamin Mahler commented on MESOS-4106: Yeah I had the same thought when I was looking at MESOS-243, but now we also have process::finalize that could be the mechanism for cleanly shutting down before {{exit}} calls. I'll file a ticket to express this issue more generally (MESOS-243 was the original but is specific to the executor driver). > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049899#comment-15049899 ] Benjamin Mahler commented on MESOS-4106: Yeah, I'll reference MESOS-4111 now that we have it, I'll also reference it in the existing command executor sleep. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4111) Provide a means for libprocess users to exit while ensuring messages are flushed.
Benjamin Mahler created MESOS-4111: -- Summary: Provide a means for libprocess users to exit while ensuring messages are flushed. Key: MESOS-4111 URL: https://issues.apache.org/jira/browse/MESOS-4111 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler Priority: Minor Currently after a {{send}} there is no way to ensure that the message is flushed on the socket before terminating. We work around this by inserting {{os::sleep}} calls (see MESOS-243, MESOS-4106). There are a number of approaches to this: (1) Return a Future from send that notifies when the message is flushed from the system. (2) Call process::finalize before exiting. This would require that process::finalize flushes all of the outstanding data on any active sockets, which may block. Regardless of the approach, there needs to be a timer if we want to guarantee termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051371#comment-15051371 ] Benjamin Mahler commented on MESOS-4106: I'm not sure we should say sleeping provides a "very weak guarantee", there is indeed *no guarantee* with a sleep that the message is sent. The approach you've suggested with querying with a timeout still provides no form of guarantee, unless you are going to wait indefinitely or use the timeout mentioned to trigger a retry rather than an exit (what did you intend to happen after the timeout?). This approach is guaranteeing application-level delivery, and we generally just use an "acknowledgement" message with retries to do this, rather than a separate query. However, since the executor resides on the same machine, and executor failover is not supported, we're unlikely to bother implementing acknowledgements with retries here. We only need to wait for the data to be sent on the socket (this gives a "weak guarantee": e.g. if there are no socket errors (note that both ends of the socket are within the same machine), and the executor remains up, the message will eventually be processed by the executor). MESOS-4111 discusses the general issue of being able to exit after ensuring that messages are processed in libprocess. In the case of the long-standing command executor sleep, we needed to handle agent failure. So we are already using acknowledgements there, and can use them to {{stop()}} cleanly. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > Fix For: 0.27.0 > > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4109) HTTPConnectionTest.ClosingResponse is flaky
[ https://issues.apache.org/jira/browse/MESOS-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-4109: -- Assignee: Benjamin Mahler Thanks for filing! I introduced this, I'll fix it shortly. > HTTPConnectionTest.ClosingResponse is flaky > --- > > Key: MESOS-4109 > URL: https://issues.apache.org/jira/browse/MESOS-4109 > Project: Mesos > Issue Type: Bug > Components: libprocess, test >Affects Versions: 0.26.0 > Environment: ASF Ubuntu 14 > {{--enable-ssl --enable-libevent}} >Reporter: Joseph Wu >Assignee: Benjamin Mahler >Priority: Minor > Labels: flaky, flaky-test, newbie, test > > Output of the test: > {code} > [ RUN ] HTTPConnectionTest.ClosingResponse > I1210 01:20:27.048532 26671 process.cpp:3077] Handling HTTP event for process > '(22)' with path: '/(22)/get' > ../../../3rdparty/libprocess/src/tests/http_tests.cpp:919: Failure > Actual function call count doesn't match EXPECT_CALL(*http.process, get(_))... > Expected: to be called twice >Actual: called once - unsatisfied and active > [ FAILED ] HTTPConnectionTest.ClosingResponse (43 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-920) Set GLOG_drop_log_memory=false in environment prior to logging initialization.
[ https://issues.apache.org/jira/browse/MESOS-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-920: -- Description: We've observed issues where the masters are slow to respond. Two perf traces collected while the masters were slow to respond: {noformat} 25.84% [kernel][k] default_send_IPI_mask_sequence_phys 20.44% [kernel][k] native_write_msr_safe 4.54% [kernel][k] _raw_spin_lock 2.95% libc-2.5.so [.] _int_malloc 1.82% libc-2.5.so [.] malloc 1.55% [kernel][k] apic_timer_interrupt 1.36% libc-2.5.so [.] _int_free {noformat} {noformat} 29.03% [kernel][k] default_send_IPI_mask_sequence_phys 9.64% [kernel][k] _raw_spin_lock 7.38% [kernel][k] native_write_msr_safe 2.43% libc-2.5.so [.] _int_malloc 2.05% libc-2.5.so [.] _int_free 1.67% [kernel][k] apic_timer_interrupt 1.58% libc-2.5.so [.] malloc {noformat} These have been found to be attributed to the posix_fadvise calls made by glog. We can disable these via the environment: {noformat} GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log contents. " "Logs can grow very quickly and they are rarely read before they " "need to be evicted from memory. Instead, drop them from memory " "as soon as they are flushed to disk."); {noformat} {code} if (FLAGS_drop_log_memory) { if (file_length_ >= logging::kPageSize) { // don't evict the most recent page uint32 len = file_length_ & ~(logging::kPageSize - 1); posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED); } } {code} We should set GLOG_drop_log_memory=false prior to making our call to google::InitGoogleLogging, to avoid others running into this issue. was: We've observed performance scaling issues attributed to the posix_fadvise calls made by glog. This can currently only disabled via the environment: GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log contents. " "Logs can grow very quickly and they are rarely read before they " "need to be evicted from memory. Instead, drop them from memory " "as soon as they are flushed to disk."); if (FLAGS_drop_log_memory) { if (file_length_ >= logging::kPageSize) { // don't evict the most recent page uint32 len = file_length_ & ~(logging::kPageSize - 1); posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED); } } We should set GLOG_drop_log_memory=false prior to making our call to google::InitGoogleLogging. > Set GLOG_drop_log_memory=false in environment prior to logging initialization. > -- > > Key: MESOS-920 > URL: https://issues.apache.org/jira/browse/MESOS-920 > Project: Mesos > Issue Type: Improvement > Components: technical debt >Affects Versions: 0.15.0, 0.16.0 >Reporter: Benjamin Mahler > > We've observed issues where the masters are slow to respond. Two perf traces > collected while the masters were slow to respond: > {noformat} > 25.84% [kernel][k] default_send_IPI_mask_sequence_phys > 20.44% [kernel][k] native_write_msr_safe > 4.54% [kernel][k] _raw_spin_lock > 2.95% libc-2.5.so [.] _int_malloc > 1.82% libc-2.5.so [.] malloc > 1.55% [kernel][k] apic_timer_interrupt > 1.36% libc-2.5.so [.] _int_free > {noformat} > {noformat} > 29.03% [kernel][k] default_send_IPI_mask_sequence_phys > 9.64% [kernel][k] _raw_spin_lock > 7.38% [kernel][k] native_write_msr_safe > 2.43% libc-2.5.so [.] _int_malloc > 2.05% libc-2.5.so [.] _int_free > 1.67% [kernel][k] apic_timer_interrupt > 1.58% libc-2.5.so [.] malloc > {noformat} > These have been found to be attributed to the posix_fadvise calls made by > glog. We can disable these via the environment: > {noformat} > GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log > contents. " > "Logs can grow very quickly and they are rarely read before > they " > "need to be evicted from memory. Instead, drop them from > memory " > "as soon as they are flushed to disk."); > {noformat} > {code} > if (FLAGS_drop_log_memory) { > if (file_length_ >= logging::kPageSize) { > // don't evict the most recent page > uint32 len = file_length_ & ~(logging::kPageSize - 1); > posix_
[jira] [Created] (MESOS-4258) Generate xml test reports in the jenkins build.
Benjamin Mahler created MESOS-4258: -- Summary: Generate xml test reports in the jenkins build. Key: MESOS-4258 URL: https://issues.apache.org/jira/browse/MESOS-4258 Project: Mesos Issue Type: Task Components: test Reporter: Benjamin Mahler Google test has a flag for generating reports: {{--gtest_output=xml:report.xml}} Jenkins can display these reports via the xUnit plugin, which has support for google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin This lets us quickly see which test failed, as well as the time that each test took to run. We should wire this up. One difficulty is that 'make distclean' complains because the .xml files are left over (we could update distclean to wipe any .xml files within the test locations): {noformat} ERROR: files left in build directory after distclean: ./3rdparty/libprocess/3rdparty/report.xml ./3rdparty/libprocess/report.xml ./src/report.xml make[1]: *** [distcleancheck] Error 1 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3831) Document operator HTTP endpoints
[ https://issues.apache.org/jira/browse/MESOS-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-3831: --- Shepherd: Neil Conway Assignee: Benjamin Mahler Sprint: Mesosphere Sprint 26 > Document operator HTTP endpoints > > > Key: MESOS-3831 > URL: https://issues.apache.org/jira/browse/MESOS-3831 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Neil Conway >Assignee: Benjamin Mahler >Priority: Minor > Labels: documentation, mesosphere, newbie > > These are not exhaustively documented; they probably should be. > Some endpoints have docs: e.g., {{/reserve}} and {{/unreserve}} are described > in the reservation doc page. But it would be good to have a single page that > lists all the endpoints and their semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4274) libprocess build fail with libhttp-parser >= 2.0
[ https://issues.apache.org/jira/browse/MESOS-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4274: --- Shepherd: Benjamin Mahler Assignee: Jocelyn De La Rosa > libprocess build fail with libhttp-parser >= 2.0 > > > Key: MESOS-4274 > URL: https://issues.apache.org/jira/browse/MESOS-4274 > Project: Mesos > Issue Type: Bug > Components: build, libprocess >Affects Versions: 0.26.0 > Environment: debian 8 with package {{libhttp-parser-dev}} installed > and libprocess configured with {{--disable-bundled}} >Reporter: Jocelyn De La Rosa >Assignee: Jocelyn De La Rosa >Priority: Minor > Labels: build-failure, compile-error, easyfix > Fix For: 0.27.0 > > > Since mesos 0.26 libprocess does not compile if the libhttp-parser version is > >= 2.0. > I traced back the issue to the commit {{d347bf56c807d}} that added URL to the > {{http::Request}} but forgot to modify the {{#if HTTP_PARSER_VERSION MAJORS > >=2}} parts in {{3rdparty/libprocess/src/decoder.hpp}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4006) add a resource offers metric
[ https://issues.apache.org/jira/browse/MESOS-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081959#comment-15081959 ] Benjamin Mahler commented on MESOS-4006: [~devroot] the master's metrics include {{"outstanding_offers"}} which is a gauge of how many offers are currently made. That is, they have been sent to the framework but the framework has not yet replied. Is this not what you're looking for? > add a resource offers metric > > > Key: MESOS-4006 > URL: https://issues.apache.org/jira/browse/MESOS-4006 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: David Robinson > > Mesos doesn't provide a metric for monitoring offers being made, therefore > it's difficult to determine whether Mesos isn't making offers or if a > framework isn't receiving them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4042) LevelDBStateTest suite fails in virtual box shared folder.
[ https://issues.apache.org/jira/browse/MESOS-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4042: --- Summary: LevelDBStateTest suite fails in virtual box shared folder. (was: Complete LevelDBStateTest suite fails in optimized build) [~bbannier] I've updated the summary to reflect that this is a virtualbox shared folder issue. Should we close this now that it's tracking the virtualbox issue? > LevelDBStateTest suite fails in virtual box shared folder. > -- > > Key: MESOS-4042 > URL: https://issues.apache.org/jira/browse/MESOS-4042 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier > > Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a > ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared > folder fails with > {code} > [ RUN ] LevelDBStateTest.FetchAndStoreAndFetch > ../../src/tests/state_tests.cpp:90: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms) > [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch > ../../src/tests/state_tests.cpp:120: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms) > [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch > ../../src/tests/state_tests.cpp:156: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms) > [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch > ../../src/tests/state_tests.cpp:198: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms) > [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge > ../../src/tests/state_tests.cpp:233: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms) > [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch > ../../src/tests/state_tests.cpp:264: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms) > [ RUN ] LevelDBStateTest.Names > ../../src/tests/state_tests.cpp:304: Failure > (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: > Invalid argument > [ FAILED ] LevelDBStateTest.Names (10 ms) > {code} > The identical error occurs for a non-optimized build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4119) Add support for enabling --3way to apply-reviews.py.
[ https://issues.apache.org/jira/browse/MESOS-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4119: --- Description: Currently if {{git apply}} fails during apply-reviews, then the change must be rebased and re-uploaded to reviewboard in order for apply-reviews to succeed. However, it is often the case that {{git apply --3way}} will succeed since the blob information is included in the diff. Even if it doesn't succeed it will leave conflict markers, which allows the committer to do a manual conflict resolution if desired, or abort if conflict resolution is too difficult. +1 [~hartem]! Updated the description, since I manually edit apply-reviews to use {{--3way}} all the time. :) > Add support for enabling --3way to apply-reviews.py. > > > Key: MESOS-4119 > URL: https://issues.apache.org/jira/browse/MESOS-4119 > Project: Mesos > Issue Type: Task >Reporter: Artem Harutyunyan > Labels: beginner, mesosphere, newbie > > Currently if {{git apply}} fails during apply-reviews, then the change must > be rebased and re-uploaded to reviewboard in order for apply-reviews to > succeed. > However, it is often the case that {{git apply --3way}} will succeed since > the blob information is included in the diff. Even if it doesn't succeed it > will leave conflict markers, which allows the committer to do a manual > conflict resolution if desired, or abort if conflict resolution is too > difficult. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4181) Change port range logging to different logging level.
[ https://issues.apache.org/jira/browse/MESOS-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082067#comment-15082067 ] Benjamin Mahler commented on MESOS-4181: It's trivial to move any LOG(INFO) to VLOG(1).. doesn't necessarily mean we should do it! :) [~js84] [~bernd-mesos] did we completely lose the INFO logging about which resources are being recovered? Or was this a case of double logging? The other location you listed is not the only other place we log resources, from a quick glance: https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L3243-L3248 https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L6132-L6135 https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L6161-L6163 https://github.com/apache/mesos/blob/0.26.0/src/master/allocator/mesos/hierarchical.cpp#L400-L403 https://github.com/apache/mesos/blob/0.26.0/src/master/allocator/mesos/hierarchical.cpp#L344-L346 https://github.com/apache/mesos/blob/0.26.0/src/master/allocator/mesos/hierarchical.cpp#L513-L516 I don't see why we chose to move only this one instance to VLOG(1). Also, was there any reason that we didn't just improve the text representation for single item ranges? That is, {{\[1050-1050, 1092-1092, 1094-1094\]}} can be more succinctly represented as {{\[1050, 1092, 1094\]}}. Improving the representation will help all of the resource logging. > Change port range logging to different logging level. > - > > Key: MESOS-4181 > URL: https://issues.apache.org/jira/browse/MESOS-4181 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.25.0 >Reporter: Cody Maloney >Assignee: Joerg Schad > Labels: mesosphere, newbie > > Transforming from mesos' internal port range representation -> text is > non-linear in the number of bytest output. We end up with a massive amount of > log data like the following: > {noformat} > Dec 15 23:54:08 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: > I1215 23:51:58.891165 15925 hierarchical.hpp:1103] Recovered cpus(*):1e-05; > mem(*):10; ports(*):[5565-5565] (total: ports(*):[1025-2180, 2182-3887, > 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; > disk(*):32541, allocated: cpus(*):0.01815; ports(*):[1050-1050, 1092-1092, > 1094-1094, 1129-1129, 1132-1132, 1140-1140, 1177-1178, 1180-1180, 1192-1192, > 1205-1205, 1221-1221, 1308-1308, 1311-1311, 1323-1323, 1326-1326, 1335-1335, > 1365-1365, 1404-1404, 1412-1412, 1436-1436, 1455-1455, 1459-1459, 1472-1472, > 1477-1477, 1482-1482, 1491-1491, 1510-1510, 1551-1551, 1553-1553, 1559-1559, > 1573-1573, 1590-1590, 1592-1592, 1619-1619, 1635-1636, 1678-1678, 1738-1738, > 1742-1742, 1752-1752, 1770-1770, 1780-1782, 1790-1790, 1792-1792, 1799-1799, > 1804-1804, 1844-1844, 1852-1852, 1867-1867, 1899-1899, 1936-1936, 1945-1945, > 1954-1954, 2046-2046, 2055-2055, 2063-2063, 2070-2070, 2089-2089, 2104-2104, > 2117-2117, 2132-2132, 2173-2173, 2178-2178, 2188-2188, 2200-2200, 2218-2218, > 2223-2223, 2244-2244, 2248-2248, 2250-2250, 2270-2270, 2286-2286, 2302-2302, > 2332-2332, 2377-2377, 2397-2397, 2423-2423, 2435-2435, 2442-2442, 2448-2448, > 2477-2477, 2482-2482, 2522-2522, 2586-2586, 2594-2594, 2600-2600, 2602-2602, > 2643-2643, 2648-2648, 2659-2659, 2691-2691, 2716-2716, 2739-2739, 2794-2794, > 2802-2802, 2823-2823, 2831-2831, 2840-2840, 2848-2848, 2876-2876, 2894-2895, > 2900-2900, 2904-2904, 2912-2912, 2983-2983, 2991-2991, 2999-2999, 3011-3011, > 3025-3025, 3036-3036, 3041-3041, 3051-3051, 3074-3074, 3097-3097, 3107-3107, > 3121-3121, 3171-3171, 3176-3176, 3195-3195, 3197-3197, 3210-3210, 3221-3221, > 3234-3234, 3245-3245, 3250-3251, 3255-3255, 3270-3270, 3293-3293, 3298-3298, > 3312-3312, 3318-3318, 3325-3325, 3368-3368, 3379-3379, 3391-3391, 3412-3412, > 3414-3414, 3420-3420, 3492-3492, 3501-3501, 3538-3538, 3579-3579, 3631-3631, > 3680-3680, 3684-3684, 3695-3695, 3699-3699, 3738-3738, 3758-3758, 3793-3793, > 3808-3808, 3817-3817, 3854-3854, 3856-3856, 3900-3900, 3906-3906, 3909-3909, > 3912-3912, 3946-3946, 3956-3956, 3959-3959, 3963-3963, 3974- > Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: > 3974, 3981-3981, 3985-3985, 4134-4134, 4178-4178, 4206-4206, 4223-4223, > 4239-4239, 4245-4245, 4251-4251, 4262-4263, 4271-4271, 4308-4308, 4323-4323, > 4329-4329, 4368-4368, 4385-4385, 4404-4404, 4419-4419, 4430-4430, 4448-4448, > 4464-4464, 4481-4481, 4494-4494, 4499-4499, 4510-4510, 4534-4534, 4543-4543, > 4555-4555, 4561-4562, 4577-4577, 4601-4601, 4675-4675, 4722-4722, 4739-4739, > 4748-4748, 4752-4752, 4764-4764, 4771-4771, 4787-4787, 4827-4827, 4830-4830, > 4837-4837, 4848-4848, 4853-4853, 4879-4879, 4883-4883, 4897-4897, 4902-4902, > 4911-4911, 4940-4940,
[jira] [Commented] (MESOS-4075) Continue test suite execution across crashing tests.
[ https://issues.apache.org/jira/browse/MESOS-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082126#comment-15082126 ] Benjamin Mahler commented on MESOS-4075: The tests shouldn't be crashing though, why don't we focus on fixing crashes instead? For example, we currently have memory management issues that cause a test failures to unnecessarily cause the test binary to crash. Consider [this snippet|https://github.com/apache/mesos/blob/0.26.0/src/tests/slave_tests.cpp#L256-L259], where we pass a pointer to a stack allocated object into the slave / test abstractions, this means that if an assertion fails in the test, the code may crash! > Continue test suite execution across crashing tests. > > > Key: MESOS-4075 > URL: https://issues.apache.org/jira/browse/MESOS-4075 > Project: Mesos > Issue Type: Improvement > Components: test >Affects Versions: 0.26.0 >Reporter: Bernd Mathiske > Labels: mesosphere > > Currently, mesos-tests.sh exits when a test crashes. This is inconvenient > when trying to find out all tests that fail. > mesos-tests.sh should rate a test that crashes as failed and continue the > same way as if the test merely returned with a failure result and exited > properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4102) Quota doesn't allocate resources on slave joining
[ https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082141#comment-15082141 ] Benjamin Mahler commented on MESOS-4102: Linked in tickets with discussions about batch vs. event driven allocation. > Quota doesn't allocate resources on slave joining > - > > Key: MESOS-4102 > URL: https://issues.apache.org/jira/browse/MESOS-4102 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway >Assignee: Alexander Rukletsov > Labels: mesosphere, quota > Attachments: quota_absent_framework_test-1.patch > > > See attached patch. {{framework1}} is not allocated any resources, despite > the fact that the resources on {{agent2}} can safely be allocated to it > without risk of violating {{quota1}}. If I understand the intended quota > behavior correctly, this doesn't seem intended. > Note that if the framework is added _after_ the slaves are added, the > resources on {{agent2}} are allocated to {{framework1}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4152) discarding a Future from process::Queue loses elements
[ https://issues.apache.org/jira/browse/MESOS-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082185#comment-15082185 ] Benjamin Mahler commented on MESOS-4152: To be accurate, it doesn't prevent the caller from *requesting a discard*. Preventing callers from requesting a discard (e.g. at compile time by introducing a Future w/o {{discard()}}, or at run-time by providing a means to check for discard support) is orthogonal to this issue. I say that because, even if discard was a valid way to use process::Queue, when the caller requests a discard it must not assume the discard takes place. The caller must always use the transition state of the future to determine whether the future was discarded. > discarding a Future from process::Queue loses elements > -- > > Key: MESOS-4152 > URL: https://issues.apache.org/jira/browse/MESOS-4152 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: James Peach > > If you discard the Future you get from {{process::Queue}} the next element > inserted into the queue will be lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.
Benjamin Mahler created MESOS-4302: -- Summary: Offer filter timeouts are ignored if the allocator is slow or backlogged. Key: MESOS-4302 URL: https://issues.apache.org/jira/browse/MESOS-4302 Project: Mesos Issue Type: Bug Components: allocation Reporter: Benjamin Mahler Priority: Critical Currently, when the allocator recovers resources from an offer, it creates a filter timeout based on time at which the call is processed. This means that if it takes longer than the filter duration for the allocator to perform an allocation for the relevant agent, then the filter is never applied. This leads to pathological behavior: if the framework sets a filter duration that is smaller than the wall clock time it takes for us to perform the next allocation, then the filters will have no effect. This can mean that low share frameworks may continue receiving offers that they have no intent to use, without other frameworks ever receiving these offers. The workaround for this is for frameworks to set high filter durations, and possibly reviving offers when they need more resources, however, we should fix this issue in the allocator. (i.e. derive the timeout deadlines and expiry based on allocation times). This seems to warrant cherry-picking into bug fix releases for future versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.
[ https://issues.apache.org/jira/browse/MESOS-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4302: --- Description: Currently, when the allocator recovers resources from an offer, it creates a filter timeout based on time at which the call is processed. This means that if it takes longer than the filter duration for the allocator to perform an allocation for the relevant agent, then the filter is never applied. This leads to pathological behavior: if the framework sets a filter duration that is smaller than the wall clock time it takes for us to perform the next allocation, then the filters will have no effect. This can mean that low share frameworks may continue receiving offers that they have no intent to use, without other frameworks ever receiving these offers. The workaround for this is for frameworks to set high filter durations, and possibly reviving offers when they need more resources, however, we should fix this issue in the allocator. (i.e. derive the timeout deadlines and expiry based on allocation times). This seems to warrant cherry-picking into bug fix releases. was: Currently, when the allocator recovers resources from an offer, it creates a filter timeout based on time at which the call is processed. This means that if it takes longer than the filter duration for the allocator to perform an allocation for the relevant agent, then the filter is never applied. This leads to pathological behavior: if the framework sets a filter duration that is smaller than the wall clock time it takes for us to perform the next allocation, then the filters will have no effect. This can mean that low share frameworks may continue receiving offers that they have no intent to use, without other frameworks ever receiving these offers. The workaround for this is for frameworks to set high filter durations, and possibly reviving offers when they need more resources, however, we should fix this issue in the allocator. (i.e. derive the timeout deadlines and expiry based on allocation times). This seems to warrant cherry-picking into bug fix releases for future versions. > Offer filter timeouts are ignored if the allocator is slow or backlogged. > - > > Key: MESOS-4302 > URL: https://issues.apache.org/jira/browse/MESOS-4302 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Mahler >Priority: Critical > Labels: mesosphere > > Currently, when the allocator recovers resources from an offer, it creates a > filter timeout based on time at which the call is processed. > This means that if it takes longer than the filter duration for the allocator > to perform an allocation for the relevant agent, then the filter is never > applied. > This leads to pathological behavior: if the framework sets a filter duration > that is smaller than the wall clock time it takes for us to perform the next > allocation, then the filters will have no effect. This can mean that low > share frameworks may continue receiving offers that they have no intent to > use, without other frameworks ever receiving these offers. > The workaround for this is for frameworks to set high filter durations, and > possibly reviving offers when they need more resources, however, we should > fix this issue in the allocator. (i.e. derive the timeout deadlines and > expiry based on allocation times). > This seems to warrant cherry-picking into bug fix releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.
[ https://issues.apache.org/jira/browse/MESOS-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088592#comment-15088592 ] Benjamin Mahler commented on MESOS-4302: Hm.. I didn't understand the use case or what setting it specifically to 100 milliseconds will accomplish. Is it that you don't want filtering at all? (then just set it to 0 seconds rather than 100 milliseconds) > Offer filter timeouts are ignored if the allocator is slow or backlogged. > - > > Key: MESOS-4302 > URL: https://issues.apache.org/jira/browse/MESOS-4302 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Mahler >Assignee: Alexander Rukletsov >Priority: Critical > Labels: mesosphere > > Currently, when the allocator recovers resources from an offer, it creates a > filter timeout based on time at which the call is processed. > This means that if it takes longer than the filter duration for the allocator > to perform an allocation for the relevant agent, then the filter is never > applied. > This leads to pathological behavior: if the framework sets a filter duration > that is smaller than the wall clock time it takes for us to perform the next > allocation, then the filters will have no effect. This can mean that low > share frameworks may continue receiving offers that they have no intent to > use, without other frameworks ever receiving these offers. > The workaround for this is for frameworks to set high filter durations, and > possibly reviving offers when they need more resources, however, we should > fix this issue in the allocator. (i.e. derive the timeout deadlines and > expiry based on allocation times). > This seems to warrant cherry-picking into bug fix releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4258: --- Shepherd: Benjamin Mahler > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090303#comment-15090303 ] Benjamin Mahler commented on MESOS-4258: Your patch is committed, so now the report files are generated. The next part is to process the reports in jenkins. I think we'll want to use '[docker cp|https://docs.docker.com/engine/reference/commandline/cp/]' to copy out the report files from the container to the jenkins workspace. This likely means removing {{--rm}} from our {{docker run}} invocation and placing the rm command within the EXIT trap. [~lins05] can you do this next part as well? > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-486) TaskInfo should include a 'source' in order to enable getting resource monitoring statistics.
[ https://issues.apache.org/jira/browse/MESOS-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092945#comment-15092945 ] Benjamin Mahler commented on MESOS-486: --- {{ExecutorInfo.source}} was added before we started to use labels to provide customization. In retrospect, I imagine that we would have provided {{ExecutorInfo.labels}} instead of {{ExecutorInfo.source}}, which would have satisfied a wider set of monitoring use cases, as well as other customization needs outside of monitoring. My thought is that we should deprecate {{ExecutorInfo.source}} and add {{ExecutorInfo.labels}}, rather than introduce {{TaskInfo.source}}. However, it's still not clear to me how to achieve standardized labels. For example, if an operator requires that all frameworks use a {{"source"}} label for monitoring purposes, it's more difficult to get the users/frameworks setting this label consistently. > TaskInfo should include a 'source' in order to enable getting resource > monitoring statistics. > - > > Key: MESOS-486 > URL: https://issues.apache.org/jira/browse/MESOS-486 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Benjamin Hindman > Labels: twitter > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4424) Initial support for GPU resources.
Benjamin Mahler created MESOS-4424: -- Summary: Initial support for GPU resources. Key: MESOS-4424 URL: https://issues.apache.org/jira/browse/MESOS-4424 Project: Mesos Issue Type: Epic Components: isolation Reporter: Benjamin Mahler Mesos already has generic mechanisms for expressing / isolating resources, and we'd like to expose GPUs as resources that can be consumed and isolated. However, GPUs present unique challenges: * Users may rely on vendor-specific libraries to interact with the device (e.g. CUDA, HSA, etc), others may rely on portable libraries like OpenCL or OpenGL. These libraries need to be available from within the container. * GPU hardware has many attributes that may impose scheduling constraints (e.g. core count, total memory, topology (via PCI-E, NVLINK, etc), driver versions, etc). * Obtaining utilization information requires vendor-specific approaches. * Isolated sharing of a GPU device requires vendor-specific approaches. As such, the focus is on supporting a narrow initial use case: homogenous device-level GPU support: * Fractional sharing of GPU devices across containers will not be supported initially, unlike CPU cores. * Heterogeneity will be supported via other means for now (e.g. using agent attributes to differentiate hardware profiles, using portable libraries like OpenCL, etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2262) Adding GPGPU resource into Mesos, so we can know if any GPU/Heterogeneous resource are available from slave
[ https://issues.apache.org/jira/browse/MESOS-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106106#comment-15106106 ] Benjamin Mahler commented on MESOS-2262: I've created an epic (MESOS-4424) to track initial support of GPU resources and added the watchers from this ticket. A design doc will be circulated for community feedback soon, looking forward to seeing feedback from folks interested! > Adding GPGPU resource into Mesos, so we can know if any GPU/Heterogeneous > resource are available from slave > --- > > Key: MESOS-2262 > URL: https://issues.apache.org/jira/browse/MESOS-2262 > Project: Mesos > Issue Type: Task > Components: slave > Environment: OpenCL support env, such as OS X, Linux, Windows.. >Reporter: chester kuo >Assignee: chester kuo >Priority: Minor > > Extending Mesos to support Heterogeneous resource such as GPGPU/FPGA..etc as > computing resources in the data-center, OpenCL will be first target to add > into Mesos (support by all major GPU vendor) , I will reserve to support > others such as CUDA in the future. > In this feature, slave will be supported to do resources discover including > but not limited to, > (1) Heterogeneous Computing programming model : "OpenCL". "CUDA", "HSA" > (2) Computing global memory (MB) > (3) Computing run time version , such as "1.2" , "2.0" > (4) Computing compute unit (double) > (5) Computing device type : GPGPU, CPU, Accelerator device. > (6) Computing (number of devices): (double) > The Heterogeneous resource isolation will be supported in the framework > instead of in the slave devices side, the major reason here is , the > ecosystem , such as OpenCL operate on top of private device driver own by > vendors, only runtime library (OpenCL) is user-space application, so its hard > for us to do like Linux cgroup to have CPU/memory resource isolation. As a > result we may use run time library to do device isolation and memory > allocation. > (PS, if anyone know how to do it for GPGPU driver, please drop me a note) > Meanwhile, some run-time library (such as OpenCL) support to run on top of > CPU, so we need to use isolator API to notify this once it allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-920) Set GLOG_drop_log_memory=false in environment prior to logging initialization.
[ https://issues.apache.org/jira/browse/MESOS-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-920: -- Shepherd: Benjamin Mahler > Set GLOG_drop_log_memory=false in environment prior to logging initialization. > -- > > Key: MESOS-920 > URL: https://issues.apache.org/jira/browse/MESOS-920 > Project: Mesos > Issue Type: Improvement > Components: technical debt >Affects Versions: 0.15.0, 0.16.0 >Reporter: Benjamin Mahler >Assignee: Kapil Arya > > We've observed issues where the masters are slow to respond. Two perf traces > collected while the masters were slow to respond: > {noformat} > 25.84% [kernel][k] default_send_IPI_mask_sequence_phys > 20.44% [kernel][k] native_write_msr_safe > 4.54% [kernel][k] _raw_spin_lock > 2.95% libc-2.5.so [.] _int_malloc > 1.82% libc-2.5.so [.] malloc > 1.55% [kernel][k] apic_timer_interrupt > 1.36% libc-2.5.so [.] _int_free > {noformat} > {noformat} > 29.03% [kernel][k] default_send_IPI_mask_sequence_phys > 9.64% [kernel][k] _raw_spin_lock > 7.38% [kernel][k] native_write_msr_safe > 2.43% libc-2.5.so [.] _int_malloc > 2.05% libc-2.5.so [.] _int_free > 1.67% [kernel][k] apic_timer_interrupt > 1.58% libc-2.5.so [.] malloc > {noformat} > These have been found to be attributed to the posix_fadvise calls made by > glog. We can disable these via the environment: > {noformat} > GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log > contents. " > "Logs can grow very quickly and they are rarely read before > they " > "need to be evicted from memory. Instead, drop them from > memory " > "as soon as they are flushed to disk."); > {noformat} > {code} > if (FLAGS_drop_log_memory) { > if (file_length_ >= logging::kPageSize) { > // don't evict the most recent page > uint32 len = file_length_ & ~(logging::kPageSize - 1); > posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED); > } > } > {code} > We should set GLOG_drop_log_memory=false prior to making our call to > google::InitGoogleLogging, to avoid others running into this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4455) Error on scrolling the page horizontally on website
[ https://issues.apache.org/jira/browse/MESOS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-4455: -- Assignee: Benjamin Mahler > Error on scrolling the page horizontally on website > --- > > Key: MESOS-4455 > URL: https://issues.apache.org/jira/browse/MESOS-4455 > Project: Mesos > Issue Type: Bug > Components: project website >Reporter: Disha Singh >Assignee: Benjamin Mahler >Priority: Minor > Labels: newbie > > When the page :http://mesos.apache.org/documentation/latest/architecture/ > is scrolled horizontally the naval bar at the top discontinues creating a > bad look. > Also, this is occuring only because of the unadjusted size of the picture > architecture3.jpg. > This makes two finger-scrolling non-pleasant. > one of the these things can be done: > 1. Adjust the image's size. > 2. Fix the naval bar on it's position by adding ":fixed" in the CSS class > block itself to prevent any issues even in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4455) Error on scrolling the page horizontally on website
[ https://issues.apache.org/jira/browse/MESOS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-4455: -- Assignee: (was: Benjamin Mahler) > Error on scrolling the page horizontally on website > --- > > Key: MESOS-4455 > URL: https://issues.apache.org/jira/browse/MESOS-4455 > Project: Mesos > Issue Type: Bug > Components: project website >Reporter: Disha Singh >Priority: Minor > Labels: newbie > > When the page :http://mesos.apache.org/documentation/latest/architecture/ > is scrolled horizontally the naval bar at the top discontinues creating a > bad look. > Also, this is occuring only because of the unadjusted size of the picture > architecture3.jpg. > This makes two finger-scrolling non-pleasant. > one of the these things can be done: > 1. Adjust the image's size. > 2. Fix the naval bar on it's position by adding ":fixed" in the CSS class > block itself to prevent any issues even in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4031) slave crashed in cgroupstatistics()
[ https://issues.apache.org/jira/browse/MESOS-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4031: --- Component/s: (was: libprocess) docker > slave crashed in cgroupstatistics() > --- > > Key: MESOS-4031 > URL: https://issues.apache.org/jira/browse/MESOS-4031 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.24.0 > Environment: Debian jessie >Reporter: Steven >Assignee: Timothy Chen > Labels: mesosphere > Fix For: 0.27.0 > > > Hi all, > I have built a mesos cluster with three slaves. Any slave may sporadically > crash when I get the summary through mesos master ui. Here is the stack > trace. > {code} > slave.sh[13336]: I1201 11:54:12.827975 13338 slave.cpp:3926] Current disk > usage 79.71%. Max allowed age: 17.279577136390834hrs > slave.sh[13336]: I1201 11:55:12.829792 13342 slave.cpp:3926] Current disk > usage 79.71%. Max allowed age: 17.279577136390834hrs > slave.sh[13336]: I1201 11:55:38.389614 13342 http.cpp:189] HTTP GET for > /slave(1)/state from 192.168.100.1:64870 with User-Agent='Mozilla/5.0 (X11; > Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0' > docker[8409]: time="2015-12-01T11:55:38.934148017+08:00" level=info msg="GET > /v1.20/containers/mesos-b25be32d-41e1-4e14-9b84-d33d733cef51-S3.79c206a6-d6b5-487b-9390-e09292c5b53a/json" > docker[8409]: time="2015-12-01T11:55:38.941489332+08:00" level=info msg="GET > /v1.20/containers/mesos-b25be32d-41e1-4e14-9b84-d33d733cef51-S3.1e01a4b3-a76e-4bf6-8ce0-a4a937faf236/json" > slave.sh[13336]: ABORT: > (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:110): > Result::get() but state == NONE*** Aborted at 1448942139 (unix time) try > "date -d @1448942139" if you are using GNU date *** > slave.sh[13336]: PC: @ 0x7f295218a107 (unknown) > slave.sh[13336]: *** SIGABRT (@0x3419) received by PID 13337 (TID > 0x7f2948992700) from PID 13337; stack trace: *** > slave.sh[13336]: @ 0x7f2952a2e8d0 (unknown) > slave.sh[13336]: @ 0x7f295218a107 (unknown) > slave.sh[13336]: @ 0x7f295218b4e8 (unknown) > slave.sh[13336]: @ 0x43dc59 _Abort() > slave.sh[13336]: @ 0x43dc87 _Abort() > slave.sh[13336]: @ 0x7f2955e31c86 Result<>::get() > slave.sh[13336]: @ 0x7f295637f017 > mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics() > slave.sh[13336]: @ 0x7f295637dfea > _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi > slave.sh[13336]: @ 0x7f295637e549 > _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_ > slave.sh[13336]: @ 0x7f295638453b > ZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv > slave.sh[13336]: @ 0x7f295638751d > FN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invoke > slave.sh[13336]: @ 0x7f29563b53e7 std::function<>::operator()() > slave.sh[13336]: @ 0x7f29563aa5dc > _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_ > slave.sh[13336]: @ 0x7f29563bd667 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > slave.sh[13336]: @ 0x7f2956b893c3 std::function<>::operator()() > slave.sh[13336]: @ 0x7f2956b72ab0 process::ProcessBase::visit() > slave.sh[13336]: @ 0x7f2956b7588e process::DispatchEvent::visit() > slave.sh[13336]: @ 0x7f2955d7f972 process::ProcessBase::serve() > slave.sh[13336]: @ 0x7f2956b6ef8e process::ProcessManager::resume() > slave.sh[13336]: @ 0x7f2956b63555 process::internal::schedule() > slave.sh[13336]: @ 0x7f2956bc0839 > _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > slave.sh[13336]: @ 0x7f2956bc0781 std::_Bind_simple<>::operator()() > slave.sh[13336]: @ 0x7f2956bc06fe std::thread::_Impl<>::_M_run() > slave.sh[13336]: @ 0x7f29527ca970 (unknown) > slave.sh[13336]: @ 0x7f2952a270a4 start_thread > slave.sh[13336]: @ 0x7f295223b04d (unknown) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.
[ https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1802: -- Assignee: (was: Timothy Chen) > HealthCheckTest.HealthStatusChange is flaky on jenkins. > --- > > Key: MESOS-1802 > URL: https://issues.apache.org/jira/browse/MESOS-1802 > Project: Mesos > Issue Type: Bug > Components: test, tests >Affects Versions: 0.26.0 >Reporter: Benjamin Mahler > Labels: flaky, mesosphere > > https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull > {noformat} > [ RUN ] HealthCheckTest.HealthStatusChange > Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2' > I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms > I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns > I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns > I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in > 642ns > I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the > db in 343ns > I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery > I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status > I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to > STARTING > I0916 22:56:14.036603 21046 master.cpp:286] Master > 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on > 67.195.81.186:47865 > I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing > authenticated frameworks to register > I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing > authenticated slaves to register > I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials' > I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 480322ns > I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to > STARTING > I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled > I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status > I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising > offers for all slaves > I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] > Initializing hierarchical allocator process with master : > master@67.195.81.186:47865 > I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is > master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026 > I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master! > I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar > I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar > I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING > I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 330251ns > I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to > VOTING > I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos > group > I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated > I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer > I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 92623ns > I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1 > I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 144215ns > I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0 > I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request > for position 0 > I0916 22:56:14.040267 21047 leveldb.cpp:438] Reading position from leveldb >
[jira] [Created] (MESOS-1668) Handle a temporary one-way master --> slave socket closure.
Benjamin Mahler created MESOS-1668: -- Summary: Handle a temporary one-way master --> slave socket closure. Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Priority: Minor In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1470) Add cluster maintenance documentation.
[ https://issues.apache.org/jira/browse/MESOS-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1470: --- Target Version/s: (was: 0.20.0) > Add cluster maintenance documentation. > -- > > Key: MESOS-1470 > URL: https://issues.apache.org/jira/browse/MESOS-1470 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.19.0 >Reporter: Benjamin Mahler > > Now that the master has replicated state on the disk, we should add > documentation that guides operators for common maintenance work: > * Swapping a master in the ensemble. > * Growing the master ensemble. > * Shrinking the master ensemble. > This would help craft similar documentation for users of the replicated log. > We should also add documentation for existing slave maintenance documentation: > * Best practices for rolling upgrades. > * How to shut down a slave. > This latter category will be incorporated with [~alexandra.sava]'s > maintenance work! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1461) Add task reconciliation to the Python API.
[ https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1461: --- Target Version/s: 0.21.0 (was: 0.20.0) > Add task reconciliation to the Python API. > -- > > Key: MESOS-1461 > URL: https://issues.apache.org/jira/browse/MESOS-1461 > Project: Mesos > Issue Type: Task > Components: python api >Affects Versions: 0.19.0 >Reporter: Benjamin Mahler > > Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but > was never added to the Python API. > This may be obviated by the lower level API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.
[ https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1517: --- Target Version/s: (was: 0.20.0) > Maintain a queue of messages that arrive before the master recovers. > > > Key: MESOS-1517 > URL: https://issues.apache.org/jira/browse/MESOS-1517 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler > Labels: reliability > > Currently when the master is recovering, we drop all incoming messages. If > slaves and frameworks knew about the leading master only once it has > recovered, then we would only expect to see messages after we've recovered. > We previously considered enqueuing all messages through the recovery future, > but this has the downside of forcing all messages to go through the master's > queue twice: > {code} > // TODO(bmahler): Consider instead re-enqueing *all* messages > // through recover(). What are the performance implications of > // the additional queueing delay and the accumulated backlog > // of messages post-recovery? > if (!recovered.get().isReady()) { > VLOG(1) << "Dropping '" << event.message->name << "' message since " > << "not recovered yet"; > ++metrics.dropped_messages; > return; > } > {code} > However, an easy solution to this problem is to maintain an explicit queue of > incoming messages that gets flushed once we finish recovery. This ensures > that all messages post-recovery are processed normally. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1653. Resolution: Fixed Fix Version/s: 0.20.0 Assignee: Timothy Chen Fix was included here: {noformat} commit 656b0e075c79e03cf6937bbe7302424768729aa2 Author: Timothy Chen Date: Wed Aug 6 11:34:03 2014 -0700 Re-enabled HealthCheckTest.ConsecutiveFailures test. The test originally was flaky because the time to process the number of consecutive checks configured exceeds the task itself, so the task finished but the number of expected task health check didn't match. Review: https://reviews.apache.org/r/23772 {noformat} > HealthCheckTest.GracePeriod is flaky. > - > > Key: MESOS-1653 > URL: https://issues.apache.org/jira/browse/MESOS-1653 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Timothy Chen > Fix For: 0.20.0 > > > {noformat} > [--] 3 tests from HealthCheckTest > [ RUN ] HealthCheckTest.GracePeriod > Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' > I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms > I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms > I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns > I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in > 2317ns > I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the > db in 1367ns > I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery > I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status > I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to > STARTING > I0729 17:10:10.508482 1213 master.cpp:289] Master > 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 > I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing > authenticated frameworks to register > I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing > authenticated slaves to register > I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for > authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' > I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled > I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:54701 > I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising > offers for all slaves > I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is > master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 > I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! > I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar > I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar > I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 12.946461ms > I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to > STARTING > I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status > I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING > I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 11.537102ms > I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to > VOTING > I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos > group > I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated > I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer > I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.830863ms > I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 > I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal
[jira] [Updated] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1613: --- Fix Version/s: (was: 0.20.0) > HealthCheckTest.ConsecutiveFailures is flaky > > > Key: MESOS-1613 > URL: https://issues.apache.org/jira/browse/MESOS-1613 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: Ubuntu 10.04 GCC >Reporter: Vinod Kone >Assignee: Timothy Chen > > {code} > [ RUN ] HealthCheckTest.ConsecutiveFailures > Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' > I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms > I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms > I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns > I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in > 2125ns > I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the > db in 10747ns > I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery > I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status > I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to > STARTING > I0717 04:39:59.301985 5031 master.cpp:288] Master > 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 > I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing > authenticated frameworks to register > I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing > authenticated slaves to register > I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' > I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled > I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:40280 > I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising > offers for all slaves > I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.325097ms > I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to > STARTING > I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is > master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 > I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! > I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar > I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar > I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status > I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING > I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.204157ms > I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to > VOTING > I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos > group > I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated > I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer > I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.271822ms > I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 > I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 5.004323ms > I0717 04:39:59.323493 5031 replica.cpp:676] Persisted action at 0 > I0717 04:39:59.323799 5031 replica.cpp:508] Replica received write request > for position 0 > I0717 04:39:59.323837 5031 leveldb.cpp:438] Reading position from leveldb > took 21901ns > I0717 04:39:59.329038 5031 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb took 5.175998ms > I0717 04:39:59.329244 5031 repl
[jira] [Reopened] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reopened MESOS-1613: Looks like it's still flaky: {noformat} Changes Summary Made ephemeral ports a resource and killed private resources. (details) Do not send ephemeral_ports resource to frameworks. (details) Create static mesos library. (details) Re-enabled HealthCheckTest.ConsecutiveFailures test. (details) Merge resourcesRecovered and resourcesUnused. (details) Added executor metrics for slave. (details) [ RUN ] HealthCheckTest.ConsecutiveFailures Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu' I0806 15:06:59.268267 9210 leveldb.cpp:176] Opened db in 29.926087ms I0806 15:06:59.275971 9210 leveldb.cpp:183] Compacted db in 7.40006ms I0806 15:06:59.276254 9210 leveldb.cpp:198] Created db iterator in 7678ns I0806 15:06:59.276741 9210 leveldb.cpp:204] Seeked to beginning of db in 2076ns I0806 15:06:59.277034 9210 leveldb.cpp:273] Iterated through 0 keys in the db in 1908ns I0806 15:06:59.277307 9210 replica.cpp:741] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0806 15:06:59.277868 9233 recover.cpp:425] Starting replica recovery I0806 15:06:59.277946 9233 recover.cpp:451] Replica is in EMPTY status I0806 15:06:59.278240 9233 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0806 15:06:59.278296 9233 recover.cpp:188] Received a recover response from a replica in EMPTY status I0806 15:06:59.278391 9233 recover.cpp:542] Updating replica status to STARTING I0806 15:06:59.282282 9234 master.cpp:287] Master 20140806-150659-16842879-60888-9210 (lucid) started on 127.0.1.1:60888 I0806 15:06:59.282316 9234 master.cpp:324] Master only allowing authenticated frameworks to register I0806 15:06:59.282322 9234 master.cpp:329] Master only allowing authenticated slaves to register I0806 15:06:59.282330 9234 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu/credentials' I0806 15:06:59.282508 9234 master.cpp:358] Authorization enabled I0806 15:06:59.283121 9234 hierarchical_allocator_process.hpp:296] Initializing hierarchical allocator process with master : master@127.0.1.1:60888 I0806 15:06:59.283174 9234 master.cpp:121] No whitelist given. Advertising offers for all slaves I0806 15:06:59.283413 9234 master.cpp:1127] The newly elected leader is master@127.0.1.1:60888 with id 20140806-150659-16842879-60888-9210 I0806 15:06:59.283429 9234 master.cpp:1140] Elected as the leading master! I0806 15:06:59.283435 9234 master.cpp:958] Recovering from registrar I0806 15:06:59.283491 9234 registrar.cpp:313] Recovering registrar I0806 15:06:59.284046 9233 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.600113ms I0806 15:06:59.284080 9233 replica.cpp:320] Persisted replica status to STARTING I0806 15:06:59.284226 9233 recover.cpp:451] Replica is in STARTING status I0806 15:06:59.284580 9233 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0806 15:06:59.284643 9233 recover.cpp:188] Received a recover response from a replica in STARTING status I0806 15:06:59.284747 9233 recover.cpp:542] Updating replica status to VOTING I0806 15:06:59.289934 9233 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.119357ms I0806 15:06:59.290256 9233 replica.cpp:320] Persisted replica status to VOTING I0806 15:06:59.290876 9237 recover.cpp:556] Successfully joined the Paxos group I0806 15:06:59.291131 9232 recover.cpp:440] Recover process terminated I0806 15:06:59.300732 9236 log.cpp:656] Attempting to start the writer I0806 15:06:59.301061 9236 replica.cpp:474] Replica received implicit promise request with proposal 1 I0806 15:06:59.306172 9236 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.070193ms I0806 15:06:59.306229 9236 replica.cpp:342] Persisted promised to 1 I0806 15:06:59.306747 9236 coordinator.cpp:230] Coordinator attemping to fill missing position I0806 15:06:59.307143 9236 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0806 15:06:59.309715 9236 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 2.521311ms I0806 15:06:59.310199 9236 replica.cpp:676] Persisted action at 0 I0806 15:06:59.320276 9234 replica.cpp:508] Replica received write request for position 0 I0806 15:06:59.320335 9234 leveldb.cpp:438] Reading position from leveldb took 26656ns I0806 15:06:59.325726 9234 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 5.358479ms I0806 15:06:59.325781 9234 replica.cpp:676] Persisted action at 0 I0806 15:06:59.325999 9234 replica.cpp:655] Replica received learned notice for position 0 I0806 15:06:59.328487 9234 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 2.458504ms I0806
[jira] [Updated] (MESOS-1668) Handle a temporary one-way master --> slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1668: --- Placing this under reconciliation because, although extremely rare, it can lead to some inconsistent state between the master and slave for an arbitrary amount of time. For example, if the launchTask message is dropped as a result of the socket closure between Master → Slave in the scenario above. > Handle a temporary one-way master --> slave socket closure. > --- > > Key: MESOS-1668 > URL: https://issues.apache.org/jira/browse/MESOS-1668 > Project: Mesos > Issue Type: Bug > Components: master, slave >Reporter: Benjamin Mahler >Priority: Minor > Labels: reliability > > In MESOS-1529, we realized that it's possible for a slave to remain > disconnected in the master if the following occurs: > → Master and Slave connected operating normally. > → Temporary one-way network failure, master→slave link breaks. > → Master marks slave as disconnected. > → Network restored and health checking continues normally, slave is not > removed as a result. Slave does not attempt to re-register since it is > receiving pings once again. > → Slave remains disconnected according to the master, and the slave does not > try to re-register. Bad! > We were originally thinking of using a failover timeout in the master to > remove these slaves that don't re-register. However, it can be dangerous when > ZooKeeper issues are preventing the slave from re-registering with the > master; we do not want to remove a ton of slaves in this situation. > Rather, when the slave is health checking correctly but does not re-register > within a timeout, we could send a registration request from the master to the > slave, telling the slave that it must re-register. This message could also be > used when receiving status updates (or other messages) from slaves that are > disconnected in the master. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088425#comment-14088425 ] Benjamin Mahler commented on MESOS-1613: So far only Twitter CI is exposing this flakiness. I've pasted the full logs in the comment above, are you looking for logging from the command executor? We may want to investigate wiring up the tests to expose them in the output to make this easier for you to debug. > HealthCheckTest.ConsecutiveFailures is flaky > > > Key: MESOS-1613 > URL: https://issues.apache.org/jira/browse/MESOS-1613 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: Ubuntu 10.04 GCC >Reporter: Vinod Kone >Assignee: Timothy Chen > > {code} > [ RUN ] HealthCheckTest.ConsecutiveFailures > Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' > I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms > I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms > I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns > I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in > 2125ns > I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the > db in 10747ns > I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery > I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status > I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to > STARTING > I0717 04:39:59.301985 5031 master.cpp:288] Master > 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 > I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing > authenticated frameworks to register > I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing > authenticated slaves to register > I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' > I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled > I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:40280 > I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising > offers for all slaves > I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.325097ms > I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to > STARTING > I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is > master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 > I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! > I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar > I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar > I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status > I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING > I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.204157ms > I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to > VOTING > I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos > group > I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated > I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer > I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.271822ms > I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 > I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 5.004323ms > I0717 04:39:59.323493 5031 replica.cpp:676] Persisted action at 0 > I0717 04:39:59.323799 5031 replica.
[jira] [Updated] (MESOS-887) Scheduler Driver should use exited() to detect disconnection with Master.
[ https://issues.apache.org/jira/browse/MESOS-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-887: -- Labels: framework reliability (was: ) > Scheduler Driver should use exited() to detect disconnection with Master. > - > > Key: MESOS-887 > URL: https://issues.apache.org/jira/browse/MESOS-887 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.14.2, 0.16.0, 0.15.0 >Reporter: Benjamin Mahler > Labels: framework, reliability > > The Scheduler Driver already links with the master, but it does not use the > built in exited() notification from libprocess to detect socket closure. > This would fast-track the delay from zookeeper detecting a leadership change, > and would minimize the number of dropped messages leaving the driver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090065#comment-14090065 ] Benjamin Mahler commented on MESOS-1613: [~tnachen] it's failing on ASF CI as well, can you triage or disable in the interim? > HealthCheckTest.ConsecutiveFailures is flaky > > > Key: MESOS-1613 > URL: https://issues.apache.org/jira/browse/MESOS-1613 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: Ubuntu 10.04 GCC >Reporter: Vinod Kone >Assignee: Timothy Chen > > {code} > [ RUN ] HealthCheckTest.ConsecutiveFailures > Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' > I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms > I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms > I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns > I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in > 2125ns > I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the > db in 10747ns > I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery > I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status > I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to > STARTING > I0717 04:39:59.301985 5031 master.cpp:288] Master > 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 > I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing > authenticated frameworks to register > I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing > authenticated slaves to register > I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' > I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled > I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:40280 > I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising > offers for all slaves > I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.325097ms > I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to > STARTING > I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is > master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 > I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! > I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar > I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar > I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status > I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING > I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.204157ms > I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to > VOTING > I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos > group > I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated > I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer > I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.271822ms > I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 > I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 5.004323ms > I0717 04:39:59.323493 5031 replica.cpp:676] Persisted action at 0 > I0717 04:39:59.323799 5031 replica.cpp:508] Replica received write request > for position 0 > I0717 04:39:59.323837 5031 leveldb.cpp:438] Reading position from leveldb > took 21901ns > I0717 04:39:59.329038 5031 leve
[jira] [Comment Edited] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090065#comment-14090065 ] Benjamin Mahler edited comment on MESOS-1613 at 8/7/14 11:56 PM: - [~tnachen] it's failing on ASF CI as well, can you triage or disable in the interim? E.g. https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2299/consoleFull was (Author: bmahler): [~tnachen] it's failing on ASF CI as well, can you triage or disable in the interim? > HealthCheckTest.ConsecutiveFailures is flaky > > > Key: MESOS-1613 > URL: https://issues.apache.org/jira/browse/MESOS-1613 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: Ubuntu 10.04 GCC >Reporter: Vinod Kone >Assignee: Timothy Chen > > {code} > [ RUN ] HealthCheckTest.ConsecutiveFailures > Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' > I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms > I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms > I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns > I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in > 2125ns > I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the > db in 10747ns > I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery > I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status > I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to > STARTING > I0717 04:39:59.301985 5031 master.cpp:288] Master > 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 > I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing > authenticated frameworks to register > I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing > authenticated slaves to register > I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' > I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled > I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:40280 > I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising > offers for all slaves > I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.325097ms > I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to > STARTING > I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is > master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 > I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! > I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar > I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar > I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status > I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING > I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.204157ms > I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to > VOTING > I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos > group > I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated > I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer > I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.271822ms > I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 > I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 5.004323ms >
[jira] [Commented] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.
[ https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093505#comment-14093505 ] Benjamin Mahler commented on MESOS-1620: Review chain for this one, did some cleanups along the way: https://reviews.apache.org/r/24582/ https://reviews.apache.org/r/24583/ https://reviews.apache.org/r/24576/ https://reviews.apache.org/r/24515/ https://reviews.apache.org/r/24516/ > Reconciliation does not send back tasks pending validation / authorization. > --- > > Key: MESOS-1620 > URL: https://issues.apache.org/jira/browse/MESOS-1620 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send > back TASK_STAGING for those tasks that are pending in the Master (validation > / authorization still in progress). > For both implicit and explicit task reconciliation, the master could reply > with TASK_STAGING for these tasks, as this provides additional information to > the framework. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.
[ https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1620: --- Shepherd: Vinod Kone (was: Dominic Hamon) > Reconciliation does not send back tasks pending validation / authorization. > --- > > Key: MESOS-1620 > URL: https://issues.apache.org/jira/browse/MESOS-1620 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send > back TASK_STAGING for those tasks that are pending in the Master (validation > / authorization still in progress). > For both implicit and explicit task reconciliation, the master could reply > with TASK_STAGING for these tasks, as this provides additional information to > the framework. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1695) The stats.json endpoint on the slave exposes "registered" as a string.
Benjamin Mahler created MESOS-1695: -- Summary: The stats.json endpoint on the slave exposes "registered" as a string. Key: MESOS-1695 URL: https://issues.apache.org/jira/browse/MESOS-1695 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Assignee: Vinod Kone Priority: Minor The slave is currently exposing a string value for the "registered" statistic, this should be a number: {code} slave:5051/stats.json { "recovery_errors": 0, "registered": "1", "slave/executors_registering": 0, ... } {code} Should be a pretty straightforward fix, looks like this first originated back in 2013: {code} commit b8291304e1523eb67ea8dc5f195cdb0d8e7d8348 Author: Vinod Kone Date: Wed Jul 3 12:37:36 2013 -0700 Added a "registered" key/value pair to slave's stats.json. Review: https://reviews.apache.org/r/12256 diff --git a/src/slave/http.cpp b/src/slave/http.cpp index dc2955f..dd51516 100644 --- a/src/slave/http.cpp +++ b/src/slave/http.cpp @@ -281,6 +281,8 @@ Future Slave::Http::stats(const Request& request) object.values["lost_tasks"] = slave.stats.tasks[TASK_LOST]; object.values["valid_status_updates"] = slave.stats.validStatusUpdates; object.values["invalid_status_updates"] = slave.stats.invalidStatusUpdates; + object.values["registered"] = slave.master ? "1" : "0"; + return OK(object, request.query.get("jsonp")); } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1696) Improve reconciliation between master and slave.
Benjamin Mahler created MESOS-1696: -- Summary: Improve reconciliation between master and slave. Key: MESOS-1696 URL: https://issues.apache.org/jira/browse/MESOS-1696 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows: {code} Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack. {code} In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master. Note the following properties: *(1)* The master may have a non-terminal task, not present in the slave's re-registration message. *(2)* The master may have a non-terminal task, present in the slave's re-registration message but in a different state. *(3)* The slave's re-registration message may contain a terminal unacknowledged task unknown to the master. In the current master / slave [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency! After chatting with [~vinodkone], we're considering updating the reconciliation to occur as follows: → Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before. → If the master sees tasks that are missing in the slave, the master sends a reconcile message to the slave for the tasks. → The slave will reply to reconcile messages with the latest state, or TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.
Benjamin Mahler created MESOS-1700: -- Summary: ThreadLocal does not release pthread keys or log properly. Key: MESOS-1700 URL: https://issues.apache.org/jira/browse/MESOS-1700 Project: Mesos Issue Type: Bug Components: stout Reporter: Benjamin Mahler Assignee: Benjamin Mahler The ThreadLocal abstraction in stout does not release the allocated pthread keys upon destruction: https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22 It also does not log the errors correctly. Fortunately this does not impact mesos at the current time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.
[ https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1700: --- Sprint: Q3 Sprint 3 > ThreadLocal does not release pthread keys or log properly. > -- > > Key: MESOS-1700 > URL: https://issues.apache.org/jira/browse/MESOS-1700 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > The ThreadLocal abstraction in stout does not release the allocated > pthread keys upon destruction: > https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22 > It also does not log the errors correctly. Fortunately this does not impact > mesos at the current time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.
[ https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096175#comment-14096175 ] Benjamin Mahler commented on MESOS-1700: https://reviews.apache.org/r/24669/ > ThreadLocal does not release pthread keys or log properly. > -- > > Key: MESOS-1700 > URL: https://issues.apache.org/jira/browse/MESOS-1700 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > The ThreadLocal abstraction in stout does not release the allocated > pthread keys upon destruction: > https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22 > It also does not log the errors correctly. Fortunately this does not impact > mesos at the current time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1006) Invalid free when in ProcessIsolator Usage when executing a short task
[ https://issues.apache.org/jira/browse/MESOS-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1006. Resolution: Fixed Fix Version/s: 0.18.0 Assignee: Benjamin Hindman This was fixed as part of MESOS-963. > Invalid free when in ProcessIsolator Usage when executing a short task > -- > > Key: MESOS-1006 > URL: https://issues.apache.org/jira/browse/MESOS-1006 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.17.0 > Environment: MacOS 10.9.1 >Reporter: Florian Douetteau >Assignee: Benjamin Hindman > Fix For: 0.18.0 > > > Executing: > mesos execute --command=/bin/ls --master=127.0.0.1:5050 --name=test > Slave Log: > I0214 11:46:34.732306 294408192 status_update_manager.cpp:367] Forwarding > status update TASK_RUNNING (UUID: 956f2268-f677-4ff0-86d2-95e338b11447) for > task test of framework 201402141053-16777343-5050-50707-0010 to > master@127.0.0.1:5050 > I0214 11:46:34.732408 296017920 slave.cpp:1882] Sending acknowledgement for > status update TASK_RUNNING (UUID: 956f2268-f677-4ff0-86d2-95e338b11447) for > task test of framework 201402141053-16777343-5050-50707-0010 to > executor(1)@127.0.0.1:57686 > I0214 11:46:34.772078 294408192 status_update_manager.cpp:392] Received > status update acknowledgement (UUID: 956f2268-f677-4ff0-86d2-95e338b11447) > for task test of framework 201402141053-16777343-5050-50707-0010 > mesos-slave(52021,0x1119cb000) malloc: *** error for object 0x79702e72657672: > pointer being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > *** Aborted at 1392407195 (unix time) try "date -d @1392407195" if you are > using GNU date *** > PC: @ 0x7fff8f602866 __pthread_kill > *** SIGABRT (@0x7fff8f602866) received by PID 52021 (TID 0x1119cb000) stack > trace: *** > @ 0x7fff907a35aa _sigtramp > @0x0 (unknown) > @ 0x7fff8d43ebba abort > @ 0x7fff8d3ba093 free > @0x1101a11c2 std::__1::vector<>::erase() > @0x110197dcb os::process() > @0x11019eaad os::processes() > @0x110198957 os::children() > @0x1101939f3 mesos::internal::slave::ProcessIsolator::usage() > @0x1101115d3 > _ZZN7process8dispatchIN5mesos18ResourceStatisticsENS1_8internal5slave8IsolatorERKNS1_11FrameworkIDERKNS1_10ExecutorIDES6_S9_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSG_FSE_T1_T2_ET3_T4_ENKS0_IS2_S5_S8_SB_S6_S9_EUlPNS_11ProcessBaseEE_clESS_ > @0x11036cd13 process::ProcessBase::visit() > @0x1103640d2 process::ProcessManager::resume() > @0x110363c2f process::schedule() > @ 0x7fff8b269899 _pthread_body > @ 0x7fff8b26972a _pthread_start > @ 0x7fff8b26dfc9 thread_start > Abort trap: 6 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1454) Command executor should have nonzero resources
[ https://issues.apache.org/jira/browse/MESOS-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1454: --- Fix Version/s: 0.20.0 > Command executor should have nonzero resources > -- > > Key: MESOS-1454 > URL: https://issues.apache.org/jira/browse/MESOS-1454 > Project: Mesos > Issue Type: Bug >Reporter: Ian Downes >Assignee: Ian Downes > Fix For: 0.20.0 > > > The command executor is used when TaskInfo does not provide an ExecutorInfo. > It is not given any dedicated resources but the executor will be launched > with the first task's resources. However, assuming a single task (or a final > task), when that task terminates the container will be updated to zero > resources - see containerizer->update in Slave::statusUpdate(). > Possible solutions: > - Require a task to specify an executor and resources for it. > - Add sufficient allowance for the command executor beyond the task's > resources. This leads to an overcommit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1705) SubprocessTest.Status sometimes flakes out
[ https://issues.apache.org/jira/browse/MESOS-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099021#comment-14099021 ] Benjamin Mahler commented on MESOS-1705: This would be because we recently turned on google-logging stacktraces in the libprocess tests. Note that the test is still passing but the child process seems to have received the SIGTERM after the fork but before the exec, which is fine but produces this unfortunate stack trace. [~vinodkone] perhaps we should drop the SIGTERM stacktracing, like we do within the mesos logging initialization. > SubprocessTest.Status sometimes flakes out > -- > > Key: MESOS-1705 > URL: https://issues.apache.org/jira/browse/MESOS-1705 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 0.20.0 >Reporter: Timothy St. Clair >Priority: Minor > Labels: build > > It's a pretty rare event, but happened more then once. > [ RUN ] SubprocessTest.Status > *** Aborted at 1408023909 (unix time) try "date -d @1408023909" if you are > using GNU date *** > PC: @ 0x35700094b1 (unknown) > *** SIGTERM (@0x3e841d8) received by PID 16872 (TID 0x7fa9ea426780) from > PID 16856; stack trace: *** > @ 0x3570435cb0 (unknown) > @ 0x35700094b1 (unknown) > @ 0x3570009d9f (unknown) > @ 0x357000e726 (unknown) > @ 0x3570015185 (unknown) > @ 0x5ead42 process::childMain() > @ 0x5ece8d std::_Function_handler<>::_M_invoke() > @ 0x5eac9c process::defaultClone() > @ 0x5ebbd4 process::subprocess() > @ 0x55a229 process::subprocess() > @ 0x55a846 process::subprocess() > @ 0x54224c SubprocessTest_Status_Test::TestBody() > @ 0x7fa9ea460323 (unknown) > @ 0x7fa9ea455b67 (unknown) > @ 0x7fa9ea455c0e (unknown) > @ 0x7fa9ea455d15 (unknown) > @ 0x7fa9ea4593a8 (unknown) > @ 0x7fa9ea459647 (unknown) > @ 0x422466 main > @ 0x3570421d65 (unknown) > @ 0x4260bd (unknown) > [ OK ] SubprocessTest.Status (153 ms) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.
Benjamin Mahler created MESOS-1714: -- Summary: The C++ 'Resources' abstraction should keep the underlying resources flattened. Key: MESOS-1714 URL: https://issues.apache.org/jira/browse/MESOS-1714 Project: Mesos Issue Type: Bug Components: c++ api Reporter: Benjamin Mahler Currently, the C++ Resources class does not ensure that the underlying Resources protobufs are kept flat. This is an issue because some of the methods, e.g. [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269], assume the resources are flat. There is code that constructs unflattened resources, e.g. [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353]. We could prevent this type of construction, however it is perfectly fine if we ensure the C++ 'Resources' class performs flattening. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1715) The slave does not send pending tasks during re-registration.
Benjamin Mahler created MESOS-1715: -- Summary: The slave does not send pending tasks during re-registration. Key: MESOS-1715 URL: https://issues.apache.org/jira/browse/MESOS-1715 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler In what looks like an oversight, the pending tasks in the slave (Framework::pending) are not sent in the re-registration message. This can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.
Benjamin Mahler created MESOS-1716: -- Summary: The slave does not add pending tasks as part of the staging tasks metric. Key: MESOS-1716 URL: https://issues.apache.org/jira/browse/MESOS-1716 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Priority: Trivial The slave does not represent pending tasks in the "tasks_staging" metric. This should be a trivial fix. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.
Benjamin Mahler created MESOS-1717: -- Summary: The slave does not show pending tasks in the JSON endpoints. Key: MESOS-1717 URL: https://issues.apache.org/jira/browse/MESOS-1717 Project: Mesos Issue Type: Bug Components: json api, slave Reporter: Benjamin Mahler The slave does not show pending tasks in the /state.json endpoint. This is a bit tricky to add since we rely on knowing the executor directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1718) Command executor can overcommit the slave.
Benjamin Mahler created MESOS-1718: -- Summary: Command executor can overcommit the slave. Key: MESOS-1718 URL: https://issues.apache.org/jira/browse/MESOS-1718 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Currently we give a small amount of resources to the command executor, in addition to resources used by the command task: https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448 {code: title=} ExecutorInfo Slave::getExecutorInfo( const FrameworkID& frameworkId, const TaskInfo& task) { ... // Add an allowance for the command executor. This does lead to a // small overcommit of resources. executor.mutable_resources()->MergeFrom( Resources::parse( "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" + "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get()); ... } {code} This leads to an overcommit of the slave. Ideally, for command tasks we can "transfer" all of the task resources to the executor at the slave / isolation level. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.
[ https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1716: -- Assignee: Benjamin Mahler > The slave does not add pending tasks as part of the staging tasks metric. > - > > Key: MESOS-1716 > URL: https://issues.apache.org/jira/browse/MESOS-1716 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Trivial > > The slave does not represent pending tasks in the "tasks_staging" metric. > This should be a trivial fix. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MESOS-1715) The slave does not send pending tasks during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1715: -- Assignee: Benjamin Mahler > The slave does not send pending tasks during re-registration. > - > > Key: MESOS-1715 > URL: https://issues.apache.org/jira/browse/MESOS-1715 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > In what looks like an oversight, the pending tasks in the slave > (Framework::pending) are not sent in the re-registration message. > This can lead to spurious TASK_LOST notifications being generated by the > master when it falsely thinks the tasks are not present on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1720) Slave should send exited executor message when the executor is never launched.
Benjamin Mahler created MESOS-1720: -- Summary: Slave should send exited executor message when the executor is never launched. Key: MESOS-1720 URL: https://issues.apache.org/jira/browse/MESOS-1720 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler When the slave sends TASK_LOST before launching an executor for a task, the slave does not send an exited executor message to the master. Since the master receives no exited executor message, it still thinks the executor's resources are consumed on the slave. One possible fix for this would be to send the exited executor message to the master in these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1721) Prevent overcommit of the slave for ports and ephemeral ports.
Benjamin Mahler created MESOS-1721: -- Summary: Prevent overcommit of the slave for ports and ephemeral ports. Key: MESOS-1721 URL: https://issues.apache.org/jira/browse/MESOS-1721 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Assignee: Benjamin Mahler It's possible for the slave to be overcommitted (e.g. MESOS-1668). In the case of "named" resources like ports and ephemeral_ports, this is problematic as the resources needed by the tasks are in use. This ticket is to present the idea of rejecting tasks when the slave is overcommitted on ports or ephemeral_ports. In order to ensure the master reconciles state with the slave, we can also trigger a re-registration. For cpu / memory, this is less crucial, so preventing overcommit for these will be punted for later. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101757#comment-14101757 ] Benjamin Mahler commented on MESOS-1466: We're going to proceed with a mitigation of this by rejecting tasks once the slave is overcommitted: https://issues.apache.org/jira/browse/MESOS-1721 However, we would also like to ensure that this kind of race is not possible. One solution is to use master acknowledgments for executor exits: (1) When an executor terminates (or the executor could not be launched: MESOS-1720), we send an exited executor message. (2) The master acknowledges these message. (3) The slave will not accept tasks for unacknowledged terminal executors (this must include those executors that could not be launched, per MESOS-1720). The result of this is that a new executor cannot be launched until the master is aware of the old executor exiting. > Race between executor exited event and launch task can cause overcommit of > resources > > > Key: MESOS-1466 > URL: https://issues.apache.org/jira/browse/MESOS-1466 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Vinod Kone >Assignee: Benjamin Mahler > Labels: reliability > > The following sequence of events can cause an overcommit > --> Launch task is called for a task whose executor is already running > --> Executor's resources are not accounted for on the master > --> Executor exits and the event is enqueued behind launch tasks on the master > --> Master sends the task to the slave which needs to commit for resources > for task and the (new) executor. > --> Master processes the executor exited event and re-offers the executor's > resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1715: --- Summary: The slave does not send pending tasks / executors during re-registration. (was: The slave does not send pending tasks during re-registration.) > The slave does not send pending tasks / executors during re-registration. > - > > Key: MESOS-1715 > URL: https://issues.apache.org/jira/browse/MESOS-1715 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > In what looks like an oversight, the pending tasks in the slave > (Framework::pending) are not sent in the re-registration message. > This can lead to spurious TASK_LOST notifications being generated by the > master when it falsely thinks the tasks are not present on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1715: --- Description: In what looks like an oversight, the pending tasks and executors in the slave (Framework::pending) are not sent in the re-registration message. For tasks, this can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. For executors, this can lead to under-accounting in the master, causing an overcommit on the slave. was: In what looks like an oversight, the pending tasks in the slave (Framework::pending) are not sent in the re-registration message. This can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. > The slave does not send pending tasks / executors during re-registration. > - > > Key: MESOS-1715 > URL: https://issues.apache.org/jira/browse/MESOS-1715 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > In what looks like an oversight, the pending tasks and executors in the slave > (Framework::pending) are not sent in the re-registration message. > For tasks, this can lead to spurious TASK_LOST notifications being generated > by the master when it falsely thinks the tasks are not present on the slave. > For executors, this can lead to under-accounting in the master, causing an > overcommit on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1734) Reduce compile time replacing macro expansions with variadic templates
[ https://issues.apache.org/jira/browse/MESOS-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111420#comment-14111420 ] Benjamin Mahler commented on MESOS-1734: Hi [~preillyme], we can't yet assume C++11 support: https://issues.apache.org/jira/browse/MESOS-750 [~dhamon] would have a better idea of when we'll move to C++11 as a requirement. > Reduce compile time replacing macro expansions with variadic templates > -- > > Key: MESOS-1734 > URL: https://issues.apache.org/jira/browse/MESOS-1734 > Project: Mesos > Issue Type: Improvement >Reporter: Patrick Reilly >Assignee: Patrick Reilly >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1735) Better Startup Failure For Duplicate Master
[ https://issues.apache.org/jira/browse/MESOS-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118544#comment-14118544 ] Benjamin Mahler commented on MESOS-1735: We could use the EXIT approach from stout/exit.hpp here to avoid the abort / stacktrace and to include a helpful message. > Better Startup Failure For Duplicate Master > --- > > Key: MESOS-1735 > URL: https://issues.apache.org/jira/browse/MESOS-1735 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.20.0 > Environment: Ubuntu 12.04 >Reporter: Ken Sipe > > The error message is cryptic when starting a mesos-master when a mesos-master > is already running. The error message is: > mesos-master --ip=192.168.74.174 --work_dir=~/mesos > WARNING: Logging before InitGoogleLogging() is written to STDERR > F0826 20:24:56.940961 3057 process.cpp:1632] Failed to initialize, bind: > Address already in use [98] > *** Check failure stack trace: *** > Aborted (core dumped) > This can be a new person's first experience. It isn't clear to them that the > process is already running. And they are lost as to what to do next. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.
[ https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120110#comment-14120110 ] Benjamin Mahler commented on MESOS-1716: https://reviews.apache.org/r/25302/ https://reviews.apache.org/r/25303/ > The slave does not add pending tasks as part of the staging tasks metric. > - > > Key: MESOS-1716 > URL: https://issues.apache.org/jira/browse/MESOS-1716 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Trivial > > The slave does not represent pending tasks in the "tasks_staging" metric. > This should be a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.
[ https://issues.apache.org/jira/browse/MESOS-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120138#comment-14120138 ] Benjamin Mahler commented on MESOS-1714: For now, this review avoids constructing an unflattened Resources object: https://reviews.apache.org/r/25306/ > The C++ 'Resources' abstraction should keep the underlying resources > flattened. > --- > > Key: MESOS-1714 > URL: https://issues.apache.org/jira/browse/MESOS-1714 > Project: Mesos > Issue Type: Bug > Components: c++ api >Reporter: Benjamin Mahler > > Currently, the C++ Resources class does not ensure that the underlying > Resources protobufs are kept flat. > This is an issue because some of the methods, e.g. > [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269], > assume the resources are flat. > There is code that constructs unflattened resources, e.g. > [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353]. > We could prevent this type of construction, however it is perfectly fine if > we ensure the C++ 'Resources' class performs flattening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-733) Speedup slave recovery tests
[ https://issues.apache.org/jira/browse/MESOS-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler closed MESOS-733. - Resolution: Incomplete Closing this in favor of an epic to track testing speedups. > Speedup slave recovery tests > > > Key: MESOS-733 > URL: https://issues.apache.org/jira/browse/MESOS-733 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone > Labels: twitter > > Several of the tests are slow now that they do offer checking. I suspect this > is due to the "Clock" semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1757) Speed up the tests.
Benjamin Mahler created MESOS-1757: -- Summary: Speed up the tests. Key: MESOS-1757 URL: https://issues.apache.org/jira/browse/MESOS-1757 Project: Mesos Issue Type: Epic Components: technical debt, test Reporter: Benjamin Mahler The full test suite is exceeding the 7 minute mark (440 seconds on my machine), this epic is to track techniques to improve this: # The reaper takes a full second to reap an exited process (MESOS-1199), this adds a second to each slave recovery test, and possibly more for things that rely on Subprocess. # The command executor sleeps for a second when shutting down (MESOS-442), this adds a second to every test that uses the command executor. # Now that the master and the slave have to perform sync'ed disk writes, consider using a ramdisk to speed up the disk writes. Additional options that hopefully will not be necessary: # Use automake's [parallel test harness|http://www.gnu.org/software/automake/manual/html_node/Parallel-Test-Harness.html] to compile tests separately and run tests in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1758) Freezer failure leads to lost task during container destruction.
Benjamin Mahler created MESOS-1758: -- Summary: Freezer failure leads to lost task during container destruction. Key: MESOS-1758 URL: https://issues.apache.org/jira/browse/MESOS-1758 Project: Mesos Issue Type: Bug Components: containerization Reporter: Benjamin Mahler In the past we've seen numerous issues around the freezer. Lately, on the 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup: (1) An oom occurs. (2) No indication of oom in the kernel logs. (3) The slave is unable to freeze the cgroup. (4) The task is marked as lost. {noformat} I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 15488MB Maximum Used: 15488MB MEMORY STATISTICS: cache 7958691840 rss 8281653248 mapped_file 9474048 pgpgin 4487861 pgpgout 522933 pgfault 2533780 pgmajfault 11 inactive_anon 0 active_anon 8281653248 inactive_file 7631708160 active_file 326852608 unevictable 0 hierarchical_memory_limit 16240345088 total_cache 7958691840 total_rss 8281653248 total_mapped_file 9474048 total_pgpgin 4487861 total_pgpgout 522933 total_pgfault 2533780 total_pgmajfault 11 total_inactive_anon 0 total_active_anon 8281653248 total_inactive_file 7631728640 total_active_file 326852608 total_unevictable 0 I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource mem(*):1.62403e+10 and will be terminated I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 'bbb9732a-d600-4c1b-b326-846338c608c3' I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.710848ms I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.588224ms I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 2.15296ms I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.643008ms I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed age: 5.630238827780799days I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.511168ms I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for '/slave(1)/stats.json' E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of framework '201104070004-002563-' failed: Failed to destroy container: discarded future I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-002563- from @0.0.0.0:0 I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660
[jira] [Commented] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122338#comment-14122338 ] Benjamin Mahler commented on MESOS-1715: For now I've fixed it to send the pending tasks, since that is important for reconciliation: https://reviews.apache.org/r/25371/ https://reviews.apache.org/r/25372/ https://reviews.apache.org/r/25373/ I'll pull out a ticket for the executors. > The slave does not send pending tasks / executors during re-registration. > - > > Key: MESOS-1715 > URL: https://issues.apache.org/jira/browse/MESOS-1715 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > In what looks like an oversight, the pending tasks and executors in the slave > (Framework::pending) are not sent in the re-registration message. > For tasks, this can lead to spurious TASK_LOST notifications being generated > by the master when it falsely thinks the tasks are not present on the slave. > For executors, this can lead to under-accounting in the master, causing an > overcommit on the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1715: --- Shepherd: Vinod Kone (was: Yan Xu) > The slave does not send pending tasks / executors during re-registration. > - > > Key: MESOS-1715 > URL: https://issues.apache.org/jira/browse/MESOS-1715 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > In what looks like an oversight, the pending tasks and executors in the slave > (Framework::pending) are not sent in the re-registration message. > For tasks, this can lead to spurious TASK_LOST notifications being generated > by the master when it falsely thinks the tasks are not present on the slave. > For executors, this can lead to under-accounting in the master, causing an > overcommit on the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-186) Resource offers should be rescinded after some configurable timeout
[ https://issues.apache.org/jira/browse/MESOS-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-186. --- Resolution: Fixed Fix Version/s: 0.21.0 {noformat} commit 707bf3b1d6f042ee92e7a291d3f74a20ae2d494b Author: Kapil Arya Date: Fri Sep 5 11:15:15 2014 -0700 Added optional --offer_timeout to rescind unused offers. The ability to set an offer timeout helps prevent unfair resource allocations in the face of frameworks that hoard offers, or that accidentally drop offers. When optimistic offers are added, hoarding will not affect the fairness for other frameworks. Review: https://reviews.apache.org/r/22066 {noformat} > Resource offers should be rescinded after some configurable timeout > --- > > Key: MESOS-186 > URL: https://issues.apache.org/jira/browse/MESOS-186 > Project: Mesos > Issue Type: Improvement > Components: framework >Reporter: Benjamin Hindman >Assignee: Timothy Chen > Fix For: 0.21.0 > > > Problem: a framework has a bug and holds on to resource offers by accident > for 24 hours/ > One suggestion: resource offers should be rescinded after some configurable > timeout. > Possible issue: this might interfere with frameworks that are hoarding. But > one possible solution here is to add another API call which checks the status > of resource offers (i.e., "remindAboutOffer"). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1476) Provide endpoints for deactivating / activating slaves.
[ https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1476: --- Sprint: (was: Mesos Q3 Sprint 5) > Provide endpoints for deactivating / activating slaves. > --- > > Key: MESOS-1476 > URL: https://issues.apache.org/jira/browse/MESOS-1476 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler > Labels: gsoc2014 > > When performing maintenance operations on slaves, it is important to allow > these slaves to be drained of their tasks. > The first essential primitive of draining slaves is to prevent them from > running more tasks. This can be achieved by "deactivating" them: stop sending > their resource offers to frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1476) Provide endpoints for deactivating / activating slaves.
[ https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1476: -- Assignee: (was: Alexandra Sava) Un-assigning for now since there is no longer a need for this with the updated maintenance design in MESOS-1474. > Provide endpoints for deactivating / activating slaves. > --- > > Key: MESOS-1476 > URL: https://issues.apache.org/jira/browse/MESOS-1476 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler > Labels: gsoc2014 > > When performing maintenance operations on slaves, it is important to allow > these slaves to be drained of their tasks. > The first essential primitive of draining slaves is to prevent them from > running more tasks. This can be achieved by "deactivating" them: stop sending > their resource offers to frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1592) Design inverse resource offer support
[ https://issues.apache.org/jira/browse/MESOS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126421#comment-14126421 ] Benjamin Mahler commented on MESOS-1592: Moving this to reviewable as inverse offers were designed as part of the maintenance work: MESOS-1474. We are currently considering how persistent resources will interact with inverse offers and the other maintenance primitives. > Design inverse resource offer support > - > > Key: MESOS-1592 > URL: https://issues.apache.org/jira/browse/MESOS-1592 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Benjamin Mahler >Assignee: Alexandra Sava > > An "inverse" resource offer means that Mesos is requesting resources back > from the framework, possibly within some time interval. > This can be leveraged initially to provide more automated cluster > maintenance, by offering schedulers the opportunity to move tasks to > compensate for planned maintenance. Operators can set a time limit on how > long to wait for schedulers to relocate tasks before the tasks are forcibly > terminated. > Inverse resource offers have many other potential uses, as it opens the > opportunity for the allocator to attempt to move tasks in the cluster through > the co-operation of the framework, possibly providing better > over-subscription, fairness, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.
[ https://issues.apache.org/jira/browse/MESOS-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1717: -- Assignee: (was: Benjamin Mahler) Punting for now since it's not that important and the fix is non-trivial given how the slave structures the JSON models. > The slave does not show pending tasks in the JSON endpoints. > > > Key: MESOS-1717 > URL: https://issues.apache.org/jira/browse/MESOS-1717 > Project: Mesos > Issue Type: Bug > Components: json api, slave >Reporter: Benjamin Mahler > > The slave does not show pending tasks in the /state.json endpoint. > This is a bit tricky to add since we rely on knowing the executor directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.
[ https://issues.apache.org/jira/browse/MESOS-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1717: --- Sprint: Q3 Sprint 4 (was: Q3 Sprint 4, Mesos Q3 Sprint 5) > The slave does not show pending tasks in the JSON endpoints. > > > Key: MESOS-1717 > URL: https://issues.apache.org/jira/browse/MESOS-1717 > Project: Mesos > Issue Type: Bug > Components: json api, slave >Reporter: Benjamin Mahler > > The slave does not show pending tasks in the /state.json endpoint. > This is a bit tricky to add since we rely on knowing the executor directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.
Benjamin Mahler created MESOS-1786: -- Summary: FaultToleranceTest.ReconcilePendingTasks is flaky. Key: MESOS-1786 URL: https://issues.apache.org/jira/browse/MESOS-1786 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] FaultToleranceTest.ReconcilePendingTasks Using temporary directory '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm' I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 1781ns I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the db in 537ns I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from a replica in EMPTY status I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to STARTING I0910 20:18:02.321723 21652 master.cpp:286] Master 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361 I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing authenticated frameworks to register I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing authenticated slaves to register I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for authentication from '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials' I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.781277ms I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to STARTING I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from a replica in STARTING status I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@127.0.1.1:60361 I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634 I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising offers for all slaves I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No resources available to allocate! I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed allocation for 0 slaves in 118308ns I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master! I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.887284299secs I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to VOTING I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos group I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit promise request with proposal 1 I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 9.044596ms I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1 I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to fill missing position I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 18.839475ms I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0 I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request for position 0 I0910 20:18:04.251260 21650 leveldb.cpp:438] Reading position from leveldb took 43213ns I0910 20:18:04.262251 21650 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 10.949353ms I0910 20:18:04.262346 21650 replica.cpp:676] Persisted action at 0 I0910 20:18:04.262717 21650 replica.cpp:655] Replica received learned notice for position 0 I0910 20:18:04.271878 21650 leveldb.cpp
[jira] [Updated] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1786: --- Sprint: Mesos Q3 Sprint 5 > FaultToleranceTest.ReconcilePendingTasks is flaky. > -- > > Key: MESOS-1786 > URL: https://issues.apache.org/jira/browse/MESOS-1786 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > {noformat} > [ RUN ] FaultToleranceTest.ReconcilePendingTasks > Using temporary directory > '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm' > I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms > I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms > I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns > I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in > 1781ns > I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the > db in 537ns > I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery > I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status > I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to > STARTING > I0910 20:18:02.321723 21652 master.cpp:286] Master > 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361 > I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing > authenticated frameworks to register > I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing > authenticated slaves to register > I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials' > I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled > I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.781277ms > I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to > STARTING > I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status > I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING > I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] > Initializing hierarchical allocator process with master : > master@127.0.1.1:60361 > I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is > master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634 > I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising > offers for all slaves > I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No > resources available to allocate! > I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed > allocation for 0 slaves in 118308ns > I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master! > I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar > I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar > I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.887284299secs > I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to > VOTING > I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos > group > I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated > I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer > I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 9.044596ms > I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1 > I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 18.839475ms > I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0 > I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request > for position 0 > I0910 20:18:04.251260 21650 leve
[jira] [Updated] (MESOS-1696) Improve reconciliation between master and slave.
[ https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1696: --- Description: As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows: {code} Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack. {code} In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master. Note the following properties: *(1)* The master may have a non-terminal task, not present in the slave's re-registration message. *(2)* The master may have a non-terminal task, present in the slave's re-registration message but in a different state. *(3)* The slave's re-registration message may contain a terminal unacknowledged task unknown to the master. In the current master / slave [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency! After chatting with [~vinodkone], we're considering updating the reconciliation to occur as follows: → Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before. → If the master sees tasks that are missing in the slave, the master sends the tasks that need to be reconciled to the slave for the tasks. This can be piggy-backed on the re-registration message. → The slave will send TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration. was: As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows: {code} Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack. {code} In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master. Note the following properties: *(1)* The master may have a non-terminal task, not present in the slave's re-registration message. *(2)* The master may have a non-terminal task, present in the slave's re-registration message but in a different state. *(3)* The slave's re-registration message may contain a terminal unacknowledged task unknown to the master. In the current master / slave [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency! After chatting with [~vinodkone], we're considering updating the reconciliation to occur as follows: → Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before. → If the master sees tasks that are missing in the slave, the master sends a reconcile message to the slave for the tasks. → The slave will reply to reconcile messages with the latest state, or TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration. > Improve reconciliation between master and slave. > > > Key: MESOS-1696 > URL: https://issues.apache.org/jira/browse/MESOS-1696 > Project: Mesos > Issue Type: Bug > Components: master, slave >Reporter: Benjamin Mahler >Assignee: Vinod Kone > > As we update the Master to keep tasks in memory until they are both terminal > and acknowledge
[jira] [Commented] (MESOS-1410) Keep terminal unacknowledged tasks in the master's state.
[ https://issues.apache.org/jira/browse/MESOS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131014#comment-14131014 ] Benjamin Mahler commented on MESOS-1410: https://reviews.apache.org/r/25565/ https://reviews.apache.org/r/25566/ https://reviews.apache.org/r/25567/ https://reviews.apache.org/r/25568/ > Keep terminal unacknowledged tasks in the master's state. > - > > Key: MESOS-1410 > URL: https://issues.apache.org/jira/browse/MESOS-1410 > Project: Mesos > Issue Type: Task >Affects Versions: 0.19.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 0.21.0 > > > Once we are sending acknowledgments through the master as per MESOS-1409, we > need to keep terminal tasks that are *unacknowledged* in the Master's memory. > This will allow us to identify these tasks to frameworks when we haven't yet > forwarded them an update. Without this, we're susceptible to MESOS-1389. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-750) Require compilers that support c++11
[ https://issues.apache.org/jira/browse/MESOS-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131912#comment-14131912 ] Benjamin Mahler commented on MESOS-750: --- Any plan to make the supported compiler versions explicit in the documentation? http://mesos.apache.org/gettingstarted/ > Require compilers that support c++11 > > > Key: MESOS-750 > URL: https://issues.apache.org/jira/browse/MESOS-750 > Project: Mesos > Issue Type: Improvement > Components: technical debt >Reporter: Benjamin Mahler >Assignee: Dominic Hamon > Fix For: 0.21.0 > > > Requiring C++11 support will provide substantial benefits to Mesos. > Most notably, the lack of lambda support has resulted in a proliferation of > continuation style functions scattered throughout the code. Having lambdas > will allow us to reduce this clutter and simplify the code. > This will require carefully documenting how to get Mesos compiling on various > systems to make this transition easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-750) Require compilers that support c++11
[ https://issues.apache.org/jira/browse/MESOS-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131976#comment-14131976 ] Benjamin Mahler commented on MESOS-750: --- That sounds good, I'm more thinking about the case where a developer decides to write some code, and they're using, say, gcc 4.8.x. Since we've worked backwards from gcc 4.4 to the configure script, they won't know whether they used something unsupported in 4.4. Reviewbot would be a nice way to catch it, but I don't think there are any gcc 4.4 apache jenkins slaves currently. =/ We also have some legacy stuff that deals with specific older compiler versions: https://github.com/apache/mesos/blob/0.20.0/src/slave/constants.hpp#L31 Would we be bumping the minimum compiler version frequently? > Require compilers that support c++11 > > > Key: MESOS-750 > URL: https://issues.apache.org/jira/browse/MESOS-750 > Project: Mesos > Issue Type: Improvement > Components: technical debt >Reporter: Benjamin Mahler >Assignee: Dominic Hamon > Fix For: 0.21.0 > > > Requiring C++11 support will provide substantial benefits to Mesos. > Most notably, the lack of lambda support has resulted in a proliferation of > continuation style functions scattered throughout the code. Having lambdas > will allow us to reduce this clutter and simplify the code. > This will require carefully documenting how to get Mesos compiling on various > systems to make this transition easy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1783) MasterTest.LaunchDuplicateOfferTest is flaky
[ https://issues.apache.org/jira/browse/MESOS-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1783: --- Fix Version/s: 0.21.0 > MasterTest.LaunchDuplicateOfferTest is flaky > > > Key: MESOS-1783 > URL: https://issues.apache.org/jira/browse/MESOS-1783 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: ubuntu-14.04-gcc Jenkins VM >Reporter: Yan Xu >Assignee: Niklas Quarfot Nielsen > Fix For: 0.21.0 > > > {noformat:title=} > [ RUN ] MasterTest.LaunchDuplicateOfferTest > Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg' > I0909 22:46:59.212977 21883 leveldb.cpp:176] Opened db in 20.307533ms > I0909 22:46:59.219717 21883 leveldb.cpp:183] Compacted db in 6.470397ms > I0909 22:46:59.219925 21883 leveldb.cpp:198] Created db iterator in 5571ns > I0909 22:46:59.220100 21883 leveldb.cpp:204] Seeked to beginning of db in > 1365ns > I0909 22:46:59.220268 21883 leveldb.cpp:273] Iterated through 0 keys in the > db in 658ns > I0909 22:46:59.220448 21883 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0909 22:46:59.220855 21903 recover.cpp:425] Starting replica recovery > I0909 22:46:59.221103 21903 recover.cpp:451] Replica is in EMPTY status > I0909 22:46:59.221626 21903 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0909 22:46:59.221914 21903 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0909 22:46:59.04 21903 recover.cpp:542] Updating replica status to > STARTING > I0909 22:46:59.232590 21900 master.cpp:286] Master > 20140909-224659-16842879-44263-21883 (trusty) started on 127.0.1.1:44263 > I0909 22:46:59.233278 21900 master.cpp:332] Master only allowing > authenticated frameworks to register > I0909 22:46:59.233543 21900 master.cpp:337] Master only allowing > authenticated slaves to register > I0909 22:46:59.233934 21900 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg/credentials' > I0909 22:46:59.236431 21900 master.cpp:366] Authorization enabled > I0909 22:46:59.237522 21898 hierarchical_allocator_process.hpp:299] > Initializing hierarchical allocator process with master : > master@127.0.1.1:44263 > I0909 22:46:59.237877 21904 master.cpp:120] No whitelist given. Advertising > offers for all slaves > I0909 22:46:59.238723 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 16.245391ms > I0909 22:46:59.238916 21903 replica.cpp:320] Persisted replica status to > STARTING > I0909 22:46:59.239203 21903 recover.cpp:451] Replica is in STARTING status > I0909 22:46:59.239724 21903 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0909 22:46:59.239967 21903 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0909 22:46:59.240304 21903 recover.cpp:542] Updating replica status to VOTING > I0909 22:46:59.240684 21900 master.cpp:1212] The newly elected leader is > master@127.0.1.1:44263 with id 20140909-224659-16842879-44263-21883 > I0909 22:46:59.240846 21900 master.cpp:1225] Elected as the leading master! > I0909 22:46:59.241149 21900 master.cpp:1043] Recovering from registrar > I0909 22:46:59.241509 21898 registrar.cpp:313] Recovering registrar > I0909 22:46:59.248440 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 7.864221ms > I0909 22:46:59.248644 21903 replica.cpp:320] Persisted replica status to > VOTING > I0909 22:46:59.248846 21903 recover.cpp:556] Successfully joined the Paxos > group > I0909 22:46:59.249330 21897 log.cpp:656] Attempting to start the writer > I0909 22:46:59.249809 21897 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0909 22:46:59.250075 21903 recover.cpp:440] Recover process terminated > I0909 22:46:59.258286 21897 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.292514ms > I0909 22:46:59.258489 21897 replica.cpp:342] Persisted promised to 1 > I0909 22:46:59.258848 21897 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0909 22:46:59.259454 21897 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0909 22:46:59.267755 21897 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 8.109338ms > I0909 22:46:59.267916 21897 replica.cpp:676] Persisted action at 0 > I0909 22:46:59.270128 21902 replica.cpp:508] Replica received write request > for position 0 > I0909 22:46:59.270294 21902 leveldb.cpp:438] Reading position from leveldb > took 27443ns > I0909 22:46:59.277220 21902 leveldb.cpp:343] Persisting action (14 bytes) to > leveldb
[jira] [Commented] (MESOS-1783) MasterTest.LaunchDuplicateOfferTest is flaky
[ https://issues.apache.org/jira/browse/MESOS-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132235#comment-14132235 ] Benjamin Mahler commented on MESOS-1783: {noformat} commit d6c1ef6842b70af068ba14896693266ed6067724 Author: Niklas Nielsen Date: Fri Sep 12 14:40:54 2014 -0700 Fixed flaky MasterTest.LaunchDuplicateOfferTest. A couple of races could occur in the "launch tasks on multiple offers" tests where recovered resources from purposely-failed invocations turned into a subsequent resource offer and oversaturated the expect's. Review: https://reviews.apache.org/r/25588 {noformat} > MasterTest.LaunchDuplicateOfferTest is flaky > > > Key: MESOS-1783 > URL: https://issues.apache.org/jira/browse/MESOS-1783 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: ubuntu-14.04-gcc Jenkins VM >Reporter: Yan Xu >Assignee: Niklas Quarfot Nielsen > Fix For: 0.21.0 > > > {noformat:title=} > [ RUN ] MasterTest.LaunchDuplicateOfferTest > Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg' > I0909 22:46:59.212977 21883 leveldb.cpp:176] Opened db in 20.307533ms > I0909 22:46:59.219717 21883 leveldb.cpp:183] Compacted db in 6.470397ms > I0909 22:46:59.219925 21883 leveldb.cpp:198] Created db iterator in 5571ns > I0909 22:46:59.220100 21883 leveldb.cpp:204] Seeked to beginning of db in > 1365ns > I0909 22:46:59.220268 21883 leveldb.cpp:273] Iterated through 0 keys in the > db in 658ns > I0909 22:46:59.220448 21883 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0909 22:46:59.220855 21903 recover.cpp:425] Starting replica recovery > I0909 22:46:59.221103 21903 recover.cpp:451] Replica is in EMPTY status > I0909 22:46:59.221626 21903 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0909 22:46:59.221914 21903 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0909 22:46:59.04 21903 recover.cpp:542] Updating replica status to > STARTING > I0909 22:46:59.232590 21900 master.cpp:286] Master > 20140909-224659-16842879-44263-21883 (trusty) started on 127.0.1.1:44263 > I0909 22:46:59.233278 21900 master.cpp:332] Master only allowing > authenticated frameworks to register > I0909 22:46:59.233543 21900 master.cpp:337] Master only allowing > authenticated slaves to register > I0909 22:46:59.233934 21900 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg/credentials' > I0909 22:46:59.236431 21900 master.cpp:366] Authorization enabled > I0909 22:46:59.237522 21898 hierarchical_allocator_process.hpp:299] > Initializing hierarchical allocator process with master : > master@127.0.1.1:44263 > I0909 22:46:59.237877 21904 master.cpp:120] No whitelist given. Advertising > offers for all slaves > I0909 22:46:59.238723 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 16.245391ms > I0909 22:46:59.238916 21903 replica.cpp:320] Persisted replica status to > STARTING > I0909 22:46:59.239203 21903 recover.cpp:451] Replica is in STARTING status > I0909 22:46:59.239724 21903 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0909 22:46:59.239967 21903 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0909 22:46:59.240304 21903 recover.cpp:542] Updating replica status to VOTING > I0909 22:46:59.240684 21900 master.cpp:1212] The newly elected leader is > master@127.0.1.1:44263 with id 20140909-224659-16842879-44263-21883 > I0909 22:46:59.240846 21900 master.cpp:1225] Elected as the leading master! > I0909 22:46:59.241149 21900 master.cpp:1043] Recovering from registrar > I0909 22:46:59.241509 21898 registrar.cpp:313] Recovering registrar > I0909 22:46:59.248440 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 7.864221ms > I0909 22:46:59.248644 21903 replica.cpp:320] Persisted replica status to > VOTING > I0909 22:46:59.248846 21903 recover.cpp:556] Successfully joined the Paxos > group > I0909 22:46:59.249330 21897 log.cpp:656] Attempting to start the writer > I0909 22:46:59.249809 21897 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0909 22:46:59.250075 21903 recover.cpp:440] Recover process terminated > I0909 22:46:59.258286 21897 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 8.292514ms > I0909 22:46:59.258489 21897 replica.cpp:342] Persisted promised to 1 > I0909 22:46:59.258848 21897 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0909 22:46:59.259454 21897 replica.cpp:375] Replica received explicit > promise req
[jira] [Commented] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132346#comment-14132346 ] Benjamin Mahler commented on MESOS-1786: https://reviews.apache.org/r/25604/ > FaultToleranceTest.ReconcilePendingTasks is flaky. > -- > > Key: MESOS-1786 > URL: https://issues.apache.org/jira/browse/MESOS-1786 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > {noformat} > [ RUN ] FaultToleranceTest.ReconcilePendingTasks > Using temporary directory > '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm' > I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms > I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms > I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns > I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in > 1781ns > I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the > db in 537ns > I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery > I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status > I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to > STARTING > I0910 20:18:02.321723 21652 master.cpp:286] Master > 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361 > I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing > authenticated frameworks to register > I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing > authenticated slaves to register > I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials' > I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled > I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.781277ms > I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to > STARTING > I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status > I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING > I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] > Initializing hierarchical allocator process with master : > master@127.0.1.1:60361 > I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is > master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634 > I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising > offers for all slaves > I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No > resources available to allocate! > I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed > allocation for 0 slaves in 118308ns > I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master! > I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar > I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar > I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.887284299secs > I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to > VOTING > I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos > group > I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated > I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer > I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 9.044596ms > I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1 > I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to > leveldb took 18.839475ms > I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0 > I0910 20:18:04.251194 21650 replica.cpp:508] Replica received wr