[jira] [Updated] (MESOS-3272) CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer is flaky.

2015-11-30 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3272:
---
Shepherd: Benjamin Mahler

Thanks [~qiujian], I will shepherd the fix.

> CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer is flaky.
> 
>
> Key: MESOS-3272
> URL: https://issues.apache.org/jira/browse/MESOS-3272
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Paul Brett
>Assignee: Jian Qiu
> Attachments: build.log
>
>
> Test aborts when configured with python, libevent and SSL on Ubuntu12.
> [ RUN  ] 
> CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer
> *** Aborted at 1439667937 (unix time) try "date -d @1439667937" if you are 
> using GNU date ***
> PC: @ 0x7feba972a753 (unknown)
> *** SIGSEGV (@0x0) received by PID 4359 (TID 0x7febabf897c0) from PID 0; 
> stack trace: ***
> @ 0x7feba8f7dcb0 (unknown)
> @ 0x7feba972a753 (unknown)
> @ 0x7febaaa69328 process::dispatch<>()
> @ 0x7febaaa5e9a7 cgroups::freezer::thaw()
> @   0xba64ff 
> mesos::internal::tests::CgroupsAnyHierarchyWithCpuMemoryTest_ROOT_CGROUPS_FreezeNonFreezer_Test::TestBody()
> @   0xc199a3 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc0f947 testing::Test::Run()
> @   0xc0f9ee testing::TestInfo::Run()
> @   0xc0faf5 testing::TestCase::Run()
> @   0xc0fda8 testing::internal::UnitTestImpl::RunAllTests()
> @   0xc10064 testing::UnitTest::Run()
> @   0x4b3273 main
> @ 0x7feba8bd176d (unknown)
> @   0x4bf1f1 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3273) EventCall Test Framework is flaky

2015-11-30 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3273:
---
Target Version/s:   (was: 0.26.0)

> EventCall Test Framework is flaky
> -
>
> Key: MESOS-3273
> URL: https://issues.apache.org/jira/browse/MESOS-3273
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 0.24.0
> Environment: 
> https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull
>Reporter: Vinod Kone
>  Labels: flaky-test, tech-debt, twitter
>
> Observed this on ASF CI. h/t [~haosd...@gmail.com]
> Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master.
> {code}
> [ RUN  ] ExamplesTest.EventCallFramework
> Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx'
> I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the 
> driver is aborted!
> Shutting down
> Sending SIGTERM to process tree at pid 26061
> Killing the following process trees:
> [ 
> ]
> Shutting down
> Sending SIGTERM to process tree at pid 26062
> Shutting down
> Killing the following process trees:
> [ 
> ]
> Sending SIGTERM to process tree at pid 26063
> Killing the following process trees:
> [ 
> ]
> Shutting down
> Sending SIGTERM to process tree at pid 26098
> Killing the following process trees:
> [ 
> ]
> Shutting down
> Sending SIGTERM to process tree at pid 26099
> Killing the following process trees:
> [ 
> ]
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 
> 172.17.2.10:60249 for 16 cpus
> I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR
> I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0
> I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms
> I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms
> I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns
> I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 
> 8429ns
> I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 4219ns
> I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery
> I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status
> I0813 19:55:17.181970 26126 master.cpp:378] Master 
> 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 
> 172.17.2.10:60249
> I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: 
> --acls="permissive: false
> register_frameworks {
>   principals {
> type: SOME
> values: "test-principal"
>   }
>   roles {
> type: SOME
> values: "*"
>   }
> }
> run_tasks {
>   principals {
> type: SOME
> values: "test-principal"
>   }
>   users {
> type: SOME
> values: "mesos"
>   }
> }
> " --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="false" 
> --authenticators="crammd5" 
> --credentials="/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials" 
> --framework_sorter="drf" --help="false" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" 
> --registry_strict="false" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.24.0/src/webui" --work_dir="/tmp/mesos-II8Gua" 
> --zk_session_timeout="10secs"
> I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated 
> frameworks to register
> I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated 
> slaves to register
> I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials'
> W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials 
> file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. 
> It is recommended that your credentials file is NOT accessible by others.
> I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' 
> authenticator
> I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL
> I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover respons

[jira] [Updated] (MESOS-4029) ContentType/SchedulerTest is flaky.

2015-11-30 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4029:
---
Shepherd: Benjamin Mahler
Assignee: Anand Mazumdar
 Summary: ContentType/SchedulerTest is flaky.  (was: 
ContentType/SchedulerTest seems flaky.)

Took a look. For the traces with the mock expectation crashing (traces 
containing UntypedInvokeWith), the issue appears to be that our stack object 
dependencies in the tests are not safely ordered.

Specifically, we pass a pointer of the {{Queue}} 
([here|https://github.com/apache/mesos/blob/0.26.0-rc2/src/tests/scheduler_tests.cpp#L147])
 into the expectations on {{Callbacks}} above. During destruction of the test, 
the {{Mesos}} class will destruct **after** the {{Queue}} is already 
destructed. If a non-HEARTBEAT event arrives in this window, the expectation 
will try to dereference the destructed {{Queue}} object.

> ContentType/SchedulerTest is flaky.
> ---
>
> Key: MESOS-4029
> URL: https://issues.apache.org/jira/browse/MESOS-4029
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>Assignee: Anand Mazumdar
>  Labels: flaky, flaky-test
>
> SSL build, [Ubuntu 
> 14.04|https://github.com/tillt/mesos-vagrant-ci/blob/master/ubuntu14/setup.sh],
>  non-root test run.
> {noformat}
> [--] 22 tests from ContentType/SchedulerTest
> [ RUN  ] ContentType/SchedulerTest.Subscribe/0
> [   OK ] ContentType/SchedulerTest.Subscribe/0 (48 ms)
> *** Aborted at 1448928007 (unix time) try "date -d @1448928007" if you are 
> using GNU date ***
> [ RUN  ] ContentType/SchedulerTest.Subscribe/1
> PC: @  0x1451b8e 
> testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
> *** SIGSEGV (@0x10030) received by PID 21320 (TID 0x2b549e5d4700) from 
> PID 48; stack trace: ***
> @ 0x2b54c95940b7 os::Linux::chained_handler()
> @ 0x2b54c9598219 JVM_handle_linux_signal
> @ 0x2b5496300340 (unknown)
> @  0x1451b8e 
> testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
> @   0xe2ea6d 
> _ZN7testing8internal18FunctionMockerBaseIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeIS6_SaIS6_E10InvokeWithERKSt5tupleIJSC_EE
> @   0xe2b1bc testing::internal::FunctionMocker<>::Invoke()
> @  0x1118aed 
> mesos::internal::tests::SchedulerTest::Callbacks::received()
> @  0x111c453 
> _ZNKSt7_Mem_fnIMN5mesos8internal5tests13SchedulerTest9CallbacksEFvRKSt5queueINS0_2v19scheduler5EventESt5dequeIS8_SaIS8_EclIJSE_EvEEvRS4_DpOT_
> @  0x111c001 
> _ZNSt5_BindIFSt7_Mem_fnIMN5mesos8internal5tests13SchedulerTest9CallbacksEFvRKSt5queueINS1_2v19scheduler5EventESt5dequeIS9_SaIS9_ESt17reference_wrapperIS5_ESt12_PlaceholderILi16__callIvJSF_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @  0x111b90d 
> _ZNSt5_BindIFSt7_Mem_fnIMN5mesos8internal5tests13SchedulerTest9CallbacksEFvRKSt5queueINS1_2v19scheduler5EventESt5dequeIS9_SaIS9_ESt17reference_wrapperIS5_ESt12_PlaceholderILi1clIJSF_EvEET0_DpOT_
> @  0x111ae09 std::_Function_handler<>::_M_invoke()
> @ 0x2b5493c6da09 std::function<>::operator()()
> @ 0x2b5493c688ee process::AsyncExecutorProcess::execute<>()
> @ 0x2b5493c6db2a 
> _ZZN7process8dispatchI7NothingNS_20AsyncExecutorProcessERKSt8functionIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeIS8_SaIS8_ESC_PvSG_SC_SJ_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSO_FSL_T1_T2_T3_ET4_T5_T6_ENKUlPNS_11ProcessBaseEE_clES11_
> @ 0x2b5493c765a4 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingNS0_20AsyncExecutorProcessERKSt8functionIFvRKSt5queueIN5mesos2v19scheduler5EventESt5dequeISC_SaISC_ESG_PvSK_SG_SN_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSS_FSP_T1_T2_T3_ET4_T5_T6_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2b54946b1201 std::function<>::operator()()
> @ 0x2b549469960f process::ProcessBase::visit()
> @ 0x2b549469d480 process::DispatchEvent::visit()
> @   0x9dc0ba process::ProcessBase::serve()
> @ 0x2b54946958cc process::ProcessManager::resume()
> @ 0x2b5494692a9c 
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x2b549469ccac 
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x2b549469cc5c 
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x2b549469cbee 
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invo

[jira] [Created] (MESOS-4064) Add ContainerInfo to internal Task protobuf.

2015-12-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4064:
--

 Summary: Add ContainerInfo to internal Task protobuf.
 Key: MESOS-4064
 URL: https://issues.apache.org/jira/browse/MESOS-4064
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


In what seems like an oversight, when ContainerInfo was added to TaskInfo, it 
was not added to our internal Task protobuf.

Also, unlike the agent, it appears that the master does not use 
protobuf::createTask. We should try remove the manual construction in the 
master in favor of construction through protobuf::createTask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4066) Expose when agent is recovering in the agent's /state.json endpoint.

2015-12-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4066:
--

 Summary: Expose when agent is recovering in the agent's 
/state.json endpoint.
 Key: MESOS-4066
 URL: https://issues.apache.org/jira/browse/MESOS-4066
 Project: Mesos
  Issue Type: Task
  Components: slave
Reporter: Benjamin Mahler


Currently when a user is hitting /state.json on the agent, it may return 
partial state if the agent has failed over and is recovering. There is 
currently no clear way to tell if this is the case when looking at a response, 
so the user may incorrectly interpret the agent as being empty of tasks.

We could consider exposing the 'state' enum of the agent in the endpoint:

{code}
  enum State
  {
RECOVERING,   // Slave is doing recovery.
DISCONNECTED, // Slave is not connected to the master.
RUNNING,  // Slave has (re-)registered.
TERMINATING,  // Slave is shutting down.
  } state;
{code}

This may be a bit tricky to maintain as far as backwards-compatibility of the 
endpoint, if we were to alter this enum.

Exposing this would allow users to be more informed about the state of the 
agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4024) HealthCheckTest.CheckCommandTimeout flaky

2015-12-03 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039701#comment-15039701
 ] 

Benjamin Mahler commented on MESOS-4024:


[~haosd...@gmail.com] please do not link to Jenkins, the data gets garbage 
collected.

Do you have logs for this that you can paste in the ticket?

> HealthCheckTest.CheckCommandTimeout flaky
> -
>
> Key: MESOS-4024
> URL: https://issues.apache.org/jira/browse/MESOS-4024
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>Assignee: haosdent
> Attachments: HealthCheckTest_CheckCommandTimeout.log
>
>
> https://builds.apache.org/job/Mesos/1288/COMPILER=gcc,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleText



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-12-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042325#comment-15042325
 ] 

Benjamin Mahler commented on MESOS-3851:


The fix is committed, it would be great to re-enable the CHECKs in order to 
detect this issue should it still be present:

{noformat}
commit 4201c2c3e5849a01d0a63769404bad03792ae5de
Author: Anand Mazumdar 
Date:   Fri Dec 4 14:15:26 2015 -0800

Linked against the executor in the agent to ensure ordered message delivery.

Previously, we did not `link` against the executor `PID` while
(re)-registering. This might lead to libprocess creating ephemeral
sockets everytime a `send` was invoked. This was leading to races
where messages might appear on the Executor out of order. This change
does a `link` on the executor PID to ensure ordered message delivery.

Review: https://reviews.apache.org/r/40660
{noformat}

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-

[jira] [Commented] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover

2015-12-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047385#comment-15047385
 ] 

Benjamin Mahler commented on MESOS-4048:


This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition handling (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent should be removed, but there were some implementation 
difficulties that led to the addition of {{--slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{--recovery_slave_removal_limit}} and we were able to re-use used the removal 
rate limiting).

The point of this ticket is to look into removing 
{{--slave_reregistration_timer}} entirely and have the master perform the same 
health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. 
partitioned). This ticket is about *how* we determine that an agent is 
unhealthy (e.g. partitioned). Specifically, we want to determine it in a 
consistent way rather than having one approach in steady state and a different 
approach after master failover.

Make sense?

> Consider unifying slave timeout behavior between steady state and master 
> failover
> -
>
> Key: MESOS-4048
> URL: https://issues.apache.org/jira/browse/MESOS-4048
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Anindya Sinha
>Priority: Minor
>  Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is 
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a 
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently 
> -- it might be simpler for operators to configure a single timeout that 
> controls how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover

2015-12-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047385#comment-15047385
 ] 

Benjamin Mahler edited comment on MESOS-4048 at 12/8/15 8:06 PM:
-

This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition detection (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent is unhealthy, but there were some implementation 
difficulties that led to the addition of {{\-\-slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{\-\-recovery_slave_removal_limit}} and we were able to re-use used the 
removal rate limiting).

The point of this ticket is to look into removing 
{{\-\-slave_reregistration_timer}} entirely and have the master perform the 
same health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy. This ticket is 
about *how* we determine that an agent is unhealthy. Specifically, we want to 
determine it in a consistent way rather than having one approach in steady 
state and a different approach after master failover.

Make sense?


was (Author: bmahler):
This ticket is independent from MESOS-4049 in that it is discussing the current 
inconsistent approaches to agent partition handling (case 1 and 2 above).

When we were implementing master recovery, we wanted to use health checking to 
determine when an agent should be removed, but there were some implementation 
difficulties that led to the addition of {{--slave_reregistration_timer}} 
instead. This approach is a bit scary because we may remove healthy agents that 
for some reason (e.g. ZK connectivity issues) could not re-register with the 
master after master failover. This was why we put in place some safety nets 
({{--recovery_slave_removal_limit}} and we were able to re-use used the removal 
rate limiting).

The point of this ticket is to look into removing 
{{--slave_reregistration_timer}} entirely and have the master perform the same 
health check based partition detection that it does in the steady state.

So, MESOS-4049 is about what we do *when* an agent is unhealthy (e.g. 
partitioned). This ticket is about *how* we determine that an agent is 
unhealthy (e.g. partitioned). Specifically, we want to determine it in a 
consistent way rather than having one approach in steady state and a different 
approach after master failover.

Make sense?

> Consider unifying slave timeout behavior between steady state and master 
> failover
> -
>
> Key: MESOS-4048
> URL: https://issues.apache.org/jira/browse/MESOS-4048
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Anindya Sinha
>Priority: Minor
>  Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is 
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a 
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently 
> -- it might be simpler for operators to configure a single timeout that 
> controls how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4106:
--

 Summary: The health checker may fail to inform the executor to 
kill an unhealthy task after max_consecutive_failures.
 Key: MESOS-4106
 URL: https://issues.apache.org/jira/browse/MESOS-4106
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1, 
0.21.2, 0.21.1, 0.20.1, 0.20.0
Reporter: Benjamin Mahler
Priority: Blocker


This was reported by [~tan] experimenting with health checks. Many tasks were 
launched with the following health check, taken from the container 
stdout/stderr:

{code}
Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
--executor=(1)@127.0.0.1:39629 
--health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
 --task_id=sleepy-2
{code}

This should have led to all tasks getting killed due to 
{{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
while other remain running.

It turns out that the health check binary does a {{send}} and promptly exits. 
Unfortunately, this may lead to a message drop since libprocess may not have 
sent this message over the socket by the time the process exits.

We work around this in the command executor with a manual sleep, which has been 
around since the svn days. See 
[here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049651#comment-15049651
 ] 

Benjamin Mahler commented on MESOS-4106:


This is also possibly the reason for MESOS-1613.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-4106:
--

Assignee: Benjamin Mahler

[~haosd...@gmail.com]: From my testing so far, yes. I will send a fix and 
re-enable the test from MESOS-1613.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2015-12-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049850#comment-15049850
 ] 

Benjamin Mahler commented on MESOS-1613:


For posterity, I also wasn't able to reproduce this by just running in 
repetition. However, when I ran one {{openssl speed}} for each core on my 
laptop in order to induce load, I could reproduce easily. We probably want to 
direct folks to try this when they are having trouble reproducing something 
flaky from CI.

I will post a fix through MESOS-4106.

> HealthCheckTest.ConsecutiveFailures is flaky
> 
>
> Key: MESOS-1613
> URL: https://issues.apache.org/jira/browse/MESOS-1613
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: Ubuntu 10.04 GCC
>Reporter: Vinod Kone
>Assignee: Timothy Chen
>  Labels: flaky, mesosphere
>
> {code}
> [ RUN  ] HealthCheckTest.ConsecutiveFailures
> Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
> I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
> I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
> I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
> I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
> 2125ns
> I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 10747ns
> I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
> I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
> I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
> STARTING
> I0717 04:39:59.301985  5031 master.cpp:288] Master 
> 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
> I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
> authenticated slaves to register
> I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
> I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
> I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:40280
> I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
> offers for all slaves
> I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.325097ms
> I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
> STARTING
> I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
> master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
> I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
> I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
> I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
> I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
> I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
> I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.204157ms
> I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
> VOTING
> I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
> group
> I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
> I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
> I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.271822ms
> I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
> I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
> leve

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049889#comment-15049889
 ] 

Benjamin Mahler commented on MESOS-4106:


Yeah I had the same thought when I was looking at MESOS-243, but now we also 
have process::finalize that could be the mechanism for cleanly shutting down 
before {{exit}} calls. I'll file a ticket to express this issue more generally 
(MESOS-243 was the original but is specific to the executor driver).

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049899#comment-15049899
 ] 

Benjamin Mahler commented on MESOS-4106:


Yeah, I'll reference MESOS-4111 now that we have it, I'll also reference it in 
the existing command executor sleep.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4111) Provide a means for libprocess users to exit while ensuring messages are flushed.

2015-12-09 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4111:
--

 Summary: Provide a means for libprocess users to exit while 
ensuring messages are flushed.
 Key: MESOS-4111
 URL: https://issues.apache.org/jira/browse/MESOS-4111
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler
Priority: Minor


Currently after a {{send}} there is no way to ensure that the message is 
flushed on the socket before terminating. We work around this by inserting 
{{os::sleep}} calls (see MESOS-243, MESOS-4106).

There are a number of approaches to this:

(1) Return a Future from send that notifies when the message is flushed from 
the system.

(2) Call process::finalize before exiting. This would require that 
process::finalize flushes all of the outstanding data on any active sockets, 
which may block.

Regardless of the approach, there needs to be a timer if we want to guarantee 
termination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-10 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051371#comment-15051371
 ] 

Benjamin Mahler commented on MESOS-4106:


I'm not sure we should say sleeping provides a "very weak guarantee", there is 
indeed *no guarantee* with a sleep that the message is sent.

The approach you've suggested with querying with a timeout still provides no 
form of guarantee, unless you are going to wait indefinitely or use the timeout 
mentioned to trigger a retry rather than an exit (what did you intend to happen 
after the timeout?). This approach is guaranteeing application-level delivery, 
and we generally just use an "acknowledgement" message with retries to do this, 
rather than a separate query.

However, since the executor resides on the same machine, and executor failover 
is not supported, we're unlikely to bother implementing acknowledgements with 
retries here. We only need to wait for the data to be sent on the socket (this 
gives a "weak guarantee": e.g. if there are no socket errors (note that both 
ends of the socket are within the same machine), and the executor remains up, 
the message will eventually be processed by the executor). MESOS-4111 discusses 
the general issue of being able to exit after ensuring that messages are 
processed in libprocess.

In the case of the long-standing command executor sleep, we needed to handle 
agent failure. So we are already using acknowledgements there, and can use them 
to {{stop()}} cleanly.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
> Fix For: 0.27.0
>
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4109) HTTPConnectionTest.ClosingResponse is flaky

2015-12-11 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-4109:
--

Assignee: Benjamin Mahler

Thanks for filing! I introduced this, I'll fix it shortly.

> HTTPConnectionTest.ClosingResponse is flaky
> ---
>
> Key: MESOS-4109
> URL: https://issues.apache.org/jira/browse/MESOS-4109
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
>Affects Versions: 0.26.0
> Environment: ASF Ubuntu 14 
> {{--enable-ssl --enable-libevent}}
>Reporter: Joseph Wu
>Assignee: Benjamin Mahler
>Priority: Minor
>  Labels: flaky, flaky-test, newbie, test
>
> Output of the test:
> {code}
> [ RUN  ] HTTPConnectionTest.ClosingResponse
> I1210 01:20:27.048532 26671 process.cpp:3077] Handling HTTP event for process 
> '(22)' with path: '/(22)/get'
> ../../../3rdparty/libprocess/src/tests/http_tests.cpp:919: Failure
> Actual function call count doesn't match EXPECT_CALL(*http.process, get(_))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] HTTPConnectionTest.ClosingResponse (43 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-920) Set GLOG_drop_log_memory=false in environment prior to logging initialization.

2015-12-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-920:
--
Description: 
We've observed issues where the masters are slow to respond. Two perf traces 
collected while the masters were slow to respond:

{noformat}
 25.84%  [kernel][k] default_send_IPI_mask_sequence_phys
 20.44%  [kernel][k] native_write_msr_safe
  4.54%  [kernel][k] _raw_spin_lock
  2.95%  libc-2.5.so [.] _int_malloc
  1.82%  libc-2.5.so [.] malloc
  1.55%  [kernel][k] apic_timer_interrupt
  1.36%  libc-2.5.so [.] _int_free
{noformat}

{noformat}
 29.03%  [kernel][k] default_send_IPI_mask_sequence_phys
  9.64%  [kernel][k] _raw_spin_lock
  7.38%  [kernel][k] native_write_msr_safe
  2.43%  libc-2.5.so [.] _int_malloc
  2.05%  libc-2.5.so [.] _int_free
  1.67%  [kernel][k] apic_timer_interrupt
  1.58%  libc-2.5.so [.] malloc
{noformat}

These have been found to be attributed to the posix_fadvise calls made by glog. 
We can disable these via the environment:

{noformat}
GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log 
contents. "
 "Logs can grow very quickly and they are rarely read before 
they "
 "need to be evicted from memory. Instead, drop them from 
memory "
 "as soon as they are flushed to disk.");

{noformat}

{code}
if (FLAGS_drop_log_memory) {
  if (file_length_ >= logging::kPageSize) {
// don't evict the most recent page
uint32 len = file_length_ & ~(logging::kPageSize - 1);
posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED);
  }
}
{code}

We should set GLOG_drop_log_memory=false prior to making our call to 
google::InitGoogleLogging, to avoid others running into this issue.

  was:
We've observed performance scaling issues attributed to the posix_fadvise calls 
made by glog. This can currently only disabled via the environment:

GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log 
contents. "
 "Logs can grow very quickly and they are rarely read before 
they "
 "need to be evicted from memory. Instead, drop them from 
memory "
 "as soon as they are flushed to disk.");


if (FLAGS_drop_log_memory) {
  if (file_length_ >= logging::kPageSize) {
// don't evict the most recent page
uint32 len = file_length_ & ~(logging::kPageSize - 1);
posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED);
  }
}

We should set GLOG_drop_log_memory=false prior to making our call to 
google::InitGoogleLogging.


> Set GLOG_drop_log_memory=false in environment prior to logging initialization.
> --
>
> Key: MESOS-920
> URL: https://issues.apache.org/jira/browse/MESOS-920
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Benjamin Mahler
>
> We've observed issues where the masters are slow to respond. Two perf traces 
> collected while the masters were slow to respond:
> {noformat}
>  25.84%  [kernel][k] default_send_IPI_mask_sequence_phys
>  20.44%  [kernel][k] native_write_msr_safe
>   4.54%  [kernel][k] _raw_spin_lock
>   2.95%  libc-2.5.so [.] _int_malloc
>   1.82%  libc-2.5.so [.] malloc
>   1.55%  [kernel][k] apic_timer_interrupt
>   1.36%  libc-2.5.so [.] _int_free
> {noformat}
> {noformat}
>  29.03%  [kernel][k] default_send_IPI_mask_sequence_phys
>   9.64%  [kernel][k] _raw_spin_lock
>   7.38%  [kernel][k] native_write_msr_safe
>   2.43%  libc-2.5.so [.] _int_malloc
>   2.05%  libc-2.5.so [.] _int_free
>   1.67%  [kernel][k] apic_timer_interrupt
>   1.58%  libc-2.5.so [.] malloc
> {noformat}
> These have been found to be attributed to the posix_fadvise calls made by 
> glog. We can disable these via the environment:
> {noformat}
> GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log 
> contents. "
>  "Logs can grow very quickly and they are rarely read before 
> they "
>  "need to be evicted from memory. Instead, drop them from 
> memory "
>  "as soon as they are flushed to disk.");
> {noformat}
> {code}
> if (FLAGS_drop_log_memory) {
>   if (file_length_ >= logging::kPageSize) {
> // don't evict the most recent page
> uint32 len = file_length_ & ~(logging::kPageSize - 1);
> posix_

[jira] [Created] (MESOS-4258) Generate xml test reports in the jenkins build.

2015-12-29 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4258:
--

 Summary: Generate xml test reports in the jenkins build.
 Key: MESOS-4258
 URL: https://issues.apache.org/jira/browse/MESOS-4258
 Project: Mesos
  Issue Type: Task
  Components: test
Reporter: Benjamin Mahler


Google test has a flag for generating reports:
{{--gtest_output=xml:report.xml}}

Jenkins can display these reports via the xUnit plugin, which has support for 
google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin

This lets us quickly see which test failed, as well as the time that each test 
took to run.

We should wire this up. One difficulty is that 'make distclean' complains 
because the .xml files are left over (we could update distclean to wipe any 
.xml files within the test locations):

{noformat}
ERROR: files left in build directory after distclean:
./3rdparty/libprocess/3rdparty/report.xml
./3rdparty/libprocess/report.xml
./src/report.xml
make[1]: *** [distcleancheck] Error 1
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3831) Document operator HTTP endpoints

2016-01-04 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3831:
---
Shepherd: Neil Conway
Assignee: Benjamin Mahler
  Sprint: Mesosphere Sprint 26

> Document operator HTTP endpoints
> 
>
> Key: MESOS-3831
> URL: https://issues.apache.org/jira/browse/MESOS-3831
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Benjamin Mahler
>Priority: Minor
>  Labels: documentation, mesosphere, newbie
>
> These are not exhaustively documented; they probably should be.
> Some endpoints have docs: e.g., {{/reserve}} and {{/unreserve}} are described 
> in the reservation doc page. But it would be good to have a single page that 
> lists all the endpoints and their semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4274) libprocess build fail with libhttp-parser >= 2.0

2016-01-04 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4274:
---
Shepherd: Benjamin Mahler
Assignee: Jocelyn De La Rosa

> libprocess build fail with libhttp-parser >= 2.0
> 
>
> Key: MESOS-4274
> URL: https://issues.apache.org/jira/browse/MESOS-4274
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess
>Affects Versions: 0.26.0
> Environment: debian 8 with package {{libhttp-parser-dev}} installed 
> and libprocess configured with {{--disable-bundled}}
>Reporter: Jocelyn De La Rosa
>Assignee: Jocelyn De La Rosa
>Priority: Minor
>  Labels: build-failure, compile-error, easyfix
> Fix For: 0.27.0
>
>
> Since mesos 0.26 libprocess does not compile if the libhttp-parser version is 
> >= 2.0.
> I traced back the issue to the commit {{d347bf56c807d}} that added URL to the 
> {{http::Request}} but forgot to modify the {{#if HTTP_PARSER_VERSION MAJORS 
> >=2}}  parts in {{3rdparty/libprocess/src/decoder.hpp}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4006) add a resource offers metric

2016-01-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081959#comment-15081959
 ] 

Benjamin Mahler commented on MESOS-4006:


[~devroot] the master's metrics include {{"outstanding_offers"}} which is a 
gauge of how many offers are currently made. That is, they have been sent to 
the framework but the framework has not yet replied. Is this not what you're 
looking for?

> add a resource offers metric
> 
>
> Key: MESOS-4006
> URL: https://issues.apache.org/jira/browse/MESOS-4006
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: David Robinson
>
> Mesos doesn't provide a metric for monitoring offers being made, therefore 
> it's difficult to determine whether Mesos isn't making offers or if a 
> framework isn't receiving them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4042) LevelDBStateTest suite fails in virtual box shared folder.

2016-01-04 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4042:
---
Summary: LevelDBStateTest suite fails in virtual box shared folder.  (was: 
Complete LevelDBStateTest suite fails in optimized build)

[~bbannier] I've updated the summary to reflect that this is a virtualbox 
shared folder issue. Should we close this now that it's tracking the virtualbox 
issue?

> LevelDBStateTest suite fails in virtual box shared folder.
> --
>
> Key: MESOS-4042
> URL: https://issues.apache.org/jira/browse/MESOS-4042
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>
> Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a 
> ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared 
> folder fails with
> {code}
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndFetch
> ../../src/tests/state_tests.cpp:90: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms)
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch
> ../../src/tests/state_tests.cpp:120: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms)
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch
> ../../src/tests/state_tests.cpp:156: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms)
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch
> ../../src/tests/state_tests.cpp:198: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms)
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge
> ../../src/tests/state_tests.cpp:233: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms)
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch
> ../../src/tests/state_tests.cpp:264: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms)
> [ RUN  ] LevelDBStateTest.Names
> ../../src/tests/state_tests.cpp:304: Failure
> (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
> Invalid argument
> [  FAILED  ] LevelDBStateTest.Names (10 ms)
> {code}
> The identical error occurs for a non-optimized build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4119) Add support for enabling --3way to apply-reviews.py.

2016-01-04 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4119:
---
Description: 
Currently if {{git apply}} fails during apply-reviews, then the change must be 
rebased and re-uploaded to reviewboard in order for apply-reviews to succeed.

However, it is often the case that {{git apply --3way}} will succeed since the 
blob information is included in the diff. Even if it doesn't succeed it will 
leave conflict markers, which allows the committer to do a manual conflict 
resolution if desired, or abort if conflict resolution is too difficult.

+1 [~hartem]! Updated the description, since I manually edit apply-reviews to 
use {{--3way}} all the time. :)

> Add support for enabling --3way to apply-reviews.py.
> 
>
> Key: MESOS-4119
> URL: https://issues.apache.org/jira/browse/MESOS-4119
> Project: Mesos
>  Issue Type: Task
>Reporter: Artem Harutyunyan
>  Labels: beginner, mesosphere, newbie
>
> Currently if {{git apply}} fails during apply-reviews, then the change must 
> be rebased and re-uploaded to reviewboard in order for apply-reviews to 
> succeed.
> However, it is often the case that {{git apply --3way}} will succeed since 
> the blob information is included in the diff. Even if it doesn't succeed it 
> will leave conflict markers, which allows the committer to do a manual 
> conflict resolution if desired, or abort if conflict resolution is too 
> difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4181) Change port range logging to different logging level.

2016-01-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082067#comment-15082067
 ] 

Benjamin Mahler commented on MESOS-4181:


It's trivial to move any LOG(INFO) to VLOG(1).. doesn't necessarily mean we 
should do it! :)
[~js84] [~bernd-mesos] did we completely lose the INFO logging about which 
resources are being recovered? Or was this a case of double logging?

The other location you listed is not the only other place we log resources, 
from a quick glance:
https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L3243-L3248
https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L6132-L6135
https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L6161-L6163
https://github.com/apache/mesos/blob/0.26.0/src/master/allocator/mesos/hierarchical.cpp#L400-L403
https://github.com/apache/mesos/blob/0.26.0/src/master/allocator/mesos/hierarchical.cpp#L344-L346
https://github.com/apache/mesos/blob/0.26.0/src/master/allocator/mesos/hierarchical.cpp#L513-L516

I don't see why we chose to move only this one instance to VLOG(1). Also, was 
there any reason that we didn't just improve the text representation for single 
item ranges? That is, {{\[1050-1050, 1092-1092, 1094-1094\]}} can be more 
succinctly represented as {{\[1050, 1092, 1094\]}}. Improving the 
representation will help all of the resource logging.

> Change port range logging to different logging level.
> -
>
> Key: MESOS-4181
> URL: https://issues.apache.org/jira/browse/MESOS-4181
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Cody Maloney
>Assignee: Joerg Schad
>  Labels: mesosphere, newbie
>
> Transforming from mesos' internal port range representation -> text is 
> non-linear in the number of bytest output. We end up with a massive amount of 
> log data like the following:
> {noformat}
> Dec 15 23:54:08 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
> I1215 23:51:58.891165 15925 hierarchical.hpp:1103] Recovered cpus(*):1e-05; 
> mem(*):10; ports(*):[5565-5565] (total: ports(*):[1025-2180, 2182-3887, 
> 3889-5049, 5052-8079, 8082-8180, 8182-32000]; cpus(*):4; mem(*):14019; 
> disk(*):32541, allocated: cpus(*):0.01815; ports(*):[1050-1050, 1092-1092, 
> 1094-1094, 1129-1129, 1132-1132, 1140-1140, 1177-1178, 1180-1180, 1192-1192, 
> 1205-1205, 1221-1221, 1308-1308, 1311-1311, 1323-1323, 1326-1326, 1335-1335, 
> 1365-1365, 1404-1404, 1412-1412, 1436-1436, 1455-1455, 1459-1459, 1472-1472, 
> 1477-1477, 1482-1482, 1491-1491, 1510-1510, 1551-1551, 1553-1553, 1559-1559, 
> 1573-1573, 1590-1590, 1592-1592, 1619-1619, 1635-1636, 1678-1678, 1738-1738, 
> 1742-1742, 1752-1752, 1770-1770, 1780-1782, 1790-1790, 1792-1792, 1799-1799, 
> 1804-1804, 1844-1844, 1852-1852, 1867-1867, 1899-1899, 1936-1936, 1945-1945, 
> 1954-1954, 2046-2046, 2055-2055, 2063-2063, 2070-2070, 2089-2089, 2104-2104, 
> 2117-2117, 2132-2132, 2173-2173, 2178-2178, 2188-2188, 2200-2200, 2218-2218, 
> 2223-2223, 2244-2244, 2248-2248, 2250-2250, 2270-2270, 2286-2286, 2302-2302, 
> 2332-2332, 2377-2377, 2397-2397, 2423-2423, 2435-2435, 2442-2442, 2448-2448, 
> 2477-2477, 2482-2482, 2522-2522, 2586-2586, 2594-2594, 2600-2600, 2602-2602, 
> 2643-2643, 2648-2648, 2659-2659, 2691-2691, 2716-2716, 2739-2739, 2794-2794, 
> 2802-2802, 2823-2823, 2831-2831, 2840-2840, 2848-2848, 2876-2876, 2894-2895, 
> 2900-2900, 2904-2904, 2912-2912, 2983-2983, 2991-2991, 2999-2999, 3011-3011, 
> 3025-3025, 3036-3036, 3041-3041, 3051-3051, 3074-3074, 3097-3097, 3107-3107, 
> 3121-3121, 3171-3171, 3176-3176, 3195-3195, 3197-3197, 3210-3210, 3221-3221, 
> 3234-3234, 3245-3245, 3250-3251, 3255-3255, 3270-3270, 3293-3293, 3298-3298, 
> 3312-3312, 3318-3318, 3325-3325, 3368-3368, 3379-3379, 3391-3391, 3412-3412, 
> 3414-3414, 3420-3420, 3492-3492, 3501-3501, 3538-3538, 3579-3579, 3631-3631, 
> 3680-3680, 3684-3684, 3695-3695, 3699-3699, 3738-3738, 3758-3758, 3793-3793, 
> 3808-3808, 3817-3817, 3854-3854, 3856-3856, 3900-3900, 3906-3906, 3909-3909, 
> 3912-3912, 3946-3946, 3956-3956, 3959-3959, 3963-3963, 3974-
> Dec 15 23:54:09 ip-10-0-7-60.us-west-2.compute.internal mesos-master[15919]: 
> 3974, 3981-3981, 3985-3985, 4134-4134, 4178-4178, 4206-4206, 4223-4223, 
> 4239-4239, 4245-4245, 4251-4251, 4262-4263, 4271-4271, 4308-4308, 4323-4323, 
> 4329-4329, 4368-4368, 4385-4385, 4404-4404, 4419-4419, 4430-4430, 4448-4448, 
> 4464-4464, 4481-4481, 4494-4494, 4499-4499, 4510-4510, 4534-4534, 4543-4543, 
> 4555-4555, 4561-4562, 4577-4577, 4601-4601, 4675-4675, 4722-4722, 4739-4739, 
> 4748-4748, 4752-4752, 4764-4764, 4771-4771, 4787-4787, 4827-4827, 4830-4830, 
> 4837-4837, 4848-4848, 4853-4853, 4879-4879, 4883-4883, 4897-4897, 4902-4902, 
> 4911-4911, 4940-4940,

[jira] [Commented] (MESOS-4075) Continue test suite execution across crashing tests.

2016-01-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082126#comment-15082126
 ] 

Benjamin Mahler commented on MESOS-4075:


The tests shouldn't be crashing though, why don't we focus on fixing crashes 
instead? For example, we currently have memory management issues that cause a 
test failures to unnecessarily cause the test binary to crash. Consider [this 
snippet|https://github.com/apache/mesos/blob/0.26.0/src/tests/slave_tests.cpp#L256-L259],
 where we pass a pointer to a stack allocated object into the slave / test 
abstractions, this means that if an assertion fails in the test, the code may 
crash!

> Continue test suite execution across crashing tests.
> 
>
> Key: MESOS-4075
> URL: https://issues.apache.org/jira/browse/MESOS-4075
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Bernd Mathiske
>  Labels: mesosphere
>
> Currently, mesos-tests.sh exits when a test crashes. This is inconvenient 
> when trying to find out all tests that fail. 
> mesos-tests.sh should rate a test that crashes as failed and continue the 
> same way as if the test merely returned with a failure result and exited 
> properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4102) Quota doesn't allocate resources on slave joining

2016-01-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082141#comment-15082141
 ] 

Benjamin Mahler commented on MESOS-4102:


Linked in tickets with discussions about batch vs. event driven allocation.

> Quota doesn't allocate resources on slave joining
> -
>
> Key: MESOS-4102
> URL: https://issues.apache.org/jira/browse/MESOS-4102
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Neil Conway
>Assignee: Alexander Rukletsov
>  Labels: mesosphere, quota
> Attachments: quota_absent_framework_test-1.patch
>
>
> See attached patch. {{framework1}} is not allocated any resources, despite 
> the fact that the resources on {{agent2}} can safely be allocated to it 
> without risk of violating {{quota1}}. If I understand the intended quota 
> behavior correctly, this doesn't seem intended.
> Note that if the framework is added _after_ the slaves are added, the 
> resources on {{agent2}} are allocated to {{framework1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4152) discarding a Future from process::Queue loses elements

2016-01-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082185#comment-15082185
 ] 

Benjamin Mahler commented on MESOS-4152:


To be accurate, it doesn't prevent the caller from *requesting a discard*. 
Preventing callers from requesting a discard (e.g. at compile time by 
introducing a Future w/o {{discard()}}, or at run-time by providing a means to 
check for discard support) is orthogonal to this issue. I say that because, 
even if discard was a valid way to use process::Queue, when the caller requests 
a discard it must not assume the discard takes place. The caller must always 
use the transition state of the future to determine whether the future was 
discarded.

> discarding a Future from process::Queue loses elements
> --
>
> Key: MESOS-4152
> URL: https://issues.apache.org/jira/browse/MESOS-4152
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>
> If you discard the Future you get from {{process::Queue}} the next element 
> inserted into the queue will be lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.

2016-01-06 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4302:
--

 Summary: Offer filter timeouts are ignored if the allocator is 
slow or backlogged.
 Key: MESOS-4302
 URL: https://issues.apache.org/jira/browse/MESOS-4302
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Benjamin Mahler
Priority: Critical


Currently, when the allocator recovers resources from an offer, it creates a 
filter timeout based on time at which the call is processed.

This means that if it takes longer than the filter duration for the allocator 
to perform an allocation for the relevant agent, then the filter is never 
applied.

This leads to pathological behavior: if the framework sets a filter duration 
that is smaller than the wall clock time it takes for us to perform the next 
allocation, then the filters will have no effect. This can mean that low share 
frameworks may continue receiving offers that they have no intent to use, 
without other frameworks ever receiving these offers.

The workaround for this is for frameworks to set high filter durations, and 
possibly reviving offers when they need more resources, however, we should fix 
this issue in the allocator. (i.e. derive the timeout deadlines and expiry 
based on allocation times).

This seems to warrant cherry-picking into bug fix releases for future versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.

2016-01-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4302:
---
Description: 
Currently, when the allocator recovers resources from an offer, it creates a 
filter timeout based on time at which the call is processed.

This means that if it takes longer than the filter duration for the allocator 
to perform an allocation for the relevant agent, then the filter is never 
applied.

This leads to pathological behavior: if the framework sets a filter duration 
that is smaller than the wall clock time it takes for us to perform the next 
allocation, then the filters will have no effect. This can mean that low share 
frameworks may continue receiving offers that they have no intent to use, 
without other frameworks ever receiving these offers.

The workaround for this is for frameworks to set high filter durations, and 
possibly reviving offers when they need more resources, however, we should fix 
this issue in the allocator. (i.e. derive the timeout deadlines and expiry 
based on allocation times).

This seems to warrant cherry-picking into bug fix releases.

  was:
Currently, when the allocator recovers resources from an offer, it creates a 
filter timeout based on time at which the call is processed.

This means that if it takes longer than the filter duration for the allocator 
to perform an allocation for the relevant agent, then the filter is never 
applied.

This leads to pathological behavior: if the framework sets a filter duration 
that is smaller than the wall clock time it takes for us to perform the next 
allocation, then the filters will have no effect. This can mean that low share 
frameworks may continue receiving offers that they have no intent to use, 
without other frameworks ever receiving these offers.

The workaround for this is for frameworks to set high filter durations, and 
possibly reviving offers when they need more resources, however, we should fix 
this issue in the allocator. (i.e. derive the timeout deadlines and expiry 
based on allocation times).

This seems to warrant cherry-picking into bug fix releases for future versions.


> Offer filter timeouts are ignored if the allocator is slow or backlogged.
> -
>
> Key: MESOS-4302
> URL: https://issues.apache.org/jira/browse/MESOS-4302
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> Currently, when the allocator recovers resources from an offer, it creates a 
> filter timeout based on time at which the call is processed.
> This means that if it takes longer than the filter duration for the allocator 
> to perform an allocation for the relevant agent, then the filter is never 
> applied.
> This leads to pathological behavior: if the framework sets a filter duration 
> that is smaller than the wall clock time it takes for us to perform the next 
> allocation, then the filters will have no effect. This can mean that low 
> share frameworks may continue receiving offers that they have no intent to 
> use, without other frameworks ever receiving these offers.
> The workaround for this is for frameworks to set high filter durations, and 
> possibly reviving offers when they need more resources, however, we should 
> fix this issue in the allocator. (i.e. derive the timeout deadlines and 
> expiry based on allocation times).
> This seems to warrant cherry-picking into bug fix releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4302) Offer filter timeouts are ignored if the allocator is slow or backlogged.

2016-01-07 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088592#comment-15088592
 ] 

Benjamin Mahler commented on MESOS-4302:


Hm.. I didn't understand the use case or what setting it specifically to 100 
milliseconds will accomplish. Is it that you don't want filtering at all? (then 
just set it to 0 seconds rather than 100 milliseconds)

> Offer filter timeouts are ignored if the allocator is slow or backlogged.
> -
>
> Key: MESOS-4302
> URL: https://issues.apache.org/jira/browse/MESOS-4302
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Alexander Rukletsov
>Priority: Critical
>  Labels: mesosphere
>
> Currently, when the allocator recovers resources from an offer, it creates a 
> filter timeout based on time at which the call is processed.
> This means that if it takes longer than the filter duration for the allocator 
> to perform an allocation for the relevant agent, then the filter is never 
> applied.
> This leads to pathological behavior: if the framework sets a filter duration 
> that is smaller than the wall clock time it takes for us to perform the next 
> allocation, then the filters will have no effect. This can mean that low 
> share frameworks may continue receiving offers that they have no intent to 
> use, without other frameworks ever receiving these offers.
> The workaround for this is for frameworks to set high filter durations, and 
> possibly reviving offers when they need more resources, however, we should 
> fix this issue in the allocator. (i.e. derive the timeout deadlines and 
> expiry based on allocation times).
> This seems to warrant cherry-picking into bug fix releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4258) Generate xml test reports in the jenkins build.

2016-01-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4258:
---
Shepherd: Benjamin Mahler

> Generate xml test reports in the jenkins build.
> ---
>
> Key: MESOS-4258
> URL: https://issues.apache.org/jira/browse/MESOS-4258
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: newbie
>
> Google test has a flag for generating reports:
> {{--gtest_output=xml:report.xml}}
> Jenkins can display these reports via the xUnit plugin, which has support for 
> google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
> This lets us quickly see which test failed, as well as the time that each 
> test took to run.
> We should wire this up. One difficulty is that 'make distclean' complains 
> because the .xml files are left over (we could update distclean to wipe any 
> .xml files within the test locations):
> {noformat}
> ERROR: files left in build directory after distclean:
> ./3rdparty/libprocess/3rdparty/report.xml
> ./3rdparty/libprocess/report.xml
> ./src/report.xml
> make[1]: *** [distcleancheck] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.

2016-01-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090303#comment-15090303
 ] 

Benjamin Mahler commented on MESOS-4258:


Your patch is committed, so now the report files are generated. The next part 
is to process the reports in jenkins. I think we'll want to use '[docker 
cp|https://docs.docker.com/engine/reference/commandline/cp/]' to copy out the 
report files from the container to the jenkins workspace. This likely means 
removing {{--rm}} from our {{docker run}} invocation and placing the rm command 
within the EXIT trap. [~lins05] can you do this next part as well?

> Generate xml test reports in the jenkins build.
> ---
>
> Key: MESOS-4258
> URL: https://issues.apache.org/jira/browse/MESOS-4258
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Shuai Lin
>  Labels: newbie
>
> Google test has a flag for generating reports:
> {{--gtest_output=xml:report.xml}}
> Jenkins can display these reports via the xUnit plugin, which has support for 
> google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin
> This lets us quickly see which test failed, as well as the time that each 
> test took to run.
> We should wire this up. One difficulty is that 'make distclean' complains 
> because the .xml files are left over (we could update distclean to wipe any 
> .xml files within the test locations):
> {noformat}
> ERROR: files left in build directory after distclean:
> ./3rdparty/libprocess/3rdparty/report.xml
> ./3rdparty/libprocess/report.xml
> ./src/report.xml
> make[1]: *** [distcleancheck] Error 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-486) TaskInfo should include a 'source' in order to enable getting resource monitoring statistics.

2016-01-11 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092945#comment-15092945
 ] 

Benjamin Mahler commented on MESOS-486:
---

{{ExecutorInfo.source}} was added before we started to use labels to provide 
customization. In retrospect, I imagine that we would have provided 
{{ExecutorInfo.labels}} instead of {{ExecutorInfo.source}}, which would have 
satisfied a wider set of monitoring use cases, as well as other customization 
needs outside of monitoring. My thought is that we should deprecate 
{{ExecutorInfo.source}} and add {{ExecutorInfo.labels}}, rather than introduce 
{{TaskInfo.source}}.

However, it's still not clear to me how to achieve standardized labels. For 
example, if an operator requires that all frameworks use a {{"source"}} label 
for monitoring purposes, it's more difficult to get the users/frameworks 
setting this label consistently.

> TaskInfo should include a 'source' in order to enable getting resource 
> monitoring statistics.
> -
>
> Key: MESOS-486
> URL: https://issues.apache.org/jira/browse/MESOS-486
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Benjamin Hindman
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4424) Initial support for GPU resources.

2016-01-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-4424:
--

 Summary: Initial support for GPU resources.
 Key: MESOS-4424
 URL: https://issues.apache.org/jira/browse/MESOS-4424
 Project: Mesos
  Issue Type: Epic
  Components: isolation
Reporter: Benjamin Mahler


Mesos already has generic mechanisms for expressing / isolating resources, and 
we'd like to expose GPUs as resources that can be consumed and isolated. 
However, GPUs present unique challenges:
* Users may rely on vendor-specific libraries to interact with the device (e.g. 
CUDA, HSA, etc), others may rely on portable libraries like OpenCL or OpenGL. 
These libraries need to be available from within the container.
* GPU hardware has many attributes that may impose scheduling constraints (e.g. 
core count, total memory, topology (via PCI-E, NVLINK, etc), driver versions, 
etc).
* Obtaining utilization information requires vendor-specific approaches.
* Isolated sharing of a GPU device requires vendor-specific approaches.

As such, the focus is on supporting a narrow initial use case: homogenous 
device-level GPU support:
* Fractional sharing of GPU devices across containers will not be supported 
initially, unlike CPU cores.
* Heterogeneity will be supported via other means for now (e.g. using agent 
attributes to differentiate hardware profiles, using portable libraries like 
OpenCL, etc).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2262) Adding GPGPU resource into Mesos, so we can know if any GPU/Heterogeneous resource are available from slave

2016-01-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106106#comment-15106106
 ] 

Benjamin Mahler commented on MESOS-2262:


I've created an epic (MESOS-4424) to track initial support of GPU resources and 
added the watchers from this ticket. A design doc will be circulated for 
community feedback soon, looking forward to seeing feedback from folks 
interested!

> Adding GPGPU resource into Mesos, so we can know if any GPU/Heterogeneous 
> resource are available from slave
> ---
>
> Key: MESOS-2262
> URL: https://issues.apache.org/jira/browse/MESOS-2262
> Project: Mesos
>  Issue Type: Task
>  Components: slave
> Environment: OpenCL support env, such as OS X, Linux, Windows..
>Reporter: chester kuo
>Assignee: chester kuo
>Priority: Minor
>
> Extending Mesos to support Heterogeneous resource such as GPGPU/FPGA..etc as 
> computing resources in the data-center, OpenCL will be first target to add 
> into Mesos (support by all major GPU vendor) , I will reserve to support 
> others such as CUDA in the future.
> In this feature, slave will be supported to do resources discover including 
> but not limited to, 
> (1) Heterogeneous Computing programming model : "OpenCL". "CUDA", "HSA"
> (2) Computing global memory (MB)
> (3) Computing run time version , such as "1.2" , "2.0"
> (4) Computing compute unit (double)
> (5) Computing device type : GPGPU, CPU, Accelerator device.
> (6) Computing (number of devices): (double)
> The Heterogeneous resource isolation will be supported in the framework 
> instead of in the slave devices side, the major reason here is , the 
> ecosystem , such as OpenCL operate on top of private device driver own by 
> vendors, only runtime library (OpenCL) is user-space application, so its hard 
> for us to do like Linux cgroup to have CPU/memory resource isolation. As a 
> result we may use run time library to do device isolation and memory 
> allocation.
> (PS, if anyone know how to do it for GPGPU driver, please drop me a note)
> Meanwhile, some run-time library (such as OpenCL) support to run on top of 
> CPU, so we need to use isolator API to notify this once it allocated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-920) Set GLOG_drop_log_memory=false in environment prior to logging initialization.

2016-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-920:
--
Shepherd: Benjamin Mahler

> Set GLOG_drop_log_memory=false in environment prior to logging initialization.
> --
>
> Key: MESOS-920
> URL: https://issues.apache.org/jira/browse/MESOS-920
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt
>Affects Versions: 0.15.0, 0.16.0
>Reporter: Benjamin Mahler
>Assignee: Kapil Arya
>
> We've observed issues where the masters are slow to respond. Two perf traces 
> collected while the masters were slow to respond:
> {noformat}
>  25.84%  [kernel][k] default_send_IPI_mask_sequence_phys
>  20.44%  [kernel][k] native_write_msr_safe
>   4.54%  [kernel][k] _raw_spin_lock
>   2.95%  libc-2.5.so [.] _int_malloc
>   1.82%  libc-2.5.so [.] malloc
>   1.55%  [kernel][k] apic_timer_interrupt
>   1.36%  libc-2.5.so [.] _int_free
> {noformat}
> {noformat}
>  29.03%  [kernel][k] default_send_IPI_mask_sequence_phys
>   9.64%  [kernel][k] _raw_spin_lock
>   7.38%  [kernel][k] native_write_msr_safe
>   2.43%  libc-2.5.so [.] _int_malloc
>   2.05%  libc-2.5.so [.] _int_free
>   1.67%  [kernel][k] apic_timer_interrupt
>   1.58%  libc-2.5.so [.] malloc
> {noformat}
> These have been found to be attributed to the posix_fadvise calls made by 
> glog. We can disable these via the environment:
> {noformat}
> GLOG_DEFINE_bool(drop_log_memory, true, "Drop in-memory buffers of log 
> contents. "
>  "Logs can grow very quickly and they are rarely read before 
> they "
>  "need to be evicted from memory. Instead, drop them from 
> memory "
>  "as soon as they are flushed to disk.");
> {noformat}
> {code}
> if (FLAGS_drop_log_memory) {
>   if (file_length_ >= logging::kPageSize) {
> // don't evict the most recent page
> uint32 len = file_length_ & ~(logging::kPageSize - 1);
> posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED);
>   }
> }
> {code}
> We should set GLOG_drop_log_memory=false prior to making our call to 
> google::InitGoogleLogging, to avoid others running into this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4455) Error on scrolling the page horizontally on website

2016-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-4455:
--

Assignee: Benjamin Mahler

> Error on scrolling the page horizontally on website
> ---
>
> Key: MESOS-4455
> URL: https://issues.apache.org/jira/browse/MESOS-4455
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: Disha Singh
>Assignee: Benjamin Mahler
>Priority: Minor
>  Labels: newbie
>
> When the page :http://mesos.apache.org/documentation/latest/architecture/
>  is scrolled horizontally the naval bar at the top discontinues creating a 
> bad look.
> Also, this is occuring only because of the unadjusted size of the picture 
> architecture3.jpg.
> This makes two finger-scrolling non-pleasant. 
> one of the these things can be done:
> 1. Adjust the image's size.
> 2. Fix the naval bar on it's position by adding ":fixed" in the CSS class 
> block  itself to prevent any issues even in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4455) Error on scrolling the page horizontally on website

2016-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-4455:
--

Assignee: (was: Benjamin Mahler)

> Error on scrolling the page horizontally on website
> ---
>
> Key: MESOS-4455
> URL: https://issues.apache.org/jira/browse/MESOS-4455
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: Disha Singh
>Priority: Minor
>  Labels: newbie
>
> When the page :http://mesos.apache.org/documentation/latest/architecture/
>  is scrolled horizontally the naval bar at the top discontinues creating a 
> bad look.
> Also, this is occuring only because of the unadjusted size of the picture 
> architecture3.jpg.
> This makes two finger-scrolling non-pleasant. 
> one of the these things can be done:
> 1. Adjust the image's size.
> 2. Fix the naval bar on it's position by adding ":fixed" in the CSS class 
> block  itself to prevent any issues even in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4031) slave crashed in cgroupstatistics()

2016-01-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4031:
---
Component/s: (was: libprocess)
 docker

> slave crashed in cgroupstatistics()
> ---
>
> Key: MESOS-4031
> URL: https://issues.apache.org/jira/browse/MESOS-4031
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.24.0
> Environment: Debian jessie
>Reporter: Steven
>Assignee: Timothy Chen
>  Labels: mesosphere
> Fix For: 0.27.0
>
>
> Hi all, 
> I have built a mesos cluster with three slaves. Any slave may sporadically 
> crash when I get the summary through mesos master ui. Here is the stack 
> trace. 
> {code}
>  slave.sh[13336]: I1201 11:54:12.827975 13338 slave.cpp:3926] Current disk 
> usage 79.71%. Max allowed age: 17.279577136390834hrs
>  slave.sh[13336]: I1201 11:55:12.829792 13342 slave.cpp:3926] Current disk 
> usage 79.71%. Max allowed age: 17.279577136390834hrs
>  slave.sh[13336]: I1201 11:55:38.389614 13342 http.cpp:189] HTTP GET for 
> /slave(1)/state from 192.168.100.1:64870 with User-Agent='Mozilla/5.0 (X11; 
> Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0'
>  docker[8409]: time="2015-12-01T11:55:38.934148017+08:00" level=info msg="GET 
> /v1.20/containers/mesos-b25be32d-41e1-4e14-9b84-d33d733cef51-S3.79c206a6-d6b5-487b-9390-e09292c5b53a/json"
>  docker[8409]: time="2015-12-01T11:55:38.941489332+08:00" level=info msg="GET 
> /v1.20/containers/mesos-b25be32d-41e1-4e14-9b84-d33d733cef51-S3.1e01a4b3-a76e-4bf6-8ce0-a4a937faf236/json"
>  slave.sh[13336]: ABORT: 
> (../../3rdparty/libprocess/3rdparty/stout/include/stout/result.hpp:110): 
> Result::get() but state == NONE*** Aborted at 1448942139 (unix time) try 
> "date -d @1448942139" if you are using GNU date ***
>  slave.sh[13336]: PC: @ 0x7f295218a107 (unknown)
>  slave.sh[13336]: *** SIGABRT (@0x3419) received by PID 13337 (TID 
> 0x7f2948992700) from PID 13337; stack trace: ***
>  slave.sh[13336]: @ 0x7f2952a2e8d0 (unknown)
>  slave.sh[13336]: @ 0x7f295218a107 (unknown)
>  slave.sh[13336]: @ 0x7f295218b4e8 (unknown)
>  slave.sh[13336]: @   0x43dc59 _Abort()
>  slave.sh[13336]: @   0x43dc87 _Abort()
>  slave.sh[13336]: @ 0x7f2955e31c86 Result<>::get()
>  slave.sh[13336]: @ 0x7f295637f017 
> mesos::internal::slave::DockerContainerizerProcess::cgroupsStatistics()
>  slave.sh[13336]: @ 0x7f295637dfea 
> _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUliE_clEi
>  slave.sh[13336]: @ 0x7f295637e549 
> _ZZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS_11ContainerIDEENKUlRKN6Docker9ContainerEE0_clES9_
>  slave.sh[13336]: @ 0x7f295638453b
> ZN5mesos8internal5slave26DockerContainerizerProcess5usageERKNS1_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEINS_6FutureINS1_18ResourceStatisticsEEESB_EEvENKUlSB_E_clESB_ENKUlvE_clEv
>  slave.sh[13336]: @ 0x7f295638751d
> FN7process6FutureIN5mesos18ResourceStatisticsEEEvEZZNKS0_9_DeferredIZNS2_8internal5slave26DockerContainerizerProcess5usageERKNS2_11ContainerIDEEUlRKN6Docker9ContainerEE0_EcvSt8functionIFT_T0_EEIS4_SG_EEvENKUlSG_E_clESG_EUlvE_E9_M_invoke
>  slave.sh[13336]: @ 0x7f29563b53e7 std::function<>::operator()()
>  slave.sh[13336]: @ 0x7f29563aa5dc 
> _ZZN7process8dispatchIN5mesos18ResourceStatisticsEEENS_6FutureIT_EERKNS_4UPIDERKSt8functionIFS5_vEEENKUlPNS_11ProcessBaseEE_clESF_
>  slave.sh[13336]: @ 0x7f29563bd667 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos18ResourceStatisticsEEENS0_6FutureIT_EERKNS0_4UPIDERKSt8functionIFS9_vEEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  slave.sh[13336]: @ 0x7f2956b893c3 std::function<>::operator()()
>  slave.sh[13336]: @ 0x7f2956b72ab0 process::ProcessBase::visit()
>  slave.sh[13336]: @ 0x7f2956b7588e process::DispatchEvent::visit()
>  slave.sh[13336]: @ 0x7f2955d7f972 process::ProcessBase::serve()
>  slave.sh[13336]: @ 0x7f2956b6ef8e process::ProcessManager::resume()
>  slave.sh[13336]: @ 0x7f2956b63555 process::internal::schedule()
>  slave.sh[13336]: @ 0x7f2956bc0839 
> _ZNSt12_Bind_simpleIFPFvvEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
>  slave.sh[13336]: @ 0x7f2956bc0781 std::_Bind_simple<>::operator()()
>  slave.sh[13336]: @ 0x7f2956bc06fe std::thread::_Impl<>::_M_run()
>  slave.sh[13336]: @ 0x7f29527ca970 (unknown)
>  slave.sh[13336]: @ 0x7f2952a270a4 start_thread
>  slave.sh[13336]: @ 0x7f295223b04d (unknown)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2016-01-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1802:
--

Assignee: (was: Timothy Chen)

> HealthCheckTest.HealthStatusChange is flaky on jenkins.
> ---
>
> Key: MESOS-1802
> URL: https://issues.apache.org/jira/browse/MESOS-1802
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
>Affects Versions: 0.26.0
>Reporter: Benjamin Mahler
>  Labels: flaky, mesosphere
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
> {noformat}
> [ RUN  ] HealthCheckTest.HealthStatusChange
> Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
> I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
> I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
> I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
> I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 642ns
> I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 343ns
> I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
> I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
> I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to 
> STARTING
> I0916 22:56:14.036603 21046 master.cpp:286] Master 
> 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
> 67.195.81.186:47865
> I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
> I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 480322ns
> I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
> STARTING
> I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled
> I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status
> I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:47865
> I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is 
> master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026
> I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master!
> I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar
> I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar
> I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING
> I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 330251ns
> I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to 
> VOTING
> I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos 
> group
> I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated
> I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer
> I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 92623ns
> I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1
> I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 144215ns
> I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0
> I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request 
> for position 0
> I0916 22:56:14.040267 21047 leveldb.cpp:438] Reading position from leveldb 
> 

[jira] [Created] (MESOS-1668) Handle a temporary one-way master --> slave socket closure.

2014-08-04 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1668:
--

 Summary: Handle a temporary one-way master --> slave socket 
closure.
 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Priority: Minor


In MESOS-1529, we realized that it's possible for a slave to remain 
disconnected in the master if the following occurs:

→ Master and Slave connected operating normally.
→ Temporary one-way network failure, master→slave link breaks.
→ Master marks slave as disconnected.
→ Network restored and health checking continues normally, slave is not removed 
as a result. Slave does not attempt to re-register since it is receiving pings 
once again.
→ Slave remains disconnected according to the master, and the slave does not 
try to re-register. Bad!

We were originally thinking of using a failover timeout in the master to remove 
these slaves that don't re-register. However, it can be dangerous when 
ZooKeeper issues are preventing the slave from re-registering with the master; 
we do not want to remove a ton of slaves in this situation.

Rather, when the slave is health checking correctly but does not re-register 
within a timeout, we could send a registration request from the master to the 
slave, telling the slave that it must re-register. This message could also be 
used when receiving status updates (or other messages) from slaves that are 
disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1470) Add cluster maintenance documentation.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1470:
---

Target Version/s:   (was: 0.20.0)

> Add cluster maintenance documentation.
> --
>
> Key: MESOS-1470
> URL: https://issues.apache.org/jira/browse/MESOS-1470
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>
> Now that the master has replicated state on the disk, we should add 
> documentation that guides operators for common maintenance work:
> * Swapping a master in the ensemble.
> * Growing the master ensemble.
> * Shrinking the master ensemble.
> This would help craft similar documentation for users of the replicated log.
> We should also add documentation for existing slave maintenance documentation:
> * Best practices for rolling upgrades.
> * How to shut down a slave.
> This latter category will be incorporated with [~alexandra.sava]'s 
> maintenance work!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1461) Add task reconciliation to the Python API.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1461:
---

Target Version/s: 0.21.0  (was: 0.20.0)

> Add task reconciliation to the Python API.
> --
>
> Key: MESOS-1461
> URL: https://issues.apache.org/jira/browse/MESOS-1461
> Project: Mesos
>  Issue Type: Task
>  Components: python api
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>
> Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but 
> was never added to the Python API.
> This may be obviated by the lower level API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1517:
---

Target Version/s:   (was: 0.20.0)

> Maintain a queue of messages that arrive before the master recovers.
> 
>
> Key: MESOS-1517
> URL: https://issues.apache.org/jira/browse/MESOS-1517
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>  Labels: reliability
>
> Currently when the master is recovering, we drop all incoming messages. If 
> slaves and frameworks knew about the leading master only once it has 
> recovered, then we would only expect to see messages after we've recovered.
> We previously considered enqueuing all messages through the recovery future, 
> but this has the downside of forcing all messages to go through the master's 
> queue twice:
> {code}
>   // TODO(bmahler): Consider instead re-enqueing *all* messages
>   // through recover(). What are the performance implications of
>   // the additional queueing delay and the accumulated backlog
>   // of messages post-recovery?
>   if (!recovered.get().isReady()) {
> VLOG(1) << "Dropping '" << event.message->name << "' message since "
> << "not recovered yet";
> ++metrics.dropped_messages;
> return;
>   }
> {code}
> However, an easy solution to this problem is to maintain an explicit queue of 
> incoming messages that gets flushed once we finish recovery. This ensures 
> that all messages post-recovery are processed normally.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1653.


   Resolution: Fixed
Fix Version/s: 0.20.0
 Assignee: Timothy Chen

Fix was included here:

{noformat}
commit 656b0e075c79e03cf6937bbe7302424768729aa2
Author: Timothy Chen 
Date:   Wed Aug 6 11:34:03 2014 -0700

Re-enabled HealthCheckTest.ConsecutiveFailures test.

The test originally was flaky because the time to process the number
of consecutive checks configured exceeds the task itself, so the task
finished but the number of expected task health check didn't match.

Review: https://reviews.apache.org/r/23772
{noformat}

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Timothy Chen
> Fix For: 0.20.0
>
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
> STARTING
> I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
> I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
> I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 11.537102ms
> I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
> VOTING
> I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
> group
> I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
> I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
> I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.830863ms
> I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
> I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 

[jira] [Updated] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1613:
---

Fix Version/s: (was: 0.20.0)

> HealthCheckTest.ConsecutiveFailures is flaky
> 
>
> Key: MESOS-1613
> URL: https://issues.apache.org/jira/browse/MESOS-1613
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: Ubuntu 10.04 GCC
>Reporter: Vinod Kone
>Assignee: Timothy Chen
>
> {code}
> [ RUN  ] HealthCheckTest.ConsecutiveFailures
> Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
> I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
> I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
> I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
> I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
> 2125ns
> I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 10747ns
> I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
> I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
> I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
> STARTING
> I0717 04:39:59.301985  5031 master.cpp:288] Master 
> 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
> I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
> authenticated slaves to register
> I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
> I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
> I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:40280
> I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
> offers for all slaves
> I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.325097ms
> I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
> STARTING
> I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
> master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
> I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
> I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
> I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
> I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
> I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
> I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.204157ms
> I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
> VOTING
> I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
> group
> I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
> I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
> I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.271822ms
> I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
> I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 5.004323ms
> I0717 04:39:59.323493  5031 replica.cpp:676] Persisted action at 0
> I0717 04:39:59.323799  5031 replica.cpp:508] Replica received write request 
> for position 0
> I0717 04:39:59.323837  5031 leveldb.cpp:438] Reading position from leveldb 
> took 21901ns
> I0717 04:39:59.329038  5031 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 5.175998ms
> I0717 04:39:59.329244  5031 repl

[jira] [Reopened] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reopened MESOS-1613:



Looks like it's still flaky:

{noformat}
Changes

Summary

Made ephemeral ports a resource and killed private resources. (details)
Do not send ephemeral_ports resource to frameworks. (details)
Create static mesos library. (details)
Re-enabled HealthCheckTest.ConsecutiveFailures test. (details)
Merge resourcesRecovered and resourcesUnused. (details)
Added executor metrics for slave. (details)

[ RUN  ] HealthCheckTest.ConsecutiveFailures
Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu'
I0806 15:06:59.268267  9210 leveldb.cpp:176] Opened db in 29.926087ms
I0806 15:06:59.275971  9210 leveldb.cpp:183] Compacted db in 7.40006ms
I0806 15:06:59.276254  9210 leveldb.cpp:198] Created db iterator in 7678ns
I0806 15:06:59.276741  9210 leveldb.cpp:204] Seeked to beginning of db in 2076ns
I0806 15:06:59.277034  9210 leveldb.cpp:273] Iterated through 0 keys in the db 
in 1908ns
I0806 15:06:59.277307  9210 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0806 15:06:59.277868  9233 recover.cpp:425] Starting replica recovery
I0806 15:06:59.277946  9233 recover.cpp:451] Replica is in EMPTY status
I0806 15:06:59.278240  9233 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0806 15:06:59.278296  9233 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0806 15:06:59.278391  9233 recover.cpp:542] Updating replica status to STARTING
I0806 15:06:59.282282  9234 master.cpp:287] Master 
20140806-150659-16842879-60888-9210 (lucid) started on 127.0.1.1:60888
I0806 15:06:59.282316  9234 master.cpp:324] Master only allowing authenticated 
frameworks to register
I0806 15:06:59.282322  9234 master.cpp:329] Master only allowing authenticated 
slaves to register
I0806 15:06:59.282330  9234 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu/credentials'
I0806 15:06:59.282508  9234 master.cpp:358] Authorization enabled
I0806 15:06:59.283121  9234 hierarchical_allocator_process.hpp:296] 
Initializing hierarchical allocator process with master : master@127.0.1.1:60888
I0806 15:06:59.283174  9234 master.cpp:121] No whitelist given. Advertising 
offers for all slaves
I0806 15:06:59.283413  9234 master.cpp:1127] The newly elected leader is 
master@127.0.1.1:60888 with id 20140806-150659-16842879-60888-9210
I0806 15:06:59.283429  9234 master.cpp:1140] Elected as the leading master!
I0806 15:06:59.283435  9234 master.cpp:958] Recovering from registrar
I0806 15:06:59.283491  9234 registrar.cpp:313] Recovering registrar
I0806 15:06:59.284046  9233 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.600113ms
I0806 15:06:59.284080  9233 replica.cpp:320] Persisted replica status to 
STARTING
I0806 15:06:59.284226  9233 recover.cpp:451] Replica is in STARTING status
I0806 15:06:59.284580  9233 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0806 15:06:59.284643  9233 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0806 15:06:59.284747  9233 recover.cpp:542] Updating replica status to VOTING
I0806 15:06:59.289934  9233 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.119357ms
I0806 15:06:59.290256  9233 replica.cpp:320] Persisted replica status to VOTING
I0806 15:06:59.290876  9237 recover.cpp:556] Successfully joined the Paxos group
I0806 15:06:59.291131  9232 recover.cpp:440] Recover process terminated
I0806 15:06:59.300732  9236 log.cpp:656] Attempting to start the writer
I0806 15:06:59.301061  9236 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0806 15:06:59.306172  9236 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.070193ms
I0806 15:06:59.306229  9236 replica.cpp:342] Persisted promised to 1
I0806 15:06:59.306747  9236 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0806 15:06:59.307143  9236 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0806 15:06:59.309715  9236 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 2.521311ms
I0806 15:06:59.310199  9236 replica.cpp:676] Persisted action at 0
I0806 15:06:59.320276  9234 replica.cpp:508] Replica received write request for 
position 0
I0806 15:06:59.320335  9234 leveldb.cpp:438] Reading position from leveldb took 
26656ns
I0806 15:06:59.325726  9234 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 5.358479ms
I0806 15:06:59.325781  9234 replica.cpp:676] Persisted action at 0
I0806 15:06:59.325999  9234 replica.cpp:655] Replica received learned notice 
for position 0
I0806 15:06:59.328487  9234 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 2.458504ms
I0806 

[jira] [Updated] (MESOS-1668) Handle a temporary one-way master --> slave socket closure.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1668:
---


Placing this under reconciliation because, although extremely rare, it can lead 
to some inconsistent state between the master and slave for an arbitrary amount 
of time. For example, if the launchTask message is dropped as a result of the 
socket closure between Master → Slave in the scenario above.

> Handle a temporary one-way master --> slave socket closure.
> ---
>
> Key: MESOS-1668
> URL: https://issues.apache.org/jira/browse/MESOS-1668
> Project: Mesos
>  Issue Type: Bug
>  Components: master, slave
>Reporter: Benjamin Mahler
>Priority: Minor
>  Labels: reliability
>
> In MESOS-1529, we realized that it's possible for a slave to remain 
> disconnected in the master if the following occurs:
> → Master and Slave connected operating normally.
> → Temporary one-way network failure, master→slave link breaks.
> → Master marks slave as disconnected.
> → Network restored and health checking continues normally, slave is not 
> removed as a result. Slave does not attempt to re-register since it is 
> receiving pings once again.
> → Slave remains disconnected according to the master, and the slave does not 
> try to re-register. Bad!
> We were originally thinking of using a failover timeout in the master to 
> remove these slaves that don't re-register. However, it can be dangerous when 
> ZooKeeper issues are preventing the slave from re-registering with the 
> master; we do not want to remove a ton of slaves in this situation.
> Rather, when the slave is health checking correctly but does not re-register 
> within a timeout, we could send a registration request from the master to the 
> slave, telling the slave that it must re-register. This message could also be 
> used when receiving status updates (or other messages) from slaves that are 
> disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-06 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088425#comment-14088425
 ] 

Benjamin Mahler commented on MESOS-1613:


So far only Twitter CI is exposing this flakiness. I've pasted the full logs in 
the comment above, are you looking for logging from the command executor? We 
may want to investigate wiring up the tests to expose them in the output to 
make this easier for you to debug.

> HealthCheckTest.ConsecutiveFailures is flaky
> 
>
> Key: MESOS-1613
> URL: https://issues.apache.org/jira/browse/MESOS-1613
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: Ubuntu 10.04 GCC
>Reporter: Vinod Kone
>Assignee: Timothy Chen
>
> {code}
> [ RUN  ] HealthCheckTest.ConsecutiveFailures
> Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
> I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
> I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
> I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
> I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
> 2125ns
> I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 10747ns
> I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
> I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
> I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
> STARTING
> I0717 04:39:59.301985  5031 master.cpp:288] Master 
> 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
> I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
> authenticated slaves to register
> I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
> I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
> I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:40280
> I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
> offers for all slaves
> I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.325097ms
> I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
> STARTING
> I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
> master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
> I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
> I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
> I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
> I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
> I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
> I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.204157ms
> I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
> VOTING
> I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
> group
> I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
> I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
> I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.271822ms
> I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
> I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 5.004323ms
> I0717 04:39:59.323493  5031 replica.cpp:676] Persisted action at 0
> I0717 04:39:59.323799  5031 replica.

[jira] [Updated] (MESOS-887) Scheduler Driver should use exited() to detect disconnection with Master.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-887:
--

Labels: framework reliability  (was: )

> Scheduler Driver should use exited() to detect disconnection with Master.
> -
>
> Key: MESOS-887
> URL: https://issues.apache.org/jira/browse/MESOS-887
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.14.2, 0.16.0, 0.15.0
>Reporter: Benjamin Mahler
>  Labels: framework, reliability
>
> The Scheduler Driver already links with the master, but it does not use the 
> built in exited() notification from libprocess to detect socket closure.
> This would fast-track the delay from zookeeper detecting a leadership change, 
> and would minimize the number of dropped messages leaving the driver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-07 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090065#comment-14090065
 ] 

Benjamin Mahler commented on MESOS-1613:


[~tnachen] it's failing on ASF CI as well, can you triage or disable in the 
interim?

> HealthCheckTest.ConsecutiveFailures is flaky
> 
>
> Key: MESOS-1613
> URL: https://issues.apache.org/jira/browse/MESOS-1613
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: Ubuntu 10.04 GCC
>Reporter: Vinod Kone
>Assignee: Timothy Chen
>
> {code}
> [ RUN  ] HealthCheckTest.ConsecutiveFailures
> Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
> I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
> I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
> I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
> I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
> 2125ns
> I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 10747ns
> I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
> I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
> I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
> STARTING
> I0717 04:39:59.301985  5031 master.cpp:288] Master 
> 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
> I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
> authenticated slaves to register
> I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
> I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
> I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:40280
> I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
> offers for all slaves
> I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.325097ms
> I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
> STARTING
> I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
> master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
> I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
> I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
> I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
> I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
> I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
> I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.204157ms
> I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
> VOTING
> I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
> group
> I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
> I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
> I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.271822ms
> I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
> I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 5.004323ms
> I0717 04:39:59.323493  5031 replica.cpp:676] Persisted action at 0
> I0717 04:39:59.323799  5031 replica.cpp:508] Replica received write request 
> for position 0
> I0717 04:39:59.323837  5031 leveldb.cpp:438] Reading position from leveldb 
> took 21901ns
> I0717 04:39:59.329038  5031 leve

[jira] [Comment Edited] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-07 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090065#comment-14090065
 ] 

Benjamin Mahler edited comment on MESOS-1613 at 8/7/14 11:56 PM:
-

[~tnachen] it's failing on ASF CI as well, can you triage or disable in the 
interim?

E.g.
https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2299/consoleFull


was (Author: bmahler):
[~tnachen] it's failing on ASF CI as well, can you triage or disable in the 
interim?

> HealthCheckTest.ConsecutiveFailures is flaky
> 
>
> Key: MESOS-1613
> URL: https://issues.apache.org/jira/browse/MESOS-1613
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: Ubuntu 10.04 GCC
>Reporter: Vinod Kone
>Assignee: Timothy Chen
>
> {code}
> [ RUN  ] HealthCheckTest.ConsecutiveFailures
> Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
> I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
> I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
> I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
> I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
> 2125ns
> I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 10747ns
> I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
> I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
> I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
> STARTING
> I0717 04:39:59.301985  5031 master.cpp:288] Master 
> 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
> I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
> authenticated slaves to register
> I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
> I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
> I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:40280
> I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
> offers for all slaves
> I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.325097ms
> I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
> STARTING
> I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
> master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
> I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
> I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
> I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
> I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
> I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
> I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.204157ms
> I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
> VOTING
> I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
> group
> I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
> I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
> I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.271822ms
> I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
> I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 5.004323ms
>

[jira] [Commented] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.

2014-08-11 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093505#comment-14093505
 ] 

Benjamin Mahler commented on MESOS-1620:


Review chain for this one, did some cleanups along the way:

https://reviews.apache.org/r/24582/
https://reviews.apache.org/r/24583/
https://reviews.apache.org/r/24576/
https://reviews.apache.org/r/24515/
https://reviews.apache.org/r/24516/

> Reconciliation does not send back tasks pending validation / authorization.
> ---
>
> Key: MESOS-1620
> URL: https://issues.apache.org/jira/browse/MESOS-1620
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send 
> back TASK_STAGING for those tasks that are pending in the Master (validation 
> / authorization still in progress).
> For both implicit and explicit task reconciliation, the master could reply 
> with TASK_STAGING for these tasks, as this provides additional information to 
> the framework.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.

2014-08-11 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1620:
---

Shepherd: Vinod Kone  (was: Dominic Hamon)

> Reconciliation does not send back tasks pending validation / authorization.
> ---
>
> Key: MESOS-1620
> URL: https://issues.apache.org/jira/browse/MESOS-1620
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send 
> back TASK_STAGING for those tasks that are pending in the Master (validation 
> / authorization still in progress).
> For both implicit and explicit task reconciliation, the master could reply 
> with TASK_STAGING for these tasks, as this provides additional information to 
> the framework.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1695) The stats.json endpoint on the slave exposes "registered" as a string.

2014-08-12 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1695:
--

 Summary: The stats.json endpoint on the slave exposes "registered" 
as a string.
 Key: MESOS-1695
 URL: https://issues.apache.org/jira/browse/MESOS-1695
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler
Assignee: Vinod Kone
Priority: Minor


The slave is currently exposing a string value for the "registered" statistic, 
this should be a number:

{code}
slave:5051/stats.json
{
  "recovery_errors": 0,
  "registered": "1",
  "slave/executors_registering": 0,
  ...
}
{code}

Should be a pretty straightforward fix, looks like this first originated back 
in 2013:

{code}
commit b8291304e1523eb67ea8dc5f195cdb0d8e7d8348
Author: Vinod Kone 
Date:   Wed Jul 3 12:37:36 2013 -0700

Added a "registered" key/value pair to slave's stats.json.

Review: https://reviews.apache.org/r/12256

diff --git a/src/slave/http.cpp b/src/slave/http.cpp
index dc2955f..dd51516 100644
--- a/src/slave/http.cpp
+++ b/src/slave/http.cpp
@@ -281,6 +281,8 @@ Future Slave::Http::stats(const Request& request)
   object.values["lost_tasks"] = slave.stats.tasks[TASK_LOST];
   object.values["valid_status_updates"] = slave.stats.validStatusUpdates;
   object.values["invalid_status_updates"] = slave.stats.invalidStatusUpdates;
+  object.values["registered"] = slave.master ? "1" : "0";
+

   return OK(object, request.query.get("jsonp"));
 }
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1696) Improve reconciliation between master and slave.

2014-08-12 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1696:
--

 Summary: Improve reconciliation between master and slave.
 Key: MESOS-1696
 URL: https://issues.apache.org/jira/browse/MESOS-1696
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler


As we update the Master to keep tasks in memory until they are both terminal 
and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as 
follows:

{code}
Master   Slave
 {}   {}
{Tn}  {}  // Master receives Task T, non-terminal. Forwards to 
slave.
{Tn} {Tn} // Slave receives Task T, non-terminal.
{Tn} {Tt} // Task becomes terminal on slave. Update forwarded.
{Tt} {Tt} // Master receives update, forwards to framework.
 {}  {Tt} // Master receives ack, forwards to slave.
 {}   {}  // Slave receives ack.
{code}

In the current form of reconciliation, the slave sends to the master all tasks 
that are not both terminal and acknowledged. At any point in the above 
lifecycle, the slave's re-registration message can reach the master.

Note the following properties:

*(1)* The master may have a non-terminal task, not present in the slave's 
re-registration message.
*(2)* The master may have a non-terminal task, present in the slave's 
re-registration message but in a different state.
*(3)* The slave's re-registration message may contain a terminal unacknowledged 
task unknown to the master.

In the current master / slave 
[reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146]
 code, the master assumes that case (1) is because a launch task message was 
dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when 
the task reaches the slave correctly, so this can lead to inconsistency!

After chatting with [~vinodkone], we're considering updating the reconciliation 
to occur as follows:


→ Slave sends all tasks that are not both terminal and acknowledged, during 
re-registration. This is the same as before.

→ If the master sees tasks that are missing in the slave, the master sends a 
reconcile message to the slave for the tasks.

→ The slave will reply to reconcile messages with the latest state, or 
TASK_LOST if the task is not known to it. Preferably in a retried manner, 
unless we update socket closure on the slave to force a re-registration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.

2014-08-13 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1700:
--

 Summary: ThreadLocal does not release pthread keys or log properly.
 Key: MESOS-1700
 URL: https://issues.apache.org/jira/browse/MESOS-1700
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


The ThreadLocal abstraction in stout does not release the allocated pthread 
keys upon destruction:

https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22

It also does not log the errors correctly. Fortunately this does not impact 
mesos at the current time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.

2014-08-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1700:
---

Sprint: Q3 Sprint 3

> ThreadLocal does not release pthread keys or log properly.
> --
>
> Key: MESOS-1700
> URL: https://issues.apache.org/jira/browse/MESOS-1700
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> The ThreadLocal abstraction in stout does not release the allocated 
> pthread keys upon destruction:
> https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22
> It also does not log the errors correctly. Fortunately this does not impact 
> mesos at the current time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.

2014-08-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096175#comment-14096175
 ] 

Benjamin Mahler commented on MESOS-1700:


https://reviews.apache.org/r/24669/

> ThreadLocal does not release pthread keys or log properly.
> --
>
> Key: MESOS-1700
> URL: https://issues.apache.org/jira/browse/MESOS-1700
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> The ThreadLocal abstraction in stout does not release the allocated 
> pthread keys upon destruction:
> https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22
> It also does not log the errors correctly. Fortunately this does not impact 
> mesos at the current time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1006) Invalid free when in ProcessIsolator Usage when executing a short task

2014-08-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1006.


   Resolution: Fixed
Fix Version/s: 0.18.0
 Assignee: Benjamin Hindman

This was fixed as part of MESOS-963.

> Invalid free when in ProcessIsolator Usage when executing a short task
> --
>
> Key: MESOS-1006
> URL: https://issues.apache.org/jira/browse/MESOS-1006
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.17.0
> Environment: MacOS 10.9.1
>Reporter: Florian Douetteau
>Assignee: Benjamin Hindman
> Fix For: 0.18.0
>
>
> Executing: 
> mesos execute --command=/bin/ls --master=127.0.0.1:5050 --name=test
> Slave Log: 
> I0214 11:46:34.732306 294408192 status_update_manager.cpp:367] Forwarding 
> status update TASK_RUNNING (UUID: 956f2268-f677-4ff0-86d2-95e338b11447) for 
> task test of framework 201402141053-16777343-5050-50707-0010 to 
> master@127.0.0.1:5050
> I0214 11:46:34.732408 296017920 slave.cpp:1882] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: 956f2268-f677-4ff0-86d2-95e338b11447) for 
> task test of framework 201402141053-16777343-5050-50707-0010 to 
> executor(1)@127.0.0.1:57686
> I0214 11:46:34.772078 294408192 status_update_manager.cpp:392] Received 
> status update acknowledgement (UUID: 956f2268-f677-4ff0-86d2-95e338b11447) 
> for task test of framework 201402141053-16777343-5050-50707-0010
> mesos-slave(52021,0x1119cb000) malloc: *** error for object 0x79702e72657672: 
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> *** Aborted at 1392407195 (unix time) try "date -d @1392407195" if you are 
> using GNU date ***
> PC: @ 0x7fff8f602866 __pthread_kill
> *** SIGABRT (@0x7fff8f602866) received by PID 52021 (TID 0x1119cb000) stack 
> trace: ***
> @ 0x7fff907a35aa _sigtramp
> @0x0 (unknown)
> @ 0x7fff8d43ebba abort
> @ 0x7fff8d3ba093 free
> @0x1101a11c2 std::__1::vector<>::erase()
> @0x110197dcb os::process()
> @0x11019eaad os::processes()
> @0x110198957 os::children()
> @0x1101939f3 mesos::internal::slave::ProcessIsolator::usage()
> @0x1101115d3 
> _ZZN7process8dispatchIN5mesos18ResourceStatisticsENS1_8internal5slave8IsolatorERKNS1_11FrameworkIDERKNS1_10ExecutorIDES6_S9_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSG_FSE_T1_T2_ET3_T4_ENKS0_IS2_S5_S8_SB_S6_S9_EUlPNS_11ProcessBaseEE_clESS_
> @0x11036cd13 process::ProcessBase::visit()
> @0x1103640d2 process::ProcessManager::resume()
> @0x110363c2f process::schedule()
> @ 0x7fff8b269899 _pthread_body
> @ 0x7fff8b26972a _pthread_start
> @ 0x7fff8b26dfc9 thread_start
> Abort trap: 6



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1454) Command executor should have nonzero resources

2014-08-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1454:
---

Fix Version/s: 0.20.0

> Command executor should have nonzero resources
> --
>
> Key: MESOS-1454
> URL: https://issues.apache.org/jira/browse/MESOS-1454
> Project: Mesos
>  Issue Type: Bug
>Reporter: Ian Downes
>Assignee: Ian Downes
> Fix For: 0.20.0
>
>
> The command executor is used when TaskInfo does not provide an ExecutorInfo. 
> It is not given any dedicated resources but the executor will be launched 
> with the first task's resources. However, assuming a single task (or a final 
> task), when that task terminates the container will be updated to zero 
> resources - see containerizer->update in Slave::statusUpdate().
> Possible solutions:
> - Require a task to specify an executor and resources for it.
> - Add sufficient allowance for the command executor beyond the task's 
> resources. This leads to an overcommit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1705) SubprocessTest.Status sometimes flakes out

2014-08-15 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099021#comment-14099021
 ] 

Benjamin Mahler commented on MESOS-1705:


This would be because we recently turned on google-logging stacktraces in the 
libprocess tests. Note that the test is still passing but the child process 
seems to have received the SIGTERM after the fork but before the exec, which is 
fine but produces this unfortunate stack trace.

[~vinodkone] perhaps we should drop the SIGTERM stacktracing, like we do within 
the mesos logging initialization.

> SubprocessTest.Status sometimes flakes out
> --
>
> Key: MESOS-1705
> URL: https://issues.apache.org/jira/browse/MESOS-1705
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.20.0
>Reporter: Timothy St. Clair
>Priority: Minor
>  Labels: build
>
> It's a pretty rare event, but happened more then once.  
> [ RUN  ] SubprocessTest.Status
> *** Aborted at 1408023909 (unix time) try "date -d @1408023909" if you are 
> using GNU date ***
> PC: @   0x35700094b1 (unknown)
> *** SIGTERM (@0x3e841d8) received by PID 16872 (TID 0x7fa9ea426780) from 
> PID 16856; stack trace: ***
> @   0x3570435cb0 (unknown)
> @   0x35700094b1 (unknown)
> @   0x3570009d9f (unknown)
> @   0x357000e726 (unknown)
> @   0x3570015185 (unknown)
> @   0x5ead42 process::childMain()
> @   0x5ece8d std::_Function_handler<>::_M_invoke()
> @   0x5eac9c process::defaultClone()
> @   0x5ebbd4 process::subprocess()
> @   0x55a229 process::subprocess()
> @   0x55a846 process::subprocess()
> @   0x54224c SubprocessTest_Status_Test::TestBody()
> @ 0x7fa9ea460323 (unknown)
> @ 0x7fa9ea455b67 (unknown)
> @ 0x7fa9ea455c0e (unknown)
> @ 0x7fa9ea455d15 (unknown)
> @ 0x7fa9ea4593a8 (unknown)
> @ 0x7fa9ea459647 (unknown)
> @   0x422466 main
> @   0x3570421d65 (unknown)
> @   0x4260bd (unknown)
> [   OK ] SubprocessTest.Status (153 ms)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1714:
--

 Summary: The C++ 'Resources' abstraction should keep the 
underlying resources flattened.
 Key: MESOS-1714
 URL: https://issues.apache.org/jira/browse/MESOS-1714
 Project: Mesos
  Issue Type: Bug
  Components: c++ api
Reporter: Benjamin Mahler


Currently, the C++ Resources class does not ensure that the underlying 
Resources protobufs are kept flat.

This is an issue because some of the methods, e.g. 
[Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269],
 assume the resources are flat.

There is code that constructs unflattened resources, e.g. 
[Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353].
 We could prevent this type of construction, however it is perfectly fine if we 
ensure the C++ 'Resources' class performs flattening.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1715:
--

 Summary: The slave does not send pending tasks during 
re-registration.
 Key: MESOS-1715
 URL: https://issues.apache.org/jira/browse/MESOS-1715
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


In what looks like an oversight, the pending tasks in the slave 
(Framework::pending) are not sent in the re-registration message.

This can lead to spurious TASK_LOST notifications being generated by the master 
when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1716:
--

 Summary: The slave does not add pending tasks as part of the 
staging tasks metric.
 Key: MESOS-1716
 URL: https://issues.apache.org/jira/browse/MESOS-1716
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler
Priority: Trivial


The slave does not represent pending tasks in the "tasks_staging" metric.

This should be a trivial fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1717:
--

 Summary: The slave does not show pending tasks in the JSON 
endpoints.
 Key: MESOS-1717
 URL: https://issues.apache.org/jira/browse/MESOS-1717
 Project: Mesos
  Issue Type: Bug
  Components: json api, slave
Reporter: Benjamin Mahler


The slave does not show pending tasks in the /state.json endpoint.

This is a bit tricky to add since we rely on knowing the executor directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1718) Command executor can overcommit the slave.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1718:
--

 Summary: Command executor can overcommit the slave.
 Key: MESOS-1718
 URL: https://issues.apache.org/jira/browse/MESOS-1718
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


Currently we give a small amount of resources to the command executor, in 
addition to resources used by the command task:

https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
{code: title=}
ExecutorInfo Slave::getExecutorInfo(
const FrameworkID& frameworkId,
const TaskInfo& task)
{
  ...
// Add an allowance for the command executor. This does lead to a
// small overcommit of resources.
executor.mutable_resources()->MergeFrom(
Resources::parse(
  "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
  "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
  ...
}
{code}

This leads to an overcommit of the slave. Ideally, for command tasks we can 
"transfer" all of the task resources to the executor at the slave / isolation 
level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.

2014-08-18 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1716:
--

Assignee: Benjamin Mahler

> The slave does not add pending tasks as part of the staging tasks metric.
> -
>
> Key: MESOS-1716
> URL: https://issues.apache.org/jira/browse/MESOS-1716
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Trivial
>
> The slave does not represent pending tasks in the "tasks_staging" metric.
> This should be a trivial fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-08-18 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1715:
--

Assignee: Benjamin Mahler

> The slave does not send pending tasks during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> In what looks like an oversight, the pending tasks in the slave 
> (Framework::pending) are not sent in the re-registration message.
> This can lead to spurious TASK_LOST notifications being generated by the 
> master when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1720) Slave should send exited executor message when the executor is never launched.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1720:
--

 Summary: Slave should send exited executor message when the 
executor is never launched.
 Key: MESOS-1720
 URL: https://issues.apache.org/jira/browse/MESOS-1720
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler


When the slave sends TASK_LOST before launching an executor for a task, the 
slave does not send an exited executor message to the master.

Since the master receives no exited executor message, it still thinks the 
executor's resources are consumed on the slave.

One possible fix for this would be to send the exited executor message to the 
master in these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1721) Prevent overcommit of the slave for ports and ephemeral ports.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1721:
--

 Summary: Prevent overcommit of the slave for ports and ephemeral 
ports.
 Key: MESOS-1721
 URL: https://issues.apache.org/jira/browse/MESOS-1721
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


It's possible for the slave to be overcommitted (e.g. MESOS-1668). In the case 
of "named" resources like ports and ephemeral_ports, this is problematic as the 
resources needed by the tasks are in use.

This ticket is to present the idea of rejecting tasks when the slave is 
overcommitted on ports or ephemeral_ports. In order to ensure the master 
reconciles state with the slave, we can also trigger a re-registration.

For cpu / memory, this is less crucial, so preventing overcommit for these will 
be punted for later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-08-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101757#comment-14101757
 ] 

Benjamin Mahler commented on MESOS-1466:


We're going to proceed with a mitigation of this by rejecting tasks once the 
slave is overcommitted:
https://issues.apache.org/jira/browse/MESOS-1721

However, we would also like to ensure that this kind of race is not possible. 
One solution is to use master acknowledgments for executor exits:

(1) When an executor terminates (or the executor could not be launched: 
MESOS-1720), we send an exited executor message.
(2) The master acknowledges these message.
(3) The slave will not accept tasks for unacknowledged terminal executors (this 
must include those executors that could not be launched, per MESOS-1720).

The result of this is that a new executor cannot be launched until the master 
is aware of the old executor exiting.

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>  Labels: reliability
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.

2014-08-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---

Summary: The slave does not send pending tasks / executors during 
re-registration.  (was: The slave does not send pending tasks during 
re-registration.)

> The slave does not send pending tasks / executors during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> In what looks like an oversight, the pending tasks in the slave 
> (Framework::pending) are not sent in the re-registration message.
> This can lead to spurious TASK_LOST notifications being generated by the 
> master when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.

2014-08-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---

Description: 
In what looks like an oversight, the pending tasks and executors in the slave 
(Framework::pending) are not sent in the re-registration message.

For tasks, this can lead to spurious TASK_LOST notifications being generated by 
the master when it falsely thinks the tasks are not present on the slave.

For executors, this can lead to under-accounting in the master, causing an 
overcommit on the slave.

  was:
In what looks like an oversight, the pending tasks in the slave 
(Framework::pending) are not sent in the re-registration message.

This can lead to spurious TASK_LOST notifications being generated by the master 
when it falsely thinks the tasks are not present on the slave.


> The slave does not send pending tasks / executors during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> In what looks like an oversight, the pending tasks and executors in the slave 
> (Framework::pending) are not sent in the re-registration message.
> For tasks, this can lead to spurious TASK_LOST notifications being generated 
> by the master when it falsely thinks the tasks are not present on the slave.
> For executors, this can lead to under-accounting in the master, causing an 
> overcommit on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1734) Reduce compile time replacing macro expansions with variadic templates

2014-08-26 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111420#comment-14111420
 ] 

Benjamin Mahler commented on MESOS-1734:


Hi [~preillyme], we can't yet assume C++11 support:
https://issues.apache.org/jira/browse/MESOS-750

[~dhamon] would have a better idea of when we'll move to C++11 as a requirement.

> Reduce compile time replacing macro expansions with variadic templates
> --
>
> Key: MESOS-1734
> URL: https://issues.apache.org/jira/browse/MESOS-1734
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Patrick Reilly
>Assignee: Patrick Reilly
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1735) Better Startup Failure For Duplicate Master

2014-09-02 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118544#comment-14118544
 ] 

Benjamin Mahler commented on MESOS-1735:


We could use the EXIT approach from stout/exit.hpp here to avoid the abort / 
stacktrace and to include a helpful message.

> Better Startup Failure For Duplicate Master
> ---
>
> Key: MESOS-1735
> URL: https://issues.apache.org/jira/browse/MESOS-1735
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.20.0
> Environment: Ubuntu 12.04
>Reporter: Ken Sipe
>
> The error message is cryptic when starting a mesos-master when a mesos-master 
> is already running.   The error message is:
> mesos-master --ip=192.168.74.174 --work_dir=~/mesos
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F0826 20:24:56.940961  3057 process.cpp:1632] Failed to initialize, bind: 
> Address already in use [98]
> *** Check failure stack trace: ***
> Aborted (core dumped)
> This can be a new person's first experience.  It isn't clear to them that the 
> process is already running.  And they are lost as to what to do next.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.

2014-09-03 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120110#comment-14120110
 ] 

Benjamin Mahler commented on MESOS-1716:


https://reviews.apache.org/r/25302/
https://reviews.apache.org/r/25303/

> The slave does not add pending tasks as part of the staging tasks metric.
> -
>
> Key: MESOS-1716
> URL: https://issues.apache.org/jira/browse/MESOS-1716
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Trivial
>
> The slave does not represent pending tasks in the "tasks_staging" metric.
> This should be a trivial fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.

2014-09-03 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120138#comment-14120138
 ] 

Benjamin Mahler commented on MESOS-1714:


For now, this review avoids constructing an unflattened Resources object:
https://reviews.apache.org/r/25306/

> The C++ 'Resources' abstraction should keep the underlying resources 
> flattened.
> ---
>
> Key: MESOS-1714
> URL: https://issues.apache.org/jira/browse/MESOS-1714
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api
>Reporter: Benjamin Mahler
>
> Currently, the C++ Resources class does not ensure that the underlying 
> Resources protobufs are kept flat.
> This is an issue because some of the methods, e.g. 
> [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269],
>  assume the resources are flat.
> There is code that constructs unflattened resources, e.g. 
> [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353].
>  We could prevent this type of construction, however it is perfectly fine if 
> we ensure the C++ 'Resources' class performs flattening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-733) Speedup slave recovery tests

2014-09-03 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler closed MESOS-733.
-
Resolution: Incomplete

Closing this in favor of an epic to track testing speedups.

> Speedup slave recovery tests
> 
>
> Key: MESOS-733
> URL: https://issues.apache.org/jira/browse/MESOS-733
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>  Labels: twitter
>
> Several of the tests are slow now that they do offer checking. I suspect this 
> is due to the "Clock" semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1757) Speed up the tests.

2014-09-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1757:
--

 Summary: Speed up the tests.
 Key: MESOS-1757
 URL: https://issues.apache.org/jira/browse/MESOS-1757
 Project: Mesos
  Issue Type: Epic
  Components: technical debt, test
Reporter: Benjamin Mahler


The full test suite is exceeding the 7 minute mark (440 seconds on my machine), 
this epic is to track techniques to improve this:

# The reaper takes a full second to reap an exited process (MESOS-1199), this 
adds a second to each slave recovery test, and possibly more for things that 
rely on Subprocess.
# The command executor sleeps for a second when shutting down (MESOS-442), this 
adds a second to every test that uses the command executor.
# Now that the master and the slave have to perform sync'ed disk writes, 
consider using a ramdisk to speed up the disk writes.

Additional options that hopefully will not be necessary:

# Use automake's [parallel test 
harness|http://www.gnu.org/software/automake/manual/html_node/Parallel-Test-Harness.html]
 to compile tests separately and run tests in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1758) Freezer failure leads to lost task during container destruction.

2014-09-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1758:
--

 Summary: Freezer failure leads to lost task during container 
destruction.
 Key: MESOS-1758
 URL: https://issues.apache.org/jira/browse/MESOS-1758
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Benjamin Mahler


In the past we've seen numerous issues around the freezer. Lately, on the 
2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:

(1) An oom occurs.
(2) No indication of oom in the kernel logs.
(3) The slave is unable to freeze the cgroup.
(4) The task is marked as lost.

{noformat}
I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 
15488MB Maximum Used: 15488MB

MEMORY STATISTICS:
cache 7958691840
rss 8281653248
mapped_file 9474048
pgpgin 4487861
pgpgout 522933
pgfault 2533780
pgmajfault 11
inactive_anon 0
active_anon 8281653248
inactive_file 7631708160
active_file 326852608
unevictable 0
hierarchical_memory_limit 16240345088
total_cache 7958691840
total_rss 8281653248
total_mapped_file 9474048
total_pgpgin 4487861
total_pgpgout 522933
total_pgfault 2533780
total_pgmajfault 11
total_inactive_anon 0
total_active_anon 8281653248
total_inactive_file 7631728640
total_active_file 326852608
total_unevictable 0
I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container 
bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource 
mem(*):1.62403e+10 and will be terminated
I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 
'bbb9732a-d600-4c1b-b326-846338c608c3'
I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.710848ms
I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.588224ms
I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
2.15296ms
I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.643008ms
I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed 
age: 5.630238827780799days
I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.511168ms
I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for 
'/slave(1)/stats.json'
E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of 
framework '201104070004-002563-' failed: Failed to destroy container: 
discarded future
I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST 
(UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 
201104070004-002563- from @0.0.0.0:0
I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 
0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 
100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container 
bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status 
update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660

[jira] [Commented] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.

2014-09-04 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122338#comment-14122338
 ] 

Benjamin Mahler commented on MESOS-1715:


For now I've fixed it to send the pending tasks, since that is important for 
reconciliation:

https://reviews.apache.org/r/25371/
https://reviews.apache.org/r/25372/
https://reviews.apache.org/r/25373/

I'll pull out a ticket for the executors.

> The slave does not send pending tasks / executors during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> In what looks like an oversight, the pending tasks and executors in the slave 
> (Framework::pending) are not sent in the re-registration message.
> For tasks, this can lead to spurious TASK_LOST notifications being generated 
> by the master when it falsely thinks the tasks are not present on the slave.
> For executors, this can lead to under-accounting in the master, causing an 
> overcommit on the slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.

2014-09-04 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---
Shepherd: Vinod Kone  (was: Yan Xu)

> The slave does not send pending tasks / executors during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> In what looks like an oversight, the pending tasks and executors in the slave 
> (Framework::pending) are not sent in the re-registration message.
> For tasks, this can lead to spurious TASK_LOST notifications being generated 
> by the master when it falsely thinks the tasks are not present on the slave.
> For executors, this can lead to under-accounting in the master, causing an 
> overcommit on the slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-186) Resource offers should be rescinded after some configurable timeout

2014-09-05 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-186.
---
   Resolution: Fixed
Fix Version/s: 0.21.0

{noformat}
commit 707bf3b1d6f042ee92e7a291d3f74a20ae2d494b
Author: Kapil Arya 
Date:   Fri Sep 5 11:15:15 2014 -0700

Added optional --offer_timeout to rescind unused offers.

The ability to set an offer timeout helps prevent unfair resource
allocations in the face of frameworks that hoard offers, or that
accidentally drop offers.

When optimistic offers are added, hoarding will not affect the
fairness for other frameworks.

Review: https://reviews.apache.org/r/22066
{noformat}

> Resource offers should be rescinded after some configurable timeout
> ---
>
> Key: MESOS-186
> URL: https://issues.apache.org/jira/browse/MESOS-186
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework
>Reporter: Benjamin Hindman
>Assignee: Timothy Chen
> Fix For: 0.21.0
>
>
> Problem: a framework has a bug and holds on to resource offers by accident 
> for 24 hours/
> One suggestion: resource offers should be rescinded after some configurable 
> timeout.
> Possible issue: this might interfere with frameworks that are hoarding. But 
> one possible solution here is to add another API call which checks the status 
> of resource offers (i.e., "remindAboutOffer").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1476) Provide endpoints for deactivating / activating slaves.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1476:
---
Sprint:   (was: Mesos Q3 Sprint 5)

> Provide endpoints for deactivating / activating slaves.
> ---
>
> Key: MESOS-1476
> URL: https://issues.apache.org/jira/browse/MESOS-1476
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>  Labels: gsoc2014
>
> When performing maintenance operations on slaves, it is important to allow 
> these slaves to be drained of their tasks.
> The first essential primitive of draining slaves is to prevent them from 
> running more tasks. This can be achieved by "deactivating" them: stop sending 
> their resource offers to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1476) Provide endpoints for deactivating / activating slaves.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1476:
--

Assignee: (was: Alexandra Sava)

Un-assigning for now since there is no longer a need for this with the updated 
maintenance design in MESOS-1474.

> Provide endpoints for deactivating / activating slaves.
> ---
>
> Key: MESOS-1476
> URL: https://issues.apache.org/jira/browse/MESOS-1476
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>  Labels: gsoc2014
>
> When performing maintenance operations on slaves, it is important to allow 
> these slaves to be drained of their tasks.
> The first essential primitive of draining slaves is to prevent them from 
> running more tasks. This can be achieved by "deactivating" them: stop sending 
> their resource offers to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1592) Design inverse resource offer support

2014-09-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126421#comment-14126421
 ] 

Benjamin Mahler commented on MESOS-1592:


Moving this to reviewable as inverse offers were designed as part of the 
maintenance work: MESOS-1474.

We are currently considering how persistent resources will interact with 
inverse offers and the other maintenance primitives.

> Design inverse resource offer support
> -
>
> Key: MESOS-1592
> URL: https://issues.apache.org/jira/browse/MESOS-1592
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Alexandra Sava
>
> An "inverse" resource offer means that Mesos is requesting resources back 
> from the framework, possibly within some time interval.
> This can be leveraged initially to provide more automated cluster 
> maintenance, by offering schedulers the opportunity to move tasks to 
> compensate for planned maintenance. Operators can set a time limit on how 
> long to wait for schedulers to relocate tasks before the tasks are forcibly 
> terminated.
> Inverse resource offers have many other potential uses, as it opens the 
> opportunity for the allocator to attempt to move tasks in the cluster through 
> the co-operation of the framework, possibly providing better 
> over-subscription, fairness, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1717:
--

Assignee: (was: Benjamin Mahler)

Punting for now since it's not that important and the fix is non-trivial given 
how the slave structures the JSON models.

> The slave does not show pending tasks in the JSON endpoints.
> 
>
> Key: MESOS-1717
> URL: https://issues.apache.org/jira/browse/MESOS-1717
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Reporter: Benjamin Mahler
>
> The slave does not show pending tasks in the /state.json endpoint.
> This is a bit tricky to add since we rely on knowing the executor directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1717:
---
Sprint: Q3 Sprint 4  (was: Q3 Sprint 4, Mesos Q3 Sprint 5)

> The slave does not show pending tasks in the JSON endpoints.
> 
>
> Key: MESOS-1717
> URL: https://issues.apache.org/jira/browse/MESOS-1717
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Reporter: Benjamin Mahler
>
> The slave does not show pending tasks in the /state.json endpoint.
> This is a bit tricky to add since we rely on knowing the executor directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.

2014-09-10 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1786:
--

 Summary: FaultToleranceTest.ReconcilePendingTasks is flaky.
 Key: MESOS-1786
 URL: https://issues.apache.org/jira/browse/MESOS-1786
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


{noformat}
[ RUN  ] FaultToleranceTest.ReconcilePendingTasks
Using temporary directory '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm'
I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms
I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms
I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns
I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 1781ns
I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the db 
in 537ns
I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery
I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status
I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to STARTING
I0910 20:18:02.321723 21652 master.cpp:286] Master 
20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361
I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing authenticated 
slaves to register
I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials'
I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled
I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.781277ms
I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to 
STARTING
I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status
I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING
I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : master@127.0.1.1:60361
I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is 
master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634
I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No 
resources available to allocate!
I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed 
allocation for 0 slaves in 118308ns
I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master!
I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar
I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar
I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 1.887284299secs
I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to VOTING
I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos group
I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated
I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer
I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 9.044596ms
I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1
I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 18.839475ms
I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0
I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request for 
position 0
I0910 20:18:04.251260 21650 leveldb.cpp:438] Reading position from leveldb took 
43213ns
I0910 20:18:04.262251 21650 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 10.949353ms
I0910 20:18:04.262346 21650 replica.cpp:676] Persisted action at 0
I0910 20:18:04.262717 21650 replica.cpp:655] Replica received learned notice 
for position 0
I0910 20:18:04.271878 21650 leveldb.cpp

[jira] [Updated] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.

2014-09-10 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1786:
---
Sprint: Mesos Q3 Sprint 5

> FaultToleranceTest.ReconcilePendingTasks is flaky.
> --
>
> Key: MESOS-1786
> URL: https://issues.apache.org/jira/browse/MESOS-1786
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> {noformat}
> [ RUN  ] FaultToleranceTest.ReconcilePendingTasks
> Using temporary directory 
> '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm'
> I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms
> I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms
> I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns
> I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 
> 1781ns
> I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 537ns
> I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery
> I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status
> I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to 
> STARTING
> I0910 20:18:02.321723 21652 master.cpp:286] Master 
> 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361
> I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials'
> I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled
> I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.781277ms
> I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to 
> STARTING
> I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status
> I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING
> I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:60361
> I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is 
> master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634
> I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No 
> resources available to allocate!
> I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed 
> allocation for 0 slaves in 118308ns
> I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master!
> I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar
> I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar
> I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.887284299secs
> I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to 
> VOTING
> I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos 
> group
> I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated
> I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer
> I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 9.044596ms
> I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1
> I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 18.839475ms
> I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0
> I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request 
> for position 0
> I0910 20:18:04.251260 21650 leve

[jira] [Updated] (MESOS-1696) Improve reconciliation between master and slave.

2014-09-11 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1696:
---
Description: 
As we update the Master to keep tasks in memory until they are both terminal 
and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as 
follows:

{code}
Master   Slave
 {}   {}
{Tn}  {}  // Master receives Task T, non-terminal. Forwards to 
slave.
{Tn} {Tn} // Slave receives Task T, non-terminal.
{Tn} {Tt} // Task becomes terminal on slave. Update forwarded.
{Tt} {Tt} // Master receives update, forwards to framework.
 {}  {Tt} // Master receives ack, forwards to slave.
 {}   {}  // Slave receives ack.
{code}

In the current form of reconciliation, the slave sends to the master all tasks 
that are not both terminal and acknowledged. At any point in the above 
lifecycle, the slave's re-registration message can reach the master.

Note the following properties:

*(1)* The master may have a non-terminal task, not present in the slave's 
re-registration message.
*(2)* The master may have a non-terminal task, present in the slave's 
re-registration message but in a different state.
*(3)* The slave's re-registration message may contain a terminal unacknowledged 
task unknown to the master.

In the current master / slave 
[reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146]
 code, the master assumes that case (1) is because a launch task message was 
dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when 
the task reaches the slave correctly, so this can lead to inconsistency!

After chatting with [~vinodkone], we're considering updating the reconciliation 
to occur as follows:


→ Slave sends all tasks that are not both terminal and acknowledged, during 
re-registration. This is the same as before.

→ If the master sees tasks that are missing in the slave, the master sends the 
tasks that need to be reconciled to the slave for the tasks. This can be 
piggy-backed on the re-registration message.

→ The slave will send TASK_LOST if the task is not known to it. Preferably in a 
retried manner, unless we update socket closure on the slave to force a 
re-registration.

  was:
As we update the Master to keep tasks in memory until they are both terminal 
and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as 
follows:

{code}
Master   Slave
 {}   {}
{Tn}  {}  // Master receives Task T, non-terminal. Forwards to 
slave.
{Tn} {Tn} // Slave receives Task T, non-terminal.
{Tn} {Tt} // Task becomes terminal on slave. Update forwarded.
{Tt} {Tt} // Master receives update, forwards to framework.
 {}  {Tt} // Master receives ack, forwards to slave.
 {}   {}  // Slave receives ack.
{code}

In the current form of reconciliation, the slave sends to the master all tasks 
that are not both terminal and acknowledged. At any point in the above 
lifecycle, the slave's re-registration message can reach the master.

Note the following properties:

*(1)* The master may have a non-terminal task, not present in the slave's 
re-registration message.
*(2)* The master may have a non-terminal task, present in the slave's 
re-registration message but in a different state.
*(3)* The slave's re-registration message may contain a terminal unacknowledged 
task unknown to the master.

In the current master / slave 
[reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146]
 code, the master assumes that case (1) is because a launch task message was 
dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when 
the task reaches the slave correctly, so this can lead to inconsistency!

After chatting with [~vinodkone], we're considering updating the reconciliation 
to occur as follows:


→ Slave sends all tasks that are not both terminal and acknowledged, during 
re-registration. This is the same as before.

→ If the master sees tasks that are missing in the slave, the master sends a 
reconcile message to the slave for the tasks.

→ The slave will reply to reconcile messages with the latest state, or 
TASK_LOST if the task is not known to it. Preferably in a retried manner, 
unless we update socket closure on the slave to force a re-registration.


> Improve reconciliation between master and slave.
> 
>
> Key: MESOS-1696
> URL: https://issues.apache.org/jira/browse/MESOS-1696
> Project: Mesos
>  Issue Type: Bug
>  Components: master, slave
>Reporter: Benjamin Mahler
>Assignee: Vinod Kone
>
> As we update the Master to keep tasks in memory until they are both terminal 
> and acknowledge

[jira] [Commented] (MESOS-1410) Keep terminal unacknowledged tasks in the master's state.

2014-09-11 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131014#comment-14131014
 ] 

Benjamin Mahler commented on MESOS-1410:


https://reviews.apache.org/r/25565/
https://reviews.apache.org/r/25566/
https://reviews.apache.org/r/25567/
https://reviews.apache.org/r/25568/

> Keep terminal unacknowledged tasks in the master's state.
> -
>
> Key: MESOS-1410
> URL: https://issues.apache.org/jira/browse/MESOS-1410
> Project: Mesos
>  Issue Type: Task
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 0.21.0
>
>
> Once we are sending acknowledgments through the master as per MESOS-1409, we 
> need to keep terminal tasks that are *unacknowledged* in the Master's memory.
> This will allow us to identify these tasks to frameworks when we haven't yet 
> forwarded them an update. Without this, we're susceptible to MESOS-1389.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-750) Require compilers that support c++11

2014-09-12 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131912#comment-14131912
 ] 

Benjamin Mahler commented on MESOS-750:
---

Any plan to make the supported compiler versions explicit in the documentation?

http://mesos.apache.org/gettingstarted/

> Require compilers that support c++11
> 
>
> Key: MESOS-750
> URL: https://issues.apache.org/jira/browse/MESOS-750
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt
>Reporter: Benjamin Mahler
>Assignee: Dominic Hamon
> Fix For: 0.21.0
>
>
> Requiring C++11 support will provide substantial benefits to Mesos.
> Most notably, the lack of lambda support has resulted in a proliferation of 
> continuation style functions scattered throughout the code. Having lambdas 
> will allow us to reduce this clutter and simplify the code.
> This will require carefully documenting how to get Mesos compiling on various 
> systems to make this transition easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-750) Require compilers that support c++11

2014-09-12 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131976#comment-14131976
 ] 

Benjamin Mahler commented on MESOS-750:
---

That sounds good, I'm more thinking about the case where a developer decides to 
write some code, and they're using, say, gcc 4.8.x.

Since we've worked backwards from gcc 4.4 to the configure script, they won't 
know whether they used something unsupported in 4.4. Reviewbot would be a nice 
way to catch it, but I don't think there are any gcc 4.4 apache jenkins slaves 
currently. =/

We also have some legacy stuff that deals with specific older compiler 
versions: 
https://github.com/apache/mesos/blob/0.20.0/src/slave/constants.hpp#L31

Would we be bumping the minimum compiler version frequently?

> Require compilers that support c++11
> 
>
> Key: MESOS-750
> URL: https://issues.apache.org/jira/browse/MESOS-750
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt
>Reporter: Benjamin Mahler
>Assignee: Dominic Hamon
> Fix For: 0.21.0
>
>
> Requiring C++11 support will provide substantial benefits to Mesos.
> Most notably, the lack of lambda support has resulted in a proliferation of 
> continuation style functions scattered throughout the code. Having lambdas 
> will allow us to reduce this clutter and simplify the code.
> This will require carefully documenting how to get Mesos compiling on various 
> systems to make this transition easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1783) MasterTest.LaunchDuplicateOfferTest is flaky

2014-09-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1783:
---
Fix Version/s: 0.21.0

> MasterTest.LaunchDuplicateOfferTest is flaky
> 
>
> Key: MESOS-1783
> URL: https://issues.apache.org/jira/browse/MESOS-1783
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: ubuntu-14.04-gcc Jenkins VM
>Reporter: Yan Xu
>Assignee: Niklas Quarfot Nielsen
> Fix For: 0.21.0
>
>
> {noformat:title=}
> [ RUN  ] MasterTest.LaunchDuplicateOfferTest
> Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg'
> I0909 22:46:59.212977 21883 leveldb.cpp:176] Opened db in 20.307533ms
> I0909 22:46:59.219717 21883 leveldb.cpp:183] Compacted db in 6.470397ms
> I0909 22:46:59.219925 21883 leveldb.cpp:198] Created db iterator in 5571ns
> I0909 22:46:59.220100 21883 leveldb.cpp:204] Seeked to beginning of db in 
> 1365ns
> I0909 22:46:59.220268 21883 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 658ns
> I0909 22:46:59.220448 21883 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0909 22:46:59.220855 21903 recover.cpp:425] Starting replica recovery
> I0909 22:46:59.221103 21903 recover.cpp:451] Replica is in EMPTY status
> I0909 22:46:59.221626 21903 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0909 22:46:59.221914 21903 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0909 22:46:59.04 21903 recover.cpp:542] Updating replica status to 
> STARTING
> I0909 22:46:59.232590 21900 master.cpp:286] Master 
> 20140909-224659-16842879-44263-21883 (trusty) started on 127.0.1.1:44263
> I0909 22:46:59.233278 21900 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0909 22:46:59.233543 21900 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0909 22:46:59.233934 21900 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg/credentials'
> I0909 22:46:59.236431 21900 master.cpp:366] Authorization enabled
> I0909 22:46:59.237522 21898 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:44263
> I0909 22:46:59.237877 21904 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0909 22:46:59.238723 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.245391ms
> I0909 22:46:59.238916 21903 replica.cpp:320] Persisted replica status to 
> STARTING
> I0909 22:46:59.239203 21903 recover.cpp:451] Replica is in STARTING status
> I0909 22:46:59.239724 21903 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0909 22:46:59.239967 21903 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0909 22:46:59.240304 21903 recover.cpp:542] Updating replica status to VOTING
> I0909 22:46:59.240684 21900 master.cpp:1212] The newly elected leader is 
> master@127.0.1.1:44263 with id 20140909-224659-16842879-44263-21883
> I0909 22:46:59.240846 21900 master.cpp:1225] Elected as the leading master!
> I0909 22:46:59.241149 21900 master.cpp:1043] Recovering from registrar
> I0909 22:46:59.241509 21898 registrar.cpp:313] Recovering registrar
> I0909 22:46:59.248440 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 7.864221ms
> I0909 22:46:59.248644 21903 replica.cpp:320] Persisted replica status to 
> VOTING
> I0909 22:46:59.248846 21903 recover.cpp:556] Successfully joined the Paxos 
> group
> I0909 22:46:59.249330 21897 log.cpp:656] Attempting to start the writer
> I0909 22:46:59.249809 21897 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0909 22:46:59.250075 21903 recover.cpp:440] Recover process terminated
> I0909 22:46:59.258286 21897 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.292514ms
> I0909 22:46:59.258489 21897 replica.cpp:342] Persisted promised to 1
> I0909 22:46:59.258848 21897 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0909 22:46:59.259454 21897 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0909 22:46:59.267755 21897 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 8.109338ms
> I0909 22:46:59.267916 21897 replica.cpp:676] Persisted action at 0
> I0909 22:46:59.270128 21902 replica.cpp:508] Replica received write request 
> for position 0
> I0909 22:46:59.270294 21902 leveldb.cpp:438] Reading position from leveldb 
> took 27443ns
> I0909 22:46:59.277220 21902 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb 

[jira] [Commented] (MESOS-1783) MasterTest.LaunchDuplicateOfferTest is flaky

2014-09-12 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132235#comment-14132235
 ] 

Benjamin Mahler commented on MESOS-1783:


{noformat}
commit d6c1ef6842b70af068ba14896693266ed6067724
Author: Niklas Nielsen 
Date:   Fri Sep 12 14:40:54 2014 -0700

Fixed flaky MasterTest.LaunchDuplicateOfferTest.

A couple of races could occur in the "launch tasks on multiple offers"
tests where recovered resources from purposely-failed invocations turned
into a subsequent resource offer and oversaturated the expect's.

Review: https://reviews.apache.org/r/25588
{noformat}

> MasterTest.LaunchDuplicateOfferTest is flaky
> 
>
> Key: MESOS-1783
> URL: https://issues.apache.org/jira/browse/MESOS-1783
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: ubuntu-14.04-gcc Jenkins VM
>Reporter: Yan Xu
>Assignee: Niklas Quarfot Nielsen
> Fix For: 0.21.0
>
>
> {noformat:title=}
> [ RUN  ] MasterTest.LaunchDuplicateOfferTest
> Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg'
> I0909 22:46:59.212977 21883 leveldb.cpp:176] Opened db in 20.307533ms
> I0909 22:46:59.219717 21883 leveldb.cpp:183] Compacted db in 6.470397ms
> I0909 22:46:59.219925 21883 leveldb.cpp:198] Created db iterator in 5571ns
> I0909 22:46:59.220100 21883 leveldb.cpp:204] Seeked to beginning of db in 
> 1365ns
> I0909 22:46:59.220268 21883 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 658ns
> I0909 22:46:59.220448 21883 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0909 22:46:59.220855 21903 recover.cpp:425] Starting replica recovery
> I0909 22:46:59.221103 21903 recover.cpp:451] Replica is in EMPTY status
> I0909 22:46:59.221626 21903 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0909 22:46:59.221914 21903 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0909 22:46:59.04 21903 recover.cpp:542] Updating replica status to 
> STARTING
> I0909 22:46:59.232590 21900 master.cpp:286] Master 
> 20140909-224659-16842879-44263-21883 (trusty) started on 127.0.1.1:44263
> I0909 22:46:59.233278 21900 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0909 22:46:59.233543 21900 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0909 22:46:59.233934 21900 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg/credentials'
> I0909 22:46:59.236431 21900 master.cpp:366] Authorization enabled
> I0909 22:46:59.237522 21898 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:44263
> I0909 22:46:59.237877 21904 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0909 22:46:59.238723 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.245391ms
> I0909 22:46:59.238916 21903 replica.cpp:320] Persisted replica status to 
> STARTING
> I0909 22:46:59.239203 21903 recover.cpp:451] Replica is in STARTING status
> I0909 22:46:59.239724 21903 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0909 22:46:59.239967 21903 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0909 22:46:59.240304 21903 recover.cpp:542] Updating replica status to VOTING
> I0909 22:46:59.240684 21900 master.cpp:1212] The newly elected leader is 
> master@127.0.1.1:44263 with id 20140909-224659-16842879-44263-21883
> I0909 22:46:59.240846 21900 master.cpp:1225] Elected as the leading master!
> I0909 22:46:59.241149 21900 master.cpp:1043] Recovering from registrar
> I0909 22:46:59.241509 21898 registrar.cpp:313] Recovering registrar
> I0909 22:46:59.248440 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 7.864221ms
> I0909 22:46:59.248644 21903 replica.cpp:320] Persisted replica status to 
> VOTING
> I0909 22:46:59.248846 21903 recover.cpp:556] Successfully joined the Paxos 
> group
> I0909 22:46:59.249330 21897 log.cpp:656] Attempting to start the writer
> I0909 22:46:59.249809 21897 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0909 22:46:59.250075 21903 recover.cpp:440] Recover process terminated
> I0909 22:46:59.258286 21897 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.292514ms
> I0909 22:46:59.258489 21897 replica.cpp:342] Persisted promised to 1
> I0909 22:46:59.258848 21897 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0909 22:46:59.259454 21897 replica.cpp:375] Replica received explicit 
> promise req

[jira] [Commented] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.

2014-09-12 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132346#comment-14132346
 ] 

Benjamin Mahler commented on MESOS-1786:


https://reviews.apache.org/r/25604/

> FaultToleranceTest.ReconcilePendingTasks is flaky.
> --
>
> Key: MESOS-1786
> URL: https://issues.apache.org/jira/browse/MESOS-1786
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> {noformat}
> [ RUN  ] FaultToleranceTest.ReconcilePendingTasks
> Using temporary directory 
> '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm'
> I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms
> I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms
> I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns
> I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 
> 1781ns
> I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 537ns
> I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery
> I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status
> I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to 
> STARTING
> I0910 20:18:02.321723 21652 master.cpp:286] Master 
> 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361
> I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials'
> I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled
> I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.781277ms
> I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to 
> STARTING
> I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status
> I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING
> I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:60361
> I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is 
> master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634
> I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No 
> resources available to allocate!
> I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed 
> allocation for 0 slaves in 118308ns
> I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master!
> I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar
> I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar
> I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.887284299secs
> I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to 
> VOTING
> I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos 
> group
> I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated
> I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer
> I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 9.044596ms
> I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1
> I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 18.839475ms
> I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0
> I0910 20:18:04.251194 21650 replica.cpp:508] Replica received wr

  1   2   3   4   5   6   7   8   9   10   >