[jira] [Updated] (MESOS-3165) Persist and recover quota to/from Registry
[ https://issues.apache.org/jira/browse/MESOS-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3165: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Persist and recover quota to/from Registry > -- > > Key: MESOS-3165 > URL: https://issues.apache.org/jira/browse/MESOS-3165 > Project: Mesos > Issue Type: Task > Components: master, replicated log >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > To persist quotas across failovers, the Master should save them in the > registry. To support this, we shall: > * Introduce a Quota state variable in registry.proto; > * Extend the Operation interface so that it supports a ‘Quota’ accumulator > (see src/master/registrar.hpp); > * Introduce AddQuota / RemoveQuota operations; > * Recover quotas from the registry on failover to the Master’s > internal::master::Role struct; > * Extend RegistrarTest with quota-specific tests. > NOTE: Registry variable can be rather big for production clusters (see > MESOS-2075). While it should be fine for MVP to add quota information to > registry, we should consider storing Quota separately, as this does not need > to be in sync with slaves update. However, currently adding more variable is > not supported by the registrar. > While the Agents are reregistering (note they may fail to do so), the > information about what part of the quota is allocated is only partially > available to the Master. In other words, the state of the quota allocation is > reconstructed as Agents reregister. During this period, some roles may be > under quota from the perspective of the newly elected Master. > The same problem exists on the allocator side: it may think the cluster is > under quota and may eagerly try to satisfy quotas before enough Agents > reregister, which may result in resources being allocated to frameworks > beyond their quota. To address this issue and also to avoid panicking and > generating under quota alerts, the Master should give a certain amount of > time for the majority (e.g. 80%) of the Agents to reregister before reporting > any quota status and notifying the allocator about granted quotas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations aka Quota
[ https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-1791: --- Shepherd: Joris Van Remoortere > Introduce Master / Offer Resource Reservations aka Quota > > > Key: MESOS-1791 > URL: https://issues.apache.org/jira/browse/MESOS-1791 > Project: Mesos > Issue Type: Epic > Components: allocation, master, replicated log >Reporter: Tom Arnfeld >Assignee: Alexander Rukletsov > Labels: mesosphere > > Currently Mesos supports the ability to reserve resources (for a given role) > on a per-slave basis, as introduced in MESOS-505. This allows you to almost > statically partition off a set of resources on a set of machines, to > guarantee certain types of frameworks get some resources. > This is very useful, though it is also very useful to be able to control > these reservations through the master (instead of per-slave) for when I don't > care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of > (X,Y). > I'm not sure what structure this could take, but apparently it has already > been discussed. Would this be a CLI flag? Could there be a (authenticated) > web interface to control these reservations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3717) Master recovery in presence of quota
[ https://issues.apache.org/jira/browse/MESOS-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3717: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Master recovery in presence of quota > > > Key: MESOS-3717 > URL: https://issues.apache.org/jira/browse/MESOS-3717 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > Quota complicates master failover in several ways. The new master should > determine if it is possible to satisfy the total quota and notify an operator > in case it's not (imagine simultaneous failovers of multiple agents). The new > master should hint the allocator how many agents might reconnect in the > future to help it decide how to satisfy quota before the majority of agents > reconnect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998699#comment-14998699 ] Till Toenshoff commented on MESOS-3851: --- I will be committing the workaround patch Tim has provided https://reviews.apache.org/r/40107/ (thanks a bunch [~tnachen]!) shortly after running a final check on it. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce google::LogMessage::Flush() > @
[jira] [Updated] (MESOS-3581) License headers show up all over doxygen documentation.
[ https://issues.apache.org/jira/browse/MESOS-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-3581: -- Target Version/s: (was: 0.26.0) > License headers show up all over doxygen documentation. > --- > > Key: MESOS-3581 > URL: https://issues.apache.org/jira/browse/MESOS-3581 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.24.1 >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > Labels: mesosphere > > Currently license headers are commented in something resembling Javadoc style, > {code} > /** > * Licensed ... > {code} > Since we use Javadoc-style comment blocks for doxygen documentation all > license headers appear in the generated documentation, potentially and likely > hiding the actual documentation. > Using {{/*}} to start the comment blocks would be enough to hide them from > doxygen, but would likely also result in a largish (though mostly > uninteresting) patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3870: --- Comment: was deleted (was: You mean "volatile"? The variable is read and written inside a "synchronized" block, which will do the necessary synchronization (memory barriers) to ensure that other CPUs see the appropriate values (provided they also use synchronized blocks when examining the variable). There are a few places that read "ProcessBase.state" without holding the mutex (e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed. (Note that "volatile" is not sufficient/appropriate for ensuring reasonable semantics for concurrent access to shared state without mutual exclusion, anyway...)) > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998837#comment-14998837 ] haosdent commented on MESOS-3870: - Ohoh, got it. Thank you for explanation. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998700#comment-14998700 ] haosdent edited comment on MESOS-3870 at 11/10/15 3:09 PM: --- Suppose a Process enqueue to runq twice(Or impossible, seems I could not find any code avoid it enqueue multi times) when it receive two events. And the Process dequeue in different work threads, and not yet running. In work thread 1, Process pop event A and not yet running. In work thread 2, Process pop event B and start running. Is this scenario possible? was (Author: haosd...@gmail.com): Suppose a Process enqueue to runq twice(Or impossible, seems I could not find any code avoid it enqueue multi times) when it receive two events. And the dequeue in different work threads, and not yet running. In work thread 1, Process dequeue event A and not yet running. In work thread 2, Process dequeue event B and start running. Is this scenario possible? > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3873) Enhance allocator interface with the recovery() method
Alexander Rukletsov created MESOS-3873: -- Summary: Enhance allocator interface with the recovery() method Key: MESOS-3873 URL: https://issues.apache.org/jira/browse/MESOS-3873 Project: Mesos Issue Type: Task Components: allocation Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov There are some scenarios (e.g. quota is set for some roles) when it makes sense to notify an allocator about the recovery. Introduce a method into the allocator interface that allows for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3862) Authorize quota requests
[ https://issues.apache.org/jira/browse/MESOS-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3862: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Authorize quota requests > > > Key: MESOS-3862 > URL: https://issues.apache.org/jira/browse/MESOS-3862 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: acl, mesosphere, security > > When quotas are requested they should authorize their roles. > This ticket will authorize quota requests with ACLs. The existing > authorization support that has been implemented in MESOS-1342 will be > extended to add a `request_quotas` ACL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3720) Tests for Quota support in master
[ https://issues.apache.org/jira/browse/MESOS-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3720: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Tests for Quota support in master > - > > Key: MESOS-3720 > URL: https://issues.apache.org/jira/browse/MESOS-3720 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > Allocator-agnostic tests for quota support in the master. They can be divided > into several groups: > * Request validation; > * Satisfiability validation; > * Master failover; > * Persisting in the registry; > * Functionality and quota guarantees. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3802) Clear the suppressed flag when deactive a framework
[ https://issues.apache.org/jira/browse/MESOS-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-3802: -- Target Version/s: (was: 0.26.0) > Clear the suppressed flag when deactive a framework > --- > > Key: MESOS-3802 > URL: https://issues.apache.org/jira/browse/MESOS-3802 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.26.0 >Reporter: Guangya Liu >Assignee: Guangya Liu > > When deactivate the framework, the suppressed flag was not cleared and this > will cause the framework cannot get resource immediately after active, we > should clear this flag when deactivate the framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998737#comment-14998737 ] Neil Conway commented on MESOS-3870: I don't see how: the routine acquires ProcessBase.mutex before examining ProcessBase.state. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998805#comment-14998805 ] Neil Conway commented on MESOS-3870: You mean "volatile"? The variable is read and written inside a "synchronized" block, which will do the necessary synchronization (memory barriers) to ensure that other CPUs see the appropriate values (provided they also use synchronized blocks when examining the variable). There are a few places that read "ProcessBase.state" without holding the mutex (e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed. (Note that "volatile" is not sufficient/appropriate for ensuring reasonable semantics for concurrent access to shared state without mutual exclusion, anyway...) > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998884#comment-14998884 ] haosdent commented on MESOS-3870: - I think have another case make same Process run in different thread. Suppose ProcessBase pop event A in thread 1 and change ProcessBase.state to BLOCKED in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2463 . And not yet consume event B. Then event B reach and enqueue same Process to ProcessManager.runq in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3008 . Then ProcessManager dequeue it in thread 2 and pop event B and run event B while thread 1 not yet running event A consume function. Is this possible? > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3065) Add authorization for persistent volume
[ https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998770#comment-14998770 ] Greg Mann commented on MESOS-3065: -- MESOS-3065 should implement authorization for the Create/Destroy HTTP endpoints, which are being added in MESOS-2455. > Add authorization for persistent volume > --- > > Key: MESOS-3065 > URL: https://issues.apache.org/jira/browse/MESOS-3065 > Project: Mesos > Issue Type: Task >Reporter: Michael Park >Assignee: Greg Mann > Labels: mesosphere, persistent-volumes > > Persistent volume should be authorized with the {{principal}} of the > reserving entity (framework or master). The idea is to introduce {{Create}} > and {{Destroy}} into the ACL. > {code} > message Create { > // Subjects. > required Entity principals = 1; > // Objects? Perhaps the kind of volume? allowed permissions? > } > message Destroy { > // Subjects. > required Entity principals = 1; > // Objects. > required Entity creator_principals = 2; > } > {code} > When a framework/operator creates a persistent volume, "create" ACLs are > checked to see if the framework (FrameworkInfo.principal) or the operator > (Credential.user) is authorized to create persistent volumes. If not > authorized, the create operation is rejected. > When a framework/operator destroys a persistent volume, "destroy" ACLs are > checked to see if the framework (FrameworkInfo.principal) or the operator > (Credential.user) is authorized to destroy the persistent volume created by a > framework or operator (Resource.DiskInfo.principal). If not authorized, the > destroy operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998806#comment-14998806 ] Neil Conway commented on MESOS-3870: You mean "volatile"? The variable is read and written inside a "synchronized" block, which will do the necessary synchronization (memory barriers) to ensure that other CPUs see the appropriate values (provided they also use synchronized blocks when examining the variable). There are a few places that read "ProcessBase.state" without holding the mutex (e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed. (Note that "volatile" is not sufficient/appropriate for ensuring reasonable semantics for concurrent access to shared state without mutual exclusion, anyway...) > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998700#comment-14998700 ] haosdent commented on MESOS-3870: - Suppose a Process enqueue to runq twice(Or impossible, seems I could not find any code avoid it enqueue multi times) when it receive two events. And the dequeue in different work threads, and not yet running. In work thread 1, Process dequeue event A and not yet running. In work thread 2, Process dequeue event B and start running. Is this scenario possible? > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998757#comment-14998757 ] haosdent commented on MESOS-3870: - Yes, but ProcessBase.state don't have violate. I am not sure if it will still get dirty value while it change in other thread. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3717) Master recovery in presence of quota
[ https://issues.apache.org/jira/browse/MESOS-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3717: --- Issue Type: Task (was: Bug) > Master recovery in presence of quota > > > Key: MESOS-3717 > URL: https://issues.apache.org/jira/browse/MESOS-3717 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > Quota complicates master failover in several ways. The new master should > determine if it is possible to satisfy the total quota and notify an operator > in case it's not (imagine simultaneous failovers of multiple agents). The new > master should hint the allocator how many agents might reconnect in the > future to help it decide how to satisfy quota before the majority of agents > reconnect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3874) Implement recovery in the Hierarchical allocator
Alexander Rukletsov created MESOS-3874: -- Summary: Implement recovery in the Hierarchical allocator Key: MESOS-3874 URL: https://issues.apache.org/jira/browse/MESOS-3874 Project: Mesos Issue Type: Task Components: allocation Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov The built-in Hierarchical allocator should implement the recovery (in the presence of quota). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998686#comment-14998686 ] haosdent commented on MESOS-3870: - I think ProcessManager could dequeue same Process in different work thread? {noformat} ProcessBase* ProcessManager::dequeue() { // TODO(benh): Remove a process from this thread's runq. If there // are no processes to run, and this is not a dedicated thread, then // steal one from another threads runq. ProcessBase* process = NULL; synchronized (runq_mutex) { if (!runq.empty()) { process = runq.front(); runq.pop_front(); // Increment the running count of processes in order to support // the Clock::settle() operation (this must be done atomically // with removing the process from the runq). running.fetch_add(1); } } return process; } {noformat} > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3199) Validate Quota Requests.
[ https://issues.apache.org/jira/browse/MESOS-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3199: --- Shepherd: Joris Van Remoortere (was: Bernd Mathiske) > Validate Quota Requests. > > > Key: MESOS-3199 > URL: https://issues.apache.org/jira/browse/MESOS-3199 > Project: Mesos > Issue Type: Task >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: mesosphere > > We need to validate quota requests in terms of syntactical and semantical > correctness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3073) Introduce HTTP endpoints for Quota
[ https://issues.apache.org/jira/browse/MESOS-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3073: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Introduce HTTP endpoints for Quota > -- > > Key: MESOS-3073 > URL: https://issues.apache.org/jira/browse/MESOS-3073 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: mesosphere > > We need to implement the HTTP endpoints for Quota as outlined in the Design > Doc: > (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3763) Need for http::put request method
[ https://issues.apache.org/jira/browse/MESOS-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3763: --- Shepherd: Joris Van Remoortere (was: Bernd Mathiske) > Need for http::put request method > - > > Key: MESOS-3763 > URL: https://issues.apache.org/jira/browse/MESOS-3763 > Project: Mesos > Issue Type: Task >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Minor > Labels: mesosphere > > As we decided to create a more restful api for managing Quota request. > Therefore we also want to use the HTTP put request and hence need to enable > the libprocess/http to send put request besides get and post requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3718) Implement Quota support in allocator
[ https://issues.apache.org/jira/browse/MESOS-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3718: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Implement Quota support in allocator > > > Key: MESOS-3718 > URL: https://issues.apache.org/jira/browse/MESOS-3718 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > The built-in Hierarchical DRF allocator should support Quota. This includes > (but not limited to): adding, updating, removing and satisfying quota; > avoiding both overcomitting resources and handing them to non-quota'ed roles > in presence of master failover. > A [design doc for Quota support in > Allocator|https://issues.apache.org/jira/browse/MESOS-2937] provides an > overview of a feature set required to be implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3418) Factor out V1 API test helper functions
[ https://issues.apache.org/jira/browse/MESOS-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-3418: -- Target Version/s: 0.27.0 (was: 0.26.0) > Factor out V1 API test helper functions > --- > > Key: MESOS-3418 > URL: https://issues.apache.org/jira/browse/MESOS-3418 > Project: Mesos > Issue Type: Improvement >Reporter: Joris Van Remoortere >Assignee: Guangya Liu > Labels: beginner, mesosphere, newbie, v1_api > > We currently have some helper functionality for V1 API tests. This is copied > in a few test files. > Factor this out into a common place once the API is stabilized. > {code} > // Helper class for using EXPECT_CALL since the Mesos scheduler API > // is callback based. > class Callbacks > { > public: > MOCK_METHOD0(connected, void(void)); > MOCK_METHOD0(disconnected, void(void)); > MOCK_METHOD1(received, void(const std::queue&)); > }; > {code} > {code} > // Enqueues all received events into a libprocess queue. > // TODO(jmlvanre): Factor this common code out of tests into V1 > // helper. > ACTION_P(Enqueue, queue) > { > std::queue events = arg0; > while (!events.empty()) { > // Note that we currently drop HEARTBEATs because most of these tests > // are not designed to deal with heartbeats. > // TODO(vinod): Implement DROP_HTTP_CALLS that can filter heartbeats. > if (events.front().type() == Event::HEARTBEAT) { > VLOG(1) << "Ignoring HEARTBEAT event"; > } else { > queue->put(events.front()); > } > events.pop(); > } > } > {code} > We can also update the helpers in {{/tests/mesos.hpp}} to support the V1 API. > This would let us get ride of lines like: > {code} > v1::TaskInfo taskInfo = evolve(createTask(devolve(offer), "", > DEFAULT_EXECUTOR_ID)); > {code} > In favor of: > {code} > v1::TaskInfo taskInfo = createTask(offer, "", DEFAULT_EXECUTOR_ID); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998720#comment-14998720 ] Neil Conway commented on MESOS-3870: A process can't be enqueued onto the runq twice. This is prevented because a process is only added to the runq when it receives an event and the process is in state "BLOCKED"; once a process is on the runq, its state is changed to "READY", so it won't be readded again in the future. (https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2998-L3017) > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998730#comment-14998730 ] haosdent commented on MESOS-3870: - Yes, but I think it still could be enqueue twice? Because ProcessBase.state in different CPU core caches. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998731#comment-14998731 ] Till Toenshoff commented on MESOS-3851: --- This following commit fixes the crash - we still may want to find the reasoning for the race condition and hence I will not close this ticket but will remove the target version (0.26.0) to unblock 0.26.0. {noformat} commit b6d4b28a4c9ca717ad8be5bbc27e40c005fc51ad Author: Timothy ChenDate: Tue Nov 10 15:46:17 2015 +0100 Removed unused checks in command executor. Review: https://reviews.apache.org/r/40107 {noformat} > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status
[jira] [Comment Edited] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998884#comment-14998884 ] haosdent edited comment on MESOS-3870 at 11/10/15 4:49 PM: --- I think have another case make same Process run in different thread. Suppose ProcessBase pop event A in thread 1 and change ProcessBase.state to BLOCKED in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2463 . And not yet consume event A. Then event B reach and enqueue same Process to ProcessManager.runq in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3008 . Then ProcessManager dequeue it in thread 2 and pop event B and run event B while thread 1 not yet running event A consume function. Is this possible? was (Author: haosd...@gmail.com): I think have another case make same Process run in different thread. Suppose ProcessBase pop event A in thread 1 and change ProcessBase.state to BLOCKED in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2463 . And not yet consume event B. Then event B reach and enqueue same Process to ProcessManager.runq in https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3008 . Then ProcessManager dequeue it in thread 2 and pop event B and run event B while thread 1 not yet running event A consume function. Is this possible? > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3865) Failover and recovery in presence of Quota
[ https://issues.apache.org/jira/browse/MESOS-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3865: --- Shepherd: Joris Van Remoortere (was: Benjamin Hindman) > Failover and recovery in presence of Quota > -- > > Key: MESOS-3865 > URL: https://issues.apache.org/jira/browse/MESOS-3865 > Project: Mesos > Issue Type: Epic > Components: allocation, master >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: mesosphere > > The presence of quota in the cluster changes > Quota complicates master failover and recovery in several ways. The new > master should determine if it is possible to satisfy the total quota and > notify an operator in case it's not (imagine simultaneous failovers of > multiple agents). The new master should hint the allocator how many agents > might reconnect in the future to help it decide how to satisfy quota before > the majority of agents reconnect. > The allocator interface should be updated with some sort of recovery > information, which will allow it to react properly (e.g. seize offers and > hold off resources for some time). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-809) External control of the ip that Mesos components publish to zookeeper
[ https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998931#comment-14998931 ] Anindya Sinha commented on MESOS-809: - Mesos master/slave (libprocess) binds to ip:port indicated via environment vars LIBPROCESS_IP and LIBPROCESS_PORT (or via the --ip, --port in command line args). If they are private IPs, then this node is not reachable from outside such as schedulers so we need a publically accessible IP:Port such that the master/slave is reachable from another node. In this case, the publically accessible IP:Port should be specified via the environment variables LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT (or on the master can be specified via the command line args --advertise_ip, --advertise_port). Note that MESOS-3809 shall add these command line args to mesos slave as well till then, you can use the environment vars. Hope this helps. > External control of the ip that Mesos components publish to zookeeper > - > > Key: MESOS-809 > URL: https://issues.apache.org/jira/browse/MESOS-809 > Project: Mesos > Issue Type: Improvement > Components: framework, master, slave >Affects Versions: 0.14.2 >Reporter: Khalid Goudeaux >Assignee: Anindya Sinha >Priority: Minor > Fix For: 0.24.0 > > > With tools like Docker making containers more manageable, it's tempting to > use containers for all software installation. The CoreOS project is an > example of this. > When an application is run inside a container it sees a different ip/hostname > from the host system running the container. That ip is only valid from inside > that host, no other machine can see it. > From inside a container, the Mesos master and slave publish that private ip > to zookeeper and as a result they can't find each other if they're on > different machines. The --ip option can't help because the public ip isn't > available for binding from within a container. > Essentially, from inside the container, mesos processes don't know the ip > they're available at (they may not know the port either). > It would be nice to bootstrap the processes with the correct ip for them to > publish to zookeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3062) Add authorization for dynamic reservation
[ https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999169#comment-14999169 ] Gabriel Hartmann commented on MESOS-3062: - Is it possible in this scheme that a Framework could see Offers it couldn't accept? Or does the work here imply that if a resource was reserved with a given role/principal pair and ACLs that it would only be re-offered to Frameworks authorized under the same role/principal pair? > Add authorization for dynamic reservation > - > > Key: MESOS-3062 > URL: https://issues.apache.org/jira/browse/MESOS-3062 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Michael Park >Assignee: Greg Mann > Labels: mesosphere, persistent-volumes > > Dynamic reservations should be authorized with the {{principal}} of the > reserving entity (framework or master). The idea is to introduce {{Reserve}} > and {{Unreserve}} into the ACL. > {code} > message Reserve { > // Subjects. > required Entity principals = 1; > // Objects. MVP: Only possible values = ANY, NONE > required Entity resources = 1; > } > message Unreserve { > // Subjects. > required Entity principals = 1; > // Objects. > required Entity reserver_principals = 2; > } > {code} > When a framework/operator reserves resources, "reserve" ACLs are checked to > see if the framework ({{FrameworkInfo.principal}}) or the operator > ({{Credential.user}}) is authorized to reserve the specified resources. If > not authorized, the reserve operation is rejected. > When a framework/operator unreserves resources, "unreserve" ACLs are checked > to see if the framework ({{FrameworkInfo.principal}}) or the operator > ({{Credential.user}}) is authorized to unreserve the resources reserved by a > framework or operator ({{Resource.ReservationInfo.principal}}). If not > authorized, the unreserve operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-809) External control of the ip that Mesos components publish to zookeeper
[ https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anindya Sinha updated MESOS-809: Comment: was deleted (was: Mesos master/slave (libprocess) binds to ip:port indicated via environment vars LIBPROCESS_IP and LIBPROCESS_PORT (or via the --ip, --port in command line args). If they are private IPs, then this node is not reachable from outside such as schedulers so we need a publically accessible IP:Port such that the master/slave is reachable from another node. In this case, the publically accessible IP:Port should be specified via the environment variables LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT (or on the master can be specified via the command line args --advertise_ip, --advertise_port). Note that MESOS-3809 shall add these command line args to mesos slave as well till then, you can use the environment vars. Hope this helps. ) > External control of the ip that Mesos components publish to zookeeper > - > > Key: MESOS-809 > URL: https://issues.apache.org/jira/browse/MESOS-809 > Project: Mesos > Issue Type: Improvement > Components: framework, master, slave >Affects Versions: 0.14.2 >Reporter: Khalid Goudeaux >Assignee: Anindya Sinha >Priority: Minor > Fix For: 0.24.0 > > > With tools like Docker making containers more manageable, it's tempting to > use containers for all software installation. The CoreOS project is an > example of this. > When an application is run inside a container it sees a different ip/hostname > from the host system running the container. That ip is only valid from inside > that host, no other machine can see it. > From inside a container, the Mesos master and slave publish that private ip > to zookeeper and as a result they can't find each other if they're on > different machines. The --ip option can't help because the public ip isn't > available for binding from within a container. > Essentially, from inside the container, mesos processes don't know the ip > they're available at (they may not know the port either). > It would be nice to bootstrap the processes with the correct ip for them to > publish to zookeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998971#comment-14998971 ] haosdent commented on MESOS-3870: - ... Looks not possible. Only after consume event A, ProcessBase.state become BLOCKED. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3876) Per-Framework Dynamic Reservation
Gabriel Hartmann created MESOS-3876: --- Summary: Per-Framework Dynamic Reservation Key: MESOS-3876 URL: https://issues.apache.org/jira/browse/MESOS-3876 Project: Mesos Issue Type: Task Reporter: Gabriel Hartmann An instance of a Framework should be able to reserve resources in such a way, that it is the only party which receives Offers once they are reserved. It should not have to resort dynamic generation of Roles, as this exposes the ability to change Weights as well. This avoids any possibility that resources that an instance of a Framework expects ownership of, are used by some other instance. It also simplifies required Framework logic as each instance doesn't have to deal with filtering out reserved Resources not intended for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3878) Log responses for HTTP requests
Alexander Rukletsov created MESOS-3878: -- Summary: Log responses for HTTP requests Key: MESOS-3878 URL: https://issues.apache.org/jira/browse/MESOS-3878 Project: Mesos Issue Type: Task Components: libprocess Reporter: Alexander Rukletsov When an HTTP request comes in, we log it twice: in the libprocess using {{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). However, we do not log the response, neither a successful one, nor even an error. In order to simplify debugging, I suggest we at least add symmetric logging for *all* responses at the libprocess level using the same logging level as it is used now for incoming requests. We may want to additionally log messages for error responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} level, providing additional information like time took to process the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3880) Propose a guideline for log messages
Alexander Rukletsov created MESOS-3880: -- Summary: Propose a guideline for log messages Key: MESOS-3880 URL: https://issues.apache.org/jira/browse/MESOS-3880 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Alexander Rukletsov We are rather inconsistent in the way we write log messages. It would be helpful to come up with a style and document various aspects of logs, including but not limited to: * Usage of backticks and/or single quotes to quote interpolated variables; * Usage of backticks and/or single quotes to quote types and other names; * Usage of tenses and other grammatical forms; * Proper way of nesting [error] messages; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3881) Implement `stout/os/pstree.hpp` on Windows
Alex Clemmer created MESOS-3881: --- Summary: Implement `stout/os/pstree.hpp` on Windows Key: MESOS-3881 URL: https://issues.apache.org/jira/browse/MESOS-3881 Project: Mesos Issue Type: Bug Components: stout Reporter: Alex Clemmer Assignee: Alex Clemmer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3877) Add operator documentation for quota
Alexander Rukletsov created MESOS-3877: -- Summary: Add operator documentation for quota Key: MESOS-3877 URL: https://issues.apache.org/jira/browse/MESOS-3877 Project: Mesos Issue Type: Task Components: documentation Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov Add an operator guide for quota which describes basic usage of the endpoints and few basic and advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3879) Incorrect and inconsistent include order for and .
[ https://issues.apache.org/jira/browse/MESOS-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-3879: --- Story Points: 1 > Incorrect and inconsistent include order for and > . > - > > Key: MESOS-3879 > URL: https://issues.apache.org/jira/browse/MESOS-3879 > Project: Mesos > Issue Type: Bug >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Minor > > We currently have an inconsistent (and mostly incorrect) include order for > and (see below). Some files include them > (incorrectly) between the c and cpp standard header, while other correclt > include them afterwards. According to the [Google Styleguide| > https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes] > the second include order is correct. > {code:title=external_containerizer_test.cpp} > #include > #include > #include > {code} > {code:title=launcher.hpp} > #include > #include > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3879) Incorrect and inconsistent include order for and .
Joerg Schad created MESOS-3879: -- Summary: Incorrect and inconsistent include order for and . Key: MESOS-3879 URL: https://issues.apache.org/jira/browse/MESOS-3879 Project: Mesos Issue Type: Bug Reporter: Joerg Schad Assignee: Joerg Schad Priority: Minor We currently have an inconsistent (and mostly incorrect) include order for and (see below). Some files include them (incorrectly) between the c and cpp standard header, while other correclt include them afterwards. According to the [Google Styleguide| https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes] the second include order is correct. {code:title=external_containerizer_test.cpp} #include #include #include {code} {code:title=launcher.hpp} #include #include {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3878) Log responses for HTTP requests
[ https://issues.apache.org/jira/browse/MESOS-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3878: --- Description: When an HTTP request comes in, we log it twice: in the libprocess using {{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). However, we do not log the response, neither a successful one, nor even an error. In order to simplify debugging, I suggest we add symmetric logging for *all* responses at the libprocess level using the same logging level as it is used now for incoming requests. We may want to additionally log messages for error responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} level, providing additional information like time took to process the request. was: When an HTTP request comes in, we log it twice: in the libprocess using {{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). However, we do not log the response, neither a successful one, nor even an error. In order to simplify debugging, I suggest we at least add symmetric logging for *all* responses at the libprocess level using the same logging level as it is used now for incoming requests. We may want to additionally log messages for error responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} level, providing additional information like time took to process the request. > Log responses for HTTP requests > --- > > Key: MESOS-3878 > URL: https://issues.apache.org/jira/browse/MESOS-3878 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Alexander Rukletsov > Labels: mesosphere, newbie++ > > When an HTTP request comes in, we log it twice: in the libprocess using > {{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). > However, we do not log the response, neither a successful one, nor even an > error. > In order to simplify debugging, I suggest we add symmetric logging for *all* > responses at the libprocess level using the same logging level as it is used > now for incoming requests. We may want to additionally log messages for error > responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} > level, providing additional information like time took to process the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3283) Improve allocation performance especially with large number of slaves and frameworks.
[ https://issues.apache.org/jira/browse/MESOS-3283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3283: --- Assignee: (was: Marco Massenzio) > Improve allocation performance especially with large number of slaves and > frameworks. > - > > Key: MESOS-3283 > URL: https://issues.apache.org/jira/browse/MESOS-3283 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.23.0 >Reporter: Mandeep Chadha > Labels: mesosphere, tech-debt > > Improve batch allocations performance especially with large number of slaves > and frameworks. > e.g. these are the allocation timings for 10K slaves and varying number of > frameworks. > Using 1 slaves and 1 frameworks > Added 1 slaves in 14.50836112secs > Updated 1 slaves in 18.665093703secs > [ OK ] > SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/12 (34983 > ms) > [ RUN ] > SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13 > Using 1 slaves and 50 frameworks > Added 1 slaves in 51.534229549secs > Updated 1 slaves in 57.131554303secs > [ OK ] > SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13 (110449 > ms) > [ RUN ] > SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14 > Using 1 slaves and 100 frameworks > Added 1 slaves in 1.5891310434mins > Updated 1 slaves in 1.80562078148333mins > [ OK ] > SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14 (205467 > ms) > [ RUN ] > SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/15 > Using 1 slaves and 200 frameworks > Added 1 slaves in 3.0750647275mins > Updated 1 slaves in 3.85846762096667mins -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3035) As a Developer I would like a standard way to run a Subprocess in libprocess
[ https://issues.apache.org/jira/browse/MESOS-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3035: --- Shepherd: Michael Park (was: Joris Van Remoortere) > As a Developer I would like a standard way to run a Subprocess in libprocess > > > Key: MESOS-3035 > URL: https://issues.apache.org/jira/browse/MESOS-3035 > Project: Mesos > Issue Type: Story > Components: libprocess >Reporter: Marco Massenzio >Assignee: Marco Massenzio > > As part of MESOS-2830 and MESOS-2902 I have been researching the ability to > run a {{Subprocess}} and capture the {{stdout / stderr}} along with the exit > status code. > {{process::subprocess()}} offers much of the functionality, but in a way that > still requires a lot of handiwork on the developer's part; we would like to > further abstract away the ability to just pass a string, an optional set of > command-line arguments and then collect the output of the command (bonus: > without blocking). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3062) Add authorization for dynamic reservation
[ https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999351#comment-14999351 ] Gabriel Hartmann commented on MESOS-3062: - Thanks Greg. I was hoping we were almost going to get to per-framework dynamic reservation with this, but I guess not. > Add authorization for dynamic reservation > - > > Key: MESOS-3062 > URL: https://issues.apache.org/jira/browse/MESOS-3062 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Michael Park >Assignee: Greg Mann > Labels: mesosphere, persistent-volumes > > Dynamic reservations should be authorized with the {{principal}} of the > reserving entity (framework or master). The idea is to introduce {{Reserve}} > and {{Unreserve}} into the ACL. > {code} > message Reserve { > // Subjects. > required Entity principals = 1; > // Objects. MVP: Only possible values = ANY, NONE > required Entity resources = 1; > } > message Unreserve { > // Subjects. > required Entity principals = 1; > // Objects. > required Entity reserver_principals = 2; > } > {code} > When a framework/operator reserves resources, "reserve" ACLs are checked to > see if the framework ({{FrameworkInfo.principal}}) or the operator > ({{Credential.user}}) is authorized to reserve the specified resources. If > not authorized, the reserve operation is rejected. > When a framework/operator unreserves resources, "unreserve" ACLs are checked > to see if the framework ({{FrameworkInfo.principal}}) or the operator > ({{Credential.user}}) is authorized to unreserve the resources reserved by a > framework or operator ({{Resource.ReservationInfo.principal}}). If not > authorized, the unreserve operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3882) Libprocess: Implement process::Clock::finalize
Joseph Wu created MESOS-3882: Summary: Libprocess: Implement process::Clock::finalize Key: MESOS-3882 URL: https://issues.apache.org/jira/browse/MESOS-3882 Project: Mesos Issue Type: Task Components: libprocess, test Reporter: Joseph Wu Assignee: Joseph Wu Tracks this [TODO|https://github.com/apache/mesos/blob/aa0cd7ed4edf1184cbc592b5caa2429a8373e813/3rdparty/libprocess/src/process.cpp#L974-L975]. The {{Clock}} is initialized with a callback that, among other things, will dereference the global {{process_manager}} object. When libprocess is shutting down, the {{process_manager}} is cleaned up. Between cleanup and termination of libprocess, there is some chance that a {{Timer}} will time out and result in dereferencing {{process_manager}}. *Proposal* * Implement {{Clock::finalize}}. This would clear: ** existing timers ** process-specific clocks ** ticks * Change {{process::finalize}}. *# Resume the clock. (The clock is only paused during some tests.) When the clock is not paused, the callback does not dereference {{process_manager}}. *# Clean up {{process_manager}}. This terminates all the processes that would potentially interact with {{Clock}}. *# Call {{Clock::finalize}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3035) As a Developer I would like a standard way to run a Subprocess in libprocess
[ https://issues.apache.org/jira/browse/MESOS-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3035: --- Labels: mesosphere tech-debt (was: ) > As a Developer I would like a standard way to run a Subprocess in libprocess > > > Key: MESOS-3035 > URL: https://issues.apache.org/jira/browse/MESOS-3035 > Project: Mesos > Issue Type: Story > Components: libprocess >Reporter: Marco Massenzio >Assignee: Marco Massenzio > Labels: mesosphere, tech-debt > > As part of MESOS-2830 and MESOS-2902 I have been researching the ability to > run a {{Subprocess}} and capture the {{stdout / stderr}} along with the exit > status code. > {{process::subprocess()}} offers much of the functionality, but in a way that > still requires a lot of handiwork on the developer's part; we would like to > further abstract away the ability to just pass a string, an optional set of > command-line arguments and then collect the output of the command (bonus: > without blocking). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3062) Add authorization for dynamic reservation
[ https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999288#comment-14999288 ] Greg Mann commented on MESOS-3062: -- These patches don't affect which offers are made to which frameworks, nor which frameworks can accept which offers; a framework should still be able to utilize all the resources offered to it. Reserved resources will be offered to, and can be used by, any framework registered with the appropriate role, regardless of which principal did the reserving. This work provides authorization for the {{Reserve}} and {{Unreserve}} offer operations. So while a framework can still accept all the offers it receives, these patches do mean that a framework could receive offers containing resources which it doesn't have permission to reserve. A framework could also receive offers containing dynamically-reserved resources which it doesn't have the permission to unreserve. > Add authorization for dynamic reservation > - > > Key: MESOS-3062 > URL: https://issues.apache.org/jira/browse/MESOS-3062 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Michael Park >Assignee: Greg Mann > Labels: mesosphere, persistent-volumes > > Dynamic reservations should be authorized with the {{principal}} of the > reserving entity (framework or master). The idea is to introduce {{Reserve}} > and {{Unreserve}} into the ACL. > {code} > message Reserve { > // Subjects. > required Entity principals = 1; > // Objects. MVP: Only possible values = ANY, NONE > required Entity resources = 1; > } > message Unreserve { > // Subjects. > required Entity principals = 1; > // Objects. > required Entity reserver_principals = 2; > } > {code} > When a framework/operator reserves resources, "reserve" ACLs are checked to > see if the framework ({{FrameworkInfo.principal}}) or the operator > ({{Credential.user}}) is authorized to reserve the specified resources. If > not authorized, the reserve operation is rejected. > When a framework/operator unreserves resources, "unreserve" ACLs are checked > to see if the framework ({{FrameworkInfo.principal}}) or the operator > ({{Credential.user}}) is authorized to unreserve the resources reserved by a > framework or operator ({{Resource.ReservationInfo.principal}}). If not > authorized, the unreserve operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3220) Offer ability to kill tasks from the API
[ https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3220: --- Description: We are investigating adding a {{dcos task kill}} command to our DCOS (and Mesos) command line interface. Currently the ability to kill tasks is only offered via the scheduler API so it would be useful to have some ability to kill tasks directly. This would complement the Maintenance Primitives, in that it would enable the operator to terminate those tasks which, for whatever reasons, do not respond to Inverse Offers events. was: We are investigating adding a `dcos task kill` command to our DCOS (and Mesos) command line interface. Currently the ability to kill tasks is only offered via the scheduler API so it would be useful to have some ability to kill tasks directly. This is a blocker for the DCOS CLI! > Offer ability to kill tasks from the API > > > Key: MESOS-3220 > URL: https://issues.apache.org/jira/browse/MESOS-3220 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Sunil Shah >Assignee: Marco Massenzio >Priority: Blocker > Labels: mesosphere > > We are investigating adding a {{dcos task kill}} command to our DCOS (and > Mesos) command line interface. Currently the ability to kill tasks is only > offered via the scheduler API so it would be useful to have some ability to > kill tasks directly. > This would complement the Maintenance Primitives, in that it would enable the > operator to terminate those tasks which, for whatever reasons, do not respond > to Inverse Offers events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3876) Per-Framework Dynamic Reservation
[ https://issues.apache.org/jira/browse/MESOS-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-3876: -- Assignee: Guangya Liu > Per-Framework Dynamic Reservation > - > > Key: MESOS-3876 > URL: https://issues.apache.org/jira/browse/MESOS-3876 > Project: Mesos > Issue Type: Task >Reporter: Gabriel Hartmann >Assignee: Guangya Liu > > An instance of a Framework should be able to reserve resources in such a way, > that it is the only party which receives Offers once they are reserved. It > should not have to resort dynamic generation of Roles, as this exposes the > ability to change Weights as well. > This avoids any possibility that resources that an instance of a Framework > expects ownership of, are used by some other instance. It also simplifies > required Framework logic as each instance doesn't have to deal with filtering > out reserved Resources not intended for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3863) Investigate the requirements of programmatically re-initializing libprocess
[ https://issues.apache.org/jira/browse/MESOS-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-3863: - Description: This issue is for investigating what needs to be added/changed in {{process::finalize}} such that {{process::initialize}} will start on a clean slate. Additional issues will be created once done. Also see [the parent issue|MESOS-3820]. {{process::finalize}} should cover the following components: * {{__s__}} (the server socket) ** {{delete}} should be sufficient. This closes the socket and thereby prevents any further interaction from it. * {{process_manager}} ** Related prior work: [MESOS-3158] ** Cleans up the garbage collector, help, logging, profiler, statistics, route processes (including [this one|https://github.com/apache/mesos/blob/3bda55da1d0b580a1b7de43babfdc0d30fbc87ea/3rdparty/libprocess/src/process.cpp#L963], which currently leaks a pointer). ** Cleans up any other {{spawn}} 'd process. ** Manages the {{EventLoop}}. * {{Clock}} ** The goal here is to clear any timers so that nothing can deference {{process_manager}} while we're finalizing/finalized. It's probably not important to execute any remaining timers, since we're "shutting down" libprocess. This means: *** The clock should be {{paused}} and {{settled}} before the clean up of {{process_manager}}. *** Processes, which might interact with the {{Clock}}, should be cleaned up next. *** A new {{Clock::finalize}} method would then clear timers, process-specific clocks, and {{tick}} s; and then {{resume}} the clock. * {{__address__}} (the advertised IP and port) ** Needs to be cleared after {{process_manager}} has been cleaned up. Processes use this to communicate events. If cleared prematurely, {{TerminateEvents}} will not be sent correctly, leading to infinite waits. * {{socket_manager}} ** The idea here is to close all sockets and deallocate any existing {{HttpProxy}} or {{Encoder}} objects. ** All sockets are created via {{__s__}}, so cleaning up the server socket prior will prevent any new activity. * {{mime}} ** This is effectively a static map. ** It should be possible to statically initialize it. * Synchronization atomics {{initialized}} & {{initializing}}. ** Once cleanup is done, these should be reset. *Summary*: * Implement {{Clock::finalize}}. [MESOS-3882] * Implement {{~SocketManager}}. * Clean up {{mime}}. * Wrap everything up in {{process::finalize}}. was: This issue is for investigating what needs to be added/changed in {{process::finalize}} such that {{process::initialize}} will start on a clean slate. Additional issues will be created once done. Also see [the parent issue|MESOS-3820]. {{process::finalize}} should cover the following components: * {{__s__}} (the server socket) ** {{delete}} should be sufficient. This closes the socket and thereby prevents any further interaction from it. * {{process_manager}} ** Related prior work: [MESOS-3158] ** Cleans up the garbage collector, help, logging, profiler, statistics, route processes (including [this one|https://github.com/apache/mesos/blob/3bda55da1d0b580a1b7de43babfdc0d30fbc87ea/3rdparty/libprocess/src/process.cpp#L963], which currently leaks a pointer). ** Cleans up any other {{spawn}} 'd process. ** Manages the {{EventLoop}}. * {{Clock}} ** The goal here is to clear any timers so that nothing can deference {{process_manager}} while we're finalizing/finalized. It's probably not important to execute any remaining timers, since we're "shutting down" libprocess. This means: *** The clock should be {{paused}} and {{settled}} before the clean up of {{process_manager}}. *** Processes, which might interact with the {{Clock}}, should be cleaned up next. *** A new {{Clock::finalize}} method would then clear timers, process-specific clocks, and {{tick}} s; and then {{resume}} the clock. * {{__address__}} (the advertised IP and port) ** Needs to be cleared after {{process_manager}} has been cleaned up. Processes use this to communicate events. If cleared prematurely, {{TerminateEvents}} will not be sent correctly, leading to infinite waits. * {{socket_manager}} ** The idea here is to close all sockets and deallocate any existing {{HttpProxy}} or {{Encoder}} objects. ** All sockets are created via {{__s__}}, so cleaning up the server socket prior will prevent any new activity. * {{mime}} ** This is effectively a static map. ** It should be possible to statically initialize it. * Synchronization atomics {{initialized}} & {{initializing}}. ** Once cleanup is done, these should be reset. *Summary*: * Implement {{Clock::finalize}}. * Implement {{~SocketManager}}. * Clean up {{mime}}. * Wrap everything up in {{process::finalize}}. > Investigate the requirements of programmatically re-initializing libprocess > --- > > Key: MESOS-3863 >
[jira] [Updated] (MESOS-3879) Incorrect and inconsistent include order for and .
[ https://issues.apache.org/jira/browse/MESOS-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-3879: --- Sprint: Mesosphere Sprint 22 Labels: mesosphere (was: ) > Incorrect and inconsistent include order for and > . > - > > Key: MESOS-3879 > URL: https://issues.apache.org/jira/browse/MESOS-3879 > Project: Mesos > Issue Type: Bug >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Minor > Labels: mesosphere > > We currently have an inconsistent (and mostly incorrect) include order for > and (see below). Some files include them > (incorrectly) between the c and cpp standard header, while other correclt > include them afterwards. According to the [Google Styleguide| > https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes] > the second include order is correct. > {code:title=external_containerizer_test.cpp} > #include > #include > #include > {code} > {code:title=launcher.hpp} > #include > #include > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999494#comment-14999494 ] Vinod Kone commented on MESOS-3851: --- Doesn't look like this is related to the new HTTP executor logic as this race seem to happen even in non-http-executor based tests. Also the changes in slave doesn't seem related. Either this race has always existed but only now got exposed due to the CHECK in the command executor or there are some recent libprocess related changes that are the cause. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: >
[jira] [Updated] (MESOS-3220) Offer ability to kill tasks from the API
[ https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3220: --- Component/s: (was: python api) master > Offer ability to kill tasks from the API > > > Key: MESOS-3220 > URL: https://issues.apache.org/jira/browse/MESOS-3220 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Sunil Shah >Assignee: Marco Massenzio >Priority: Blocker > Labels: mesosphere > > We are investigating adding a `dcos task kill` command to our DCOS (and > Mesos) command line interface. Currently the ability to kill tasks is only > offered via the scheduler API so it would be useful to have some ability to > kill tasks directly. > This is a blocker for the DCOS CLI! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3868) Make apply-review.sh use apply-reviews.py
Artem Harutyunyan created MESOS-3868: Summary: Make apply-review.sh use apply-reviews.py Key: MESOS-3868 URL: https://issues.apache.org/jira/browse/MESOS-3868 Project: Mesos Issue Type: Bug Reporter: Artem Harutyunyan Assignee: Artem Harutyunyan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1478) Replace Master/Slave terminology
[ https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998219#comment-14998219 ] Erik Weathers commented on MESOS-1478: -- If I may ask, can someone please explain where the discussion & conclusion about the choice of the new name happened? I saw an email chain about *whether* to do the rename (which was inconclusive), and then when I attended MesosCon 2015 in Seattle, it was announced "from on high" that the new name was "agent". Was this discussed in some ad hoc informal forum? Decided internally to Mesosphere? > Replace Master/Slave terminology > > > Key: MESOS-1478 > URL: https://issues.apache.org/jira/browse/MESOS-1478 > Project: Mesos > Issue Type: Epic >Reporter: Clark Breyman >Assignee: Benjamin Hindman >Priority: Minor > Labels: mesosphere > > Inspired by the comments on this PR: > https://github.com/django/django/pull/2692 > TL;DR - Computers sharing work should be a good thing. Using the language of > human bondage and suffering is inappropriate in this context. It also has the > potential to alienate users and community members. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1478) Replace Master/Slave terminology
[ https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998219#comment-14998219 ] Erik Weathers edited comment on MESOS-1478 at 11/10/15 8:24 AM: If I may ask, can someone please explain where the discussion & conclusion about the choice of the new name for "slave" happened? I saw an email chain about *whether* to do the rename (which was inconclusive), and then when I attended MesosCon 2015 in Seattle, it was announced "from on high" that the new name was "agent". Was this discussed in some ad hoc informal forum? Decided internally to Mesosphere? was (Author: erikdw): If I may ask, can someone please explain where the discussion & conclusion about the choice of the new name happened? I saw an email chain about *whether* to do the rename (which was inconclusive), and then when I attended MesosCon 2015 in Seattle, it was announced "from on high" that the new name was "agent". Was this discussed in some ad hoc informal forum? Decided internally to Mesosphere? > Replace Master/Slave terminology > > > Key: MESOS-1478 > URL: https://issues.apache.org/jira/browse/MESOS-1478 > Project: Mesos > Issue Type: Epic >Reporter: Clark Breyman >Assignee: Benjamin Hindman >Priority: Minor > Labels: mesosphere > > Inspired by the comments on this PR: > https://github.com/django/django/pull/2692 > TL;DR - Computers sharing work should be a good thing. Using the language of > human bondage and suffering is inappropriate in this context. It also has the > potential to alienate users and community members. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2455) Add operator endpoints to create/destroy persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998168#comment-14998168 ] Neil Conway commented on MESOS-2455: Hi Dan -- I'm working on this at the moment. I should have patches ready for review shortly. > Add operator endpoints to create/destroy persistent volumes. > > > Key: MESOS-2455 > URL: https://issues.apache.org/jira/browse/MESOS-2455 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Neil Conway >Priority: Critical > Labels: mesosphere, persistent-volumes > > Persistent volumes will not be released automatically. > So we probably need an endpoint for operators to forcefully release > persistent volumes. We probably need to add principal to Persistence struct > and use ACLs to control who can release what. > Additionally, it would be useful to have an endpoint for operators to create > persistent volumes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3871) Document libprocess message delivery semantics
Neil Conway created MESOS-3871: -- Summary: Document libprocess message delivery semantics Key: MESOS-3871 URL: https://issues.apache.org/jira/browse/MESOS-3871 Project: Mesos Issue Type: Documentation Components: documentation, libprocess Reporter: Neil Conway Priority: Minor What are the semantics of {{send()}} in libprocess? Specifically, does libprocess guarantee that messages will not be dropped, reordered, or duplicated? These are important properties to understand when building software on top of libprocess. Clearly message drops are allowed. Message reordering _appears_ to be allowed, although it should only happen in corner cases (see MESOS-3870). Duplicate message delivery probably can't happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998287#comment-14998287 ] Bernd Mathiske commented on MESOS-3851: --- [~marco-mesos]: Guess what I have been looking at as of yesterday :-) [~anandmazumdar] has analyzed this well. There is highly likely be some race of sorts between executor registration and task launching. That would completely explain the CHECK that fails. There is another explanation that is less likely also: faulty data or faulty marshaling or faulty transmission or faulty unmarshaling. My focus is on understanding how the code allows for said race and once I understand it, I will try to cause the race by inserting sleep(someSeconds) somewhere suitable. Without that, there is no reliable way of reproducing the bug. It never happens when I run this, not even on CentOS 7.1. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: >
[jira] [Created] (MESOS-3869) Better error reporting for bad user when launching containers
Isabel Jimenez created MESOS-3869: - Summary: Better error reporting for bad user when launching containers Key: MESOS-3869 URL: https://issues.apache.org/jira/browse/MESOS-3869 Project: Mesos Issue Type: Improvement Components: docker, slave Reporter: Isabel Jimenez Assignee: Isabel Jimenez When launching containers with an non existing user, the scheduler receives the following error: "Abnormal executor termination" This error should provide more information. As of right now to have more details you have to check the sandbox log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3870: --- Labels: mesosphere (was: ) > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998288#comment-14998288 ] Bernd Mathiske commented on MESOS-3851: --- That said we could tag 0.26.0 without the change in CommandExecutor. This would leave Fetcher tests flaky, but at least CommandExecutor could launch tasks with some probability even if a race occurred. > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce
[jira] [Updated] (MESOS-3872) Investigate adding color to `support/post-reviews.py` on Windows
[ https://issues.apache.org/jira/browse/MESOS-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Clemmer updated MESOS-3872: Description: >From the comments: # TODO(hausdorff): We have disabled colors for the diffs on Windows, as piping them through `subprocess` causes us to emit ANSI escape codes, which the command prompt doesn't recognize. Presumably we are being routed through some TTY that causes git to not emit the colors using `cmd`'s color codes API (which is entirely different from ANSI. See [1] for more information and MESOS-3872. # # [1] http://stackoverflow.com/questions/5921556/in-git-bash-on-windows-7-colors-display-as-code-when-running-cucumber-or-rspec > Investigate adding color to `support/post-reviews.py` on Windows > > > Key: MESOS-3872 > URL: https://issues.apache.org/jira/browse/MESOS-3872 > Project: Mesos > Issue Type: Bug > Components: general >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > > From the comments: > # TODO(hausdorff): We have disabled colors for the diffs on Windows, as > piping them through `subprocess` causes us to emit ANSI escape codes, which > the command prompt doesn't recognize. Presumably we are being routed through > some TTY that causes git to not emit the colors using `cmd`'s color codes API > (which is entirely different from ANSI. See [1] for more information and > MESOS-3872. > # > # [1] > http://stackoverflow.com/questions/5921556/in-git-bash-on-windows-7-colors-display-as-code-when-running-cucumber-or-rspec -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3872) Investigate adding color to `support/post-reviews.py` on Windows
Alex Clemmer created MESOS-3872: --- Summary: Investigate adding color to `support/post-reviews.py` on Windows Key: MESOS-3872 URL: https://issues.apache.org/jira/browse/MESOS-3872 Project: Mesos Issue Type: Bug Components: general Reporter: Alex Clemmer Assignee: Alex Clemmer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials
[ https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999702#comment-14999702 ] Marco Massenzio commented on MESOS-3024: BTW - shutting down the framework works too: {noformat} I 00:48:02.558192 2789 http.cpp:336] HTTP POST for /master//api/v1/scheduler from 192.168.33.1:52509 with User-Agent='python-requests/2.7.0 CPython/2.7.10 Darwin/15.0.0' I 00:48:02.558320 2789 master.cpp:5571] Removing framework 0878d422-0e83-4b15-8a26-f04a6e3d829f- (Example HTTP Framework) I 00:48:02.558527 2789 hierarchical.hpp:599] Deactivated framework 0878d422-0e83-4b15-8a26-f04a6e3d829f- I 00:48:02.558600 2789 hierarchical.hpp:1103] Recovered ports(*):[9000-1]; ephemeral_ports(*):[32768-57344]; cpus(*):1; mem(*):496; disk(*):35164 (total: ports(*):[9000-1]; ephemeral_ports(*):[32768-57344]; cpus(*):1; mem(*):496; disk(*):35164, allocated: ) on slave e08833af-00af-44c6-abd1-bc666b1949c0-S0 from framework 0878d422-0e83-4b15-8a26-f04a6e3d829f- I 00:48:02.558624 2789 hierarchical.hpp:552] Removed framework 0878d422-0e83-4b15-8a26-f04a6e3d829f- {noformat} no authentication provided on this call either. > HTTP endpoint authN is enabled merely by specifying --credentials > - > > Key: MESOS-3024 > URL: https://issues.apache.org/jira/browse/MESOS-3024 > Project: Mesos > Issue Type: Bug > Components: master, security >Reporter: Adam B >Assignee: Marco Massenzio > Labels: authentication, http, mesosphere > > If I set `--credentials` on the master, framework and slave authentication > are allowed, but not required. On the other hand, http authentication is now > required for authenticated endpoints (currently only `/shutdown`). That means > that I cannot enable framework or slave authentication without also enabling > http endpoint authentication. This is undesirable. > Framework and slave authentication have separate flags (`\--authenticate` and > `\--authenticate_slaves`) to require authentication for each. It would be > great if there was also such a flag for framework authentication. Or maybe we > get rid of these flags altogether and rely on ACLs to determine which > unauthenticated principals are even allowed to authenticate for each > endpoint/action. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials
[ https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999696#comment-14999696 ] Marco Massenzio commented on MESOS-3024: I am unclear about this: {quote} It would be great if there was also such a flag for framework authentication. {quote} Is this a typo? ({{--authenticate}} does exactly that) Looking at [master/http.cpp|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L375]: {code} if (master->flags.authenticate_frameworks) { return Unauthorized( "Mesos master", "HTTP schedulers are not supported when authentication is required"); } {code} It seems to me that the HTTP API requires authentication for *all* request types; and that is required only when {{--authenticate}} is set on the master: when [master sets the {{credentials}} flag|https://github.com/apache/mesos/blob/master/src/master/master.cpp#L425] the former is not touched. To test this, I launched a Master with {{--credentials}} but no {{--authenticate}} and then registered a framework via the HTTP API and also received an offer - it all worked just fine. I am assuming here that I'm missing something fundamental, can folks please clarify what the issue is? Thanks! > HTTP endpoint authN is enabled merely by specifying --credentials > - > > Key: MESOS-3024 > URL: https://issues.apache.org/jira/browse/MESOS-3024 > Project: Mesos > Issue Type: Bug > Components: master, security >Reporter: Adam B >Assignee: Marco Massenzio > Labels: authentication, http, mesosphere > > If I set `--credentials` on the master, framework and slave authentication > are allowed, but not required. On the other hand, http authentication is now > required for authenticated endpoints (currently only `/shutdown`). That means > that I cannot enable framework or slave authentication without also enabling > http endpoint authentication. This is undesirable. > Framework and slave authentication have separate flags (`\--authenticate` and > `\--authenticate_slaves`) to require authentication for each. It would be > great if there was also such a flag for framework authentication. Or maybe we > get rid of these flags altogether and rely on ACLs to determine which > unauthenticated principals are even allowed to authenticate for each > endpoint/action. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998453#comment-14998453 ] haosdent commented on MESOS-3851: - In the error log, registered and launchTaks have different thread id. {noformat} I1110 00:36:30.616987 5169 exec.cpp:306] Executor::launchTask took 160701ns I1110 00:36:30.621285 5163 exec.cpp:222] Executor::registered took 399555ns {noformat} But in local test, these always have same thread id. {noformat} I1110 19:34:46.304114 8953 exec.cpp:222] Executor::registered took 182100ns S1110 19:34:46.304416 8953 exec.cpp:306] Executor::launchTask took 47975ns {noformat} {noformat} R1110 19:34:47.439801 9027 exec.cpp:222] Executor::registered took 257152ns I1110 19:34:47.440234 9027 exec.cpp:306] Executor::launchTask took 111249ns {noformat} {noformat} I1110 19:34:47.943961 9097 exec.cpp:222] Executor::registered took 271225ns I1110 19:34:47.944284 9097 exec.cpp:306] Executor::launchTask took 45141ns {noformat} > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log:
[jira] [Comment Edited] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials
[ https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999696#comment-14999696 ] Marco Massenzio edited comment on MESOS-3024 at 11/11/15 12:47 AM: --- I am unclear about this: {quote} It would be great if there was also such a flag for framework authentication. {quote} Is this a typo? ({{--authenticate}} does exactly that) Looking at [master/http.cpp|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L375]: {code} if (master->flags.authenticate_frameworks) { return Unauthorized( "Mesos master", "HTTP schedulers are not supported when authentication is required"); } {code} It seems to me that the HTTP API requires authentication for *all* request types; and that is required only when {{--authenticate}} is set on the master: when [master sets the {{credentials}} flag|https://github.com/apache/mesos/blob/master/src/master/master.cpp#L425] the former is not touched. To test this, I launched a Master with {{-- credentials}} but no {{-- authenticate}} and then registered a framework via the HTTP API and also received an offer - it all worked just fine. I am assuming here that I'm missing something fundamental, can folks please clarify what the issue is? Thanks! was (Author: marco-mesos): I am unclear about this: {quote} It would be great if there was also such a flag for framework authentication. {quote} Is this a typo? ({{--authenticate}} does exactly that) Looking at [master/http.cpp|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L375]: {code} if (master->flags.authenticate_frameworks) { return Unauthorized( "Mesos master", "HTTP schedulers are not supported when authentication is required"); } {code} It seems to me that the HTTP API requires authentication for *all* request types; and that is required only when {{--authenticate}} is set on the master: when [master sets the {{credentials}} flag|https://github.com/apache/mesos/blob/master/src/master/master.cpp#L425] the former is not touched. To test this, I launched a Master with {{--credentials}} but no {{--authenticate}} and then registered a framework via the HTTP API and also received an offer - it all worked just fine. I am assuming here that I'm missing something fundamental, can folks please clarify what the issue is? Thanks! > HTTP endpoint authN is enabled merely by specifying --credentials > - > > Key: MESOS-3024 > URL: https://issues.apache.org/jira/browse/MESOS-3024 > Project: Mesos > Issue Type: Bug > Components: master, security >Reporter: Adam B >Assignee: Marco Massenzio > Labels: authentication, http, mesosphere > > If I set `--credentials` on the master, framework and slave authentication > are allowed, but not required. On the other hand, http authentication is now > required for authenticated endpoints (currently only `/shutdown`). That means > that I cannot enable framework or slave authentication without also enabling > http endpoint authentication. This is undesirable. > Framework and slave authentication have separate flags (`\--authenticate` and > `\--authenticate_slaves`) to require authentication for each. It would be > great if there was also such a flag for framework authentication. Or maybe we > get rid of these flags altogether and rely on ACLs to determine which > unauthenticated principals are even allowed to authenticate for each > endpoint/action. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3834) slave upgrade framework checkpoint incompatibility
[ https://issues.apache.org/jira/browse/MESOS-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1477#comment-1477 ] James Peach commented on MESOS-3834: https://reviews.apache.org/r/40177/ [~vi...@twitter.com] or [~karya], could you shepherd this bug? > slave upgrade framework checkpoint incompatibility > --- > > Key: MESOS-3834 > URL: https://issues.apache.org/jira/browse/MESOS-3834 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1 >Reporter: James Peach >Assignee: James Peach > > We are upgrading from 0.22 to 0.25 and experienced the following crash in the > 0.24 slave: > {code} > F1104 05:20:49.162701 1153 slave.cpp:4175] Check failed: > frameworkInfo.has_id() > *** Check failure stack trace: *** > @ 0x7fef9c294650 google::LogMessage::Fail() > @ 0x7fef9c29459f google::LogMessage::SendToLog() > @ 0x7fef9c293fb0 google::LogMessage::Flush() > @ 0x7fef9c296ce4 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fef9b9a5492 mesos::internal::slave::Slave::recoverFramework() > @ 0x7fef9b9a3314 mesos::internal::slave::Slave::recover() > @ 0x7fef9b9d069c > _ZZN7process8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS4_5state5StateEES9_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSG_FSE_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESP_ > @ 0x7fef9ba039f4 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > {code} > As near as I can tell, what happened was this: > - 0.22 wrote {{framework.info}} without the FrameworkID > - 0.23 had a compatibility check so it was ok with it > - 0.24 removed the compatibility check in MESOS-2259 > - the framework checkpoint doesn't get rewritten during recovery so when the > 0.24 slave starts it reads the 0.22 version > - 0.24 asserts -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3581) License headers show up all over doxygen documentation.
[ https://issues.apache.org/jira/browse/MESOS-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3581: Sprint: Mesosphere Sprint 22 > License headers show up all over doxygen documentation. > --- > > Key: MESOS-3581 > URL: https://issues.apache.org/jira/browse/MESOS-3581 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.24.1 >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > Labels: mesosphere > > Currently license headers are commented in something resembling Javadoc style, > {code} > /** > * Licensed ... > {code} > Since we use Javadoc-style comment blocks for doxygen documentation all > license headers appear in the generated documentation, potentially and likely > hiding the actual documentation. > Using {{/*}} to start the comment blocks would be enough to hide them from > doxygen, but would likely also result in a largish (though mostly > uninteresting) patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3551) Replace use of strerror with thread-safe alternatives strerror_r / strerror_l.
[ https://issues.apache.org/jira/browse/MESOS-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3551: Sprint: Mesosphere Sprint 22 > Replace use of strerror with thread-safe alternatives strerror_r / strerror_l. > -- > > Key: MESOS-3551 > URL: https://issues.apache.org/jira/browse/MESOS-3551 > Project: Mesos > Issue Type: Bug > Components: libprocess, stout >Reporter: Benjamin Mahler >Assignee: Benjamin Bannier > Labels: mesosphere, newbie, tech-debt > > {{strerror()}} is not required to be thread safe by POSIX and is listed as > unsafe on Linux: > http://pubs.opengroup.org/onlinepubs/9699919799/ > http://man7.org/linux/man-pages/man3/strerror.3.html > I don't believe we've seen any issues reported due to this. We should replace > occurrences of strerror accordingly, possibly offering a wrapper in stout to > simplify callsites. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998677#comment-14998677 ] Neil Conway commented on MESOS-3870: This should be accounted for by the fact that each process has a queue of input events that are consumed in-order (see the "events" deque in ProcessBase). i.e., although we can have many worker threads, a given process is only running in at most one thread at a time and each process' input events are consumed in the order in which they were delivered. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998634#comment-14998634 ] Felix Bechstein commented on MESOS-2353: We patched the master to hold only 100 completed tasks per framework and 10 completed frameworks. It reduced the state size to ~2MB but the master is still using all its CPU to generate. We stopped fetching the /master/state from all the browsers of your developers with iptables and the load was gone. > Improve performance of the master's state.json endpoint for large clusters. > --- > > Key: MESOS-2353 > URL: https://issues.apache.org/jira/browse/MESOS-2353 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Benjamin Mahler > Labels: newbie, scalability, twitter > > The master's state.json endpoint consistently takes a long time to compute > the JSON result, for large clusters: > {noformat} > $ time curl -s -o /dev/null localhost:5050/master/state.json > Mon Jan 26 22:38:50 UTC 2015 > real 0m13.174s > user 0m0.003s > sys 0m0.022s > {noformat} > This can cause the master to get backlogged if there are many state.json > requests in flight. > Looking at {{perf}} data, it seems most of the time is spent doing memory > allocation / de-allocation. This ticket will try to capture any low hanging > fruit to speed this up. Possibly we can leverage moves if they are not > already being used by the compiler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes
[ https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3839: Story Points: 2 > Update documentation for FetcherCache mtime-related changes > --- > > Key: MESOS-3839 > URL: https://issues.apache.org/jira/browse/MESOS-3839 > Project: Mesos > Issue Type: Documentation > Components: fetcher, slave >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes
[ https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3839: Sprint: Mesosphere Sprint 23 > Update documentation for FetcherCache mtime-related changes > --- > > Key: MESOS-3839 > URL: https://issues.apache.org/jira/browse/MESOS-3839 > Project: Mesos > Issue Type: Documentation > Components: fetcher, slave >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3856) Add mtime-related fetcher tests
[ https://issues.apache.org/jira/browse/MESOS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3856: Story Points: 2 > Add mtime-related fetcher tests > --- > > Key: MESOS-3856 > URL: https://issues.apache.org/jira/browse/MESOS-3856 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes
[ https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3839: Story Points: 1 (was: 2) > Update documentation for FetcherCache mtime-related changes > --- > > Key: MESOS-3839 > URL: https://issues.apache.org/jira/browse/MESOS-3839 > Project: Mesos > Issue Type: Documentation > Components: fetcher, slave >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes
[ https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-3839: -- Sprint: Mesosphere Sprint 22 (was: Mesosphere Sprint 23) > Update documentation for FetcherCache mtime-related changes > --- > > Key: MESOS-3839 > URL: https://issues.apache.org/jira/browse/MESOS-3839 > Project: Mesos > Issue Type: Documentation > Components: fetcher, slave >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2376) Allow libprocess ip and port to be configured
[ https://issues.apache.org/jira/browse/MESOS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998645#comment-14998645 ] Dimitri commented on MESOS-2376: [~hfaran] What do you mean ? I am trying to set the mesos binding ip to something more secure than 0.0.0.0. I am running mesos inside a docker container, should I bind in the docker environment or on the host ? I have tried both, none did work. I haven't tried to set LIBPROCESS_PORT since i am just trying to change the interface. I am quite suprirse by the lake of documentation for this feature, which is to me something that cannot make mesos usuable to production. > Allow libprocess ip and port to be configured > - > > Key: MESOS-2376 > URL: https://issues.apache.org/jira/browse/MESOS-2376 > Project: Mesos > Issue Type: Improvement > Components: java api >Reporter: Dario Rexin >Priority: Minor > > Currently if we want to configure the ip libprocess uses for communication, > we have to set the env var LIBPROCESS_IP, or LIBPROCESS_PORT for the port. > For the Java API this means, that the variable has to be set before the JVM > is started, because setting env vars from within JAVA is not possible / > non-trivial. Therefore it would be great to be able to pass them in to the > constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3856) Add mtime-related fetcher tests
[ https://issues.apache.org/jira/browse/MESOS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-3856: Sprint: Mesosphere Sprint 22 > Add mtime-related fetcher tests > --- > > Key: MESOS-3856 > URL: https://issues.apache.org/jira/browse/MESOS-3856 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3220) Offer ability to kill tasks from the API
[ https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999552#comment-14999552 ] Marco Massenzio commented on MESOS-3220: To revive this thread - a couple of clarifying points: 1. Maintenance This is meant to augment the Maintenance Primitives (MESOS-1474) and certainly *not* to replace it. In particular, this endpoint (which ought to be scriptable, for automated maintenance scripts) would enable operators to kill "recalcitrant" frameworks/tasks which, for whatever reason, do not follow the Inverse Offer mechanism; 2. Repairs There may be situations in which the task itself gets in a funky state and needs to be killed, without Mesos necessarily noticing it (ie, we cannot rely on the {{TASK_LOST}}/{{TASK_KILLED}} conditions). Once that happens, however, the Framework will be notified (via the usual Mesos mechanisms) and can thus decide whether to re-schedule the task (possibly, somewhere else). 3. Remote termination Using tools such as the {{DCOS CLI}} we want to enable users to reach out to Mesos Master directly (possibly bypassing the framework) and terminate a task, without requiring every framework developer to re-implement the same API (so, this would be a "common service" that Mesos offers to framework developers, that they wouldn't have to worry about). 4. Security There is obviously the expedient (if somewhat draconian) "firewalling" ability, to prevent outright access to this endpoint. At a finer-grained level, we would consider using ACLs (probably in line with what is currently being done for the Maintenance Primitives) to authorize access to this functionality. > Offer ability to kill tasks from the API > > > Key: MESOS-3220 > URL: https://issues.apache.org/jira/browse/MESOS-3220 > Project: Mesos > Issue Type: Improvement > Components: python api >Reporter: Sunil Shah >Assignee: Marco Massenzio >Priority: Blocker > Labels: mesosphere > > We are investigating adding a `dcos task kill` command to our DCOS (and > Mesos) command line interface. Currently the ability to kill tasks is only > offered via the scheduler API so it would be useful to have some ability to > kill tasks directly. > This is a blocker for the DCOS CLI! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3157) only perform batch resource allocations
[ https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1440#comment-1440 ] James Peach commented on MESOS-3157: No, I hope to get back to it soon though. > only perform batch resource allocations > --- > > Key: MESOS-3157 > URL: https://issues.apache.org/jira/browse/MESOS-3157 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: James Peach >Assignee: James Peach > > Our deployment environments have a lot of churn, with many short-live > frameworks that often revive offers. Running the allocator takes a long time > (from seconds up to minutes). > In this situation, event-triggered allocation causes the event queue in the > allocator process to get very long, and the allocator effectively becomes > unresponsive (eg. a revive offers message takes too long to come to the head > of the queue). > We have been running a patch to remove all the event-triggered allocations > and only allocate from the batch task > {{HierarchicalAllocatorProcess::batch}}. This works great and really improves > responsiveness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor
[ https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998541#comment-14998541 ] Bernd Mathiske commented on MESOS-3851: --- Interesting. Thanks for spotting this! > Investigate recent crashes in Command Executor > -- > > Key: MESOS-3851 > URL: https://issues.apache.org/jira/browse/MESOS-3851 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to > support rootfs. There seem to be some tests showing frequent crashes due to > assert violations. > {{FetcherCacheTest.SimpleEviction}} failed due to the following log: > {code} > I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to > executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at > executor(1)@172.17.5.200:33871' > I1107 19:36:46.363682 1236 exec.cpp:297] > I1107 19:36:46.373569 1245 exec.cpp:210] Executor registered on slave > 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0 > @ 0x7f9f5a7db3fa google::LogMessage::Fail() > I1107 19:36:46.394081 1245 exec.cpp:222] Executor::registered took 395411ns > @ 0x7f9f5a7db359 google::LogMessage::SendToLog() > @ 0x7f9f5a7dad6a google::LogMessage::Flush() > @ 0x7f9f5a7dda9e google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @ 0x49c99d > mesos::internal::CommandExecutorProcess::launchTask() > @ 0x4b3dd7 > _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_ > @ 0x4c470c > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f9f5a761b1b std::function<>::operator()() > @ 0x7f9f5a749935 process::ProcessBase::visit() > @ 0x7f9f5a74d700 process::DispatchEvent::visit() > @ 0x48e004 process::ProcessBase::serve() > @ 0x7f9f5a745d21 process::ProcessManager::resume() > @ 0x7f9f5a742f52 > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ > @ 0x7f9f5a74cf2c > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > @ 0x7f9f5a74cedc > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ > @ 0x7f9f5a74ce6e > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE > @ 0x7f9f5a74cdc5 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv > @ 0x7f9f5a74cd5e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7f9f5624f1e0 (unknown) > @ 0x7f9f564a8df5 start_thread > @ 0x7f9f559b71ad __clone > I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container > '6553a617-6b4a-418d-9759-5681f45ff854' has exited > I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container > '6553a617-6b4a-418d-9759-5681f45ff854' > I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container > 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited > {code} > The reason seems to be a race between the executor receiving a > {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the > {{CHECK_SOME(executorInfo)}} failure. > Link to complete log: > https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535 > Another related failure from {{ExamplesTest.PersistentVolumeFramework}} > {code} > @ 0x7f4f71529cbd google::LogMessage::SendToLog() > I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager > successfully handled status update acknowledgement (UUID: > 721c7316-5580-4636-a83a-098e3bd4ed1f) for task > ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework > ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f- > @ 0x7f4f715296ce google::LogMessage::Flush() > @ 0x7f4f7152c402 google::LogMessageFatal::~LogMessageFatal() > @ 0x48d00a _CheckFatal::~_CheckFatal() > @
[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery
[ https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998605#comment-14998605 ] haosdent commented on MESOS-3870: - I think although send could be ordered when only have a connection, but execute also could be out-of-order in receiver. ProcessManager would create 8~cpu number work threads to handle input messages. Because ProcessManager dispatch work by event, same Process would be called in different thread for different event. > Prevent out-of-order libprocess message delivery > > > Key: MESOS-3870 > URL: https://issues.apache.org/jira/browse/MESOS-3870 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > I was under the impression that {{send()}} provided in-order, unreliable > message delivery. So if P1 sendsto P2, P2 might see <>, , , > or — but not . > I suspect much of the code makes a similar assumption. However, it appears > that this behavior is not guaranteed. slave.cpp:2217 has the following > comment: > {noformat} > // TODO(jieyu): Here we assume that CheckpointResourcesMessages are > // ordered (i.e., slave receives them in the same order master sends > // them). This should be true in most of the cases because TCP > // enforces in order delivery per connection. However, the ordering > // is technically not guaranteed because master creates multiple > // connections to the slave in some cases (e.g., persistent socket > // to slave breaks and master uses ephemeral socket). This could > // potentially be solved by using a version number and rejecting > // stale messages according to the version number. > {noformat} > We can improve this situation by _either_: (1) fixing libprocess to guarantee > ordered message delivery, e.g., by adding a sequence number, or (2) > clarifying that ordered message delivery is not guaranteed, and ideally > providing a tool to force messages to be delivered out-of-order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3157) only perform batch resource allocations
[ https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999879#comment-14999879 ] Klaus Ma commented on MESOS-3157: - [~jpe...@apache.org], any update on this? > only perform batch resource allocations > --- > > Key: MESOS-3157 > URL: https://issues.apache.org/jira/browse/MESOS-3157 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: James Peach >Assignee: James Peach > > Our deployment environments have a lot of churn, with many short-live > frameworks that often revive offers. Running the allocator takes a long time > (from seconds up to minutes). > In this situation, event-triggered allocation causes the event queue in the > allocator process to get very long, and the allocator effectively becomes > unresponsive (eg. a revive offers message takes too long to come to the head > of the queue). > We have been running a patch to remove all the event-triggered allocations > and only allocate from the batch task > {{HierarchicalAllocatorProcess::batch}}. This works great and really improves > responsiveness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3826) Add an optional unique identifier for resource reservations
[ https://issues.apache.org/jira/browse/MESOS-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999881#comment-14999881 ] Klaus Ma commented on MESOS-3826: - [~sargun], is that addressed your concern? > Add an optional unique identifier for resource reservations > --- > > Key: MESOS-3826 > URL: https://issues.apache.org/jira/browse/MESOS-3826 > Project: Mesos > Issue Type: Improvement > Components: general >Reporter: Sargun Dhillon >Assignee: Guangya Liu >Priority: Minor > Labels: mesosphere > > Thanks to the resource reservation primitives, frameworks can reserve > resources. These reservations are per role, which means multiple frameworks > can share reservations. This can get very hairy, as multiple reservations can > occur on each agent. > It would be nice to be able to optionally, uniquely identify reservations by > ID, much like persistent volumes are today. This could be done by adding a > new protobuf field, such as Resource.ReservationInfo.id, that if set upon > reservation time, would come back when the reservation is advertised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)