[jira] [Updated] (MESOS-3165) Persist and recover quota to/from Registry

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3165:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Persist and recover quota to/from Registry
> --
>
> Key: MESOS-3165
> URL: https://issues.apache.org/jira/browse/MESOS-3165
> Project: Mesos
>  Issue Type: Task
>  Components: master, replicated log
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> To persist quotas across failovers, the Master should save them in the 
> registry. To support this, we shall:
> * Introduce a Quota state variable in registry.proto;
> * Extend the Operation interface so that it supports a ‘Quota’ accumulator 
> (see src/master/registrar.hpp);
> * Introduce AddQuota / RemoveQuota operations;
> * Recover quotas from the registry on failover to the Master’s 
> internal::master::Role struct;
> * Extend RegistrarTest with quota-specific tests.
> NOTE: Registry variable can be rather big for production clusters (see 
> MESOS-2075). While it should be fine for MVP to add quota information to 
> registry, we should consider storing Quota separately, as this does not need 
> to be in sync with slaves update. However, currently adding more variable is 
> not supported by the registrar.
> While the Agents are reregistering (note they may fail to do so), the 
> information about what part of the quota is allocated is only partially 
> available to the Master. In other words, the state of the quota allocation is 
> reconstructed as Agents reregister. During this period, some roles may be 
> under quota from the perspective of the newly elected Master.
> The same problem exists on the allocator side: it may think the cluster is 
> under quota and may eagerly try to satisfy quotas before enough Agents 
> reregister, which may result in resources being allocated to frameworks 
> beyond their quota. To address this issue and also to avoid panicking and 
> generating under quota alerts, the Master should give a certain amount of 
> time for the majority (e.g. 80%) of the Agents to reregister before reporting 
> any quota status and notifying the allocator about granted quotas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations aka Quota

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1791:
---
Shepherd: Joris Van Remoortere

> Introduce Master / Offer Resource Reservations aka Quota
> 
>
> Key: MESOS-1791
> URL: https://issues.apache.org/jira/browse/MESOS-1791
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, master, replicated log
>Reporter: Tom Arnfeld
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Currently Mesos supports the ability to reserve resources (for a given role) 
> on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
> statically partition off a set of resources on a set of machines, to 
> guarantee certain types of frameworks get some resources.
> This is very useful, though it is also very useful to be able to control 
> these reservations through the master (instead of per-slave) for when I don't 
> care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
> (X,Y).
> I'm not sure what structure this could take, but apparently it has already 
> been discussed. Would this be a CLI flag? Could there be a (authenticated) 
> web interface to control these reservations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3717) Master recovery in presence of quota

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3717:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Master recovery in presence of quota
> 
>
> Key: MESOS-3717
> URL: https://issues.apache.org/jira/browse/MESOS-3717
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Quota complicates master failover in several ways. The new master should 
> determine if it is possible to satisfy the total quota and notify an operator 
> in case it's not (imagine simultaneous failovers of multiple agents). The new 
> master should hint the allocator how many agents might reconnect in the 
> future to help it decide how to satisfy quota before the majority of agents 
> reconnect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998699#comment-14998699
 ] 

Till Toenshoff commented on MESOS-3851:
---

I will be committing the workaround patch Tim has provided 
https://reviews.apache.org/r/40107/  (thanks a bunch [~tnachen]!) shortly after 
running a final check on it.

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd  google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager 
> successfully handled status update acknowledgement (UUID: 
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task 
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework 
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-
> @ 0x7f4f715296ce  google::LogMessage::Flush()
> @   

[jira] [Updated] (MESOS-3581) License headers show up all over doxygen documentation.

2015-11-10 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-3581:
--
Target Version/s:   (was: 0.26.0)

> License headers show up all over doxygen documentation.
> ---
>
> Key: MESOS-3581
> URL: https://issues.apache.org/jira/browse/MESOS-3581
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.24.1
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>  Labels: mesosphere
>
> Currently license headers are commented in something resembling Javadoc style,
> {code}
> /**
> * Licensed ...
> {code}
> Since we use Javadoc-style comment blocks for doxygen documentation all 
> license headers appear in the generated documentation, potentially and likely 
> hiding the actual documentation.
> Using {{/*}} to start the comment blocks would be enough to hide them from 
> doxygen, but would likely also result in a largish (though mostly 
> uninteresting) patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3870:
---
Comment: was deleted

(was: You mean "volatile"? The variable is read and written inside a 
"synchronized" block, which will do the necessary synchronization (memory 
barriers) to ensure that other CPUs see the appropriate values (provided they 
also use synchronized blocks when examining the variable).

There are a few places that read "ProcessBase.state" without holding the mutex 
(e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed.

(Note that "volatile" is not sufficient/appropriate for ensuring reasonable 
semantics for concurrent access to shared state without mutual exclusion, 
anyway...))

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998837#comment-14998837
 ] 

haosdent commented on MESOS-3870:
-

Ohoh, got it. Thank you for explanation.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998700#comment-14998700
 ] 

haosdent edited comment on MESOS-3870 at 11/10/15 3:09 PM:
---

Suppose a Process enqueue to runq twice(Or impossible, seems I could not find 
any code avoid it enqueue multi times) when it receive two events.

And the Process dequeue in different work threads, and not yet running. In work 
thread 1, Process pop event A and not yet running. In work thread 2, Process 
pop event B and start running.

Is this scenario possible?


was (Author: haosd...@gmail.com):
Suppose a Process enqueue to runq twice(Or impossible, seems I could not find 
any code avoid it enqueue multi times) when it receive two events.

And the dequeue in different work threads, and not yet running. In work thread 
1, Process dequeue event A and not yet running. In work thread 2, Process 
dequeue event B and start running.

Is this scenario possible?

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3873) Enhance allocator interface with the recovery() method

2015-11-10 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-3873:
--

 Summary: Enhance allocator interface with the recovery() method
 Key: MESOS-3873
 URL: https://issues.apache.org/jira/browse/MESOS-3873
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


There are some scenarios (e.g. quota is set for some roles) when it makes sense 
to notify an allocator about the recovery. Introduce a method into the 
allocator interface that allows for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3862) Authorize quota requests

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3862:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Authorize quota requests
> 
>
> Key: MESOS-3862
> URL: https://issues.apache.org/jira/browse/MESOS-3862
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: acl, mesosphere, security
>
> When quotas are requested they should authorize their roles.
> This ticket will authorize quota requests with ACLs. The existing 
> authorization support that has been implemented in MESOS-1342 will be 
> extended to add a `request_quotas` ACL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3720) Tests for Quota support in master

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3720:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Tests for Quota support in master
> -
>
> Key: MESOS-3720
> URL: https://issues.apache.org/jira/browse/MESOS-3720
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Allocator-agnostic tests for quota support in the master. They can be divided 
> into several groups:
> * Request validation;
> * Satisfiability validation;
> * Master failover;
> * Persisting in the registry;
> * Functionality and quota guarantees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3802) Clear the suppressed flag when deactive a framework

2015-11-10 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-3802:
--
Target Version/s:   (was: 0.26.0)

> Clear the suppressed flag when deactive a framework
> ---
>
> Key: MESOS-3802
> URL: https://issues.apache.org/jira/browse/MESOS-3802
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> When deactivate the framework, the suppressed flag was not cleared and this 
> will cause the framework cannot get resource immediately after active, we 
> should clear this flag when deactivate the framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998737#comment-14998737
 ] 

Neil Conway commented on MESOS-3870:


I don't see how: the routine acquires ProcessBase.mutex before examining 
ProcessBase.state.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998805#comment-14998805
 ] 

Neil Conway commented on MESOS-3870:


You mean "volatile"? The variable is read and written inside a "synchronized" 
block, which will do the necessary synchronization (memory barriers) to ensure 
that other CPUs see the appropriate values (provided they also use synchronized 
blocks when examining the variable).

There are a few places that read "ProcessBase.state" without holding the mutex 
(e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed.

(Note that "volatile" is not sufficient/appropriate for ensuring reasonable 
semantics for concurrent access to shared state without mutual exclusion, 
anyway...)

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998884#comment-14998884
 ] 

haosdent commented on MESOS-3870:
-

I think have another case make same Process run in different thread. Suppose 
ProcessBase pop event A in thread 1 and change ProcessBase.state to BLOCKED in 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2463
 . And not yet consume event B. Then event B reach and enqueue same Process to 
ProcessManager.runq in 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3008
 . Then ProcessManager dequeue it in thread 2 and pop event B and run event B 
while thread 1 not yet running event A consume function. Is this possible?

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3065) Add authorization for persistent volume

2015-11-10 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998770#comment-14998770
 ] 

Greg Mann commented on MESOS-3065:
--

MESOS-3065 should implement authorization for the Create/Destroy HTTP 
endpoints, which are being added in MESOS-2455.

> Add authorization for persistent volume
> ---
>
> Key: MESOS-3065
> URL: https://issues.apache.org/jira/browse/MESOS-3065
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes
>
> Persistent volume should be authorized with the {{principal}} of the 
> reserving entity (framework or master). The idea is to introduce {{Create}} 
> and {{Destroy}} into the ACL.
> {code}
>   message Create {
> // Subjects.
> required Entity principals = 1;
> // Objects? Perhaps the kind of volume? allowed permissions?
>   }
>   message Destroy {
> // Subjects.
> required Entity principals = 1;
> // Objects.
> required Entity creator_principals = 2;
>   }
> {code}
> When a framework/operator creates a persistent volume, "create" ACLs are 
> checked to see if the framework (FrameworkInfo.principal) or the operator 
> (Credential.user) is authorized to create persistent volumes. If not 
> authorized, the create operation is rejected.
> When a framework/operator destroys a persistent volume, "destroy" ACLs are 
> checked to see if the framework (FrameworkInfo.principal) or the operator 
> (Credential.user) is authorized to destroy the persistent volume created by a 
> framework or operator (Resource.DiskInfo.principal). If not authorized, the 
> destroy operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998806#comment-14998806
 ] 

Neil Conway commented on MESOS-3870:


You mean "volatile"? The variable is read and written inside a "synchronized" 
block, which will do the necessary synchronization (memory barriers) to ensure 
that other CPUs see the appropriate values (provided they also use synchronized 
blocks when examining the variable).

There are a few places that read "ProcessBase.state" without holding the mutex 
(e.g., ProcessManager::resume()) -- that is probably unsafe and should be fixed.

(Note that "volatile" is not sufficient/appropriate for ensuring reasonable 
semantics for concurrent access to shared state without mutual exclusion, 
anyway...)

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998700#comment-14998700
 ] 

haosdent commented on MESOS-3870:
-

Suppose a Process enqueue to runq twice(Or impossible, seems I could not find 
any code avoid it enqueue multi times) when it receive two events.

And the dequeue in different work threads, and not yet running. In work thread 
1, Process dequeue event A and not yet running. In work thread 2, Process 
dequeue event B and start running.

Is this scenario possible?

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998757#comment-14998757
 ] 

haosdent commented on MESOS-3870:
-

Yes, but ProcessBase.state don't have violate. I am not sure if it will still 
get dirty value while it change in other thread.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3717) Master recovery in presence of quota

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3717:
---
Issue Type: Task  (was: Bug)

> Master recovery in presence of quota
> 
>
> Key: MESOS-3717
> URL: https://issues.apache.org/jira/browse/MESOS-3717
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Quota complicates master failover in several ways. The new master should 
> determine if it is possible to satisfy the total quota and notify an operator 
> in case it's not (imagine simultaneous failovers of multiple agents). The new 
> master should hint the allocator how many agents might reconnect in the 
> future to help it decide how to satisfy quota before the majority of agents 
> reconnect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3874) Implement recovery in the Hierarchical allocator

2015-11-10 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-3874:
--

 Summary: Implement recovery in the Hierarchical allocator
 Key: MESOS-3874
 URL: https://issues.apache.org/jira/browse/MESOS-3874
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


The built-in Hierarchical allocator should implement the recovery (in the 
presence of quota).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998686#comment-14998686
 ] 

haosdent commented on MESOS-3870:
-

I think ProcessManager could dequeue same Process in different work thread?
{noformat}
ProcessBase* ProcessManager::dequeue()
{
  // TODO(benh): Remove a process from this thread's runq. If there
  // are no processes to run, and this is not a dedicated thread, then
  // steal one from another threads runq.

  ProcessBase* process = NULL;

  synchronized (runq_mutex) {
if (!runq.empty()) {
  process = runq.front();
  runq.pop_front();
  // Increment the running count of processes in order to support
  // the Clock::settle() operation (this must be done atomically
  // with removing the process from the runq).
  running.fetch_add(1);
}
  }

  return process;
}
{noformat}

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3199) Validate Quota Requests.

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3199:
---
Shepherd: Joris Van Remoortere  (was: Bernd Mathiske)

> Validate Quota Requests.
> 
>
> Key: MESOS-3199
> URL: https://issues.apache.org/jira/browse/MESOS-3199
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> We need to validate quota requests in terms of syntactical and semantical 
> correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3073) Introduce HTTP endpoints for Quota

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3073:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Introduce HTTP endpoints for Quota
> --
>
> Key: MESOS-3073
> URL: https://issues.apache.org/jira/browse/MESOS-3073
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> We need to implement the HTTP endpoints for Quota as outlined in the Design 
> Doc: 
> (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3763) Need for http::put request method

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3763:
---
Shepherd: Joris Van Remoortere  (was: Bernd Mathiske)

> Need for http::put request method
> -
>
> Key: MESOS-3763
> URL: https://issues.apache.org/jira/browse/MESOS-3763
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Minor
>  Labels: mesosphere
>
> As we decided to create a more restful api for managing Quota request.
> Therefore we also want to use the HTTP put request and hence need to enable 
> the libprocess/http to send put request besides get and post requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3718) Implement Quota support in allocator

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3718:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Implement Quota support in allocator
> 
>
> Key: MESOS-3718
> URL: https://issues.apache.org/jira/browse/MESOS-3718
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> The built-in Hierarchical DRF allocator should support Quota. This includes 
> (but not limited to): adding, updating, removing and satisfying quota; 
> avoiding both overcomitting resources and handing them to non-quota'ed roles 
> in presence of master failover.
> A [design doc for Quota support in 
> Allocator|https://issues.apache.org/jira/browse/MESOS-2937] provides an 
> overview of a feature set required to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3418) Factor out V1 API test helper functions

2015-11-10 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-3418:
--
Target Version/s: 0.27.0  (was: 0.26.0)

> Factor out V1 API test helper functions
> ---
>
> Key: MESOS-3418
> URL: https://issues.apache.org/jira/browse/MESOS-3418
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joris Van Remoortere
>Assignee: Guangya Liu
>  Labels: beginner, mesosphere, newbie, v1_api
>
> We currently have some helper functionality for V1 API tests. This is copied 
> in a few test files.
> Factor this out into a common place once the API is stabilized.
> {code}
> // Helper class for using EXPECT_CALL since the Mesos scheduler API
>   // is callback based.
>   class Callbacks
>   {
>   public:
> MOCK_METHOD0(connected, void(void));
> MOCK_METHOD0(disconnected, void(void));
> MOCK_METHOD1(received, void(const std::queue&));
>   };
> {code}
> {code}
> // Enqueues all received events into a libprocess queue.
> // TODO(jmlvanre): Factor this common code out of tests into V1
> // helper.
> ACTION_P(Enqueue, queue)
> {
>   std::queue events = arg0;
>   while (!events.empty()) {
> // Note that we currently drop HEARTBEATs because most of these tests
> // are not designed to deal with heartbeats.
> // TODO(vinod): Implement DROP_HTTP_CALLS that can filter heartbeats.
> if (events.front().type() == Event::HEARTBEAT) {
>   VLOG(1) << "Ignoring HEARTBEAT event";
> } else {
>   queue->put(events.front());
> }
> events.pop();
>   }
> }
> {code}
> We can also update the helpers in {{/tests/mesos.hpp}} to support the V1 API. 
>  This would let us get ride of lines like:
> {code}
> v1::TaskInfo taskInfo = evolve(createTask(devolve(offer), "", 
> DEFAULT_EXECUTOR_ID));
> {code}
> In favor of:
> {code}
> v1::TaskInfo taskInfo = createTask(offer, "", DEFAULT_EXECUTOR_ID);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998720#comment-14998720
 ] 

Neil Conway commented on MESOS-3870:


A process can't be enqueued onto the runq twice. This is prevented because a 
process is only added to the runq when it receives an event and the process is 
in state "BLOCKED"; once a process is on the runq, its state is changed to 
"READY", so it won't be readded again in the future. 
(https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2998-L3017)

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998730#comment-14998730
 ] 

haosdent commented on MESOS-3870:
-

Yes, but I think it still could be enqueue twice? Because ProcessBase.state in 
different CPU core caches.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998731#comment-14998731
 ] 

Till Toenshoff commented on MESOS-3851:
---

This following commit fixes the crash - we still may want to find the reasoning 
for the race condition and hence I will not close this ticket but will remove 
the target version (0.26.0) to unblock 0.26.0.

{noformat}
commit b6d4b28a4c9ca717ad8be5bbc27e40c005fc51ad
Author: Timothy Chen 
Date:   Tue Nov 10 15:46:17 2015 +0100

Removed unused checks in command executor.

Review: https://reviews.apache.org/r/40107
{noformat}

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd  google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status 

[jira] [Comment Edited] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998884#comment-14998884
 ] 

haosdent edited comment on MESOS-3870 at 11/10/15 4:49 PM:
---

I think have another case make same Process run in different thread. Suppose 
ProcessBase pop event A in thread 1 and change ProcessBase.state to BLOCKED in 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2463
 . And not yet consume event A. Then event B reach and enqueue same Process to 
ProcessManager.runq in 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3008
 . Then ProcessManager dequeue it in thread 2 and pop event B and run event B 
while thread 1 not yet running event A consume function. Is this possible?


was (Author: haosd...@gmail.com):
I think have another case make same Process run in different thread. Suppose 
ProcessBase pop event A in thread 1 and change ProcessBase.state to BLOCKED in 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2463
 . And not yet consume event B. Then event B reach and enqueue same Process to 
ProcessManager.runq in 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L3008
 . Then ProcessManager dequeue it in thread 2 and pop event B and run event B 
while thread 1 not yet running event A consume function. Is this possible?

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3865) Failover and recovery in presence of Quota

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3865:
---
Shepherd: Joris Van Remoortere  (was: Benjamin Hindman)

> Failover and recovery in presence of Quota
> --
>
> Key: MESOS-3865
> URL: https://issues.apache.org/jira/browse/MESOS-3865
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, master
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> The presence of quota in the cluster changes 
> Quota complicates master failover and recovery in several ways. The new 
> master should determine if it is possible to satisfy the total quota and 
> notify an operator in case it's not (imagine simultaneous failovers of 
> multiple agents). The new master should hint the allocator how many agents 
> might reconnect in the future to help it decide how to satisfy quota before 
> the majority of agents reconnect.
> The allocator interface should be updated with some sort of recovery 
> information, which will allow it to react properly (e.g. seize offers and 
> hold off resources for some time).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-809) External control of the ip that Mesos components publish to zookeeper

2015-11-10 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998931#comment-14998931
 ] 

Anindya Sinha commented on MESOS-809:
-

Mesos master/slave (libprocess) binds to ip:port indicated via environment vars 
LIBPROCESS_IP and LIBPROCESS_PORT (or via the --ip, --port in command line 
args). If they are private IPs, then this node is not reachable from outside 
such as schedulers so we need a publically accessible IP:Port such that the 
master/slave is reachable from another node.
In this case, the publically accessible IP:Port should be specified via the 
environment variables LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT (or 
on the master can be specified via the command line args --advertise_ip, 
--advertise_port). Note that MESOS-3809 shall add these command line args to 
mesos slave as well till then, you can use the environment vars.
Hope this helps.

> External control of the ip that Mesos components publish to zookeeper
> -
>
> Key: MESOS-809
> URL: https://issues.apache.org/jira/browse/MESOS-809
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, master, slave
>Affects Versions: 0.14.2
>Reporter: Khalid Goudeaux
>Assignee: Anindya Sinha
>Priority: Minor
> Fix For: 0.24.0
>
>
> With tools like Docker making containers more manageable, it's tempting to 
> use containers for all software installation. The CoreOS project is an 
> example of this.
> When an application is run inside a container it sees a different ip/hostname 
> from the host system running the container. That ip is only valid from inside 
> that host, no other machine can see it.
> From inside a container, the Mesos master and slave publish that private ip 
> to zookeeper and as a result they can't find each other if they're on 
> different machines. The --ip option can't help because the public ip isn't 
> available for binding from within a container.
> Essentially, from inside the container, mesos processes don't know the ip 
> they're available at (they may not know the port either).
> It would be nice to bootstrap the processes with the correct ip for them to 
> publish to zookeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3062) Add authorization for dynamic reservation

2015-11-10 Thread Gabriel Hartmann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999169#comment-14999169
 ] 

Gabriel Hartmann commented on MESOS-3062:
-

Is it possible in this scheme that a Framework could see Offers it couldn't 
accept?  Or does the work here imply that if a resource was reserved with a 
given role/principal pair and ACLs that it would only be re-offered to 
Frameworks authorized under the same role/principal pair?

> Add authorization for dynamic reservation
> -
>
> Key: MESOS-3062
> URL: https://issues.apache.org/jira/browse/MESOS-3062
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes
>
> Dynamic reservations should be authorized with the {{principal}} of the 
> reserving entity (framework or master). The idea is to introduce {{Reserve}} 
> and {{Unreserve}} into the ACL.
> {code}
>   message Reserve {
> // Subjects.
> required Entity principals = 1;
> // Objects.  MVP: Only possible values = ANY, NONE
> required Entity resources = 1;
>   }
>   message Unreserve {
> // Subjects.
> required Entity principals = 1;
> // Objects.
> required Entity reserver_principals = 2;
>   }
> {code}
> When a framework/operator reserves resources, "reserve" ACLs are checked to 
> see if the framework ({{FrameworkInfo.principal}}) or the operator 
> ({{Credential.user}}) is authorized to reserve the specified resources. If 
> not authorized, the reserve operation is rejected.
> When a framework/operator unreserves resources, "unreserve" ACLs are checked 
> to see if the framework ({{FrameworkInfo.principal}}) or the operator 
> ({{Credential.user}}) is authorized to unreserve the resources reserved by a 
> framework or operator ({{Resource.ReservationInfo.principal}}). If not 
> authorized, the unreserve operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-809) External control of the ip that Mesos components publish to zookeeper

2015-11-10 Thread Anindya Sinha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anindya Sinha updated MESOS-809:

Comment: was deleted

(was: Mesos master/slave (libprocess) binds to ip:port indicated via 
environment vars LIBPROCESS_IP and LIBPROCESS_PORT (or via the --ip, --port in 
command line args). If they are private IPs, then this node is not reachable 
from outside such as schedulers so we need a publically accessible IP:Port such 
that the master/slave is reachable from another node.
In this case, the publically accessible IP:Port should be specified via the 
environment variables LIBPROCESS_ADVERTISE_IP and LIBPROCESS_ADVERTISE_PORT (or 
on the master can be specified via the command line args --advertise_ip, 
--advertise_port). Note that MESOS-3809 shall add these command line args to 
mesos slave as well till then, you can use the environment vars.

Hope this helps.
)

> External control of the ip that Mesos components publish to zookeeper
> -
>
> Key: MESOS-809
> URL: https://issues.apache.org/jira/browse/MESOS-809
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, master, slave
>Affects Versions: 0.14.2
>Reporter: Khalid Goudeaux
>Assignee: Anindya Sinha
>Priority: Minor
> Fix For: 0.24.0
>
>
> With tools like Docker making containers more manageable, it's tempting to 
> use containers for all software installation. The CoreOS project is an 
> example of this.
> When an application is run inside a container it sees a different ip/hostname 
> from the host system running the container. That ip is only valid from inside 
> that host, no other machine can see it.
> From inside a container, the Mesos master and slave publish that private ip 
> to zookeeper and as a result they can't find each other if they're on 
> different machines. The --ip option can't help because the public ip isn't 
> available for binding from within a container.
> Essentially, from inside the container, mesos processes don't know the ip 
> they're available at (they may not know the port either).
> It would be nice to bootstrap the processes with the correct ip for them to 
> publish to zookeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998971#comment-14998971
 ] 

haosdent commented on MESOS-3870:
-

... Looks not possible. Only after consume event A, ProcessBase.state become 
BLOCKED.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3876) Per-Framework Dynamic Reservation

2015-11-10 Thread Gabriel Hartmann (JIRA)
Gabriel Hartmann created MESOS-3876:
---

 Summary: Per-Framework Dynamic Reservation
 Key: MESOS-3876
 URL: https://issues.apache.org/jira/browse/MESOS-3876
 Project: Mesos
  Issue Type: Task
Reporter: Gabriel Hartmann


An instance of a Framework should be able to reserve resources in such a way, 
that it is the only party which receives Offers once they are reserved.  It 
should not have to resort dynamic generation of Roles, as this exposes the 
ability to change Weights as well.

This avoids any possibility that resources that an instance of a Framework 
expects ownership of, are used by some other instance.  It also simplifies 
required Framework logic as each instance doesn't have to deal with filtering 
out reserved Resources not intended for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3878) Log responses for HTTP requests

2015-11-10 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-3878:
--

 Summary: Log responses for HTTP requests
 Key: MESOS-3878
 URL: https://issues.apache.org/jira/browse/MESOS-3878
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Alexander Rukletsov


When an HTTP request comes in, we log it twice: in the libprocess using 
{{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). 
However, we do not log the response, neither a successful one, nor even an 
error.

In order to simplify debugging, I suggest we at least add symmetric logging for 
*all* responses at the libprocess level using the same logging level as it is 
used now for incoming requests. We may want to additionally log messages for 
error responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} 
level, providing additional information like time took to process the request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3880) Propose a guideline for log messages

2015-11-10 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-3880:
--

 Summary: Propose a guideline for log messages
 Key: MESOS-3880
 URL: https://issues.apache.org/jira/browse/MESOS-3880
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Alexander Rukletsov


We are rather inconsistent in the way we write log messages. It would be 
helpful to come up with a style and document various aspects of logs, including 
but not limited to:
* Usage of backticks and/or single quotes to quote interpolated variables;
* Usage of backticks and/or single quotes to quote types and other names;
* Usage of tenses and other grammatical forms;
* Proper way of nesting [error] messages;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3881) Implement `stout/os/pstree.hpp` on Windows

2015-11-10 Thread Alex Clemmer (JIRA)
Alex Clemmer created MESOS-3881:
---

 Summary: Implement `stout/os/pstree.hpp` on Windows
 Key: MESOS-3881
 URL: https://issues.apache.org/jira/browse/MESOS-3881
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Alex Clemmer
Assignee: Alex Clemmer






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3877) Add operator documentation for quota

2015-11-10 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-3877:
--

 Summary: Add operator documentation for quota
 Key: MESOS-3877
 URL: https://issues.apache.org/jira/browse/MESOS-3877
 Project: Mesos
  Issue Type: Task
  Components: documentation
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


Add an operator guide for quota which describes basic usage of the endpoints 
and few basic and advanced usage cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3879) Incorrect and inconsistent include order for and .

2015-11-10 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-3879:
---
Story Points: 1

> Incorrect and inconsistent include order for  and 
> .
> -
>
> Key: MESOS-3879
> URL: https://issues.apache.org/jira/browse/MESOS-3879
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Minor
>
> We currently have an inconsistent (and mostly incorrect) include order for 
>  and  (see below). Some files include them 
> (incorrectly)  between the c and cpp standard header, while other correclt 
> include them afterwards. According to the [Google Styleguide| 
> https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes]
>  the second include order is correct.
> {code:title=external_containerizer_test.cpp}
> #include 
> #include 
> #include 
> {code}
> {code:title=launcher.hpp}
> #include 
> #include 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3879) Incorrect and inconsistent include order for and .

2015-11-10 Thread Joerg Schad (JIRA)
Joerg Schad created MESOS-3879:
--

 Summary: Incorrect and inconsistent include order for 
 and .
 Key: MESOS-3879
 URL: https://issues.apache.org/jira/browse/MESOS-3879
 Project: Mesos
  Issue Type: Bug
Reporter: Joerg Schad
Assignee: Joerg Schad
Priority: Minor


We currently have an inconsistent (and mostly incorrect) include order for 
 and  (see below). Some files include them 
(incorrectly)  between the c and cpp standard header, while other correclt 
include them afterwards. According to the [Google Styleguide| 
https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes] 
the second include order is correct.


{code:title=external_containerizer_test.cpp}
#include 

#include 

#include 
{code}

{code:title=launcher.hpp}
#include 

#include 
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3878) Log responses for HTTP requests

2015-11-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3878:
---
Description: 
When an HTTP request comes in, we log it twice: in the libprocess using 
{{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). 
However, we do not log the response, neither a successful one, nor even an 
error.

In order to simplify debugging, I suggest we add symmetric logging for *all* 
responses at the libprocess level using the same logging level as it is used 
now for incoming requests. We may want to additionally log messages for error 
responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} 
level, providing additional information like time took to process the request.

  was:
When an HTTP request comes in, we log it twice: in the libprocess using 
{{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). 
However, we do not log the response, neither a successful one, nor even an 
error.

In order to simplify debugging, I suggest we at least add symmetric logging for 
*all* responses at the libprocess level using the same logging level as it is 
used now for incoming requests. We may want to additionally log messages for 
error responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} 
level, providing additional information like time took to process the request.


> Log responses for HTTP requests
> ---
>
> Key: MESOS-3878
> URL: https://issues.apache.org/jira/browse/MESOS-3878
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Alexander Rukletsov
>  Labels: mesosphere, newbie++
>
> When an HTTP request comes in, we log it twice: in the libprocess using 
> {{VLOG}} and in Mesos route handlers using {{LOG(INFO)}} (see MESOS-2519). 
> However, we do not log the response, neither a successful one, nor even an 
> error.
> In order to simplify debugging, I suggest we add symmetric logging for *all* 
> responses at the libprocess level using the same logging level as it is used 
> now for incoming requests. We may want to additionally log messages for error 
> responses (e.g. {{BadRequest}}, {{Conflict}} in Mesos with {{LOG(ERROR)}} 
> level, providing additional information like time took to process the request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3283) Improve allocation performance especially with large number of slaves and frameworks.

2015-11-10 Thread Marco Massenzio (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3283:
---
Assignee: (was: Marco Massenzio)

> Improve allocation performance especially with large number of slaves and 
> frameworks.
> -
>
> Key: MESOS-3283
> URL: https://issues.apache.org/jira/browse/MESOS-3283
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Affects Versions: 0.23.0
>Reporter: Mandeep Chadha
>  Labels: mesosphere, tech-debt
>
> Improve batch allocations performance especially with large number of slaves 
> and frameworks. 
> e.g. these are the allocation timings for 10K slaves and varying number of 
> frameworks.
> Using 1 slaves and 1 frameworks
> Added 1 slaves in 14.50836112secs
> Updated 1 slaves in 18.665093703secs
> [   OK ] 
> SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/12 (34983 
> ms)
> [ RUN  ] 
> SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13
> Using 1 slaves and 50 frameworks
> Added 1 slaves in 51.534229549secs
> Updated 1 slaves in 57.131554303secs
> [   OK ] 
> SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13 (110449 
> ms)
> [ RUN  ] 
> SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14
> Using 1 slaves and 100 frameworks
> Added 1 slaves in 1.5891310434mins
> Updated 1 slaves in 1.80562078148333mins
> [   OK ] 
> SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14 (205467 
> ms)
> [ RUN  ] 
> SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/15
> Using 1 slaves and 200 frameworks
> Added 1 slaves in 3.0750647275mins
> Updated 1 slaves in 3.85846762096667mins



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3035) As a Developer I would like a standard way to run a Subprocess in libprocess

2015-11-10 Thread Marco Massenzio (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3035:
---
Shepherd: Michael Park  (was: Joris Van Remoortere)

> As a Developer I would like a standard way to run a Subprocess in libprocess
> 
>
> Key: MESOS-3035
> URL: https://issues.apache.org/jira/browse/MESOS-3035
> Project: Mesos
>  Issue Type: Story
>  Components: libprocess
>Reporter: Marco Massenzio
>Assignee: Marco Massenzio
>
> As part of MESOS-2830 and MESOS-2902 I have been researching the ability to 
> run a {{Subprocess}} and capture the {{stdout / stderr}} along with the exit 
> status code.
> {{process::subprocess()}} offers much of the functionality, but in a way that 
> still requires a lot of handiwork on the developer's part; we would like to 
> further abstract away the ability to just pass a string, an optional set of 
> command-line arguments and then collect the output of the command (bonus: 
> without blocking).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3062) Add authorization for dynamic reservation

2015-11-10 Thread Gabriel Hartmann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999351#comment-14999351
 ] 

Gabriel Hartmann commented on MESOS-3062:
-

Thanks Greg.  I was hoping we were almost going to get to per-framework dynamic 
reservation with this, but I guess not.

> Add authorization for dynamic reservation
> -
>
> Key: MESOS-3062
> URL: https://issues.apache.org/jira/browse/MESOS-3062
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes
>
> Dynamic reservations should be authorized with the {{principal}} of the 
> reserving entity (framework or master). The idea is to introduce {{Reserve}} 
> and {{Unreserve}} into the ACL.
> {code}
>   message Reserve {
> // Subjects.
> required Entity principals = 1;
> // Objects.  MVP: Only possible values = ANY, NONE
> required Entity resources = 1;
>   }
>   message Unreserve {
> // Subjects.
> required Entity principals = 1;
> // Objects.
> required Entity reserver_principals = 2;
>   }
> {code}
> When a framework/operator reserves resources, "reserve" ACLs are checked to 
> see if the framework ({{FrameworkInfo.principal}}) or the operator 
> ({{Credential.user}}) is authorized to reserve the specified resources. If 
> not authorized, the reserve operation is rejected.
> When a framework/operator unreserves resources, "unreserve" ACLs are checked 
> to see if the framework ({{FrameworkInfo.principal}}) or the operator 
> ({{Credential.user}}) is authorized to unreserve the resources reserved by a 
> framework or operator ({{Resource.ReservationInfo.principal}}). If not 
> authorized, the unreserve operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3882) Libprocess: Implement process::Clock::finalize

2015-11-10 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-3882:


 Summary: Libprocess: Implement process::Clock::finalize
 Key: MESOS-3882
 URL: https://issues.apache.org/jira/browse/MESOS-3882
 Project: Mesos
  Issue Type: Task
  Components: libprocess, test
Reporter: Joseph Wu
Assignee: Joseph Wu


Tracks this 
[TODO|https://github.com/apache/mesos/blob/aa0cd7ed4edf1184cbc592b5caa2429a8373e813/3rdparty/libprocess/src/process.cpp#L974-L975].

The {{Clock}} is initialized with a callback that, among other things, will 
dereference the global {{process_manager}} object.

When libprocess is shutting down, the {{process_manager}} is cleaned up.  
Between cleanup and termination of libprocess, there is some chance that a 
{{Timer}} will time out and result in dereferencing {{process_manager}}.

*Proposal* 
* Implement {{Clock::finalize}}.  This would clear:
** existing timers
** process-specific clocks
** ticks
* Change {{process::finalize}}.
*# Resume the clock.  (The clock is only paused during some tests.)  When the 
clock is not paused, the callback does not dereference {{process_manager}}.
*# Clean up {{process_manager}}.  This terminates all the processes that would 
potentially interact with {{Clock}}.
*# Call {{Clock::finalize}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3035) As a Developer I would like a standard way to run a Subprocess in libprocess

2015-11-10 Thread Marco Massenzio (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3035:
---
Labels: mesosphere tech-debt  (was: )

> As a Developer I would like a standard way to run a Subprocess in libprocess
> 
>
> Key: MESOS-3035
> URL: https://issues.apache.org/jira/browse/MESOS-3035
> Project: Mesos
>  Issue Type: Story
>  Components: libprocess
>Reporter: Marco Massenzio
>Assignee: Marco Massenzio
>  Labels: mesosphere, tech-debt
>
> As part of MESOS-2830 and MESOS-2902 I have been researching the ability to 
> run a {{Subprocess}} and capture the {{stdout / stderr}} along with the exit 
> status code.
> {{process::subprocess()}} offers much of the functionality, but in a way that 
> still requires a lot of handiwork on the developer's part; we would like to 
> further abstract away the ability to just pass a string, an optional set of 
> command-line arguments and then collect the output of the command (bonus: 
> without blocking).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3062) Add authorization for dynamic reservation

2015-11-10 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999288#comment-14999288
 ] 

Greg Mann commented on MESOS-3062:
--

These patches don't affect which offers are made to which frameworks, nor which 
frameworks can accept which offers; a framework should still be able to utilize 
all the resources offered to it. Reserved resources will be offered to, and can 
be used by, any framework registered with the appropriate role, regardless of 
which principal did the reserving.

This work provides authorization for the {{Reserve}} and {{Unreserve}} offer 
operations. So while a framework can still accept all the offers it receives, 
these patches do mean that a framework could receive offers containing 
resources which it doesn't have permission to reserve. A framework could also 
receive offers containing dynamically-reserved resources which it doesn't have 
the permission to unreserve.

> Add authorization for dynamic reservation
> -
>
> Key: MESOS-3062
> URL: https://issues.apache.org/jira/browse/MESOS-3062
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes
>
> Dynamic reservations should be authorized with the {{principal}} of the 
> reserving entity (framework or master). The idea is to introduce {{Reserve}} 
> and {{Unreserve}} into the ACL.
> {code}
>   message Reserve {
> // Subjects.
> required Entity principals = 1;
> // Objects.  MVP: Only possible values = ANY, NONE
> required Entity resources = 1;
>   }
>   message Unreserve {
> // Subjects.
> required Entity principals = 1;
> // Objects.
> required Entity reserver_principals = 2;
>   }
> {code}
> When a framework/operator reserves resources, "reserve" ACLs are checked to 
> see if the framework ({{FrameworkInfo.principal}}) or the operator 
> ({{Credential.user}}) is authorized to reserve the specified resources. If 
> not authorized, the reserve operation is rejected.
> When a framework/operator unreserves resources, "unreserve" ACLs are checked 
> to see if the framework ({{FrameworkInfo.principal}}) or the operator 
> ({{Credential.user}}) is authorized to unreserve the resources reserved by a 
> framework or operator ({{Resource.ReservationInfo.principal}}). If not 
> authorized, the unreserve operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3220) Offer ability to kill tasks from the API

2015-11-10 Thread Marco Massenzio (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3220:
---
Description: 
We are investigating adding a {{dcos task kill}} command to our DCOS (and 
Mesos) command line interface. Currently the ability to kill tasks is only 
offered via the scheduler API so it would be useful to have some ability to 
kill tasks directly.

This would complement the Maintenance Primitives, in that it would enable the 
operator to terminate those tasks which, for whatever reasons, do not respond 
to Inverse Offers events.

  was:
We are investigating adding a `dcos task kill` command to our DCOS (and Mesos) 
command line interface. Currently the ability to kill tasks is only offered via 
the scheduler API so it would be useful to have some ability to kill tasks 
directly.

This is a blocker for the DCOS CLI!


> Offer ability to kill tasks from the API
> 
>
> Key: MESOS-3220
> URL: https://issues.apache.org/jira/browse/MESOS-3220
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Sunil Shah
>Assignee: Marco Massenzio
>Priority: Blocker
>  Labels: mesosphere
>
> We are investigating adding a {{dcos task kill}} command to our DCOS (and 
> Mesos) command line interface. Currently the ability to kill tasks is only 
> offered via the scheduler API so it would be useful to have some ability to 
> kill tasks directly.
> This would complement the Maintenance Primitives, in that it would enable the 
> operator to terminate those tasks which, for whatever reasons, do not respond 
> to Inverse Offers events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3876) Per-Framework Dynamic Reservation

2015-11-10 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-3876:
--

Assignee: Guangya Liu

> Per-Framework Dynamic Reservation
> -
>
> Key: MESOS-3876
> URL: https://issues.apache.org/jira/browse/MESOS-3876
> Project: Mesos
>  Issue Type: Task
>Reporter: Gabriel Hartmann
>Assignee: Guangya Liu
>
> An instance of a Framework should be able to reserve resources in such a way, 
> that it is the only party which receives Offers once they are reserved.  It 
> should not have to resort dynamic generation of Roles, as this exposes the 
> ability to change Weights as well.
> This avoids any possibility that resources that an instance of a Framework 
> expects ownership of, are used by some other instance.  It also simplifies 
> required Framework logic as each instance doesn't have to deal with filtering 
> out reserved Resources not intended for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3863) Investigate the requirements of programmatically re-initializing libprocess

2015-11-10 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3863:
-
Description: 
This issue is for investigating what needs to be added/changed in 
{{process::finalize}} such that {{process::initialize}} will start on a clean 
slate.  Additional issues will be created once done.  Also see [the parent 
issue|MESOS-3820].

{{process::finalize}} should cover the following components:
* {{__s__}} (the server socket)
** {{delete}} should be sufficient.  This closes the socket and thereby 
prevents any further interaction from it.
* {{process_manager}}
** Related prior work: [MESOS-3158]
** Cleans up the garbage collector, help, logging, profiler, statistics, route 
processes (including [this 
one|https://github.com/apache/mesos/blob/3bda55da1d0b580a1b7de43babfdc0d30fbc87ea/3rdparty/libprocess/src/process.cpp#L963],
 which currently leaks a pointer).
** Cleans up any other {{spawn}} 'd process.
** Manages the {{EventLoop}}.
* {{Clock}}
** The goal here is to clear any timers so that nothing can deference 
{{process_manager}} while we're finalizing/finalized.  It's probably not 
important to execute any remaining timers, since we're "shutting down" 
libprocess.  This means:
*** The clock should be {{paused}} and {{settled}} before the clean up of 
{{process_manager}}.
*** Processes, which might interact with the {{Clock}}, should be cleaned up 
next.
*** A new {{Clock::finalize}} method would then clear timers, process-specific 
clocks, and {{tick}} s; and then {{resume}} the clock.
* {{__address__}} (the advertised IP and port)
** Needs to be cleared after {{process_manager}} has been cleaned up.  
Processes use this to communicate events.  If cleared prematurely, 
{{TerminateEvents}} will not be sent correctly, leading to infinite waits.
* {{socket_manager}}
** The idea here is to close all sockets and deallocate any existing 
{{HttpProxy}} or {{Encoder}} objects.
** All sockets are created via {{__s__}}, so cleaning up the server socket 
prior will prevent any new activity.
* {{mime}}
** This is effectively a static map.
** It should be possible to statically initialize it.
* Synchronization atomics {{initialized}} & {{initializing}}.
** Once cleanup is done, these should be reset.

*Summary*:
* Implement {{Clock::finalize}}.  [MESOS-3882]
* Implement {{~SocketManager}}.
* Clean up {{mime}}.
* Wrap everything up in {{process::finalize}}.

  was:
This issue is for investigating what needs to be added/changed in 
{{process::finalize}} such that {{process::initialize}} will start on a clean 
slate.  Additional issues will be created once done.  Also see [the parent 
issue|MESOS-3820].

{{process::finalize}} should cover the following components:
* {{__s__}} (the server socket)
** {{delete}} should be sufficient.  This closes the socket and thereby 
prevents any further interaction from it.
* {{process_manager}}
** Related prior work: [MESOS-3158]
** Cleans up the garbage collector, help, logging, profiler, statistics, route 
processes (including [this 
one|https://github.com/apache/mesos/blob/3bda55da1d0b580a1b7de43babfdc0d30fbc87ea/3rdparty/libprocess/src/process.cpp#L963],
 which currently leaks a pointer).
** Cleans up any other {{spawn}} 'd process.
** Manages the {{EventLoop}}.
* {{Clock}}
** The goal here is to clear any timers so that nothing can deference 
{{process_manager}} while we're finalizing/finalized.  It's probably not 
important to execute any remaining timers, since we're "shutting down" 
libprocess.  This means:
*** The clock should be {{paused}} and {{settled}} before the clean up of 
{{process_manager}}.
*** Processes, which might interact with the {{Clock}}, should be cleaned up 
next.
*** A new {{Clock::finalize}} method would then clear timers, process-specific 
clocks, and {{tick}} s; and then {{resume}} the clock.
* {{__address__}} (the advertised IP and port)
** Needs to be cleared after {{process_manager}} has been cleaned up.  
Processes use this to communicate events.  If cleared prematurely, 
{{TerminateEvents}} will not be sent correctly, leading to infinite waits.
* {{socket_manager}}
** The idea here is to close all sockets and deallocate any existing 
{{HttpProxy}} or {{Encoder}} objects.
** All sockets are created via {{__s__}}, so cleaning up the server socket 
prior will prevent any new activity.
* {{mime}}
** This is effectively a static map.
** It should be possible to statically initialize it.
* Synchronization atomics {{initialized}} & {{initializing}}.
** Once cleanup is done, these should be reset.

*Summary*:
* Implement {{Clock::finalize}}.
* Implement {{~SocketManager}}.
* Clean up {{mime}}.
* Wrap everything up in {{process::finalize}}.


> Investigate the requirements of programmatically re-initializing libprocess
> ---
>
> Key: MESOS-3863
>  

[jira] [Updated] (MESOS-3879) Incorrect and inconsistent include order for and .

2015-11-10 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-3879:
---
Sprint: Mesosphere Sprint 22
Labels: mesosphere  (was: )

> Incorrect and inconsistent include order for  and 
> .
> -
>
> Key: MESOS-3879
> URL: https://issues.apache.org/jira/browse/MESOS-3879
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Minor
>  Labels: mesosphere
>
> We currently have an inconsistent (and mostly incorrect) include order for 
>  and  (see below). Some files include them 
> (incorrectly)  between the c and cpp standard header, while other correclt 
> include them afterwards. According to the [Google Styleguide| 
> https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes]
>  the second include order is correct.
> {code:title=external_containerizer_test.cpp}
> #include 
> #include 
> #include 
> {code}
> {code:title=launcher.hpp}
> #include 
> #include 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999494#comment-14999494
 ] 

Vinod Kone commented on MESOS-3851:
---

Doesn't look like this is related to the new HTTP executor logic as this race 
seem to happen even in non-http-executor based tests. Also the changes in slave 
doesn't seem related. Either this race has always existed but only now got 
exposed due to the CHECK in the command executor or there are some recent 
libprocess related changes that are the cause. 

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd  google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager 
> successfully handled status update acknowledgement (UUID: 
> 

[jira] [Updated] (MESOS-3220) Offer ability to kill tasks from the API

2015-11-10 Thread Marco Massenzio (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3220:
---
Component/s: (was: python api)
 master

> Offer ability to kill tasks from the API
> 
>
> Key: MESOS-3220
> URL: https://issues.apache.org/jira/browse/MESOS-3220
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Sunil Shah
>Assignee: Marco Massenzio
>Priority: Blocker
>  Labels: mesosphere
>
> We are investigating adding a `dcos task kill` command to our DCOS (and 
> Mesos) command line interface. Currently the ability to kill tasks is only 
> offered via the scheduler API so it would be useful to have some ability to 
> kill tasks directly.
> This is a blocker for the DCOS CLI!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3868) Make apply-review.sh use apply-reviews.py

2015-11-10 Thread Artem Harutyunyan (JIRA)
Artem Harutyunyan created MESOS-3868:


 Summary: Make apply-review.sh use apply-reviews.py
 Key: MESOS-3868
 URL: https://issues.apache.org/jira/browse/MESOS-3868
 Project: Mesos
  Issue Type: Bug
Reporter: Artem Harutyunyan
Assignee: Artem Harutyunyan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1478) Replace Master/Slave terminology

2015-11-10 Thread Erik Weathers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998219#comment-14998219
 ] 

Erik Weathers commented on MESOS-1478:
--

If I may ask, can someone please explain where the discussion & conclusion 
about the choice of the new name happened?  I saw an email chain about 
*whether* to do the rename (which was inconclusive), and then when I attended 
MesosCon 2015 in Seattle, it was announced "from on high" that the new name was 
"agent".  Was this discussed in some ad hoc informal forum? Decided internally 
to Mesosphere?

> Replace Master/Slave terminology
> 
>
> Key: MESOS-1478
> URL: https://issues.apache.org/jira/browse/MESOS-1478
> Project: Mesos
>  Issue Type: Epic
>Reporter: Clark Breyman
>Assignee: Benjamin Hindman
>Priority: Minor
>  Labels: mesosphere
>
> Inspired by the comments on this PR:
> https://github.com/django/django/pull/2692
> TL;DR - Computers sharing work should be a good thing. Using the language of 
> human bondage and suffering is inappropriate in this context. It also has the 
> potential to alienate users and community members. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1478) Replace Master/Slave terminology

2015-11-10 Thread Erik Weathers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998219#comment-14998219
 ] 

Erik Weathers edited comment on MESOS-1478 at 11/10/15 8:24 AM:


If I may ask, can someone please explain where the discussion & conclusion 
about the choice of the new name for "slave" happened?  I saw an email chain 
about *whether* to do the rename (which was inconclusive), and then when I 
attended MesosCon 2015 in Seattle, it was announced "from on high" that the new 
name was "agent".  Was this discussed in some ad hoc informal forum? Decided 
internally to Mesosphere?


was (Author: erikdw):
If I may ask, can someone please explain where the discussion & conclusion 
about the choice of the new name happened?  I saw an email chain about 
*whether* to do the rename (which was inconclusive), and then when I attended 
MesosCon 2015 in Seattle, it was announced "from on high" that the new name was 
"agent".  Was this discussed in some ad hoc informal forum? Decided internally 
to Mesosphere?

> Replace Master/Slave terminology
> 
>
> Key: MESOS-1478
> URL: https://issues.apache.org/jira/browse/MESOS-1478
> Project: Mesos
>  Issue Type: Epic
>Reporter: Clark Breyman
>Assignee: Benjamin Hindman
>Priority: Minor
>  Labels: mesosphere
>
> Inspired by the comments on this PR:
> https://github.com/django/django/pull/2692
> TL;DR - Computers sharing work should be a good thing. Using the language of 
> human bondage and suffering is inappropriate in this context. It also has the 
> potential to alienate users and community members. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2455) Add operator endpoints to create/destroy persistent volumes.

2015-11-10 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998168#comment-14998168
 ] 

Neil Conway commented on MESOS-2455:


Hi Dan -- I'm working on this at the moment. I should have patches ready for 
review shortly.

> Add operator endpoints to create/destroy persistent volumes.
> 
>
> Key: MESOS-2455
> URL: https://issues.apache.org/jira/browse/MESOS-2455
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Neil Conway
>Priority: Critical
>  Labels: mesosphere, persistent-volumes
>
> Persistent volumes will not be released automatically.
> So we probably need an endpoint for operators to forcefully release 
> persistent volumes. We probably need to add principal to Persistence struct 
> and use ACLs to control who can release what.
> Additionally, it would be useful to have an endpoint for operators to create 
> persistent volumes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3871) Document libprocess message delivery semantics

2015-11-10 Thread Neil Conway (JIRA)
Neil Conway created MESOS-3871:
--

 Summary: Document libprocess message delivery semantics
 Key: MESOS-3871
 URL: https://issues.apache.org/jira/browse/MESOS-3871
 Project: Mesos
  Issue Type: Documentation
  Components: documentation, libprocess
Reporter: Neil Conway
Priority: Minor


What are the semantics of {{send()}} in libprocess? Specifically, does 
libprocess guarantee that messages will not be dropped, reordered, or 
duplicated? These are important properties to understand when building software 
on top of libprocess.

Clearly message drops are allowed. Message reordering _appears_ to be allowed, 
although it should only happen in corner cases (see MESOS-3870). Duplicate 
message delivery probably can't happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998287#comment-14998287
 ] 

Bernd Mathiske commented on MESOS-3851:
---

[~marco-mesos]: Guess what I have been looking at as of yesterday :-)
[~anandmazumdar] has analyzed this well. There is highly likely be some race of 
sorts between executor registration and task launching. That would completely 
explain the CHECK that fails. There is another explanation that is less likely 
also: faulty data or faulty marshaling or faulty transmission or faulty 
unmarshaling. My focus is on understanding how the code allows for said race 
and once I understand it, I will try to cause the race by inserting 
sleep(someSeconds) somewhere suitable. Without that, there is no reliable way 
of reproducing the bug. It never happens when I run this, not even on CentOS 
7.1.

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> 

[jira] [Created] (MESOS-3869) Better error reporting for bad user when launching containers

2015-11-10 Thread Isabel Jimenez (JIRA)
Isabel Jimenez created MESOS-3869:
-

 Summary: Better error reporting for bad user when launching 
containers
 Key: MESOS-3869
 URL: https://issues.apache.org/jira/browse/MESOS-3869
 Project: Mesos
  Issue Type: Improvement
  Components: docker, slave
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez


When launching containers with an non existing user, the scheduler receives the 
following error:
"Abnormal executor termination" 

This error should provide more information. As of right now to have more 
details you have to check the sandbox log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3870:
---
Labels: mesosphere  (was: )

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998288#comment-14998288
 ] 

Bernd Mathiske commented on MESOS-3851:
---

That said we could tag 0.26.0 without the change in CommandExecutor. This would 
leave Fetcher tests flaky, but at least CommandExecutor could launch tasks with 
some probability even if a race occurred.

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd  google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager 
> successfully handled status update acknowledgement (UUID: 
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task 
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework 
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-
> @ 0x7f4f715296ce  

[jira] [Updated] (MESOS-3872) Investigate adding color to `support/post-reviews.py` on Windows

2015-11-10 Thread Alex Clemmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Clemmer updated MESOS-3872:

Description: 
>From the comments:

# TODO(hausdorff): We have disabled colors for the diffs on Windows, as piping 
them through `subprocess` causes us to emit ANSI escape codes, which the 
command prompt doesn't recognize. Presumably we are being routed through some 
TTY that causes git to not emit the colors using `cmd`'s color codes API (which 
is entirely different from ANSI. See [1] for more information and MESOS-3872.
#
# [1] 
http://stackoverflow.com/questions/5921556/in-git-bash-on-windows-7-colors-display-as-code-when-running-cucumber-or-rspec

> Investigate adding color to `support/post-reviews.py` on Windows
> 
>
> Key: MESOS-3872
> URL: https://issues.apache.org/jira/browse/MESOS-3872
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, windows
>
> From the comments:
> # TODO(hausdorff): We have disabled colors for the diffs on Windows, as 
> piping them through `subprocess` causes us to emit ANSI escape codes, which 
> the command prompt doesn't recognize. Presumably we are being routed through 
> some TTY that causes git to not emit the colors using `cmd`'s color codes API 
> (which is entirely different from ANSI. See [1] for more information and 
> MESOS-3872.
> #
> # [1] 
> http://stackoverflow.com/questions/5921556/in-git-bash-on-windows-7-colors-display-as-code-when-running-cucumber-or-rspec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3872) Investigate adding color to `support/post-reviews.py` on Windows

2015-11-10 Thread Alex Clemmer (JIRA)
Alex Clemmer created MESOS-3872:
---

 Summary: Investigate adding color to `support/post-reviews.py` on 
Windows
 Key: MESOS-3872
 URL: https://issues.apache.org/jira/browse/MESOS-3872
 Project: Mesos
  Issue Type: Bug
  Components: general
Reporter: Alex Clemmer
Assignee: Alex Clemmer






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials

2015-11-10 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999702#comment-14999702
 ] 

Marco Massenzio commented on MESOS-3024:


BTW - shutting down the framework works too:
{noformat}
I 00:48:02.558192  2789 http.cpp:336] HTTP POST for 
/master//api/v1/scheduler from 192.168.33.1:52509 with 
User-Agent='python-requests/2.7.0 CPython/2.7.10 Darwin/15.0.0'
I 00:48:02.558320  2789 master.cpp:5571] Removing framework 
0878d422-0e83-4b15-8a26-f04a6e3d829f- (Example HTTP Framework)
I 00:48:02.558527  2789 hierarchical.hpp:599] Deactivated framework 
0878d422-0e83-4b15-8a26-f04a6e3d829f-
I 00:48:02.558600  2789 hierarchical.hpp:1103] Recovered 
ports(*):[9000-1]; ephemeral_ports(*):[32768-57344]; cpus(*):1; mem(*):496; 
disk(*):35164 (total: ports(*):[9000-1]; ephemeral_ports(*):[32768-57344]; 
cpus(*):1; mem(*):496; disk(*):35164, allocated: ) on slave 
e08833af-00af-44c6-abd1-bc666b1949c0-S0 from framework 
0878d422-0e83-4b15-8a26-f04a6e3d829f-
I 00:48:02.558624  2789 hierarchical.hpp:552] Removed framework 
0878d422-0e83-4b15-8a26-f04a6e3d829f-
{noformat}
no authentication provided on this call either.

> HTTP endpoint authN is enabled merely by specifying --credentials
> -
>
> Key: MESOS-3024
> URL: https://issues.apache.org/jira/browse/MESOS-3024
> Project: Mesos
>  Issue Type: Bug
>  Components: master, security
>Reporter: Adam B
>Assignee: Marco Massenzio
>  Labels: authentication, http, mesosphere
>
> If I set `--credentials` on the master, framework and slave authentication 
> are allowed, but not required. On the other hand, http authentication is now 
> required for authenticated endpoints (currently only `/shutdown`). That means 
> that I cannot enable framework or slave authentication without also enabling 
> http endpoint authentication. This is undesirable.
> Framework and slave authentication have separate flags (`\--authenticate` and 
> `\--authenticate_slaves`) to require authentication for each. It would be 
> great if there was also such a flag for framework authentication. Or maybe we 
> get rid of these flags altogether and rely on ACLs to determine which 
> unauthenticated principals are even allowed to authenticate for each 
> endpoint/action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials

2015-11-10 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999696#comment-14999696
 ] 

Marco Massenzio commented on MESOS-3024:


I am unclear about this:
{quote}
It would be great if there was also such a flag for framework authentication.
{quote}
Is this a typo? ({{--authenticate}} does exactly that)

Looking at 
[master/http.cpp|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L375]:
{code}
  if (master->flags.authenticate_frameworks) {
return Unauthorized(
"Mesos master",
"HTTP schedulers are not supported when authentication is required");
  }
{code}

It seems to me that the HTTP API requires authentication for *all* request 
types; and that is required only when {{--authenticate}} is set on the master: 
when [master sets the {{credentials}} 
flag|https://github.com/apache/mesos/blob/master/src/master/master.cpp#L425] 
the former is not touched.

To test this, I launched a Master with {{--credentials}} but no 
{{--authenticate}} and then registered a framework via the HTTP API and also 
received an offer - it all worked just fine.

I am assuming here that I'm missing something fundamental, can folks please 
clarify what the issue is?

Thanks!

> HTTP endpoint authN is enabled merely by specifying --credentials
> -
>
> Key: MESOS-3024
> URL: https://issues.apache.org/jira/browse/MESOS-3024
> Project: Mesos
>  Issue Type: Bug
>  Components: master, security
>Reporter: Adam B
>Assignee: Marco Massenzio
>  Labels: authentication, http, mesosphere
>
> If I set `--credentials` on the master, framework and slave authentication 
> are allowed, but not required. On the other hand, http authentication is now 
> required for authenticated endpoints (currently only `/shutdown`). That means 
> that I cannot enable framework or slave authentication without also enabling 
> http endpoint authentication. This is undesirable.
> Framework and slave authentication have separate flags (`\--authenticate` and 
> `\--authenticate_slaves`) to require authentication for each. It would be 
> great if there was also such a flag for framework authentication. Or maybe we 
> get rid of these flags altogether and rely on ACLs to determine which 
> unauthenticated principals are even allowed to authenticate for each 
> endpoint/action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998453#comment-14998453
 ] 

haosdent commented on MESOS-3851:
-

In the error log, registered and launchTaks have different thread id.
{noformat}
I1110 00:36:30.616987  5169 exec.cpp:306] Executor::launchTask took 160701ns
I1110 00:36:30.621285  5163 exec.cpp:222] Executor::registered took 399555ns
{noformat}

But in local test, these always have same thread id.

{noformat}
I1110 19:34:46.304114  8953 exec.cpp:222] Executor::registered took 182100ns
S1110 19:34:46.304416  8953 exec.cpp:306] Executor::launchTask took 47975ns
{noformat}
{noformat}
R1110 19:34:47.439801  9027 exec.cpp:222] Executor::registered took 257152ns
I1110 19:34:47.440234  9027 exec.cpp:306] Executor::launchTask took 111249ns
{noformat}
{noformat}
I1110 19:34:47.943961  9097 exec.cpp:222] Executor::registered took 271225ns
I1110 19:34:47.944284  9097 exec.cpp:306] Executor::launchTask took 45141ns
{noformat}

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 

[jira] [Comment Edited] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials

2015-11-10 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999696#comment-14999696
 ] 

Marco Massenzio edited comment on MESOS-3024 at 11/11/15 12:47 AM:
---

I am unclear about this:
{quote}
It would be great if there was also such a flag for framework authentication.
{quote}
Is this a typo? ({{--authenticate}} does exactly that)

Looking at 
[master/http.cpp|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L375]:
{code}
  if (master->flags.authenticate_frameworks) {
return Unauthorized(
"Mesos master",
"HTTP schedulers are not supported when authentication is required");
  }
{code}

It seems to me that the HTTP API requires authentication for *all* request 
types; and that is required only when {{--authenticate}} is set on the master: 
when [master sets the {{credentials}} 
flag|https://github.com/apache/mesos/blob/master/src/master/master.cpp#L425] 
the former is not touched.

To test this, I launched a Master with {{-- credentials}} but no {{-- 
authenticate}} and then registered a framework via the HTTP API and also 
received an offer - it all worked just fine.

I am assuming here that I'm missing something fundamental, can folks please 
clarify what the issue is?

Thanks!


was (Author: marco-mesos):
I am unclear about this:
{quote}
It would be great if there was also such a flag for framework authentication.
{quote}
Is this a typo? ({{--authenticate}} does exactly that)

Looking at 
[master/http.cpp|https://github.com/apache/mesos/blob/master/src/master/http.cpp#L375]:
{code}
  if (master->flags.authenticate_frameworks) {
return Unauthorized(
"Mesos master",
"HTTP schedulers are not supported when authentication is required");
  }
{code}

It seems to me that the HTTP API requires authentication for *all* request 
types; and that is required only when {{--authenticate}} is set on the master: 
when [master sets the {{credentials}} 
flag|https://github.com/apache/mesos/blob/master/src/master/master.cpp#L425] 
the former is not touched.

To test this, I launched a Master with {{--credentials}} but no 
{{--authenticate}} and then registered a framework via the HTTP API and also 
received an offer - it all worked just fine.

I am assuming here that I'm missing something fundamental, can folks please 
clarify what the issue is?

Thanks!

> HTTP endpoint authN is enabled merely by specifying --credentials
> -
>
> Key: MESOS-3024
> URL: https://issues.apache.org/jira/browse/MESOS-3024
> Project: Mesos
>  Issue Type: Bug
>  Components: master, security
>Reporter: Adam B
>Assignee: Marco Massenzio
>  Labels: authentication, http, mesosphere
>
> If I set `--credentials` on the master, framework and slave authentication 
> are allowed, but not required. On the other hand, http authentication is now 
> required for authenticated endpoints (currently only `/shutdown`). That means 
> that I cannot enable framework or slave authentication without also enabling 
> http endpoint authentication. This is undesirable.
> Framework and slave authentication have separate flags (`\--authenticate` and 
> `\--authenticate_slaves`) to require authentication for each. It would be 
> great if there was also such a flag for framework authentication. Or maybe we 
> get rid of these flags altogether and rely on ACLs to determine which 
> unauthenticated principals are even allowed to authenticate for each 
> endpoint/action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3834) slave upgrade framework checkpoint incompatibility

2015-11-10 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1477#comment-1477
 ] 

James Peach commented on MESOS-3834:


https://reviews.apache.org/r/40177/

[~vi...@twitter.com] or [~karya], could you shepherd this bug?

> slave upgrade framework checkpoint incompatibility 
> ---
>
> Key: MESOS-3834
> URL: https://issues.apache.org/jira/browse/MESOS-3834
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1
>Reporter: James Peach
>Assignee: James Peach
>
> We are upgrading from 0.22 to 0.25 and experienced the following crash in the 
> 0.24 slave:
> {code}
> F1104 05:20:49.162701  1153 slave.cpp:4175] Check failed: 
> frameworkInfo.has_id()
> *** Check failure stack trace: ***
> @ 0x7fef9c294650  google::LogMessage::Fail()
> @ 0x7fef9c29459f  google::LogMessage::SendToLog()
> @ 0x7fef9c293fb0  google::LogMessage::Flush()
> @ 0x7fef9c296ce4  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fef9b9a5492  mesos::internal::slave::Slave::recoverFramework()
> @ 0x7fef9b9a3314  mesos::internal::slave::Slave::recover()
> @ 0x7fef9b9d069c  
> _ZZN7process8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS4_5state5StateEES9_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSG_FSE_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESP_
> @ 0x7fef9ba039f4  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> {code}
> As near as I can tell, what happened was this:
> - 0.22 wrote {{framework.info}} without the FrameworkID
> - 0.23 had a compatibility check so it was ok with it
> - 0.24 removed the compatibility check in MESOS-2259
> - the framework checkpoint doesn't get rewritten during recovery so when the 
> 0.24 slave starts it reads the 0.22 version
> - 0.24 asserts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3581) License headers show up all over doxygen documentation.

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3581:

Sprint: Mesosphere Sprint 22

> License headers show up all over doxygen documentation.
> ---
>
> Key: MESOS-3581
> URL: https://issues.apache.org/jira/browse/MESOS-3581
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.24.1
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>  Labels: mesosphere
>
> Currently license headers are commented in something resembling Javadoc style,
> {code}
> /**
> * Licensed ...
> {code}
> Since we use Javadoc-style comment blocks for doxygen documentation all 
> license headers appear in the generated documentation, potentially and likely 
> hiding the actual documentation.
> Using {{/*}} to start the comment blocks would be enough to hide them from 
> doxygen, but would likely also result in a largish (though mostly 
> uninteresting) patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3551) Replace use of strerror with thread-safe alternatives strerror_r / strerror_l.

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3551:

Sprint: Mesosphere Sprint 22

> Replace use of strerror with thread-safe alternatives strerror_r / strerror_l.
> --
>
> Key: MESOS-3551
> URL: https://issues.apache.org/jira/browse/MESOS-3551
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, stout
>Reporter: Benjamin Mahler
>Assignee: Benjamin Bannier
>  Labels: mesosphere, newbie, tech-debt
>
> {{strerror()}} is not required to be thread safe by POSIX and is listed as 
> unsafe on Linux:
> http://pubs.opengroup.org/onlinepubs/9699919799/
> http://man7.org/linux/man-pages/man3/strerror.3.html
> I don't believe we've seen any issues reported due to this. We should replace 
> occurrences of strerror accordingly, possibly offering a wrapper in stout to 
> simplify callsites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998677#comment-14998677
 ] 

Neil Conway commented on MESOS-3870:


This should be accounted for by the fact that each process has a queue of input 
events that are consumed in-order (see the "events" deque in ProcessBase). 
i.e., although we can have many worker threads, a given process is only running 
in at most one thread at a time and each process' input events are consumed in 
the order in which they were delivered.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.

2015-11-10 Thread Felix Bechstein (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998634#comment-14998634
 ] 

Felix Bechstein commented on MESOS-2353:


We patched the master to hold only 100 completed tasks per framework and 10 
completed frameworks. It reduced the state size to ~2MB but the master is still 
using all its CPU to generate.

We stopped fetching the /master/state from all the browsers of your developers 
with iptables and the load was gone.

> Improve performance of the master's state.json endpoint for large clusters.
> ---
>
> Key: MESOS-2353
> URL: https://issues.apache.org/jira/browse/MESOS-2353
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Benjamin Mahler
>  Labels: newbie, scalability, twitter
>
> The master's state.json endpoint consistently takes a long time to compute 
> the JSON result, for large clusters:
> {noformat}
> $ time curl -s -o /dev/null localhost:5050/master/state.json
> Mon Jan 26 22:38:50 UTC 2015
> real  0m13.174s
> user  0m0.003s
> sys   0m0.022s
> {noformat}
> This can cause the master to get backlogged if there are many state.json 
> requests in flight.
> Looking at {{perf}} data, it seems most of the time is spent doing memory 
> allocation / de-allocation. This ticket will try to capture any low hanging 
> fruit to speed this up. Possibly we can leverage moves if they are not 
> already being used by the compiler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3839:

Story Points: 2

> Update documentation for FetcherCache mtime-related changes
> ---
>
> Key: MESOS-3839
> URL: https://issues.apache.org/jira/browse/MESOS-3839
> Project: Mesos
>  Issue Type: Documentation
>  Components: fetcher, slave
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3839:

Sprint: Mesosphere Sprint 23

> Update documentation for FetcherCache mtime-related changes
> ---
>
> Key: MESOS-3839
> URL: https://issues.apache.org/jira/browse/MESOS-3839
> Project: Mesos
>  Issue Type: Documentation
>  Components: fetcher, slave
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3856) Add mtime-related fetcher tests

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3856:

Story Points: 2

> Add mtime-related fetcher tests
> ---
>
> Key: MESOS-3856
> URL: https://issues.apache.org/jira/browse/MESOS-3856
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3839:

Story Points: 1  (was: 2)

> Update documentation for FetcherCache mtime-related changes
> ---
>
> Key: MESOS-3839
> URL: https://issues.apache.org/jira/browse/MESOS-3839
> Project: Mesos
>  Issue Type: Documentation
>  Components: fetcher, slave
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3839) Update documentation for FetcherCache mtime-related changes

2015-11-10 Thread Bernd Mathiske (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-3839:
--
Sprint: Mesosphere Sprint 22  (was: Mesosphere Sprint 23)

> Update documentation for FetcherCache mtime-related changes
> ---
>
> Key: MESOS-3839
> URL: https://issues.apache.org/jira/browse/MESOS-3839
> Project: Mesos
>  Issue Type: Documentation
>  Components: fetcher, slave
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2376) Allow libprocess ip and port to be configured

2015-11-10 Thread Dimitri (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998645#comment-14998645
 ] 

Dimitri commented on MESOS-2376:


[~hfaran] What do you mean ? 
I am trying to set the mesos binding ip to something more secure than 0.0.0.0. 
I am running mesos inside a docker container, should I bind in the docker 
environment or on the host ? I have tried both, none did work. I haven't tried 
to set LIBPROCESS_PORT since i am just  trying to change the interface.

I am quite suprirse by the lake of documentation for this feature, which is to 
me something that cannot make mesos usuable to production.

> Allow libprocess ip and port to be configured
> -
>
> Key: MESOS-2376
> URL: https://issues.apache.org/jira/browse/MESOS-2376
> Project: Mesos
>  Issue Type: Improvement
>  Components: java api
>Reporter: Dario Rexin
>Priority: Minor
>
> Currently if we want to configure the ip libprocess uses for communication, 
> we have to set the env var LIBPROCESS_IP, or LIBPROCESS_PORT for the port. 
> For the Java API this means, that the variable has to be set before the JVM 
> is started, because setting env vars from within JAVA is not possible / 
> non-trivial. Therefore it would be great to be able to pass them in to the 
> constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3856) Add mtime-related fetcher tests

2015-11-10 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-3856:

Sprint: Mesosphere Sprint 22

> Add mtime-related fetcher tests
> ---
>
> Key: MESOS-3856
> URL: https://issues.apache.org/jira/browse/MESOS-3856
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3220) Offer ability to kill tasks from the API

2015-11-10 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999552#comment-14999552
 ] 

Marco Massenzio commented on MESOS-3220:


To revive this thread - a couple of clarifying points:

1. Maintenance
This is meant to augment the Maintenance Primitives (MESOS-1474) and certainly 
*not* to replace it.
In particular, this endpoint (which ought to be scriptable, for automated 
maintenance scripts) would enable operators to kill "recalcitrant" 
frameworks/tasks which, for whatever reason, do not follow the Inverse Offer 
mechanism;

2. Repairs
There may be situations in which the task itself gets in a funky state and 
needs to be killed, without Mesos necessarily noticing it (ie, we cannot rely 
on the {{TASK_LOST}}/{{TASK_KILLED}} conditions).
Once that happens, however, the Framework will be notified (via the usual Mesos 
mechanisms) and can thus decide whether to re-schedule the task (possibly, 
somewhere else).

3. Remote termination
Using tools such as the {{DCOS CLI}} we want to enable  users to reach out to 
Mesos Master directly (possibly bypassing the framework) and terminate a task, 
without requiring every framework developer to re-implement the same API (so, 
this would be a "common service" that Mesos offers to framework developers, 
that they wouldn't have to worry about).

4. Security
There is obviously the expedient (if somewhat draconian) "firewalling" ability, 
to prevent outright access to this endpoint.
At a finer-grained level, we would consider using ACLs (probably in line with 
what is currently being done for the Maintenance Primitives) to authorize 
access to this functionality.

> Offer ability to kill tasks from the API
> 
>
> Key: MESOS-3220
> URL: https://issues.apache.org/jira/browse/MESOS-3220
> Project: Mesos
>  Issue Type: Improvement
>  Components: python api
>Reporter: Sunil Shah
>Assignee: Marco Massenzio
>Priority: Blocker
>  Labels: mesosphere
>
> We are investigating adding a `dcos task kill` command to our DCOS (and 
> Mesos) command line interface. Currently the ability to kill tasks is only 
> offered via the scheduler API so it would be useful to have some ability to 
> kill tasks directly.
> This is a blocker for the DCOS CLI!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3157) only perform batch resource allocations

2015-11-10 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1440#comment-1440
 ] 

James Peach commented on MESOS-3157:


No, I hope to get back to it soon though.

> only perform batch resource allocations
> ---
>
> Key: MESOS-3157
> URL: https://issues.apache.org/jira/browse/MESOS-3157
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James Peach
>Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3851) Investigate recent crashes in Command Executor

2015-11-10 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998541#comment-14998541
 ] 

Bernd Mathiske commented on MESOS-3851:
---

Interesting. Thanks for spotting this!

> Investigate recent crashes in Command Executor
> --
>
> Key: MESOS-3851
> URL: https://issues.apache.org/jira/browse/MESOS-3851
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> Post https://reviews.apache.org/r/38900 i.e. updating CommandExecutor to 
> support rootfs. There seem to be some tests showing frequent crashes due to 
> assert violations.
> {{FetcherCacheTest.SimpleEviction}} failed due to the following log:
> {code}
> I1107 19:36:46.360908 30657 slave.cpp:1793] Sending queued task '3' to 
> executor ''3' of framework 7d94c7fb-8950-4bcf-80c1-46112292dcd6- at 
> executor(1)@172.17.5.200:33871'
> I1107 19:36:46.363682  1236 exec.cpp:297] 
> I1107 19:36:46.373569  1245 exec.cpp:210] Executor registered on slave 
> 7d94c7fb-8950-4bcf-80c1-46112292dcd6-S0
> @ 0x7f9f5a7db3fa  google::LogMessage::Fail()
> I1107 19:36:46.394081  1245 exec.cpp:222] Executor::registered took 395411ns
> @ 0x7f9f5a7db359  google::LogMessage::SendToLog()
> @ 0x7f9f5a7dad6a  google::LogMessage::Flush()
> @ 0x7f9f5a7dda9e  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @   0x49c99d  
> mesos::internal::CommandExecutorProcess::launchTask()
> @   0x4b3dd7  
> _ZZN7process8dispatchIN5mesos8internal22CommandExecutorProcessEPNS1_14ExecutorDriverERKNS1_8TaskInfoES5_S6_EEvRKNS_3PIDIT_EEMSA_FvT0_T1_ET2_T3_ENKUlPNS_11ProcessBaseEE_clESL_
> @   0x4c470c  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal22CommandExecutorProcessEPNS5_14ExecutorDriverERKNS5_8TaskInfoES9_SA_EEvRKNS0_3PIDIT_EEMSE_FvT0_T1_ET2_T3_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9f5a761b1b  std::function<>::operator()()
> @ 0x7f9f5a749935  process::ProcessBase::visit()
> @ 0x7f9f5a74d700  process::DispatchEvent::visit()
> @   0x48e004  process::ProcessBase::serve()
> @ 0x7f9f5a745d21  process::ProcessManager::resume()
> @ 0x7f9f5a742f52  
> _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
> @ 0x7f9f5a74cf2c  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7f9f5a74cedc  
> _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
> @ 0x7f9f5a74ce6e  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
> @ 0x7f9f5a74cdc5  
> _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
> @ 0x7f9f5a74cd5e  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
> @ 0x7f9f5624f1e0  (unknown)
> @ 0x7f9f564a8df5  start_thread
> @ 0x7f9f559b71ad  __clone
> I1107 19:36:46.551370 30656 containerizer.cpp:1257] Executor for container 
> '6553a617-6b4a-418d-9759-5681f45ff854' has exited
> I1107 19:36:46.551429 30656 containerizer.cpp:1074] Destroying container 
> '6553a617-6b4a-418d-9759-5681f45ff854'
> I1107 19:36:46.553869 30656 containerizer.cpp:1257] Executor for container 
> 'd2c1f924-c92a-453e-82b1-c294d09c4873' has exited
> {code}
> The reason seems to be a race between the executor receiving a 
> {{RunTaskMessage}} before {{ExecutorRegisteredMessage}} leading to the 
> {{CHECK_SOME(executorInfo)}} failure.
> Link to complete log: 
> https://issues.apache.org/jira/browse/MESOS-2831?focusedCommentId=14995535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14995535
> Another related failure from {{ExamplesTest.PersistentVolumeFramework}}
> {code}
> @ 0x7f4f71529cbd  google::LogMessage::SendToLog()
> I1107 13:15:09.949987 31573 slave.cpp:2337] Status update manager 
> successfully handled status update acknowledgement (UUID: 
> 721c7316-5580-4636-a83a-098e3bd4ed1f) for task 
> ad90531f-d3d8-43f6-96f2-c81c4548a12d of framework 
> ac4ea54a-7d19-4e41-9ee3-1a761f8e5b0f-
> @ 0x7f4f715296ce  google::LogMessage::Flush()
> @ 0x7f4f7152c402  google::LogMessageFatal::~LogMessageFatal()
> @   0x48d00a  _CheckFatal::~_CheckFatal()
> @  

[jira] [Commented] (MESOS-3870) Prevent out-of-order libprocess message delivery

2015-11-10 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998605#comment-14998605
 ] 

haosdent commented on MESOS-3870:
-

I think although send could be ordered when only have a connection, but execute 
also could be out-of-order in receiver. ProcessManager would create 8~cpu 
number work threads to handle input messages. Because ProcessManager dispatch 
work by event, same Process would be called in different thread for different 
event.

> Prevent out-of-order libprocess message delivery
> 
>
> Key: MESOS-3870
> URL: https://issues.apache.org/jira/browse/MESOS-3870
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> I was under the impression that {{send()}} provided in-order, unreliable 
> message delivery. So if P1 sends  to P2, P2 might see <>, , , 
> or  — but not .
> I suspect much of the code makes a similar assumption. However, it appears 
> that this behavior is not guaranteed. slave.cpp:2217 has the following 
> comment:
> {noformat}
>   // TODO(jieyu): Here we assume that CheckpointResourcesMessages are
>   // ordered (i.e., slave receives them in the same order master sends
>   // them). This should be true in most of the cases because TCP
>   // enforces in order delivery per connection. However, the ordering
>   // is technically not guaranteed because master creates multiple
>   // connections to the slave in some cases (e.g., persistent socket
>   // to slave breaks and master uses ephemeral socket). This could
>   // potentially be solved by using a version number and rejecting
>   // stale messages according to the version number.
> {noformat}
> We can improve this situation by _either_: (1) fixing libprocess to guarantee 
> ordered message delivery, e.g., by adding a sequence number, or (2) 
> clarifying that ordered message delivery is not guaranteed, and ideally 
> providing a tool to force messages to be delivered out-of-order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3157) only perform batch resource allocations

2015-11-10 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999879#comment-14999879
 ] 

Klaus Ma commented on MESOS-3157:
-

[~jpe...@apache.org], any update on this?

> only perform batch resource allocations
> ---
>
> Key: MESOS-3157
> URL: https://issues.apache.org/jira/browse/MESOS-3157
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James Peach
>Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3826) Add an optional unique identifier for resource reservations

2015-11-10 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14999881#comment-14999881
 ] 

Klaus Ma commented on MESOS-3826:
-

[~sargun], is that addressed your concern?

> Add an optional unique identifier for resource reservations
> ---
>
> Key: MESOS-3826
> URL: https://issues.apache.org/jira/browse/MESOS-3826
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Reporter: Sargun Dhillon
>Assignee: Guangya Liu
>Priority: Minor
>  Labels: mesosphere
>
> Thanks to the resource reservation primitives, frameworks can reserve 
> resources. These reservations are per role, which means multiple frameworks 
> can share reservations. This can get very hairy, as multiple reservations can 
> occur on each agent. 
> It would be nice to be able to optionally, uniquely identify reservations by 
> ID, much like persistent volumes are today. This could be done by adding a 
> new protobuf field, such as Resource.ReservationInfo.id, that if set upon 
> reservation time, would come back when the reservation is advertised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)