[jira] [Commented] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

2019-01-23 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750531#comment-16750531
 ] 

Joseph Wu commented on MESOS-9507:
--

One possible fix is to add a conditional between these two blocks:
https://github.com/apache/mesos/blob/0f8ee9555f89f0a5f139bc12c666a60164c7b09b/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L277-L287

{code}
  if (read.isNone()) {
// This could happen if the agent died after opening the file for writing
// but before it checkpointed anything.
LOG(WARNING) << "Some descriptive warning";

// 
  }
{code}

> Agent could not recover due to empty docker volume checkpointed files.
> --
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Priority: Critical
>  Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but 
> before checkpointing finishes. Basically the docker volume is not mounted 
> yet, so the docker volume isolator should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9533) CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.

2019-01-23 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-9533:
---

Assignee: Gilbert Song

> CniIsolatorTest.ROOT_CleanupAfterReboot is flaky.
> -
>
> Key: MESOS-9533
> URL: https://issues.apache.org/jira/browse/MESOS-9533
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.8.0
> Environment: centos-6 with SSL enabled
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>  Labels: flaky-test
>
> {noformat}
> Error Message
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> Stacktrace
> ../../src/tests/containerizer/cni_isolator_tests.cpp:2685
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7fffc7c05aa0, @0x7fe637918430 136-byte 
> object <80-24 29-45 E6-7F 00-00 00-00 00-00 00-00 00-00 3E-E8 00-00 00-00 
> 00-00 00-B8 0E-20 F0-55 00-00 C0-03 07-18 E6-7F 00-00 20-17 05-18 E6-7F 00-00 
> 10-50 05-18 E6-7F 00-00 50-D1 04-18 E6-7F 00-00 ... 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 F0-89 16-E9 58-2B D7-41 00-00 00-00 01-00 00-00 18-00 00-00 
> 0B-00 00-00>)
>  Expected: to be called 3 times
>Actual: called 4 times - over-saturated and active
> {noformat}
> It was from this commit 
> https://github.com/apache/mesos/commit/c338f5ada0123c0558658c6452ac3402d9fbec29



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9460) Speculative operations may make master and agent resource views out of sync.

2019-01-23 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750318#comment-16750318
 ] 

Greg Mann commented on MESOS-9460:
--

Reprioritizing this as Major since, upon further investigation, I realized that 
the scope of the issue is much smaller than previously thought.

> Speculative operations may make master and agent resource views out of sync.
> 
>
> Key: MESOS-9460
> URL: https://issues.apache.org/jira/browse/MESOS-9460
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Meng Zhu
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
>
> When speculative operations (RESERVE, UNRESERVE, CREATE, DESTROY) are issued 
> via the master operator API, the master updates the allocator state in 
> {{Master::apply()}}, and then later updates its internal state in 
> {{Master::_apply}}. This means that other updates to the allocator may be 
> interleaved between these two continuations, causing the master state to be 
> out of sync with the allocator state.
> This bug could happen with the following sequence of events:
> - agent (re)registers with the master
> - multiple speculative operation calls are made to the master via the 
> operator API
> - the allocator is speculatively updated in 
> https://github.com/apache/mesos/blob/1d1af190b0eb674beecf20646d0b6ce082db4ed0/src/master/master.cpp#L11326
> - before agent resource gets updated, it sends `UpdateSlaveMessage` when 
> getting the (re)registered message if it has the capability 
> `RESOURCE_PROVIDER` or oversubscription is used 
> (https://github.com/apache/mesos/blob/3badf7179992e61f30f5a79da9d481dd451c7c2f/src/slave/slave.cpp#L1560-L1566
>  and 
> https://github.com/apache/mesos/blob/3badf7179992e61f30f5a79da9d481dd451c7c2f/src/slave/slave.cpp#L1643-L1648)
> - as long as the first operation via the operator API has been added to the 
> {{Slave}} struct at this point, then the master won't hit [this block 
> here|https://github.com/apache/mesos/blob/1d1af190b0eb674beecf20646d0b6ce082db4ed0/src/master/master.cpp#L7940-L7945]
>  and the `UpdateSlaveMessage` triggers allocator to update the total 
> resources with STALE info from the {{Slave}} struct 
> [here|https://github.com/apache/mesos/blob/1d1af190b0eb674beecf20646d0b6ce082db4ed0/src/master/master.cpp#L8207],
>  thus the update from the previous operation is overwritten and LOST. Since 
> the {{Slave}} struct has not yet been updated, the allocator update at that 
> point uses stale resources from {{slave->totalResources}}.
> - agent finishes the operation and informs the master through 
> `UpdateOperationStatusMessage` but for the speculative operation, we do not 
> update the allocator 
> https://github.com/apache/mesos/blob/3badf7179992e61f30f5a79da9d481dd451c7c2f/src/master/master.cpp#L11187-L11189
> - The resource views of the master/agent state and the allocator state are 
> now inconsistent
> This caused MESOS-7971 and likely MESOS-9458 as well. 
> To fix this issue, we should make sure that updates to the allocator state 
> and the master state are performed in a single synchronous block of code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9356) Make agent atomically checkpoint operations and resources

2019-01-23 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746761#comment-16746761
 ] 

Gastón Kleiman edited comment on MESOS-9356 at 1/23/19 6:27 PM:


https://reviews.apache.org/r/69790/
https://reviews.apache.org/r/69792/
https://reviews.apache.org/r/69793/
https://reviews.apache.org/r/69794/
https://reviews.apache.org/r/69795/
https://reviews.apache.org/r/69825/


was (Author: gkleiman):
https://reviews.apache.org/r/69790/
https://reviews.apache.org/r/69792/
https://reviews.apache.org/r/69793/
https://reviews.apache.org/r/69794/
https://reviews.apache.org/r/69795/

> Make agent atomically checkpoint operations and resources
> -
>
> Key: MESOS-9356
> URL: https://issues.apache.org/jira/browse/MESOS-9356
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: agent, foundations, mesosphere, operation-feedback
>
> See 
> https://docs.google.com/document/d/1HxMBCfzU9OZ-5CxmPG3TG9FJjZ_-xDUteLz64GhnBl0/edit
>  for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9535) Master should clean up operations from downgraded agents

2019-01-23 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9535:


 Summary: Master should clean up operations from downgraded agents
 Key: MESOS-9535
 URL: https://issues.apache.org/jira/browse/MESOS-9535
 Project: Mesos
  Issue Type: Task
Reporter: Greg Mann


If a Mesos agent is upgraded to provide reliable feedback for operations on 
agent default resources and then later downgraded, the master may possess 
in-memory state related to operations requesting feedback which should be 
cleaned up. We should update the master to detect downgraded agents and clean 
up appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9534) CSI Spec v1.0 Support

2019-01-23 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9534:
--

 Summary: CSI Spec v1.0 Support
 Key: MESOS-9534
 URL: https://issues.apache.org/jira/browse/MESOS-9534
 Project: Mesos
  Issue Type: Epic
  Components: storage
Reporter: Chun-Hung Hsiao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8507) SLRP discards reservations when the agent is discarded, which could lead to leaked volumes.

2019-01-23 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749679#comment-16749679
 ] 

Chun-Hung Hsiao commented on MESOS-8507:


The current proposal I have in mind is:
First, the recovered CSI volumes will be default-reserved through the 
{{default_reservations}} field in the SLRP config, so only a special framework 
will receive these resources. Then we can do one of the followings:
 # Introduce a special "reservation transfer" offer operation so this special 
framework could re-reserve the CSI volume to the consumer framework atomically, 
then the consumer framework can re-reserve the volume again atomically if it 
wants to add reservation labels, then "re-create" the persistent volume on the 
CSI volume without wiping out the data. As an initial attempt, the required 
information about the reservation transfer (e.g., the target role) can be 
provided manually by the operator, but ultimately we should automatize the 
efforts.
 # Use a special authorization module to learn about the reservations when the 
persistent volume has been created, then only allows the same reservations on 
recovered CSI volumes.

> SLRP discards reservations when the agent is discarded, which could lead to 
> leaked volumes.
> ---
>
> Key: MESOS-8507
> URL: https://issues.apache.org/jira/browse/MESOS-8507
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>  Labels: storage
>
> In the current SLRP implementation the reservations for new SLRP/CSI backed 
> volumes are checkpointed under {{/slaves/latest/resource_providers}} so 
> when the agent runs into incompatible configuration changes (the kinds that 
> cannot be addressed by MESOS-1739), the operator has to remove the symlink 
> and then the reservations are gone. 
> Then the agent recovers with a new {{SlaveInfo}} and new SLRPs are created to 
> recover the CSI volumes. These CSI volumes will not have reservations and 
> thus will be offered to frameworks of any role, potentially with the data 
> already written by the previous owner. 
>  
> The framework doesn't have any control over this and any chance to clean up 
> before the volumes are re-offered, which is undesired for security reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)