[jira] [Comment Edited] (MESOS-8241) Add metrics for offer operation feedback

2019-02-11 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765518#comment-16765518
 ] 

Greg Mann edited comment on MESOS-8241 at 2/12/19 6:50 AM:
---

This should include:
 * operation in each operation state
 ** counters for terminal states, showing total number of operations which the 
master has observed transition to that state
 *** note that for the OPERATION_ERROR case, there will be operations dropped 
by the master, e.g. because they are invalidated, which do not request feedback 
and thus do not have any OPERATION_ERROR updates associated with them. we 
should probably still increment the corresponding {{operation_error}} metric in 
these cases
 * gauges for non-terminal states, representing the number of operations 
currently in that state in the system. note that we should use the newer 
{{PushGauge}} for all of these.


was (Author: greggomann):
This should include:
 * operation in each operation state
 ** counters for terminal states, showing total number of operations which the 
master has observed transition to that state
 *** note that for the OPERATION_ERROR case, there will be operations dropped 
by the master, e.g. because they are invalidated, which do not request feedback 
and thus do not have any OPERATION_ERROR updates associated with them. we 
should probably still increment the corresponding {{operation_error}} metric in 
these cases
 ** gauges for non-terminal states, representing the number of operations 
currently in that state in the system. note that we should use the newer 
{{PushGauge}} for all of these.

> Add metrics for offer operation feedback
> 
>
> Key: MESOS-8241
> URL: https://issues.apache.org/jira/browse/MESOS-8241
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8241) Add metrics for offer operation feedback

2019-02-11 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765518#comment-16765518
 ] 

Greg Mann edited comment on MESOS-8241 at 2/12/19 6:50 AM:
---

This should include:
 * operation in each operation state
 ** counters for terminal states, showing total number of operations which the 
master has observed transition to that state
 *** note that for the OPERATION_ERROR case, there will be operations dropped 
by the master, e.g. because they are invalidated, which do not request feedback 
and thus do not have any OPERATION_ERROR updates associated with them. we 
should probably still increment the corresponding {{operation_error}} metric in 
these cases
 ** gauges for non-terminal states, representing the number of operations 
currently in that state in the system. note that we should use the newer 
{{PushGauge}} for all of these.


was (Author: greggomann):
This should include metrics for dropped operations and operation status updates.

> Add metrics for offer operation feedback
> 
>
> Key: MESOS-8241
> URL: https://issues.apache.org/jira/browse/MESOS-8241
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8241) Add metrics for offer operation feedback

2019-02-11 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765518#comment-16765518
 ] 

Greg Mann edited comment on MESOS-8241 at 2/12/19 6:38 AM:
---

This should include metrics for dropped operations and operation status updates.


was (Author: greggomann):
This should include metrics for dropped operations.

> Add metrics for offer operation feedback
> 
>
> Key: MESOS-8241
> URL: https://issues.apache.org/jira/browse/MESOS-8241
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9473) Add end to end tests for operations on agent default resources.

2019-02-11 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765703#comment-16765703
 ] 

Greg Mann commented on MESOS-9473:
--

{code}
commit 02e4477187173fcae86ca6cb8db002f0f90fcf5f
Author: Gastón Kleiman 
Date:   Fri Feb 8 18:29:44 2019 -0800

Added tests for feedback for operations on agent default resources.

Review: https://reviews.apache.org/r/69910/
{code}
{code}
commit fa6ea019109e74a23576c3f736fda7d4faa16bc2
Author: Gastón Kleiman 
Date:   Fri Feb 8 18:29:46 2019 -0800

Removed `MasterAPITest.OperationFeedbackOnAgentDefaultResources`.

This patch removes a `MasterAPITest`, because the new test suite
`AgentOperationFeedbackTest` already covers the scenario from the
original test.

Review: https://reviews.apache.org/r/69920/
{code}
{code}
commit 156da38ef63bf1815b59abd20a719156f4b1fc6d
Author: Gastón Kleiman 
Date:   Fri Feb 8 18:29:55 2019 -0800

Added missing periods at the end of comments.

Review: https://reviews.apache.org/r/69919/
{code}
{code}
commit 34b0adc83b5eef4a7b2bb125203e5efa56497121
Author: Gastón Kleiman 
Date:   Fri Feb 8 18:29:56 2019 -0800

Added tests for reconciliation of operations on agent default resources.

Review: https://reviews.apache.org/r/69911/
{code}

> Add end to end tests for operations on agent default resources.
> ---
>
> Key: MESOS-9473
> URL: https://issues.apache.org/jira/browse/MESOS-9473
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>
> Making note of particular cases we need to test:
> * Verify that frameworks will receive OPERATION_GONE_BY_OPERATOR for 
> operations on agent default resources when an agent is marked gone
> * Verify that frameworks will receive OPERATION_GONE_BY_OPERATOR when they 
> reconcile operations on agents which have been marked gone



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9565) Unit tests for destroying persistent volumes in SLRP.

2019-02-11 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9565:
--

 Summary: Unit tests for destroying persistent volumes in SLRP.
 Key: MESOS-9565
 URL: https://issues.apache.org/jira/browse/MESOS-9565
 Project: Mesos
  Issue Type: Task
  Components: test
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


The plan is to add/update the following unit tests to test persistent volume 
destroy:
* CreateDestroyDisk
* CreateDestroyDiskWithRecovery
* CreateDestroyPersistentMountVolume
* CreateDestroyPersistentMountVolumeWithRecovery
* CreateDestroyPersistentMountVolumeWithReboot
* CreateDestroyPersistentBlockVolume
* DestroyPersistentMountVolumeFailed
* DestroyUnpublishedPersistentVolume
* DestroyUnpublishedPersistentVolumeWithRecovery
* DestroyUnpublishedPersistentVolumeWithReboot
* RecoverPublishedPersistentVolumeFailed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9562) Authorization for DESTROY and UNRESERVE is not symmetrical.

2019-02-11 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765693#comment-16765693
 ] 

Chun-Hung Hsiao commented on MESOS-9562:


For {{UNRESERVE}}, we current support the following two use cases:
1. If all resources the {{UNRESERVE}} operation applies to have reservation 
principals, there will be one authorization request for each resource.
2. If none of the resources has any principal, there will be one single 
authorization request to verify if the subject is authorized to perform an 
{{UNRESERVE}} operation.
Equivalently, if a subject is authorized to do {{UNRESERVE}} on any reservation 
with a principal, Mesos would implicitly assume that the subject has the right 
to do {{UNRESERVE}} on a reservation without a principal as well.
We should either document this, or issue a request per resource, with or 
without a principle.
Since we're deprecating the {{value}} field in favor of the {{resource}} field, 
it seems to me that we should issue a request for each resource, no matter it 
is reserved by a principal or not.

For {{DESTROY}}, it seems to me that setting a default empty string is an 
undocumented behavior, and also having a magic string (which is the empty 
string) doesn't sound a good idea in an API.

> Authorization for DESTROY and UNRESERVE is not symmetrical.
> ---
>
> Key: MESOS-9562
> URL: https://issues.apache.org/jira/browse/MESOS-9562
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Affects Versions: 1.7.1
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: integration, mesosphere, tech-debt
>
> For [the {{UNRESERVE}} 
> case|https://github.com/apache/mesos/blob/5d3ed364c6d1307d88e6b950ae0eef423c426673/src/master/master.cpp#L3661-L3677],
>  if the principal was not set, {{.has_principal()}} will be {{false}}, hence 
> we will not call {{authorizations.push_back()}}, and hence we will not create 
> an authz request with this resource as an object. For [the {{DESTROY}} 
> case|https://github.com/apache/mesos/blob/5d3ed364c6d1307d88e6b950ae0eef423c426673/src/master/master.cpp#L3772-L3773],
>  if the principal was not set, a default value {{""}} for string will be used 
> and hence we will create an authz request with this resource as an object. 
> We definitely need to make the behaviour consistent. I'm not sure which 
> approach is correct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9544) SLRP does not clean up destroyed persistent volumes.

2019-02-11 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765696#comment-16765696
 ] 

Chun-Hung Hsiao commented on MESOS-9544:


I've paused the development of the {{RecoverPublishedVolumeFailed}} test. I 
have some WIP patches and plan to integrate them into MESOS-8745, then we can 
use the {{GET_RESOURCE_PROVIDRES}} call to verify the resource provider 
recovery failure in the test.

> SLRP does not clean up destroyed persistent volumes.
> 
>
> Key: MESOS-9544
> URL: https://issues.apache.org/jira/browse/MESOS-9544
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-beta, storage
>
> When a persistent volume created on a {{ROOT}} disk is destroyed, the agent 
> will clean up its data: 
> https://github.com/apache/mesos/blob/f44535bca811720fc272c9abad2bc78652d61fe3/src/slave/slave.cpp#L4397
> However, this is not the case for PVs on SLRP disks. The agent relies on the 
> SLRP to do the cleanup:
> https://github.com/apache/mesos/blob/f44535bca811720fc272c9abad2bc78652d61fe3/src/slave/slave.cpp#L4472
> But SLRP simply updates its metadata and do nothing:
> https://github.com/apache/mesos/blob/f44535bca811720fc272c9abad2bc78652d61fe3/src/resource_provider/storage/provider.cpp#L2805
> This would lead to data leakage if the framework does not call `CREATE_DISK` 
> but just unreserve it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9544) SLRP does not clean up destroyed persistent volumes.

2019-02-11 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765690#comment-16765690
 ] 

Chun-Hung Hsiao commented on MESOS-9544:


Two more patches for testing:

https://reviews.apache.org/r/69954/ (CreateDestroyPersistentBlockVolume)
https://reviews.apache.org/r/69955/ 
(DestroyUbpublishedPersistentVolume{,WithRecovery,WithReboot}

> SLRP does not clean up destroyed persistent volumes.
> 
>
> Key: MESOS-9544
> URL: https://issues.apache.org/jira/browse/MESOS-9544
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.7.0, 1.7.1
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-beta, storage
>
> When a persistent volume created on a {{ROOT}} disk is destroyed, the agent 
> will clean up its data: 
> https://github.com/apache/mesos/blob/f44535bca811720fc272c9abad2bc78652d61fe3/src/slave/slave.cpp#L4397
> However, this is not the case for PVs on SLRP disks. The agent relies on the 
> SLRP to do the cleanup:
> https://github.com/apache/mesos/blob/f44535bca811720fc272c9abad2bc78652d61fe3/src/slave/slave.cpp#L4472
> But SLRP simply updates its metadata and do nothing:
> https://github.com/apache/mesos/blob/f44535bca811720fc272c9abad2bc78652d61fe3/src/resource_provider/storage/provider.cpp#L2805
> This would lead to data leakage if the framework does not call `CREATE_DISK` 
> but just unreserve it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9560) ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky

2019-02-11 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765687#comment-16765687
 ] 

Greg Mann commented on MESOS-9560:
--

Observed again with a slightly different stack trace (notice {{connected()}} 
vs. {{disconnected()}}):
{code}
PC: @ 0x7f1ed2b30fe6 mesos::v1::resource_provider::Driver::send()
*** SIGSEGV (@0x0) received by PID 15010 (TID 0x7f1ec6657700) from PID 0; stack 
trace: ***
@ 0x7f1e9b9765f2 (unknown)
@ 0x7f1e9b97ac19 (unknown)
@ 0x7f1e9b96dd28 (unknown)
@ 0x7f1ecf459390 (unknown)
@ 0x7f1ed2b30fe6 mesos::v1::resource_provider::Driver::send()
@ 0x563f6934d1ef 
mesos::internal::tests::resource_provider::MockResourceProvider<>::connectedDefault()
@ 0x563f6926cbfe 
testing::internal::FunctionMockerBase<>::UntypedPerformDefaultAction()
@ 0x563f6a788e96 
testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
@ 0x563f692990c4 
mesos::internal::tests::resource_provider::MockResourceProvider<>::connected()
@ 0x7f1ed2820110 process::AsyncExecutorProcess::execute<>()
@ 0x7f1ed282fe58 
_ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_
@ 0x7f1ed3a636c1 process::ProcessBase::consume()
@ 0x7f1ed3a85b2a process::ProcessManager::resume()
@ 0x7f1ed3a89866 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7f1ecfc3cc80 (unknown)
@ 0x7f1ecf44f6ba start_thread
@ 0x7f1ecf18541d (unknown)
{code}

> ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky
> 
>
> Key: MESOS-9560
> URL: https://issues.apache.org/jira/browse/MESOS-9560
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere, storage, test
> Attachments: consoleText.txt
>
>
> We observed a segfault in 
> {{ContentType/AgentAPITest.MarkResourceProviderGone/1}} on test teardown.
> {noformat}
> I0131 23:55:59.378453  6798 slave.cpp:923] Agent terminating
> I0131 23:55:59.378813 31143 master.cpp:1269] Agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
> (ip-172-16-10-236.ec2.internal) disconnected
> I0131 23:55:59.378831 31143 master.cpp:3272] Disconnecting agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
> (ip-172-16-10-236.ec2.internal)
> I0131 23:55:59.378846 31143 master.cpp:3291] Deactivating agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
> (ip-172-16-10-236.ec2.internal)
> I0131 23:55:59.378891 31143 hierarchical.cpp:793] Agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 deactivated
> F0131 23:55:59.378891 31149 logging.cpp:67] RAW: Pure virtual method called
> @ 0x7f633aaaebdd  google::LogMessage::Fail()
> @ 0x7f633aab6281  google::RawLog__()
> @ 0x7f6339821262  __cxa_pure_virtual
> @ 0x55671cacc113  
> testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
> @ 0x55671b532e78  
> mesos::internal::tests::resource_provider::MockResourceProvider<>::disconnected()
> @ 0x7f633978f6b0  process::AsyncExecutorProcess::execute<>()
> @ 0x7f633979f218  
> _ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_
> @ 0x7f633a9f5d01  process::ProcessBase::consume()
> @ 0x7f633aa1a08a  process::ProcessManager::resume()
> @ 0x7f633aa1db06  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7f633acc9f80  execute_native_thread_routine
> @ 0x7f6337142e25  start_thread
> @ 0x7f6336241bad  __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9564) Logrotate container logger lets tasks execute arbitrary commands in the Mesos agent's namespace

2019-02-11 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-9564:


 Summary: Logrotate container logger lets tasks execute arbitrary 
commands in the Mesos agent's namespace
 Key: MESOS-9564
 URL: https://issues.apache.org/jira/browse/MESOS-9564
 Project: Mesos
  Issue Type: Bug
  Components: agent, modules
Reporter: Joseph Wu


The non-default {{LogrotateContainerLogger}} module allows tasks to configure 
sandbox log rotation (See 
http://mesos.apache.org/documentation/latest/logging/#Containers ).  The 
{{logrotate_stdout_options}} and {{logrotate_stderr_options}} in particular let 
the task specify free-form text, which is written to a configuration file 
located in the task's sandbox.  The module does not sanitize or check this 
configuration at all.

The logger itself will eventually run {{logrotate}} against the written 
configuration file, but the logger is not isolated in the same way as the task. 
 For both the Mesos and Docker containerizers, the logger binary will run in 
the same namespace as the Mesos agent.  This makes it possible to affect files 
outside of the task's mount namespace.

Two modes of attack are known to be problematic:
* Changing or adding entries to the configuration file.  Normally, the 
configuration file contains a single file to rotate:
{code}
/path/to/sandbox/stdout {
  
}
{code}
It is trivial to add text to the {{logrotate_stdout_options}} to add a new 
entry:
{code}
/path/to/sandbox/stdout {
  
}
/path/to/other/file/on/disk {
  
}
{code}
* Logrotate's {{postrotate}} option allows for execution of arbitrary commands. 
 This can again be supplied with the {{logrotate_stdout_options}} variable.
{code}
/path/to/sandbox/stdout {
  postrotate
rm -rf /
  endscript
}
{code}

Some potential fixes to consider:
* Overwrite the .logrotate.conf files each time. This would give only 
milliseconds between writing and calling logrotate for a thirdparty to modify 
the config files maliciously. This would not help if the task itself had 
postrotate options in its environment variables.
* Sanitize the free-form options field in the environment variables to remove 
postrotate or injection attempts like }\n/path/to/some/file\noptions{.
* Refactor parts of the Mesos isolation code path so that the logger and IO 
switchboard binary live in the same namespaces as the container (instead of the 
agent). This would also be nice in that the logger's CPU usage would then be 
accounted for within the container's resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

2019-02-11 Thread Qian Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-9507:
-

Assignee: Qian Zhang  (was: Andrei Budnik)

> Agent could not recover due to empty docker volume checkpointed files.
> --
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but 
> before checkpointing finishes. Basically the docker volume is not mounted 
> yet, so the docker volume isolator should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8241) Add metrics for offer operation feedback

2019-02-11 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765518#comment-16765518
 ] 

Greg Mann commented on MESOS-8241:
--

This should include metrics for dropped operations.

> Add metrics for offer operation feedback
> 
>
> Key: MESOS-8241
> URL: https://issues.apache.org/jira/browse/MESOS-8241
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9557) Operations are leaked in Framework struct when agents are removed

2019-02-11 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-9557:


Assignee: Joseph Wu

> Operations are leaked in Framework struct when agents are removed
> -
>
> Key: MESOS-9557
> URL: https://issues.apache.org/jira/browse/MESOS-9557
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> Currently, when agents are removed from the master, their operations are not 
> removed from the {{Framework}} structs. We should ensure that this occurs in 
> all cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9315) Adding support for implicit allocation of mandatory custom resources in Mesos

2019-02-11 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9315:
--

Assignee: Clément Michaud

> Adding support for implicit allocation of mandatory custom resources in Mesos
> -
>
> Key: MESOS-9315
> URL: https://issues.apache.org/jira/browse/MESOS-9315
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Clément Michaud
>Assignee: Clément Michaud
>Priority: Minor
>  Labels: resource-management
> Attachments: mesos-community-email.txt
>
>
> I sent a an email (attached) few days ago to propose the introduction of a 
> new hook to append resources implicitly to tasks for mandatory resources. 
> This would allow Mesos to support mandatory resources like network bandwidth 
> or disk IO for instance. 
> In a nutshell, we propose to add a hook with the following signature
> {code:java}
> Result masterLaunchTaskResourceDecorator(
>   const Resources& slaveResources,
>   TaskInfo& task)
> {code}
> and call it in the master in the ACCEPT message handler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9554) Allocator might skip allocations because a single framework is incapable of receiving certain resources

2019-02-11 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765237#comment-16765237
 ] 

Benjamin Mahler commented on MESOS-9554:


Test: https://reviews.apache.org/r/69942/

> Allocator might skip allocations because a single framework is incapable of 
> receiving certain resources
> ---
>
> Key: MESOS-9554
> URL: https://issues.apache.org/jira/browse/MESOS-9554
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>Priority: Major
>
> Currently in the hierarchical allocator allocation loops we compute 
> {{available}} resources by taking into account the capabilities of the 
> current framework. Further down in the loop we might then {{break}} out of 
> the iteration under the assumption that no other framework can receive the 
> resources in question.
> This is only correct if all considered frameworks have identical capabilities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9563) Improve master's 'AcknowledgeOperationStatus' validation

2019-02-11 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9563:


 Summary: Improve master's 'AcknowledgeOperationStatus' validation
 Key: MESOS-9563
 URL: https://issues.apache.org/jira/browse/MESOS-9563
 Project: Mesos
  Issue Type: Improvement
  Components: master, scheduler api
Reporter: Greg Mann


Currently, the master quickly returns a 202 ACCEPTED response to schedulers for 
the ACKNOWLEDGE_OPERATION_STATUS call, with most validation that depends on the 
master's internal state being performed afterward in the 
{{acknowledgeOperationStatus()}} handler.

The master's HTTP code could instead perform this state-dependent validation 
before returning a response, improving the UX of the scheduler API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)