[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2018-08-10 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576641#comment-16576641
 ] 

Zhitao Li commented on MESOS-8038:
--

[~gilbert] I don't think we will use forever. My plan is to use a value like 
10mins for this flag after back port, then observe whether new timeout works.

[~bmahler] I agree that we are not really fixing the root cause here. I'll link 
the patches to a new task MESOS-9148 and keep this one open instead.

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9148) Make cgroups destroy timeout configurable for Mesos containerizer

2018-08-10 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-9148:


 Summary: Make cgroups destroy timeout configurable for Mesos 
containerizer
 Key: MESOS-9148
 URL: https://issues.apache.org/jira/browse/MESOS-9148
 Project: Mesos
  Issue Type: Task
Reporter: Zhitao Li
Assignee: Zhitao Li


Previously all containers from Mesos containerizer uses same 1 minute timeout 
for destroying cgroup. However, we have observed that for certain containers 
(possibly with deep system calls), the cgroup hierarchy was not destroyed 
within that timeout. The is quite problematic because containerizer 
short-circuits the destroy routine and skips _isolator::cleanup_. We have 
observed that GPU resources got leaked indefinitely due to such a bug (see 
MESOS-8038).

The proposed workaround here is to add an optional agent flag to allow operator 
to override this timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2018-07-26 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558938#comment-16558938
 ] 

Zhitao Li commented on MESOS-8038:
--

I just attached another full agent log with this issue.

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2018-07-25 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556308#comment-16556308
 ] 

Zhitao Li commented on MESOS-8038:
--

Some update:

We have another episode on this issue. Our setup is our custom framework + 
command executor on GPU in a cluster with GPU only machines and GPU only tasks, 
running Mesos 1.5.0.

One error log pattern that 100% correlates in our environment:

bq. E0724 01:14:15.203124 10883 slave.cpp:5798] Termination of executor 
'3e213d20-ed99-4196-bd26-12560423c5fd-151-1' of framework 
8e2c0e03-3147-442c-9abe-b370aad201cd- failed: Failed to kill all processes 
in the container: Timed out after 1mins

Based on [this 
TODO|https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/mesos/containerizer.cpp#L2504],
 containerizer would not call into nvidia/gpu isolator's cleanup method, thus 
the gpu resource would "leaked" until next agent restart, which matches our 
observation.

Based on the error logs, we are looking at the following:
1. Why would the termination takes longer than the 
[DESTROY_TIMEOUT|https://github.com/apache/mesos/blob/a86ff8c36532f97b6eb6b44c6f871de24afbcc4d/src/linux/cgroups.hpp#L44]?
2. Should this be configurable?
3. should the early return in above case really happen?

[~jasonlai] and I might discuss this issue in next containerizer WG.

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2018-06-20 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518573#comment-16518573
 ] 

Zhitao Li edited comment on MESOS-8038 at 6/20/18 8:41 PM:
---

We have this happening again in our cluster.

One suggestion I have is to change 
[Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated as long as agent is configured to automatically restart upon crash, 
which would be recommended in a production installation.

[~bmahler] [~vinodkone] What do you think?


was (Author: zhitao):
We have this happening again in our cluster.

One suggestion I have is to change 
[Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated as long as agent is configured to automated restart upon crash (which 
would be recommended in a production installation)

[~bmahler] [~vinodkone] What do you think?

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2018-06-20 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518573#comment-16518573
 ] 

Zhitao Li edited comment on MESOS-8038 at 6/20/18 8:35 PM:
---

We have this happening again in our cluster.

One suggestion I have is to change 
[Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated as long as agent is configured to automated restart upon crash (which 
would be recommended in a production installation)

[~bmahler] [~vinodkone] What do you think?


was (Author: zhitao):
We have this happening again in our cluster.

One suggestion I have is to change 
[Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated.

[~bmahler][~vinodkone] What do you think?

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2018-06-20 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518573#comment-16518573
 ] 

Zhitao Li edited comment on MESOS-8038 at 6/20/18 8:34 PM:
---

We have this happening again in our cluster.

One suggestion I have is to change 
[Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated.

[~bmahler][~vinodkone] What do you think?


was (Author: zhitao):
We have this happening again in our cluster.

One suggestion I have is to change[ 
https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258
 | the Failure] into a FATAL so that agent will generate a coredump when 
allowed, and we can use gdb to analyze the data structure further. This will 
also make recovery automated.

[~bmahler][~vinodkone] What do you think?

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2018-06-20 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518573#comment-16518573
 ] 

Zhitao Li edited comment on MESOS-8038 at 6/20/18 8:34 PM:
---

We have this happening again in our cluster.

One suggestion I have is to change[ 
https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258
 | the Failure] into a FATAL so that agent will generate a coredump when 
allowed, and we can use gdb to analyze the data structure further. This will 
also make recovery automated.

[~bmahler][~vinodkone] What do you think?


was (Author: zhitao):
We have this happening again in our cluster.

One suggestion I have is to change[ this 
Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated.

[~bmahler][~vinodkone] What do you think?

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2018-06-20 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518573#comment-16518573
 ] 

Zhitao Li commented on MESOS-8038:
--

We have this happening again in our cluster.

One suggestion I have is to change[ this 
Failure|https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/slave/containerizer/mesos/isolators/gpu/allocator.cpp#L258]
 into a FATAL so that agent will generate a coredump when allowed, and we can 
use gdb to analyze the data structure further. This will also make recovery 
automated.

[~bmahler][~vinodkone] What do you think?

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8038) Launching GPU task sporadically fails.

2018-06-20 Thread Zhitao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8038:


Assignee: Zhitao Li

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave.INFO.log
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9000) Operator API event stream can miss task status updates

2018-06-15 Thread Zhitao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513816#comment-16513816
 ] 

Zhitao Li commented on MESOS-9000:
--

I believe the high level intention was to avoid sending unnecessary duplicate 
status update messages, but I don't think we explicitly considered the multiple 
event queued scenario you described.

I think if we have a counter to monitor rate of message on event stream, it 
sounds fine to add this.



> Operator API event stream can miss task status updates
> --
>
> Key: MESOS-9000
> URL: https://issues.apache.org/jira/browse/MESOS-9000
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> As of now, the master only sends TaskUpdated messages to subscribers when the 
> latest known task state on the agent changed:
> {noformat}
>   // src/master/master.cpp
>   if (!protobuf::isTerminalState(task->state())) {
> if (status.state() != task->state()) {
>   sendSubscribersUpdate = true;
> }
> task->set_state(latestState.getOrElse(status.state()));
>   }
> {noformat}
> The latest state is set like this:
> {noformat}
> // src/messages/messages.proto
> message StatusUpdate {
>   [...]
>   // This corresponds to the latest state of the task according to the
>   // agent. Note that this state might be different than the state in
>   // 'status' because task status update manager queues updates. In
>   // other words, 'status' corresponds to the update at top of the
>   // queue and 'latest_state' corresponds to the update at bottom of
>   // the queue.
>   optional TaskState latest_state = 7;
> }
> {noformat}
> However, the `TaskStatus` message included in an `TaskUpdated` event is the 
> event at the bottom of the queue when the update was sent.
> So we can easily get in a situation where e.g. the first TaskUpdated has 
> .status.state == TASK_STARTING and .state == TASK_RUNNING, and the second 
> update with .status.state == TASK_RUNNNING and .state == TASK_RUNNING would 
> not get delivered because the latest known state did not change.
> This implies that schedulers can not reliably wait for the status information 
> corresponding to specific task state, since there is no guarantee that 
> subscribers get notified during the time when this status update will be 
> included in the status field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-06-01 Thread Zhitao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8830:


Assignee: Zhitao Li

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {noformat}
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out fleet, and I do not know whether the 
> 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486761#comment-16486761
 ] 

Zhitao Li commented on MESOS-8830:
--

[~jieyu] I put up a patch in https://reviews.apache.org/r/67264/. Please let me 
know what you think. Thanks.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {noformat}
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.

[jira] [Commented] (MESOS-8909) Scrubbing value secret from HTTP responses

2018-05-22 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484234#comment-16484234
 ] 

Zhitao Li commented on MESOS-8909:
--

[~jieyu] Yes this is only applicable to `Value` type secret (we don't care if a 
secret with `REFERENCE` type exposed).

If we do not consider `VALUE` type good enough to use for prod, maybe we should 
call that out in http://mesos.apache.org/documentation/latest/secrets/ as well 
as protobuf comments? I can send patches if you agree on that intent.

> Scrubbing value secret from HTTP responses
> --
>
> Key: MESOS-8909
> URL: https://issues.apache.org/jira/browse/MESOS-8909
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Zhitao Li
>Priority: Major
>  Labels: security
>
> Mesos supports a value based secret. However, I believe some HTTP endpoints 
> and v1 operator responses could leak this information.
> The goal here is to make sure these endpoints do not leak the information.
> We did some quick research and gather the following list in this [Google 
> doc|https://docs.google.com/document/d/1W26RUpYEB92eTQYbACIOem5B9hzXX59jeEIT9RB2X1o/edit#heading=h.gzvg4ec6wllm].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8909) Scrubbing value secret from HTTP responses

2018-05-17 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479892#comment-16479892
 ] 

Zhitao Li commented on MESOS-8909:
--

My current thought:
- Create a common function of `void scrueSecretValue(mesos::Secret* secret)` 
which will be similar to `upgradeResources()`, by using protobuf's reflection 
to scrub any secret value;
- ensure responses of protobuf messages from various endpoints are properly 
called with this function.

> Scrubbing value secret from HTTP responses
> --
>
> Key: MESOS-8909
> URL: https://issues.apache.org/jira/browse/MESOS-8909
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Zhitao Li
>Priority: Major
>  Labels: security
>
> Mesos supports a value based secret. However, I believe some HTTP endpoints 
> and v1 operator responses could leak this information.
> The goal here is to make sure these endpoints do not leak the information.
> We did some quick research and gather the following list in this [Google 
> doc|https://docs.google.com/document/d/1W26RUpYEB92eTQYbACIOem5B9hzXX59jeEIT9RB2X1o/edit#heading=h.gzvg4ec6wllm].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476604#comment-16476604
 ] 

Zhitao Li edited comment on MESOS-8830 at 5/15/18 11:27 PM:


[~jieyu] Unfortunately I lost the environment on this issue.

Still, I'd like to pursue on the idea of `not follow bind mounts (just like not 
follow symlinks) when doing workdir gc`: Do you mean that we should just filter 
out any FTS node which is a bind mount in 
[stout::rmdir|https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/os/posix/rmdir.hpp#L43]?
 The [FTS manual|https://www.freebsd.org/cgi/man.cgi?query=fts=3] I can 
find mentioned nothing about detecting a bind mount. so I guess we need to 
handle that ourselves?


was (Author: zhitao):
[~jieyu] Unfortunately I lost the environment on this issue.

Still, I'd like to pursue on the idea of `not follow bind mounts (just like not 
follow symlinks) when doing workdir gc`: Do you mean that we should just filter 
out any FTS node which is a bind mount in 
[stout::rmdir|https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/os/posix/rmdir.hpp#L43]?

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  

[jira] [Comment Edited] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476604#comment-16476604
 ] 

Zhitao Li edited comment on MESOS-8830 at 5/15/18 11:22 PM:


[~jieyu] Unfortunately I lost the environment on this issue.

Still, I'd like to pursue on the idea of `not follow bind mounts (just like not 
follow symlinks) when doing workdir gc`: Do you mean that we should just filter 
out any FTS node which is a bind mount in 
[stout::rmdir|https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/os/posix/rmdir.hpp#L43]?


was (Author: zhitao):
[~jieyu] Unfortunately I lost the environment on this issue.

Still, I'd like to pursue on the idea of `not follow bind mounts (just like not 
follow symlinks) when doing workdir gc`: Do you mean that we should just filter 
out any FTS node in 
[stout::rmdir|https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/os/posix/rmdir.hpp#L43]?

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476604#comment-16476604
 ] 

Zhitao Li commented on MESOS-8830:
--

[~jieyu] Unfortunately I lost the environment on this issue.

Still, I'd like to pursue on the idea of `not follow bind mounts (just like not 
follow symlinks) when doing workdir gc`: Do you mean that we should just filter 
out any FTS node in 
[stout::rmdir|https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/os/posix/rmdir.hpp#L43]?

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {noformat}
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> 

[jira] [Created] (MESOS-8909) Scrubbing value secret from HTTP responses

2018-05-11 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8909:


 Summary: Scrubbing value secret from HTTP responses
 Key: MESOS-8909
 URL: https://issues.apache.org/jira/browse/MESOS-8909
 Project: Mesos
  Issue Type: Task
  Components: security
Reporter: Zhitao Li


Mesos supports a value based secret. However, I believe some HTTP endpoints and 
v1 operator responses could leak this information.

The goal here is to make sure these endpoints do not leak the information.

We did some quick research and gather the following list in this [Google 
doc|https://docs.google.com/document/d/1W26RUpYEB92eTQYbACIOem5B9hzXX59jeEIT9RB2X1o/edit#heading=h.gzvg4ec6wllm].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8600) Add more permissive reconfiguration policies

2018-05-08 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468172#comment-16468172
 ] 

Zhitao Li commented on MESOS-8600:
--

Another usability improvement I'm considering is something like 
https://reviews.apache.org/r/67022/ : I feel that keeping checkpointed 
SlaveInfo updated if agent properly reregisters with master will help operator 
to reason the cluster and deals with various upgrade/downgrade sequences.

FYI we already run https://reviews.apache.org/r/64384 at scale w/o seeing too 
much issues.


> Add more permissive reconfiguration policies
> 
>
> Key: MESOS-8600
> URL: https://issues.apache.org/jira/browse/MESOS-8600
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> With Mesos 1.5, the `reconfiguration_policy` flag was added to allow 
> reconfiguration of agents without necessarily draining all tasks.
> However, the current implementation only allows a limited set of changes, 
> with the `–reconfiguration_policy=all` setting laid out in the original 
> design doc not yet being implemented.
> This ticket is intended to track progress on implementing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8884) Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.

2018-05-07 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466546#comment-16466546
 ] 

Zhitao Li commented on MESOS-8884:
--

Attempt to fix: https://reviews.apache.org/r/66993/

> Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.
> --
>
> Key: MESOS-8884
> URL: https://issues.apache.org/jira/browse/MESOS-8884
> Project: Mesos
>  Issue Type: Bug
> Environment: master-520b7298
>Reporter: Andrei Budnik
>Assignee: Zhitao Li
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_DOCKER_MaxCompletionTime-badrun.txt
>
>
> This test fails quite often in our internal CI.
> {code:java}
> ../../src/tests/containerizer/docker_containerizer_tests.cpp:663: Failure
> termination.get() is NONE
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8884) Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.

2018-05-07 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8884:


Assignee: Zhitao Li

> Flaky `DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime`.
> --
>
> Key: MESOS-8884
> URL: https://issues.apache.org/jira/browse/MESOS-8884
> Project: Mesos
>  Issue Type: Bug
> Environment: master-520b7298
>Reporter: Andrei Budnik
>Assignee: Zhitao Li
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_DOCKER_MaxCompletionTime-badrun.txt
>
>
> This test fails quite often in our internal CI.
> {code:java}
> ../../src/tests/containerizer/docker_containerizer_tests.cpp:663: Failure
> termination.get() is NONE
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8851) Introduce a push-based gauge.

2018-05-01 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460228#comment-16460228
 ] 

Zhitao Li commented on MESOS-8851:
--

Okay, I'm not particularly picky on naming so either is fine.

Do you have 1) a benchmark about potential performance improvement on this, and 
2) a list of expensive gauges we can convert? I'm interested in helping out on 
some of them (or find someone from my company).

> Introduce a push-based gauge.
> -
>
> Key: MESOS-8851
> URL: https://issues.apache.org/jira/browse/MESOS-8851
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, metrics
> Fix For: 1.6.0
>
>
> Currently, we only have pull-based gauges which have significant performance 
> downsides.
> A push-based gauge differs from a pull-based gauge in that the client is 
> responsible for pushing the latest value into the gauge whenever it changes. 
> This can be challenging in some cases as it requires the client to have a 
> good handle on when the gauge value changes (rather than just computing the 
> current value when asked).
> It is highly recommended to use push-based gauges if possible as they provide 
> significant performance benefits over pull-based gauges. Pull-based gauge 
> suffer from delays getting processed on the event queue of a Process, as well 
> as incur computation cost on the Process each time the metrics are collected. 
> Push-based gauges, on the other hand, incur no cost to the owning Process 
> when metrics are collected, and instead incur a trivial cost when the Process 
> pushes new values in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8856) UNIMPLEMENTED macro in stout could conflict with protobuf

2018-05-01 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8856:


 Summary: UNIMPLEMENTED macro in stout could conflict with protobuf
 Key: MESOS-8856
 URL: https://issues.apache.org/jira/browse/MESOS-8856
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


When I tried to use *UNIMPLEMENTED* macro defined in 
3rdparty/stout/include/stout/unimplemented.hpp, it conflicted with an 
`UNIMPLEMENTED` macro in 
{build_dir}/3rdparty/protobuf-3.5.0/src/google/protobuf/stubs/status.h

Although the latter is an enum like this:


{code:cpp}

namespace google {
namespace protobuf {
namespace util {
namespace error {
// These values must match error codes defined in google/rpc/code.proto.
enum Code {
  OK = 0,
...
  UNIMPLEMENTED = 12,
...
};
}  // namespace error
{code}


Preprocessor would transform the enum into incorrect macro.

cc + [~chhsia0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8851) Introduce a push-based gauge.

2018-05-01 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459789#comment-16459789
 ] 

Zhitao Li commented on MESOS-8851:
--

This is great, but would it be better called `PollGauge`?

> Introduce a push-based gauge.
> -
>
> Key: MESOS-8851
> URL: https://issues.apache.org/jira/browse/MESOS-8851
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, metrics
> Fix For: 1.6.0
>
>
> Currently, we only have pull-based gauges which have significant performance 
> downsides.
> A push-based gauge differs from a pull-based gauge in that the client is 
> responsible for pushing the latest value into the gauge whenever it changes. 
> This can be challenging in some cases as it requires the client to have a 
> good handle on when the gauge value changes (rather than just computing the 
> current value when asked).
> It is highly recommended to use push-based gauges if possible as they provide 
> significant performance benefits over pull-based gauges. Pull-based gauge 
> suffer from delays getting processed on the event queue of a Process, as well 
> as incur computation cost on the Process each time the metrics are collected. 
> Push-based gauges, on the other hand, incur no cost to the owning Process 
> when metrics are collected, and instead incur a trivial cost when the Process 
> pushes new values in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-04-24 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450468#comment-16450468
 ] 

Zhitao Li commented on MESOS-8830:
--

Minor correction: I mistakenly thought the problematic path is a hard link 
while it's actually a bind mount.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out 

[jira] [Created] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-04-24 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8830:


 Summary: Agent gc on old slave sandboxes could empty persistent 
volume data
 Key: MESOS-8830
 URL: https://issues.apache.org/jira/browse/MESOS-8830
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


We had an issue in which custom Cassandra executors (which does not use any 
container image thus running on host filesystem) saw its persistent volume data 
got wiped out.

Upon revisiting logs, we found following suspicious lines:


{panel:title=log}
I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
allowed age: 4.764742265646493days
I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
removal time 2.23508429704593days
I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
removal time 2.23508429587852days
I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
'/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
 Device or resource busy
E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
 Directory not empty
E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
 Directory not empty
E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
 Directory not empty
E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
 Directory not empty
E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
 Directory not empty
E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
 Directory not empty
E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
 Directory not empty
{panel}

(I can try to provide more logs, depending on how much local archive has 
provided).

The path 
*/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
 is a hard link to a persistent volume. The fact that agent gc touched that 
process makes me believe this is what triggered the data loss.

We had some misconfigurations on out fleet, and I do not know whether the 
previous slave (id-ed *70279b87-553a-4213-a85b-46fdc191849d-S4*) was shutdown 
cleanly yet.

My questions/suggestions:

1) if an executor was asked to shutdown by a new agent (with a new id), how 
much of persistent volume clean up code will be executed (especially if new 
agent do not really know this executor should be running anymore)?
2) should we figure out a way to better protect hard links to persistent 
volumes in slave/gc.cpp (for instance, skip any path which seems dangerous), to 
prevent similar problems?



--
This message 

[jira] [Created] (MESOS-8791) Convert grow_volume and shrink_volume into non-speculative operations

2018-04-16 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8791:


 Summary: Convert grow_volume and shrink_volume into 
non-speculative operations
 Key: MESOS-8791
 URL: https://issues.apache.org/jira/browse/MESOS-8791
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Zhitao Li


We implemented most of grow_volume and shrink_volume in 1.6. However, we were 
not able to finish the work to implement them as non-speculative offer 
operations (which was the original intention), mostly due to some blockers on 
operator API triggered operations.

This task tracks the work of converting them back to non-speculative:
- master and allocator need to properly track "consumed" resource for a pending 
operation which is not triggered by framework;
- if an non-speculative operation succeeded, master and allocator should add 
"converted" to available resources ;
- if a non-speculative operation failed, master and allocator should properly 
add "consumed" back to available resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5933) Refactor the uri::Fetcher as a binary.

2018-04-12 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-5933:


Assignee: (was: Zhitao Li)

> Refactor the uri::Fetcher as a binary.
> --
>
> Key: MESOS-5933
> URL: https://issues.apache.org/jira/browse/MESOS-5933
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Gilbert Song
>Priority: Major
>  Labels: fetcher, mesosphere
>
> By refactoring the uri::Fetcher as a binary, the fetcher can be used 
> independently. Not only mesos, but also new fetcher plugin testing, mesos cli 
> and many other new components in the future can re-use the binary to fetch 
> any URI with different schemes. Ideally, after this change, mesos cli is able 
> to re-use the uri::Fetcher binary to introduce new image pulling commands, 
> e.g., `mesos fetch -i `.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-8725) Support max_duration for tasks

2018-04-12 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8725:
-
Comment: was deleted

(was: One minor decision I'm making is to require all tasks in the same group 
to have the same `max_duration` (either all absent, or carries the same value).

Keeping this as record here.)

> Support max_duration for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8600) Add more permissive reconfiguration policies

2018-04-12 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436125#comment-16436125
 ] 

Zhitao Li edited comment on MESOS-8600 at 4/12/18 6:44 PM:
---

ping? [~vinodkone][~bennoe]

We are running this patch for a while in our cluster so we would like to see 
whether we can get this upstreamed to split.

We are okay if we declare this option as experimental for several minor 
versions.


was (Author: zhitao):
ping? [~vinodkone][~bennoe]
We are running this patch for a while in our cluster so we would like to see 
whether we can get this upstreamed to split.

We are okay if we declare this option as experimental.

> Add more permissive reconfiguration policies
> 
>
> Key: MESOS-8600
> URL: https://issues.apache.org/jira/browse/MESOS-8600
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> With Mesos 1.5, the `reconfiguration_policy` flag was added to allow 
> reconfiguration of agents without necessarily draining all tasks.
> However, the current implementation only allows a limited set of changes, 
> with the `–reconfiguration_policy=all` setting laid out in the original 
> design doc not yet being implemented.
> This ticket is intended to track progress on implementing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8600) Add more permissive reconfiguration policies

2018-04-12 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436125#comment-16436125
 ] 

Zhitao Li commented on MESOS-8600:
--

ping? [~vinodkone][~bennoe]
We are running this patch for a while in our cluster so we would like to see 
whether we can get this upstreamed to split.

We are okay if we declare this option as experimental.

> Add more permissive reconfiguration policies
> 
>
> Key: MESOS-8600
> URL: https://issues.apache.org/jira/browse/MESOS-8600
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> With Mesos 1.5, the `reconfiguration_policy` flag was added to allow 
> reconfiguration of agents without necessarily draining all tasks.
> However, the current implementation only allows a limited set of changes, 
> with the `–reconfiguration_policy=all` setting laid out in the original 
> design doc not yet being implemented.
> This ticket is intended to track progress on implementing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8768) Provide custom reason for cascaded kill in a task group

2018-04-09 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8768:


 Summary: Provide custom reason for cascaded kill in a task group
 Key: MESOS-8768
 URL: https://issues.apache.org/jira/browse/MESOS-8768
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Currently, if a task fails in a task group, other active tasks in the same 
group will see _TASK_KILLED_ without any custom reason. We would like to 
provide a custom reason like _*REASON_TASK_GROUP_KILLED*_ to distinguish 
whether the task is killed upon request of scheduler or upon a cascaded failure.

 

Open question: does any framework depends the value of this reason? If not, we 
probably can just change the reason without a opt-in mechanism from framework 
(i.e, a new framework capability).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8748) Create ACL for grow and shrink volume

2018-03-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8748:


 Summary: Create ACL for grow and shrink volume
 Key: MESOS-8748
 URL: https://issues.apache.org/jira/browse/MESOS-8748
 Project: Mesos
  Issue Type: Task
  Components: security
Reporter: Zhitao Li
Assignee: Zhitao Li


As follow up work of MESOS-4965, we should make sure new operations are 
properly protected in ACL and authorizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8747) Support resizing persistent volume through operator API

2018-03-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8747:


 Summary: Support resizing persistent volume through operator API
 Key: MESOS-8747
 URL: https://issues.apache.org/jira/browse/MESOS-8747
 Project: Mesos
  Issue Type: Task
Reporter: Zhitao Li
Assignee: Zhitao Li


MESOS-4965 tracks the implementation through framework offer operation, while 
this task extends the support to operator API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8746) Support difference for hashset in stout

2018-03-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8746:


 Summary: Support difference for hashset in stout
 Key: MESOS-8746
 URL: https://issues.apache.org/jira/browse/MESOS-8746
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Zhitao Li
Assignee: Zhitao Li


Several code place requires calculating difference between two hashset. This 
should be defined in stout itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8725) Support max_duration for tasks

2018-03-26 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414617#comment-16414617
 ] 

Zhitao Li commented on MESOS-8725:
--

One minor decision I'm making is to require all tasks in the same group to have 
the same `max_duration` (either all absent, or carries the same value).

Keeping this as record here.

> Support max_duration for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8725) Support deadline for tasks

2018-03-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412053#comment-16412053
 ] 

Zhitao Li commented on MESOS-8725:
--

{quote}bq.Can you look into whether we could/should implement this in the agent?
{quote}
To recapture our conversation, I think implementing in executor is preferred:
 * it is typically executor sending *TASK_KILLED* state so this is following 
the convention;
 * it is simpler to implement in executor because its lifecycle is as long as 
the task so it does not need to checkpoint/recover these information, comparing 
to agent which could restart during the duration;
 * agent cannot follow kill_policy as what executor do.

 

> Support deadline for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8725) Support deadline for tasks

2018-03-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412045#comment-16412045
 ] 

Zhitao Li commented on MESOS-8725:
--

The following chain is a proof of concept in command executor:

[https://reviews.apache.org/r/66258/]

[https://reviews.apache.org/r/66259/]

[https://reviews.apache.org/r/66260/]

 

I will work on other executors after API review with dev list.

> Support deadline for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8725) Support deadline for tasks

2018-03-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411704#comment-16411704
 ] 

Zhitao Li commented on MESOS-8725:
--

[~jpe...@apache.org], thanks for shepherding this. I'll start with a prototype 
chain for a new `*Deadline*` message on TaskInfo and an implementation/test on 
command executor. If the end to end design looks good, I'll get to other two 
executors (docker/default).

> Support deadline for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8725) Support deadline for tasks

2018-03-23 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8725:


Shepherd: James Peach
Assignee: Zhitao Li

> Support deadline for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8725) Support deadline for tasks

2018-03-22 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410613#comment-16410613
 ] 

Zhitao Li commented on MESOS-8725:
--

[~jamesmulcahy], we actually started on that path, however some of the 
scalability difficulties we met:
 * limited compute resource on scheduler: a lot schedulers takes same design of 
Mesos master and only run one active process, and tracking a timer per task 
there uses up precious resources there;
 * network partition: if master/agent was under network partition, the 
scheduler could not terminate the task;
 * recovery upon scheduler restart: this was the biggest problem for us, but 
when our scheduler process restarted, it needed to recover "all" running tasks 
from database and reconstruct what to do for each task (which is also a common 
pattern among schedulers). Any additional features introduced there will 
further made the process heavier;
 * cheaper to implement in executor: with isolation mechanisms like `pid`, we 
expect that executor has a longer lifecycle. Therefore, executors do not even 
need to maintain a busy thread, but simply use a 
[Timer|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/timer.hpp]
 and terminate the task.

> Support deadline for tasks
> --
>
> Key: MESOS-8725
> URL: https://issues.apache.org/jira/browse/MESOS-8725
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Priority: Major
>
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8725) Support deadline for tasks

2018-03-22 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8725:


 Summary: Support deadline for tasks
 Key: MESOS-8725
 URL: https://issues.apache.org/jira/browse/MESOS-8725
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


In our environment, we run a lot of batch jobs, some of which have tight 
timeline. If any tasks in the job runs longer than x hours, it does not make 
sense to run it anymore. 
 
For instance, a team would submit a job which builds a weekly index and repeats 
every Monday. If the job does not finish before next Monday for whatever 
reason, there is no point to keep any task running.
 
We believe that implementing deadline tracking distributed across our cluster 
makes more sense as it makes the system more scalable and also makes our 
centralized state machine simpler.
 
One idea I have right now is to add an  *optional* *TimeInfo deadline* to 
TaskInfo field, and all default executors in Mesos can simply terminate the 
task and send a proper *StatusUpdate.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8600) Add more permissive reconfiguration policies

2018-03-16 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402406#comment-16402406
 ] 

Zhitao Li commented on MESOS-8600:
--

[~bennoe], can you add whoever had previous conversations I take from your 
comment:

"The most specific concern that we had at the time was that we were not
sure about the best way to handle health checks on agents where
the hostname changed. (together with a general feeling
that we needed a bit more time to think through possible failure
scenarios)"

I am very interested in helping out any discussion or work to close this out, 
since we will run the patch in [https://reviews.apache.org/r/64384.]

 

> Add more permissive reconfiguration policies
> 
>
> Key: MESOS-8600
> URL: https://issues.apache.org/jira/browse/MESOS-8600
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> With Mesos 1.5, the `reconfiguration_policy` flag was added to allow 
> reconfiguration of agents without necessarily draining all tasks.
> However, the current implementation only allows a limited set of changes, 
> with the `–reconfiguration_policy=all` setting laid out in the original 
> design doc not yet being implemented.
> This ticket is intended to track progress on implementing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8411) Killing a queued task can lead to the command executor never terminating.

2018-03-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399654#comment-16399654
 ] 

Zhitao Li commented on MESOS-8411:
--

Hi, do you think it's possible to paste log line patterns when this issue 
happens? That will help people triaging issues to know whether it's the same 
problem.

 

Thanks.

> Killing a queued task can lead to the command executor never terminating.
> -
>
> Key: MESOS-8411
> URL: https://issues.apache.org/jira/browse/MESOS-8411
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Critical
> Fix For: 1.4.2, 1.6.0, 1.5.1, 1.3.3
>
>
> If a task is killed while the executor is re-registering, we will remove it 
> from queued tasks and shut down the executor if all the its initial tasks 
> could not be delivered. However, there is a case (within {{Slave::___run}}) 
> where we leave the executor running, the race is:
> # Command-executor task launched.
> # Command executor sends registration message. Agent tells containerizer to 
> update the resources before it sends the tasks to the executor.
> # Kill arrives, and we synchronously remove the task from queued tasks.
> # Containerizer finishes updating the resources, and in {{Slave::___run}} the 
> killed task is ignored.
> # Command executor stays running!
> Executors could have a timeout to handle this case, but it's not clear that 
> all executors will implement this correctly. It would be better to have a 
> defensive policy that will shut down an executor if all of its initial batch 
> of tasks were killed prior to delivery.
> In order to implement this, one approach discussed with [~vinodkone] is to 
> look at the running + terminated but unacked + completed tasks, and if empty, 
> shut the executor down in the {{Slave::___run}} path. This will require us to 
> check that the completed task cache size is set to at least 1, and this also 
> assumes that the completed tasks are not cleared based on time or during 
> agent recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-03-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399215#comment-16399215
 ] 

Zhitao Li edited comment on MESOS-8609 at 3/14/18 8:27 PM:
---

{noformat}
commit 82c50c0e00284c131354499f74176b19d89bd21d (HEAD -> master, origin/master, 
origin/HEAD)
Author: Zhitao Li 
Date:   Wed Mar 14 09:25:01 2018 -0700

Document new `slave/recovery_time_secs` gauge.

Review: https://reviews.apache.org/r/66070

commit b8526c61403214aaa67fa941b4e8b0fd8e3328f2
Author: Zhitao Li 
Date:   Wed Mar 7 15:18:53 2018 -0800

Added a test to make sure `slave/recovery_time_secs` is reported.

Review: https://reviews.apache.org/r/65959

commit 026dafd33cd23d41818e18e31ec271fa2c13abd2
Author: Zhitao Li 
Date:   Tue Mar 6 17:43:48 2018 -0800

Added a gauge for how long agent recovery takes.

The new metric `slave/recover_time_secs` can be used to tell us how long
Mesos agent needed to finish its recovery cycle. This is an important
metric on agent machines which have a lot of completed executor
sandboxes.

Note that the metric 1) will only be available after recovery succeeded
and 2) never change its value across agent process lifecycle afterwards.

Review: https://reviews.apache.org/r/65954
{noformat}


was (Author: zhitao):
commit 82c50c0e00284c131354499f74176b19d89bd21d (HEAD -> master, origin/master, 
origin/HEAD)
Author: Zhitao Li 
Date:   Wed Mar 14 09:25:01 2018 -0700

Document new `slave/recovery_time_secs` gauge.

Review: https://reviews.apache.org/r/66070

commit b8526c61403214aaa67fa941b4e8b0fd8e3328f2
Author: Zhitao Li 
Date:   Wed Mar 7 15:18:53 2018 -0800

Added a test to make sure `slave/recovery_time_secs` is reported.

Review: https://reviews.apache.org/r/65959

commit 026dafd33cd23d41818e18e31ec271fa2c13abd2
Author: Zhitao Li 
Date:   Tue Mar 6 17:43:48 2018 -0800

Added a gauge for how long agent recovery takes.

The new metric `slave/recover_time_secs` can be used to tell us how long
Mesos agent needed to finish its recovery cycle. This is an important
metric on agent machines which have a lot of completed executor
sandboxes.

Note that the metric 1) will only be available after recovery succeeded
and 2) never change its value across agent process lifecycle afterwards.

Review: https://reviews.apache.org/r/65954


> Create a metric to indicate how long agent takes to recover executors
> -
>
> Key: MESOS-8609
> URL: https://issues.apache.org/jira/browse/MESOS-8609
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>  Labels: Metrics, agent
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-03-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399215#comment-16399215
 ] 

Zhitao Li edited comment on MESOS-8609 at 3/14/18 7:59 PM:
---

commit 82c50c0e00284c131354499f74176b19d89bd21d (HEAD -> master, origin/master, 
origin/HEAD)
Author: Zhitao Li 
Date:   Wed Mar 14 09:25:01 2018 -0700

Document new `slave/recovery_time_secs` gauge.

Review: https://reviews.apache.org/r/66070

commit b8526c61403214aaa67fa941b4e8b0fd8e3328f2
Author: Zhitao Li 
Date:   Wed Mar 7 15:18:53 2018 -0800

Added a test to make sure `slave/recovery_time_secs` is reported.

Review: https://reviews.apache.org/r/65959

commit 026dafd33cd23d41818e18e31ec271fa2c13abd2
Author: Zhitao Li 
Date:   Tue Mar 6 17:43:48 2018 -0800

Added a gauge for how long agent recovery takes.

The new metric `slave/recover_time_secs` can be used to tell us how long
Mesos agent needed to finish its recovery cycle. This is an important
metric on agent machines which have a lot of completed executor
sandboxes.

Note that the metric 1) will only be available after recovery succeeded
and 2) never change its value across agent process lifecycle afterwards.

Review: https://reviews.apache.org/r/65954



was (Author: zhitao):
commit 82c50c0e00284c131354499f74176b19d89bd21d (HEAD -> master, origin/master, 
origin/HEAD)

Author: Zhitao Li 

Date:   Wed Mar 14 09:25:01 2018 -0700

 

    Document new `slave/recovery_time_secs` gauge.

    

    Review: https://reviews.apache.org/r/66070

 

commit b8526c61403214aaa67fa941b4e8b0fd8e3328f2

Author: Zhitao Li 

Date:   Wed Mar 7 15:18:53 2018 -0800

 

    Added a test to make sure `slave/recovery_time_secs` is reported.

    

    Review: https://reviews.apache.org/r/65959

 

commit 026dafd33cd23d41818e18e31ec271fa2c13abd2

Author: Zhitao Li 

Date:   Tue Mar 6 17:43:48 2018 -0800

 

    Added a gauge for how long agent recovery takes.

    

    The new metric `slave/recover_time_secs` can be used to tell us how long

    Mesos agent needed to finish its recovery cycle. This is an important

    metric on agent machines which have a lot of completed executor

    sandboxes.

    

    Note that the metric 1) will only be available after recovery succeeded

    and 2) never change its value across agent process lifecycle afterwards.

    

    Review: https://reviews.apache.org/r/65954

 

> Create a metric to indicate how long agent takes to recover executors
> -
>
> Key: MESOS-8609
> URL: https://issues.apache.org/jira/browse/MESOS-8609
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>  Labels: Metrics, agent
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-03-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399019#comment-16399019
 ] 

Zhitao Li edited comment on MESOS-8609 at 3/14/18 6:11 PM:
---

[https://reviews.apache.org/r/65954/] (implementation)

[https://reviews.apache.org/r/65959/] (test)

[https://reviews.apache.org/r/66070/] (document)


was (Author: zhitao):
[https://reviews.apache.org/r/65954/] (prod)

[https://reviews.apache.org/r/65959/] (test)

[https://reviews.apache.org/r/66070/] (document)

> Create a metric to indicate how long agent takes to recover executors
> -
>
> Key: MESOS-8609
> URL: https://issues.apache.org/jira/browse/MESOS-8609
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>  Labels: Metrics, agent
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7461) balloon test and disk full framework test relies on possibly unavailable ports

2018-03-14 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398987#comment-16398987
 ] 

Zhitao Li commented on MESOS-7461:
--

The issue with disk_full_framework.sh using fixed port is still there. I'll try 
to submit a fix by using some bash tricks to select a random unused port 
(ideally in ephemeral port range) instead (similar to 
https://stackoverflow.com/questions/6942097/finding-next-open-port)

[~andschwa], do we run these tests on windows build?

> balloon test and disk full framework test relies on possibly unavailable ports
> --
>
> Key: MESOS-7461
> URL: https://issues.apache.org/jira/browse/MESOS-7461
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> balloon_framework_test.sh and disk_full_framework_test.sh all have code to 
> directly listen at a {{5432}} port, but in our environment that port is 
> directly reserved by something else.
> A possible fix is to write some utility to try to find an unused port, and 
> try to use it for the master. It's not perfect though as there could still be 
> a race condition.
> Another possible fix if to move listen "port" to a domain socket, when that's 
> supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7461) balloon test and disk full framework test relies on possibly unavailable ports

2018-03-14 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-7461:


Assignee: Zhitao Li

> balloon test and disk full framework test relies on possibly unavailable ports
> --
>
> Key: MESOS-7461
> URL: https://issues.apache.org/jira/browse/MESOS-7461
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> balloon_framework_test.sh and disk_full_framework_test.sh all have code to 
> directly listen at a {{5432}} port, but in our environment that port is 
> directly reserved by something else.
> A possible fix is to write some utility to try to find an unused port, and 
> try to use it for the master. It's not perfect though as there could still be 
> a race condition.
> Another possible fix if to move listen "port" to a domain socket, when that's 
> supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8663) Support transfer of persistent volume between roles without losing data

2018-03-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8663:


 Summary: Support transfer of persistent volume between roles 
without losing data
 Key: MESOS-8663
 URL: https://issues.apache.org/jira/browse/MESOS-8663
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


A persistent volume is scoped to a role right now. On agent, the volume path 
actually included the role string.

A possible workflow from operator side is to transfer the volume onto a new 
role (i.e, a reorg which transfers the owner of a database to a new team). 
Today the operator would need to create new volume and delete old volume, 
during which data could be lost (unless they perform manual data migration out 
of Mesos control).

This would be an operational blocker for us to try out hierarchical roles once 
they support persistent volume.

The goal here is to see whether Mesos can provide an atomic operation which can 
allow transfer the volume onto a different role without necessity to 
delete/recreate the volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388599#comment-16388599
 ] 

Zhitao Li edited comment on MESOS-6918 at 3/6/18 10:07 PM:
---

[~jamespeach], do you think it's feasible to target some of this work for 1.6? 
We are interested in use this format for our monitoring on master/agent.

The issue we have is that we need to hardcode whether a metric is gauge or 
counter because our monitoring system treats them differently, and that hard 
coded list was never maintainable.


was (Author: zhitao):
[~jamespeach], do you think it's feasible to target some of this work for 1.6? 
We are interested in reusing some functionalities here.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388599#comment-16388599
 ] 

Zhitao Li commented on MESOS-6918:
--

[~jamespeach], do you think it's feasible to target some of this work for 1.6? 
We are interested in reusing some functionalities here.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388552#comment-16388552
 ] 

Zhitao Li edited comment on MESOS-4965 at 3/6/18 9:24 PM:
--

WIP [design 
doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#]
 (mostly gather information)


was (Author: zhitao):
WIP[ design 
doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#]
 (mostly gather information)

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388552#comment-16388552
 ] 

Zhitao Li commented on MESOS-4965:
--

WIP[ design 
doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#]
 (mostly gather information)

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8641) New heartbeat on event stream could change the behavior for subscriber

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388094#comment-16388094
 ] 

Zhitao Li commented on MESOS-8641:
--

Attempt to fix: 

https://reviews.apache.org/r/65930

> New heartbeat on event stream could change the behavior for subscriber
> --
>
> Key: MESOS-8641
> URL: https://issues.apache.org/jira/browse/MESOS-8641
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.5.0
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> A new event for heartbeat is added in 
> [MESOS-7695|https://reviews.apache.org/r/61262/bugs/MESOS-7695/], but I 
> believe the implementation in [https://reviews.apache.org/r/61262/] can 
> trigger a corner case and send *_HEARTBEAT_* before _*SUBSCRIBED*_
>  
> I would consider this a behavior change for the customer and I propose we 
> change the order as I suggest in the review to preserve previous behavior 
> (since the subscriber needs to see the _*SUBSCRIBED*_ event to really know 
> how it should respond to *_HEARTBEAT_* message anyway)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8641) New heartbeat on event stream could change the behavior for subscriber

2018-03-05 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8641:


 Summary: New heartbeat on event stream could change the behavior 
for subscriber
 Key: MESOS-8641
 URL: https://issues.apache.org/jira/browse/MESOS-8641
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API
Reporter: Zhitao Li
Assignee: Zhitao Li


A new event for heartbeat is added in 
[MESOS-7695|https://reviews.apache.org/r/61262/bugs/MESOS-7695/], but I believe 
the implementation in [https://reviews.apache.org/r/61262/] can trigger a 
corner case and send *_HEARTBEAT_* before _*SUBSCRIBED*_

 

__I would consider this a behavior change for the customer and I propose we 
change the order as I suggest in the review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8637) Persistent volume doc missed related operator API calls

2018-03-05 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8637:


 Summary: Persistent volume doc missed related operator API calls
 Key: MESOS-8637
 URL: https://issues.apache.org/jira/browse/MESOS-8637
 Project: Mesos
  Issue Type: Improvement
  Components: documentation
Reporter: Zhitao Li


CREATE_VOLUME and DESTROY_VOLUME is not in 
[http://mesos.apache.org/documentation/latest/persistent-volume/]. This could 
create confusion for users about possible ways to create/destroy volumes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-02 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-4965:


Assignee: Zhitao Li

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-02-24 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8609:


 Summary: Create a metric to indicate how long agent takes to 
recover executors
 Key: MESOS-8609
 URL: https://issues.apache.org/jira/browse/MESOS-8609
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8609) Create a metric to indicate how long agent takes to recover executors

2018-02-24 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8609:


Assignee: Zhitao Li

> Create a metric to indicate how long agent takes to recover executors
> -
>
> Key: MESOS-8609
> URL: https://issues.apache.org/jira/browse/MESOS-8609
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8506) Add test coverage for `Resources::find` on revocable resources

2018-01-29 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8506:


Shepherd: James Peach
Assignee: Zhitao Li

> Add test coverage for `Resources::find` on revocable resources
> --
>
> Key: MESOS-8506
> URL: https://issues.apache.org/jira/browse/MESOS-8506
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Trivial
>
> In the process of fixing MESOS-8471, we want to add some tests on handling of 
> revocable resources in `Resources::find()` to avoid further regression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8506) Add test coverage for `Resources::find` on revocable resources

2018-01-29 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8506:


 Summary: Add test coverage for `Resources::find` on revocable 
resources
 Key: MESOS-8506
 URL: https://issues.apache.org/jira/browse/MESOS-8506
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Zhitao Li


In the process of fixing MESOS-8471, we want to add some tests on handling of 
revocable resources in `Resources::find()` to avoid further regression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8471) Allow revocable_resources capability for mesos-execute

2018-01-29 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8471:


Shepherd: James Peach
Assignee: Zhitao Li

> Allow revocable_resources capability for mesos-execute
> --
>
> Key: MESOS-8471
> URL: https://issues.apache.org/jira/browse/MESOS-8471
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> While mesos-execute is a nice tool to quickly test certain behavior of Mesos 
> itself without an external framework, it seems there is not direct way to 
> test revocable support in it.
> A quick test with the binary suggests that if we infer *REVOCABLE_RESOURCES* 
> capability from input, this should allow revocable resources on `task` or 
> `task_group` to be launched to Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-24 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338114#comment-16338114
 ] 

Zhitao Li edited comment on MESOS-8480 at 1/24/18 7:39 PM:
---

Will this be also cherrypicked to 1.5.0 since the RC is still not finalized yet?


was (Author: zhitao):
Will this be also back ported to 1.5.0 since the RC is still not finalized yet?

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//cgroup}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8480) Mesos returns high resource usage when killing a Docker task.

2018-01-24 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338114#comment-16338114
 ] 

Zhitao Li commented on MESOS-8480:
--

Will this be also back ported to 1.5.0 since the RC is still not finalized yet?

> Mesos returns high resource usage when killing a Docker task.
> -
>
> Key: MESOS-8480
> URL: https://issues.apache.org/jira/browse/MESOS-8480
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>Priority: Major
> Fix For: 1.3.2, 1.4.2, 1.6.0, 1.5.1
>
> Attachments: test.cpp
>
>
> The way we get resource statistics for Docker tasks is through getting the 
> cgroup subsystem path through {{/proc//cgroup}} first (taking the 
> {{cpuacct}} subsystem as an example):
> {noformat}
> 9:cpuacct,cpu:/docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b
> {noformat}
> Then read 
> {{/sys/fs/cgroup/cpuacct//docker/66fbe67b64ad3a86c6e080e18578bc9e540e55ee0bdcae09c2e131a4264a3a3b/cpuacct.stat}}
>  to get the statistics:
> {noformat}
> user 4
> system 0
> {noformat}
> However, when a Docker container is being teared down, it seems that Docker 
> or the operation system will first move the process to the root cgroup before 
> actually killing it, making {{/proc//docker}} look like the following:
> {noformat}
> 9:cpuacct,cpu:/
> {noformat}
> This makes a racy call to 
> [{{cgroup::internal::cgroup()}}|https://github.com/apache/mesos/blob/master/src/linux/cgroups.cpp#L1935]
>  return a single '/', which in turn makes 
> [{{DockerContainerizerProcess::cgroupsStatistics()}}|https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L1991]
>  read {{/sys/fs/cgroup/cpuacct///cpuacct.stat}}, which contains the 
> statistics for the root cgroup:
> {noformat}
> user 228058750
> system 24506461
> {noformat}
> This can be reproduced by [^test.cpp] with the following command:
> {noformat}
> $ docker run --name sleep -d --rm alpine sleep 1000; ./test $(docker inspect 
> sleep | jq .[].State.Pid) & sleep 1 && docker rm -f sleep
> ...
> Reading file '/proc/44224/cgroup'
> Reading file 
> '/sys/fs/cgroup/cpuacct//docker/1d79a6c877e2af3081630aa57d23d853e6bd7d210dad28f897556bfea20bc9c1/cpuacct.stat'
> user 4
> system 0
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Reading file '/proc/44224/cgroup'
> Reading file '/sys/fs/cgroup/cpuacct///cpuacct.stat'
> user 228058750
> system 24506461
> Failed to open file '/proc/44224/cgroup'
> sleep
> [2]-  Exit 1  ./test $(docker inspect sleep | jq 
> .[].State.Pid)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8471) Allow revocable_resources capability for mesos-execute

2018-01-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336149#comment-16336149
 ] 

Zhitao Li commented on MESOS-8471:
--

A quick attempt is at https://reviews.apache.org/r/65294/

> Allow revocable_resources capability for mesos-execute
> --
>
> Key: MESOS-8471
> URL: https://issues.apache.org/jira/browse/MESOS-8471
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Zhitao Li
>Priority: Minor
>
> While mesos-execute is a nice tool to quickly test certain behavior of Mesos 
> itself without an external framework, it seems there is not direct way to 
> test revocable support in it.
> A quick test with the binary suggests that if we infer *REVOCABLE_RESOURCES* 
> capability from input, this should allow revocable resources on `task` or 
> `task_group` to be launched to Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8471) Allow revocable_resources capability for mesos-execute

2018-01-21 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8471:


 Summary: Allow revocable_resources capability for mesos-execute
 Key: MESOS-8471
 URL: https://issues.apache.org/jira/browse/MESOS-8471
 Project: Mesos
  Issue Type: Improvement
  Components: cli
Reporter: Zhitao Li


While mesos-execute is a nice tool to quickly test certain behavior of Mesos 
itself without an external framework, it seems there is not direct way to test 
revocable support in it.

A quick test with the binary suggests that if we infer *REVOCABLE_RESOURCES* 
capability from input, this should allow revocable resources on `task` or 
`task_group` to be launched to Mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume

2018-01-21 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333751#comment-16333751
 ] 

Zhitao Li commented on MESOS-8161:
--

The `TASK_ERROR` state was picked by framework author without good 
understanding of the problem.

I talked to the framework owner. The error was generated because they had code 
in executor which may access persistent volume after a task completed. My 
advice to them was to mount the persistent volume onto the executor instead so 
unmount only happens after every task as well as executor itself have 
terminated, thus no code should be accessing persistent volume anymore.

There are enough fixes from framework side and we cannot generate reliable 
reproduce anymore. I'll close the ticket for now. If we observe similar 
behaviors, I'll report again.

 

Thanks for your time, [~jieyu] and [~gilbert].

> Potentially dangerous dangling mount when stopping task with persistent volume
> --
>
> Key: MESOS-8161
> URL: https://issues.apache.org/jira/browse/MESOS-8161
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Priority: Major
>
> While we fixed a case in MESOS-7366 when an executor terminates, it seems 
> like a very similar case can still happen if a task with a persistent volume 
> terminates, executor still active, and [this unmount 
> call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
>  fails due to "device busy".
> I believe if agent gc or something other things run on the host mount 
> namespace, it is possible to lose persistent volume data because of this.
> Agent log:
> {code:none}
> I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for 
> task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
> executor(1)@10.70.142.140:36929
> I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888
> f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK 
> for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) 
> for task node-1__23fa9624-4608-404f-8d6f-0
> 2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update 
> TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
> 1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
> I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f o
> f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing 
> UPDATE for status update TASK_RUNNING (UUID: 
> c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
> f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
> I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update 
> TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
> node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
> f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
> I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for 
> task node-1__23fa9624-4608-404f-8d6f-0235559588
> 8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
> executor(1)@10.70.142.140:36929
> I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max 
> allowed age: 6.283560425078715days
> I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
> I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 
> 10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
> I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json 
> from 10.70.142.140:56414 with User-Agent='filebundle-agent'
> I1101 20:20:07.913359 102216 

[jira] [Updated] (MESOS-6893) Track total docker image layer size in store

2018-01-08 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6893:
-
   Priority: Minor  (was: Major)
Description: We want to give cluster operator some insights on total size 
of docker image layers in store so we can use it for monitoring purpose.
Component/s: containerization
 Issue Type: Improvement  (was: Task)
Summary: Track total docker image layer size in store  (was: Track 
docker layer size and access time)

> Track total docker image layer size in store
> 
>
> Key: MESOS-6893
> URL: https://issues.apache.org/jira/browse/MESOS-6893
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Minor
>
> We want to give cluster operator some insights on total size of docker image 
> layers in store so we can use it for monitoring purpose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.

2018-01-08 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316816#comment-16316816
 ] 

Zhitao Li commented on MESOS-4945:
--

That one is not necessarily part of this epic. I'll move it out.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>  Labels: Mesosphere
> Fix For: 1.5.0
>
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8365) Create AuthN support for prune images API

2017-12-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8365:


 Summary: Create AuthN support for prune images API
 Key: MESOS-8365
 URL: https://issues.apache.org/jira/browse/MESOS-8365
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li
Assignee: Zhitao Li


We want to make sure there is a way to configure AuthZ for new API added in 
MESOS-8360.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8365) Create AuthN support for prune images API

2017-12-28 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8365:
-
Target Version/s: 1.5.0

> Create AuthN support for prune images API
> -
>
> Key: MESOS-8365
> URL: https://issues.apache.org/jira/browse/MESOS-8365
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We want to make sure there is a way to configure AuthZ for new API added in 
> MESOS-8360.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8358) Create agent endpoints for pruning images

2017-12-22 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8358:
-
Issue Type: Improvement  (was: Bug)

> Create agent endpoints for pruning images
> -
>
> Key: MESOS-8358
> URL: https://issues.apache.org/jira/browse/MESOS-8358
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> This is a follow up on MESOS-4945, but we agreed that we should create a HTTP 
> endpoint on agent to manually trigger image gc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8358) Create agent endpoints for pruning images

2017-12-22 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8358:


 Summary: Create agent endpoints for pruning images
 Key: MESOS-8358
 URL: https://issues.apache.org/jira/browse/MESOS-8358
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Assignee: Zhitao Li


This is a follow up on MESOS-4945, but we agreed that we should create a HTTP 
endpoint on agent to manually trigger image gc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8353) Duplicate task for same framework on multiple agents crashes out master after failover

2017-12-20 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8353:


 Summary: Duplicate task for same framework on multiple agents 
crashes out master after failover
 Key: MESOS-8353
 URL: https://issues.apache.org/jira/browse/MESOS-8353
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


We have seen a mesos master crash loop after a leader failover. After more 
investigation, it seems that a same task ID was managed to be created onto 
multiple Mesos agents in the cluster. 

One possible logical sequence which can lead to such problem:

1. Task T1 was launched to master M1 on agent A1 for framework F;
2. Master M1 failed over to M2;
3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 
does not know previous T1 yet so it accepted it and sent to A2;
4. A1 reregistered: this probably crashed M2 (because same task cannot be added 
twice);
5. When M3 tries to come up after M2, it further crashes because both A1 and A2 
tried to add a T1 to the framework.

(I only have logs to prove the last step right now)

This happened on 1.4.0 masters.

Although this is probably triggered by incorrect retry logic on framework side, 
I wonder whether Mesos master should do extra protection to prevent such issue 
to happen. One possible idea to instruct one of the agents carrying tasks w/ 
duplicate ID to terminate corresponding tasks, or just refuse to reregister 
such agents and instruct them to shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8324) Add succeeded metric to container launch in Mesos agent

2017-12-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8324:


 Summary: Add succeeded metric to container launch in Mesos agent
 Key: MESOS-8324
 URL: https://issues.apache.org/jira/browse/MESOS-8324
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Only metric on agent related to stability of containerizer is 
"slave/container_launch_errors" and it does not track standalone/nested 
containers.

I propose we add a container_launch_succeeded counter to track all container 
launches in containerizer, and also add make sure `error` counter tracks 
standalone and nested containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8323) Separate resource fetching timeout from executor_registere_timeout

2017-12-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8323:


 Summary: Separate resource fetching timeout from 
executor_registere_timeout
 Key: MESOS-8323
 URL: https://issues.apache.org/jira/browse/MESOS-8323
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Containers could have varying size on images/resources, so it's more desirable 
to have a separate timeout (in duration) which is separate from executor 
register timeout.

[~bmahler], can we also agree this should be customizable to each task launch 
request (which hopefully can provide a better value based on its knowledge of 
artifact size)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8070) Bundled GRPC build does not build on Debian 8

2017-12-10 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285336#comment-16285336
 ] 

Zhitao Li commented on MESOS-8070:
--

[~gilbert], can we make sure this catches 1.5 release? Thanks!

> Bundled GRPC build does not build on Debian 8
> -
>
> Key: MESOS-8070
> URL: https://issues.apache.org/jira/browse/MESOS-8070
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Assignee: Chun-Hung Hsiao
> Fix For: 1.5.0
>
>
> Debian 8 includes an outdated version of libc-ares-dev, which prevents 
> bundled GRPC to build.
> I believe [~chhsia0] already has a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8280) Mesos Containerizer GC should set 'layers' after checkpointing layer ids in provisioner.

2017-11-29 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8280:


Assignee: Zhitao Li

> Mesos Containerizer GC should set 'layers' after checkpointing layer ids in 
> provisioner.
> 
>
> Key: MESOS-8280
> URL: https://issues.apache.org/jira/browse/MESOS-8280
> Project: Mesos
>  Issue Type: Bug
>  Components: image-gc, provisioner
>Reporter: Gilbert Song
>Assignee: Zhitao Li
>Priority: Critical
>  Labels: containerizer, image-gc, provisioner
>
> {noformat}
> 1
> 22
> 33
> 44
> 1
> 22
> 33
> 44
> I1129 23:24:45.469543  6592 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/MVgVC7/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/MVgVC7/38135e3743e6dcb66bd1394b633053714333c7b7cf930bfeebfda660c06e/rootfs.overlay'
> I1129 23:24:45.473287  6592 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/MVgVC7/sha256:b56ae66c29370df48e7377c8f9baa744a3958058a766793f821dadcb144a4647
>  to rootfs 
> '/tmp/mesos/store/docker/staging/MVgVC7/b5815a31a59b66c909dbf6c670de78690d4b52649b8e283fc2bfd2594f61cca3/rootfs.overlay'
> I1129 23:24:45.582002  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/e28617c6dd2169bfe2b10017dfaa04bd7183ff840c4f78ebe73fca2a89effeb6/rootfs.overlay'
> I1129 23:24:45.589404  6595 metadata_manager.cpp:167] Successfully cached 
> image 'alpine'
> I1129 23:24:45.590204  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/be4ce2753831b8952a5b797cf45b2230e1befead6f5db0630bcb24a5f554255e/rootfs.overlay'
> I1129 23:24:45.595190  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/53b5066c5a7dff5d6f6ef0c1945572d6578c083d550d2a3d575b4cdf7460306f/rootfs.overlay'
> I1129 23:24:45.599500  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/a9eb172552348a9a49180694790b33a1097f546456d041b6e82e4d7716ddb721/rootfs.overlay'
> I1129 23:24:45.602047  6597 provisioner.cpp:506] Provisioning image rootfs 
> '/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/rootfses/b5d48445-848d-4274-a4f8-e909351ebc35'
>  for container 
> 3bbc3fd1-0138-43a9-94ba-d017d813daac.01de09c5-d8e9-412e-8825-a592d2c875e5 
> using overlay backend
> I1129 23:24:45.602751  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:1db09adb5ddd7f1a07b6d585a7db747a51c7bd17418d47e91f901bdf420abd66
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/120e218dd395ec314e7b6249f39d2853911b3d6def6ea164ae05722649f34b16/rootfs.overlay'
> I1129 23:24:45.603054  6596 overlay.cpp:168] Created symlink 
> '/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/links'
>  -> '/tmp/xAWQ8y'
> I1129 23:24:45.604398  6596 overlay.cpp:196] Provisioning image rootfs with 
> overlayfs: 
> 'lowerdir=/tmp/xAWQ8y/1:/tmp/xAWQ8y/0,upperdir=/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/upperdir,workdir=/tmp/provisioner/containers/3bbc3fd1-0138-43a9-94ba-d017d813daac/containers/01de09c5-d8e9-412e-8825-a592d2c875e5/backends/overlay/scratch/b5d48445-848d-4274-a4f8-e909351ebc35/workdir'
> I1129 23:24:45.607802  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> '/tmp/mesos/store/docker/staging/6Zbc17/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
>  to rootfs 
> '/tmp/mesos/store/docker/staging/6Zbc17/42eed7f1bf2ac3f1610c5e616d2ab1ee9c7290234240388d6297bc0f32c34229/rootfs.overlay'
> I1129 23:24:45.612139  6594 registry_puller.cpp:395] Extracting layer tar 
> ball 
> 

[jira] [Commented] (MESOS-7366) Agent sandbox gc could accidentally delete the entire persistent volume content

2017-11-01 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235004#comment-16235004
 ] 

Zhitao Li commented on MESOS-7366:
--

I filed MESOS-8161 for the other case.

> Agent sandbox gc could accidentally delete the entire persistent volume 
> content
> ---
>
> Key: MESOS-7366
> URL: https://issues.apache.org/jira/browse/MESOS-7366
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Zhitao Li
>Assignee: Jie Yu
>Priority: Blocker
> Fix For: 1.0.4, 1.1.2, 1.2.1
>
>
> When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) 
> executor directory gc being invoked, agent seems to emit a log like:
> ```
>  Failed to delete directory  /runs//volume: Device or 
> resource busy
> ```
> After this, the persistent volume directory is empty.
> This could trigger data loss on critical workload so we should fix this ASAP.
> The triggering environment is a custom executor w/o rootfs image.
> Please let me know if you need more signal.
> {noformat}
> I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
>  to user 'uber'
> I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources 
> cpus(cassandra-cstar-location-store, cassandra, {resource_id: 
> 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; 
> mem(cassandra-cstar-location-store, cassandra, {resource_id: 
> 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; 
> ports(cassandra-cstar-location-store, cassandra, {resource_id: 
> fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
> I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container 
> d5a56564-3e24-4c60-9919-746710b78377 for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
> I0407 15:18:22.767514 22766 linux.cpp:730] Mounting 
> '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff'
>  to 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume'
>  for persistent volume disk(cassandra-cstar-location-store, cassandra, 
> {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's 
> forked pid 6892 to 
> '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
> I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
> I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 
> 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 
> 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 
> 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
> I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount 
> '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
> meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
> 4-3e24-4c60-9919-746710b78377/volume' for persistent volume 
> disk(cassandra-cstar-location-store, cassandra, {resource_id: 
> fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445
>  of container d5a56564-3e24-4c60-9919-746710b78377
> E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for 
> container 

[jira] [Created] (MESOS-8161) Potentially dangerous dangling mount when stopping task with persistent volume

2017-11-01 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8161:


 Summary: Potentially dangerous dangling mount when stopping task 
with persistent volume
 Key: MESOS-8161
 URL: https://issues.apache.org/jira/browse/MESOS-8161
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Priority: Critical


While we fixed a case in MESOS-7366 when an executor terminates, it seems like 
a very similar case can still happen if a task with a persistent volume 
terminates, executor still active, and [this unmount 
call|https://github.com/apache/mesos/blob/6f98b8d6d149c5497d16f588c683a68fccba4fc9/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L489]
 fails due to "device busy".

I believe if agent gc or something other things run on the host mount 
namespace, it is possible to lose persistent volume data because of this.

Agent log:

{code:none}
I1101 20:19:44.137109 102240 slave.cpp:3961] Sending acknowledgement for status 
update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task 
node-1__23fa9624-4608-404f-8d6f-0235559588
8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
executor(1)@10.70.142.140:36929
I1101 20:19:44.235196 102233 status_update_manager.cpp:395] Received status 
update acknowledgement (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888
f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:44.235302 102233 status_update_manager.cpp:832] Checkpointing ACK 
for status update TASK_RUNNING (UUID: ecdd32b8-8eba-40c5-92c8-3398310f142b) for 
task node-1__23fa9624-4608-404f-8d6f-0
2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:59.135591 102213 slave.cpp:3634] Handling status update 
TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db6
1f6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
I1101 20:19:59.136494 102216 status_update_manager.cpp:323] Received status 
update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f o
f framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:59.136540 102216 status_update_manager.cpp:832] Checkpointing 
UPDATE for status update TASK_RUNNING (UUID: 
c1667f59-b404-43ab-b096-b12397fb00f0) for task node-1__23fa9624-4608-404f-8d6
f-02355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:19:59.136724 102234 slave.cpp:4051] Forwarding the update TASK_RUNNING 
(UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61
f6d4-fd0f-48be-927d-14282c12301f-0014 to master@10.162.12.31:5050
I1101 20:19:59.136867 102234 slave.cpp:3961] Sending acknowledgement for status 
update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-0235559588
8f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014 to 
executor(1)@10.70.142.140:36929
I1101 20:20:02.010108 102223 http.cpp:277] HTTP GET for /slave(1)/flags from 
10.70.142.140:43046 with User-Agent='Python-urllib/2.7'
I1101 20:20:02.038574 102238 http.cpp:277] HTTP GET for /slave(1)/flags from 
10.70.142.140:43144 with User-Agent='Python-urllib/2.7'
I1101 20:20:02.246388 102237 slave.cpp:5044] Current disk usage 0.23%. Max 
allowed age: 6.283560425078715days
I1101 20:20:02.445312 102235 http.cpp:277] HTTP GET for /slave(1)/state.json 
from 10.70.142.140:44716 with User-Agent='Python-urllib/2.7'
I1101 20:20:02.448276 102215 http.cpp:277] HTTP GET for /slave(1)/flags from 
10.70.142.140:44732 with User-Agent='Python-urllib/2.7'
I1101 20:20:07.789482 102231 http.cpp:277] HTTP GET for /slave(1)/state.json 
from 10.70.142.140:56414 with User-Agent='filebundle-agent'
I1101 20:20:07.913359 102216 status_update_manager.cpp:395] Received status 
update acknowledgement (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888
f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:20:07.913455 102216 status_update_manager.cpp:832] Checkpointing ACK 
for status update TASK_RUNNING (UUID: c1667f59-b404-43ab-b096-b12397fb00f0) for 
task node-1__23fa9624-4608-404f-8d6f-0
2355595888f of framework db61f6d4-fd0f-48be-927d-14282c12301f-0014
I1101 20:20:14.135632 102231 slave.cpp:3634] Handling status update TASK_ERROR 
(UUID: 913c25be-dfb6-4ad8-874f-d8e1c789ccc0) for task 
node-1__23fa9624-4608-404f-8d6f-02355595888f of framework db61f
6d4-fd0f-48be-927d-14282c12301f-0014 from executor(1)@10.70.142.140:36929
E1101 20:20:14.136687 102211 slave.cpp:6736] Unexpected terminal task state 
TASK_ERROR
I1101 20:20:14.137081 102230 linux.cpp:627] Removing mount 
'/var/lib/mesos/slaves/db61f6d4-fd0f-48be-927d-14282c12301f-S193/frameworks/db61f6d4-fd0f-48be-927d-14282c12301f-0014/executors/node-1_ex

[jira] [Commented] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-17 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208103#comment-16208103
 ] 

Zhitao Li commented on MESOS-8090:
--

A quick attempt to fix: https://reviews.apache.org/r/63084/

> Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
> --
>
> Key: MESOS-8090
> URL: https://issues.apache.org/jira/browse/MESOS-8090
> Project: Mesos
>  Issue Type: Bug
>  Components: master, oversubscription
>Affects Versions: 1.4.0
>Reporter: Zhitao Li
>Assignee: Michael Park
>
> We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
> over-subscription enabled agent running 1.3.1 code.
> The crash line is:
> {code:none}
> resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
> {code}
> Stack trace in gdb:
> {panel:title=My title}
> #0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f22f3554448 in __GI_abort () at abort.c:89
> #2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
> src/utilities.cc:147
> #3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x7f22f61566cd in google::LogMessage::SendToLog (this= out>) at src/logging.cc:1412
> #5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
> src/logging.cc:1281
> #6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
> (this=, __in_chrg=) at src/logging.cc:1984
> #7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
> /mesos/src/common/resources.cpp:1051
> #8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
> (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
> #9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, 
> that=...) at /mesos/src/common/resources.cpp:1993
> #10 0x7f22f527f860 in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2016
> #11 0x7f22f527f91d in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2025
> #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
> _resources=...) at /mesos/src/common/resources.cpp:1277
> #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
> (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
> #14 0x7f22f550adc1 in 
> ProtobufProcess::_handlerM
>  (t=0x558137bbae70, method=
> (void 
> (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, 
> const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
>   const&)>, 
> data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
> at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
> #15 0x7f22f54c8791 in 
> ProtobufProcess::visit (this=0x558137bbae70, 
> event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
> #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
> (this=this@entry=0x558137bbae70, event=...) at 
> /mesos/src/master/master.cpp:1643
> #17 0x7f22f547014d in mesos::internal::master::Master::visit 
> (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
> #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
> /mesos/3rdparty/libprocess/include/process/process.hpp:87
> #19 process::ProcessManager::resume (this=, 
> process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
> #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
> /mesos/3rdparty/libprocess/src/process.cpp:2881
> #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
> #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
> #23 
> std::thread::_Impl()>
>  >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
> #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
> pthread_create.c:309
> #26 0x7f22f360662d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {panel}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8090:
-
Description: 
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:

{code:none}
resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
{code}

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/resources.cpp:1993
#10 0x7f22f527f860 in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2016
#11 0x7f22f527f91d in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2025
#12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
_resources=...) at /mesos/src/common/resources.cpp:1277
#13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
(this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
#14 0x7f22f550adc1 in 
ProtobufProcess::_handlerM
 (t=0x558137bbae70, method=
(void (mesos::internal::master::Master::*)(mesos::internal::master::Master 
* const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
, 
data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
#15 0x7f22f54c8791 in 
ProtobufProcess::visit (this=0x558137bbae70, 
event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
#16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
(this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643
#17 0x7f22f547014d in mesos::internal::master::Master::visit 
(this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
#18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#19 process::ProcessManager::resume (this=, 
process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
#20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
/mesos/3rdparty/libprocess/src/process.cpp:2881
#21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
#22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
#23 
std::thread::_Impl()>
 >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
#24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
pthread_create.c:309
#26 0x7f22f360662d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{panel}



  was:
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:


{panel:title=My title}
resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
{panel}

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in 

[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8090:
-
Description: 
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:


{panel:title=My title}
resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
{panel}

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/resources.cpp:1993
#10 0x7f22f527f860 in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2016
#11 0x7f22f527f91d in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2025
#12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
_resources=...) at /mesos/src/common/resources.cpp:1277
#13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
(this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
#14 0x7f22f550adc1 in 
ProtobufProcess::_handlerM
 (t=0x558137bbae70, method=
(void (mesos::internal::master::Master::*)(mesos::internal::master::Master 
* const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
, 
data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
#15 0x7f22f54c8791 in 
ProtobufProcess::visit (this=0x558137bbae70, 
event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
#16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
(this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643
#17 0x7f22f547014d in mesos::internal::master::Master::visit 
(this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
#18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#19 process::ProcessManager::resume (this=, 
process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
#20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
/mesos/3rdparty/libprocess/src/process.cpp:2881
#21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
#22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
#23 
std::thread::_Impl()>
 >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
#24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
pthread_create.c:309
#26 0x7f22f360662d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{panel}



  was:
We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:

resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 

[jira] [Created] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8090:


 Summary: Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
 Key: MESOS-8090
 URL: https://issues.apache.org/jira/browse/MESOS-8090
 Project: Mesos
  Issue Type: Bug
  Components: master, oversubscription
Reporter: Zhitao Li
Assignee: Michael Park


We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
over-subscription enabled agent running 1.3.1 code.

The crash line is:

resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19

Stack trace in gdb:

{panel:title=My title}
#0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7f22f3554448 in __GI_abort () at abort.c:89
#2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x7f22f61566cd in google::LogMessage::SendToLog (this=) 
at src/logging.cc:1412
#5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
src/logging.cc:1281
#6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
(this=, __in_chrg=) at src/logging.cc:1984
#7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
/mesos/src/common/resources.cpp:1051
#8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
(this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
#9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, that=...) 
at /mesos/src/common/resources.cpp:1993
#10 0x7f22f527f860 in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2016
#11 0x7f22f527f91d in mesos::Resources::operator+= 
(this=this@entry=0x7f22e713d400, that=...) at 
/mesos/src/common/resources.cpp:2025
#12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
_resources=...) at /mesos/src/common/resources.cpp:1277
#13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
(this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
#14 0x7f22f550adc1 in 
ProtobufProcess::_handlerM
 (t=0x558137bbae70, method=
(void (mesos::internal::master::Master::*)(mesos::internal::master::Master 
* const, const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
, 
data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
#15 0x7f22f54c8791 in 
ProtobufProcess::visit (this=0x558137bbae70, 
event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
#16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
(this=this@entry=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1643
#17 0x7f22f547014d in mesos::internal::master::Master::visit 
(this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
#18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#19 process::ProcessManager::resume (this=, 
process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
#20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
/mesos/3rdparty/libprocess/src/process.cpp:2881
#21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
#22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
#23 
std::thread::_Impl()>
 >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
#24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
pthread_create.c:309
#26 0x7f22f360662d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{panel}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8090) Mesos 1.4.0 crashes with 1.3.x agent with oversubscription

2017-10-13 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8090:
-
Affects Version/s: 1.4.0

> Mesos 1.4.0 crashes with 1.3.x agent with oversubscription
> --
>
> Key: MESOS-8090
> URL: https://issues.apache.org/jira/browse/MESOS-8090
> Project: Mesos
>  Issue Type: Bug
>  Components: master, oversubscription
>Affects Versions: 1.4.0
>Reporter: Zhitao Li
>Assignee: Michael Park
>
> We are seeing a crash in 1.4.0 master when it receives {{updateSlave}} from a 
> over-subscription enabled agent running 1.3.1 code.
> The crash line is:
> resources.cpp:1050] Check failed: !resource.has_role() cpus{REV}:19
> Stack trace in gdb:
> {panel:title=My title}
> #0  0x7f22f3553067 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x7f22f3554448 in __GI_abort () at abort.c:89
> #2  0x7f22f615cd79 in google::DumpStackTraceAndExit () at 
> src/utilities.cc:147
> #3  0x7f22f6154a4d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x7f22f61566cd in google::LogMessage::SendToLog (this= out>) at src/logging.cc:1412
> #5  0x7f22f6154612 in google::LogMessage::Flush (this=0x18ac7) at 
> src/logging.cc:1281
> #6  0x7f22f61570b9 in google::LogMessageFatal::~LogMessageFatal 
> (this=, __in_chrg=) at src/logging.cc:1984
> #7  0x7f22f527e133 in mesos::Resources::isEmpty (resource=...) at 
> /mesos/src/common/resources.cpp:1051
> #8  0x7f22f527e1e5 in mesos::Resources::Resource_::isEmpty 
> (this=this@entry=0x7f22e713d2e0) at /mesos/src/common/resources.cpp:1173
> #9  0x7f22f527e20c in mesos::Resources::add (this=0x7f22e713d400, 
> that=...) at /mesos/src/common/resources.cpp:1993
> #10 0x7f22f527f860 in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2016
> #11 0x7f22f527f91d in mesos::Resources::operator+= 
> (this=this@entry=0x7f22e713d400, that=...) at 
> /mesos/src/common/resources.cpp:2025
> #12 0x7f22f527fa4b in mesos::Resources::Resources (this=0x7f22e713d400, 
> _resources=...) at /mesos/src/common/resources.cpp:1277
> #13 0x7f22f548b812 in mesos::internal::master::Master::updateSlave 
> (this=0x558137bbae70, message=...) at /mesos/src/master/master.cpp:6681
> #14 0x7f22f550adc1 in 
> ProtobufProcess::_handlerM
>  (t=0x558137bbae70, method=
> (void 
> (mesos::internal::master::Master::*)(mesos::internal::master::Master * const, 
> const mesos::internal::UpdateSlaveMessage &)) 0x7f22f548b6d0 
>   const&)>, 
> data="\n)\n'07ba28cc-d9fa-44fb-8d6b-f8c5c90f8a90-S1\022\030\n\004cpus\020\000\032\t\t\000\000\000\000\000\000\063@2\001*J")
> at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:799
> #15 0x7f22f54c8791 in 
> ProtobufProcess::visit (this=0x558137bbae70, 
> event=...) at /mesos/3rdparty/libprocess/include/process/protobuf.hpp:104
> #16 0x7f22f54572d4 in mesos::internal::master::Master::_visit 
> (this=this@entry=0x558137bbae70, event=...) at 
> /mesos/src/master/master.cpp:1643
> #17 0x7f22f547014d in mesos::internal::master::Master::visit 
> (this=0x558137bbae70, event=...) at /mesos/src/master/master.cpp:1575
> #18 0x7f22f60b7169 in serve (event=..., this=0x558137bbbf28) at 
> /mesos/3rdparty/libprocess/include/process/process.hpp:87
> #19 process::ProcessManager::resume (this=, 
> process=0x558137bbbf28) at /mesos/3rdparty/libprocess/src/process.cpp:3346
> #20 0x7f22f60bd056 in operator() (__closure=0x558137aa3218) at 
> /mesos/3rdparty/libprocess/src/process.cpp:2881
> #21 _M_invoke<> (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1700
> #22 operator() (this=0x558137aa3218) at /usr/include/c++/4.9/functional:1688
> #23 
> std::thread::_Impl()>
>  >::_M_run(void) (this=0x558137aa3200) at /usr/include/c++/4.9/thread:115
> #24 0x7f22f40b3970 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #25 0x7f22f38d1064 in start_thread (arg=0x7f22e713e700) at 
> pthread_create.c:309
> #26 0x7f22f360662d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {panel}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8075) Add RWMutex to libprocess

2017-10-12 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8075:
-
Shepherd: Benjamin Hindman

> Add RWMutex to libprocess
> -
>
> Key: MESOS-8075
> URL: https://issues.apache.org/jira/browse/MESOS-8075
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide 
> better concurrecy protection for mutual exclusive actions, but allow high 
> concurrency for actions which can be performed at the same time.
> One use case is image garbage collection: the new API 
> {{provisioner::pruneImages}} needs to be mutually exclusive from 
> {{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
> concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8079) Checkpoint and recover layers used to provision rootfs in provisioner

2017-10-12 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8079:


 Summary: Checkpoint and recover layers used to provision rootfs in 
provisioner
 Key: MESOS-8079
 URL: https://issues.apache.org/jira/browse/MESOS-8079
 Project: Mesos
  Issue Type: Task
  Components: provisioner
Reporter: Zhitao Li


This information will be necessary for {{provisioner}} to determine all layers 
of active containers, which we need to retain when image gc happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8075) Add RWMutex to libprocess

2017-10-11 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li reassigned MESOS-8075:


Assignee: Zhitao Li

> Add RWMutex to libprocess
> -
>
> Key: MESOS-8075
> URL: https://issues.apache.org/jira/browse/MESOS-8075
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide 
> better concurrecy protection for mutual exclusive actions, but allow high 
> concurrency for actions which can be performed at the same time.
> One use case is image garbage collection: the new API 
> {{provisioner::pruneImages}} needs to be mutually exclusive from 
> {{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
> concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8075) Add RWMutex to libprocess

2017-10-11 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8075:


 Summary: Add RWMutex to libprocess
 Key: MESOS-8075
 URL: https://issues.apache.org/jira/browse/MESOS-8075
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Zhitao Li


We want to add a new {{RWMutex}} similar to {{Mutex}}, which can provide better 
concurrecy protection for mutual exclusive actions, but allow high concurrency 
for actions which can be performed at the same time.

One use case is image garbage collection: the new API 
{{provisioner::pruneImages}} needs to be mutually exclusive from 
{{provisioner::provision}}, but multiple {{{provisioner::provision}} can 
concurrently run safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8070) Bundled GRPC build does not build on Debian 8

2017-10-10 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8070:


 Summary: Bundled GRPC build does not build on Debian 8
 Key: MESOS-8070
 URL: https://issues.apache.org/jira/browse/MESOS-8070
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Assignee: Chun-Hung Hsiao


Debian 8 includes an outdated version of libc-ares-dev, which prevents bundled 
GRPC to build.

I believe [~chhsia0] already has a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.

2017-10-04 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191610#comment-16191610
 ] 

Zhitao Li commented on MESOS-6240:
--

+1

Taking out executor to agent API from TCP to domain socket will also reduce 
some potential security exposure of agent.

Is there a design doc for this work?

> Allow executor/agent communication over non-TCP/IP stream socket.
> -
>
> Key: MESOS-6240
> URL: https://issues.apache.org/jira/browse/MESOS-6240
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: Linux and Windows
>Reporter: Avinash Sridharan
>Assignee: Benjamin Hindman
>Priority: Critical
>  Labels: mesosphere
>
> Currently, the executor agent communication happens specifically over TCP 
> sockets. This works fine in most cases, but specifically for the 
> `MesosContainerizer` when containers are running on CNI networks, this mode 
> of communication starts imposing constraints on the CNI network. Since, now 
> there has to connectivity between the CNI network  (on which the executor is 
> running) and the agent. Introducing paths from a CNI network to the 
> underlying agent, at best, creates headaches for operators and at worst 
> introduces serious security holes in the network, since it is breaking the 
> isolation between the container CNI network and the host network (on which 
> the agent is running).
> In order to simplify/strengthen deployment of Mesos containers on CNI 
> networks we therefore need to move away from using TCP/IP sockets for 
> executor/agent communication. Since, executor and agent are guaranteed to run 
> on the same host, the above problems can be resolved if, for the 
> `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of 
> TCP/IP sockets for the executor/agent communication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8040) Return nested containers in `GET_CONTAINERS` API call

2017-09-28 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-8040:
-
Component/s: containerization
 Issue Type: Improvement  (was: Bug)

> Return nested containers in `GET_CONTAINERS` API call
> -
>
> Key: MESOS-8040
> URL: https://issues.apache.org/jira/browse/MESOS-8040
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Zhitao Li
>
> Right now, there is no way to directly query agent and know all nested 
> containers' id, parent id and other information.
> After talking to [~jieyu], `GET_CONTAINERS` API seems a good fit to return 
> this information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8040) Return nested containers in `GET_CONTAINERS` API call

2017-09-28 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8040:


 Summary: Return nested containers in `GET_CONTAINERS` API call
 Key: MESOS-8040
 URL: https://issues.apache.org/jira/browse/MESOS-8040
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


Right now, there is no way to directly query agent and know all nested 
containers' id, parent id and other information.

After talking to [~jieyu], `GET_CONTAINERS` API seems a good fit to return this 
information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184562#comment-16184562
 ] 

Zhitao Li edited comment on MESOS-8018 at 9/28/17 6:01 PM:
---

[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree, if other part of AuthZ permits that.

My rationale here is pretty much intentionally treating this task as an 
extension part of the executor. I'd argue this is simpler than forcing everyone 
to write an executor.


was (Author: zhitao):
[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree, if other part of AuthZ permits that.

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184562#comment-16184562
 ] 

Zhitao Li edited comment on MESOS-8018 at 9/28/17 6:00 PM:
---

[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree, if other part of AuthZ permits that.


was (Author: zhitao):
[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree.

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-28 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184562#comment-16184562
 ] 

Zhitao Li commented on MESOS-8018:
--

[~jamespeach] If the framework *opt-in* to this behavior, then the task will be 
allowed to whatever the (default) executor can do through agent HTTP API, 
possibly including launching a privileged task within the executor container 
tree.

> Allow framework to opt-in to forward executor's JWT token to the tasks
> --
>
> Key: MESOS-8018
> URL: https://issues.apache.org/jira/browse/MESOS-8018
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>
> Nested container API is an awesome feature and enabled a lot of interesting 
> use cases. A pattern we have seen multiple times is that a task (often the 
> only one) launched by default executor wants to further creates containers 
> nested behind itself (or the executor) to run some different workload.
> Because the entire request is 1) completely local to the executor container, 
> 2) okay to be bounded within the executor's lifecycle, we'd like to allow the 
> task to use the mesos agent API directly to create these nested containers. 
> However, it creates a problem when we want to enable HTTP executor 
> authentication because the JWT auth tokens are only available to the executor 
> so the task's API request will be rejected.
> Requiring framework owner to fork or create a custom executor simply for this 
> purpose also seems a bit too heavy.
> My proposal is to allow framework to opt-in with some field so that the 
> launched task will receive certain environment variables from default 
> executor, so the task can "act upon" the executor. One idea is to add a new 
> field to allow certain environment variables to be forwarded from executor to 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8018) Allow framework to opt-in to forward executor's JWT token to the tasks

2017-09-26 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8018:


 Summary: Allow framework to opt-in to forward executor's JWT token 
to the tasks
 Key: MESOS-8018
 URL: https://issues.apache.org/jira/browse/MESOS-8018
 Project: Mesos
  Issue Type: Improvement
Reporter: Zhitao Li


Nested container API is an awesome feature and enabled a lot of interesting use 
cases. A pattern we have seen multiple times is that a task (often the only 
one) launched by default executor wants to further creates containers nested 
behind itself (or the executor) to run some different workload.

Because the entire request is 1) completely local to the executor container, 2) 
okay to be bounded within the executor's lifecycle, we'd like to allow the task 
to use the mesos agent API directly to create these nested containers. However, 
it creates a problem when we want to enable HTTP executor authentication 
because the JWT auth tokens are only available to the executor so the task's 
API request will be rejected.

Requiring framework owner to fork or create a custom executor simply for this 
purpose also seems a bit too heavy.

My proposal is to allow framework to opt-in with some field so that the 
launched task will receive certain environment variables from default executor, 
so the task can "act upon" the executor. One idea is to add a new field to 
allow certain environment variables to be forwarded from executor to task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart

2017-09-22 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177221#comment-16177221
 ] 

Zhitao Li commented on MESOS-1739:
--

Ping on this too. I'm willing to work on this in the next couple of months and 
push this to happen.

> Allow slave reconfiguration on restart
> --
>
> Key: MESOS-1739
> URL: https://issues.apache.org/jira/browse/MESOS-1739
> Project: Mesos
>  Issue Type: Epic
>Reporter: Patrick Reilly
>  Labels: external-volumes, mesosphere, myriad
>
> Make it so that either via a slave restart or a out of process "reconfigure" 
> ping, the attributes and resources of a slave can be updated to be a superset 
> of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   3   4   >