[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

Vinod Kone (Jira) Thu, 30 Apr 2020 15:40:23 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097036#comment-17097036
 ]


Vinod Kone commented on MESOS-8038:
-----------------------------------

Thanks [~cf.natali] for the repro and analysis.

The above log lines you pasted in the comment doesn't capture everything that 
transpired, you would need to do a grep like this to get the whole picture. 

{quote}
grep -E 
"task-650af3bd-3f5b-4e17-9d34-4642480b4da0|:36541|6f446173-2bba-4cc4-bc15-c956bc159d4e"
 mesos_agent.log
{quote}

But, anyway, I think your observations are largely correct. When a container is 
in the process of being destroyed, the agent does short-circuit to send the 
terminal update to the master causing the resources to be released and offered 
and used by some other task. 

I remember discussions around this behavior in the past, but not sure where we 
landed in terms of the long term solution. Right now, we err on the side of 
releasing the resources incase the cgroup gets stuck in destroying instead of 
hoarding it. If we do decide to change this code to always wait for the cgroup 
destruction to be finished (or update to be finished) there's a possibility 
that resources are locked forever incase of bugs (either in mesos or kernel) in 
the destruction path. I can't remember if we have seen this behavior in 
production clusters before. 

[~abudnik] [~greggomann] thoughts on fixing this?




> Launching GPU task sporadically fails.
> --------------------------------------
>
>                 Key: MESOS-8038
>                 URL: https://issues.apache.org/jira/browse/MESOS-8038
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, gpu
>    Affects Versions: 1.4.0
>            Reporter: Sai Teja Ranuva
>            Assignee: Zhitao Li
>            Priority: Critical
>         Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

Reply via email to