[ https://issues.apache.org/jira/browse/MESOS-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gilbert Song reassigned MESOS-9502: ----------------------------------- Shepherd: Gilbert Song Assignee: Andrei Budnik Sprint: Containerization R9 Sprint 37 Story Points: 8 Labels: containerizer (was: ) > IOswitchboard cleanup could get stuck. > -------------------------------------- > > Key: MESOS-9502 > URL: https://issues.apache.org/jira/browse/MESOS-9502 > Project: Mesos > Issue Type: Bug > Components: containerization > Affects Versions: 1.7.0 > Reporter: Meng Zhu > Assignee: Andrei Budnik > Priority: Critical > Labels: containerizer > > Our check container got stuck during destroy which in turned stucks the > parent container. It is blocked by the I/O switchboard cleanup: > 1223 18:04:41.000000 16269 switchboard.cpp:814] Sending SIGTERM to I/O > switchboard server (pid: 62854) since container > 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e > is being destroyed > .... > 1227 04:45:38.000000 5189 switchboard.cpp:916] I/O switchboard server > process for container > 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e > has terminated (status=N/A) > Note the timestamp. > *Root Cause:* > Fundamentally, this is caused by a race between *.discard()* triggered by > Check Container TIMEOUT and IOSB extracting ContainerIO object. This race > could be exposed by overloaded/slow agent process. Please see how this race > be triggered below: > # Right after IOSB server process is running, Check container Timed out and > the checker process returns a failure, which would close the HTTP connection > with agent. > # From the agent side, if the connection breaks, the handler will trigger a > discard on the returned future and that will result in > containerizer->launch()'s future transitioned to DISCARDED state. > # In containerizer, the DISCARDED state will be propagated back to IOSB > prepare(), which stop its continuation on *extracting the containerIO* (it > implies the object being cleaned up and FDs(one end of pipes created in IOSB) > being closed in its destructor). > # Agent starts to destroy the container due to its discarded launch result, > and asks IOSB to cleanup the container. > # IOSB server is still running, so agent sends a SIGTERM. > # SIGTERM handler unblocks the IOSB from redirecting (to redirect > stdout/stderr from container to logger before exiting). > # io::redirect() calls io::splice() and reads the other end of those pipes > forever. > This issue is *not easy to reproduce unless* on a busy agent, because the > timeout has to happen exactly *AFTER* IOSB server is running and *BEFORE* > IOSB extracts containerIO. -- This message was sent by Atlassian JIRA (v7.6.3#76005)