Andrew Schwartzmeyer created MESOS-8563: -------------------------------------------
Summary: Windows executors cannot re-register Key: MESOS-8563 URL: https://issues.apache.org/jira/browse/MESOS-8563 Project: Mesos Issue Type: Bug Components: agent, executor, libprocess Environment: Windows 10 Reporter: Andrew Schwartzmeyer Assignee: Andrew Schwartzmeyer This issue captures an important (but already resolved) bug due to incorrect inheritance of sockets. When enabling agent recovery, it was discovered that the executors could not re-register to the new agent. They would send the re-register message, and then fail silently. The agent never received the re-register message. This turned out to be due to incorrect inheritance semantics of sockets. On POSIX systems, {{os::cloexec}} was used to prevent file descriptors (or socket handles, on Windows) from being inherited by child processes. On Windows, we were creating {{SOCKET}} handles using the CRT API {{::socket}}, which by default created _inheritable_ socket handles. The subsequent call to {{os::cloexec}} to prevent this was a no-op, leaving us leaking socket handles to all child processes, causing the described bug. The solution was to split {{net::socket}} into a POSIX and Windows implementation, where on Windows we use the WinSock 2 API {{WSASocket}}, which allows us to create the socket upfront with {{WSA_FLAG_NO_HANDLE_INHERIT}}, preventing the leaks. This is somewhat like using {{O_CLOEXEC}} on Linux. -- This message was sent by Atlassian JIRA (v7.6.3#76005)