Qian Zhang created MESOS-10163:
----------------------------------

             Summary: Implement a new component to launch CSI plugins as 
standalone containers and make CSI gRPC calls
                 Key: MESOS-10163
                 URL: https://issues.apache.org/jira/browse/MESOS-10163
             Project: Mesos
          Issue Type: Task
            Reporter: Qian Zhang
            Assignee: Greg Mann


*Background:*

Originally we want `volume/csi` isolator to leverage the existing [service 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]
 to launch CSI plugins as standalone containers and currently service manager 
needs to call the following agent HTTP APIs:
 # `GET_CONTAINERS` to get all standalone containers in its `recover` method.
 # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone 
containers in its `recover` method.
 # `LAUNCH_CONTAINER` via the existing 
[ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46]
 to launch CSI plugin as standalone container when its `getEndpoint` method is 
called.

The problem with the above design is, `volume/csi` isolator may need to clean 
up orphan container during agent recovery which is triggered by containerizer 
(see 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275]
 for details), to clean up an orphan container which is using a CSI volume, 
`volume/csi` isolator needs to instantiate and recover the service manager and 
get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` method 
will be called by `volume/csi` isolator during agent recovery. And as I 
mentioned above service manager’s `getEndpoint` may need to call 
`LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent is 
still in recovering state, such agent HTTP call will be just rejected by agent. 
So we have to instantiate and recover service manager *after agent recovery is 
done*, but in `volume/csi` isolator we do not have such information (i.e. the 
signal that agent recovery is done).

 

*Solution*

We need to implement a new component (like `CSIVolumeManager` or a better 
name?) in Mesos agent which is responsible for launching CSI plugins as 
standalone containers (via the existing [service 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51])
 and making CSI gRPC calls (via the existing [volume 
manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]).
 * We can instantiate this new component in the `main` method of agent and pass 
it to both containerizer and agent (i.e. it will be a member of the `Slave` 
object), and containerizer will in turn pass it to the `volume/csi` isolator.
 * Since this new component relies on service manager which will call agent 
HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, 
agentIP, agentPort, agentLibprocessId + "/api/v1")`, see 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471]
 for an example.
 * When agent registers/reregisters with master (`Slave::registered` and 
`Slave::reregistered`), we should call this new component’s `start` method (see 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742]
 and 
[here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827]
 as examples) which will scan the directory `--csi_plugin_config_dir` and 
create the `service manager - volume manager` pair for each CSI plugin loaded 
from that directory.
 * For the `volume/csi` isolator, it needs to call this new component’s 
`publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` 
method.

In the case of clean up orphan containers during agent recovery, `volume/csi` 
isolator will just call this new component’s `unpublishVolume` method as usual, 
and it is this new component’s responsibility to only make the actual CSI gRPC 
call after agent recovery is done and agent has registered with master (e.g., 
when this new component’s start method is called).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to