Thanks for your detailed answer / explanation Lennart, it's fully consistent with my code-browsing findings.
I've been struggling myself with the problem that you alluded above to identify "foreign" mountpoints. After banging my head against the wall for a while i ended up implementing an heuristic based on the major:minor-number field of the /proc/pid/mountinfo file: if the container mountpoint being considered has a major:minor-id that matches those major:minor-ids present in the host mount namespace, then this one is likely a "foreign" mountpoint, and shouldn't be unmounted. Obviously, this would force you to extend the current systemd mountInfo parser. And there is a caveat as not all file-systems make use of a unique / differentiated ID for every new mountpoint (e.g. "/dev/null" fs always use the same major:minor id across different mount namespaces), so there could be false-positives, but that doesn't represent a problem in our case. Here is the specific code if you want to check it out: https://github.com/nestybox/sysbox-fs/blob/master/mount/infoParser.go#L828 Please let me know if you ever find a better approach. cheers, /Rodny On Wed, Feb 24, 2021 at 9:19 AM Lennart Poettering <[email protected]> wrote: > On Fr, 19.02.21 19:17, Rodny Molina ([email protected]) wrote: > > > Hi, > > > > As part of a prototype I'm working on to run systemd within an > unprivileged > > docker container, I would like to prevent mountpoints created at runtime > > from being unmounted during the container shutdown process. I understand > > that systemd creates "<blah>.mount" units dynamically for > > these mountpoints as they show up in /proc/pid/mountinfo, but after > reading > > the docs + code, I don't see a way to avoid these unmounts during the > > shutdown.target execution. > > Yeah, it would be great if we could automatically determine "foreign > owned" mounts, and then step away from them. But there's really no way > for us to figure that out, at lesat to my knowledge. Ideally > /proc/self/mountinfo would tell us about this in some field, but it > really doesn't afaik. > > > Interestingly, I see that there's code > > < > https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398> > > that > > skips the unmounting cycle attending to the ConditionVirtualization / > > containeinarized settings, which is what I need, but I'm not able to see > > that code being called during the container shutdown -- probably i'm not > > understanding systemd's fsm unwinding logic well enough ... > > There are two phases of shutdown: the regular phase where we follow > mount unit deps, and stuff is umounted via /sbin/umount. i.e. where > the shutdown is handled by the usual unit logic. > > And then there's the second phase which shutdown.c implements: it's a > separate binary that PID 1 invokes via execve() (so that it becomes > new PID 1) and then pretty robustly just tries to > umount/detach/disassembles/… without understanding of dependencies > what might be left over. > > The first phase hence is the "clean" shutdown logic and the second > phase is the "dirty" fallback logic that tries really hard to sync/put > file systems into a clean state if the first phase fails (maybe > because some misplaced deps). > > The second phase is skipped in containers, the first one is not. The > second phase is unnecessary in containers since the container manager > and namespace cleanup take care of this anyway, and even if it didn't, > the host's shutdown logic can take responsibility of all this. > > Now, if the kernel would provide us with the info we'd generate the > deps for .mount units synthesized from /proc/self/mountinfo in a way > that "foreign owned" mounts won't get unmounted in phase 1, but we > simply can't do that automatically since we can't distinguish > them. :-( > > You could manually define .mount units for all units you know are > owned by the outside container manager, but that is nasty and > fragile. The mount units would have to carefully have the right deps > (or better: should miss the right deps) to ensure things are clean > when shutting down. > > So yeah, I#d love to fix this properly, generically, but this requires > some kernel work first, and that's not just a technical difficulty but > given the maintainer of said interfaces also a political one. > > Lennart > > -- > Lennart Poettering, Berlin > -- /Rodny
_______________________________________________ systemd-devel mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/systemd-devel
