What precisely are the semantics of directory operations on union mounts supposed to be? (Note: that's mount -o union, not unionfs, which is mount -t union.)
As some may remember, the chief goal of the namei rototilling that's now been going on ~forever was to simplify how directory operations interact with namei and the vnode interface. Things have gotten to the point where material progress on that is now possible; however, it's important to implement all the bits andcorner cases correctly. Union mounts are complicated in this regard because when the directory involved is a union mount point, some layer of the union mount needs to be chosen to invoke the filesystem-level operation; and in some cases it might need to be tried repeatedly or at more than one layer before giving up. TRYEMULROOT is similar in that ideally (when the directory in question exists both in the emulation root and the regular root) it would behave the same way. (As the implementation is quite different that may not be practical, but I feel like we shouldn't begin by aiming low.) The current behavior of these operations on union mount points is not necessarily relevant because after reviewing things I'm fairly certain that in at least some cases it's wrong. Directory operations can be divided into five categories: - lookup (ordinary directory traversal, operations like stat, open without O_CREATE, etc.) - nonexclusive create (open without O_CREATE) - exclusive create (mkdir, symlink, open with O_CREATE|O_EXCL, etc.) - remove (rmdir, unlink) - rename So I think these should behave as follows: For lookup, we should start at the top of the union stack and try looking up the target name, and descend until either we find it in some layer or run out of layers. This much is pretty clear. For nonexclusive create, we should do the same, and if we run out of layers start at the top again and, knowing that the name doesn't exist, continue like an exclusive create. This requires not unlocking the directory in between so that the proposition "the name doesn't exist" remains true. For an exclusive create, however, we need to ascertain that the name doesn't exist before we try creating anything. Various security properties depend on exclusive create actually being exclusive, and I don't think having union mounts weaken this is healthy. So I think we need to test all layers before creating anything. (It also means we need to lock all layers, not just one at a time, which we don't currently do and is currently problematic, but that's a topic for later.) Once we've ascertained that the name doesn't exist, we use the topmost read-write layer; that is, start at the top and descend skipping layers that are tagged readonly. (But: does this strictly mean readonly as in EROFS, or do we skip layers that are chmod -w for the current user?) For remove, I think the correct thing to do is to descend until we find the topmost layer where the target name exists, if any, and then operate at that layer. And for rename, I think the correct thing is to make like remove on the first (from-dir) argument, then find whatever layer in the second (to-dir) argument is the same volume, regardless of stack order. This can result in moving a file under another file, but that's what Plan 9 does and I guess it's ok. (I guess if there's more than one instance of the same volume in the to-dir union stack, which is not impossible with rebind mounts if we ever implement that, it should use the topmost one.) Plan 9 has a mount flag (mount -c) that it uses to pick the layer where new objects get created, rather than going by readonly vs. read-write; we don't have that but could implement it. Does this seem reasonable? -- David A. Holland dholl...@netbsd.org