Re: [git pull] mount API series
Al Viro writes: > On Mon, Nov 12, 2018 at 08:54:39PM +, Al Viro wrote: >> On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote: >> > Steven Whitehouse writes: >> >> > > Can you share some details of what this NULL dereference is? David and >> > > Al have been working on the changes as requested by Linus later in >> > > this thread, and they'd like to tidy up this issue too at the same >> > > time if possible. We are not asking you to actually provide a fix, in >> > > case you are too busy to do so, however it would be good to know what >> > > the issue is so that we can make sure that it is resolved in the next >> > > round of patches, >> > >> > https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/ >> >> Thought it had been dealt with, but you are right - oops is still there >> and obviously needs fixing. However, looking at that place in mainline... >> How does that thing manage to work? Look: >> /* Notice when we are propagating across user namespaces */ >> if (m->mnt_ns->user_ns != user_ns) >> type |= CL_UNPRIVILEGED; >> child = copy_tree(last_source, last_source->mnt.mnt_root, type); >> if (IS_ERR(child)) >> return PTR_ERR(child); >> child->mnt.mnt_flags &= ~MNT_LOCKED; >> mnt_set_mountpoint(m, mp, child); >> last_dest = m; >> last_source = child; [Moved this question up as it's answer is a good foundation for the rest] > FWIW, I don't believe that CL_UNPRIVELEGED approach is workable; the logics > we want is, AFAICS, "when destination is not yours, whatever you attach to > it should be all locked". Correct? Kinda sorta. There is a tree of privilege. Mount namespaces owned by a more privileged user_ns can propagate to slaves in a mount namespace owned by less privileged user_ns. Because the required permission for the making the initial mount are insufficient a mount can not propagate from a less privileged mount namespace to a more privileged mount namespace. When propagating from a more privileged mount namespace to a less privileged mount namespace we want to maintain some properties. 1) That mounts that propogate together are locked together. 2) That the mount flags readonly, nodev, nosuid, and noexec when set in a more privileged mount namespace are not changeable in a less privileged mount namespace. >> OK, we'd created the copy to be attached to the next candidate mountpoint. >> If we have e.g. a 4-element peer group, we'll start with what we'd been >> asked to mount, then call that sucker three times, getting a copy for >> each of those mountpoints. Right? Now, what happens if the 1st, 3rd and >> 4th members live in your namespace, with the second one being >> elsewhere? As I understand the question we have a bug in the mount propagation tree. All peers need to be in mount namespaces that share a user namespace. Further copy_tree fundamentally needs to copy from either a peer node to a peer node or a master node to a slave node to achieve the correct results when setting up the propagation tree. So I don't currently see how the logic in propgate_mnt could possibly get into a problematic situation. >> We have >> source_mnt: that'll go on top of the 1st mountpoint >> copy of source_mnt: that'll go on top of the 2nd mountpoint >> copy of copy of source_mnt: that'll go on top of the 3rd mountpoint >> copy of copy of copy of source_mnt: that'll go on top of the 4th one >> And AFAICS your logics there has just made sure that everything except the >> source_mnt will have MNT_LOCK_... all over the place. IOW, the effect of >> CL_UNPRIVELEGED is cumulative. >> >> How the hell does that code avoid this randomness? Note had the members of >> that peer group been in a different order, you would've gotten a different >> result. >> What am I missing here? I believe it is that members of a peer group are guaranteed to share a user namepace, or else they can't be peers. >> Oops is a separate story, and a regression in its own right; it needs to be >> fixed. But I would really like to sort out the semantics of the existing >> code in that area, so that we don't end up with patch conflicts. > > Ping? Apologies for the delay I have been on my own deep dive and I initially misunderstood the question. For some reason I thought you were implying that f2ebb3a921c1 ("smarter propagate_mnt()") changed the logic and introduced the bug. So I didn't understand why you were asking me. > If so, the natural place to deal with that would be in commit_tree() after > propagate_mnt() has created all copies, not while creating those copies. > That, after all, is where we mark all those struct mount as belonging to > namespace... Alternatively we could add a new mount propagation flag call it MNT_PRIV_BORDER that would mean that we don't need to look at mount namespaces when considering the propgation tree. We would just need to notice that the slave we
Re: [git pull] mount API series
On Mon, Nov 12, 2018 at 08:54:39PM +, Al Viro wrote: > On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote: > > Steven Whitehouse writes: > > > > Can you share some details of what this NULL dereference is? David and > > > Al have been working on the changes as requested by Linus later in > > > this thread, and they'd like to tidy up this issue too at the same > > > time if possible. We are not asking you to actually provide a fix, in > > > case you are too busy to do so, however it would be good to know what > > > the issue is so that we can make sure that it is resolved in the next > > > round of patches, > > > > https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/ > > Thought it had been dealt with, but you are right - oops is still there > and obviously needs fixing. However, looking at that place in mainline... > How does that thing manage to work? Look: > /* Notice when we are propagating across user namespaces */ > if (m->mnt_ns->user_ns != user_ns) > type |= CL_UNPRIVILEGED; > child = copy_tree(last_source, last_source->mnt.mnt_root, type); > if (IS_ERR(child)) > return PTR_ERR(child); > child->mnt.mnt_flags &= ~MNT_LOCKED; > mnt_set_mountpoint(m, mp, child); > last_dest = m; > last_source = child; > OK, we'd created the copy to be attached to the next candidate mountpoint. > If we have e.g. a 4-element peer group, we'll start with what we'd been > asked to mount, then call that sucker three times, getting a copy for > each of those mountpoints. Right? Now, what happens if the 1st, 3rd and > 4th members live in your namespace, with the second one being elsewhere? > We have > source_mnt: that'll go on top of the 1st mountpoint > copy of source_mnt: that'll go on top of the 2nd mountpoint > copy of copy of source_mnt: that'll go on top of the 3rd mountpoint > copy of copy of copy of source_mnt: that'll go on top of the 4th one > And AFAICS your logics there has just made sure that everything except the > source_mnt will have MNT_LOCK_... all over the place. IOW, the effect of > CL_UNPRIVELEGED is cumulative. > > How the hell does that code avoid this randomness? Note had the members of > that peer group been in a different order, you would've gotten a different > result. > What am I missing here? > > Oops is a separate story, and a regression in its own right; it needs to be > fixed. But I would really like to sort out the semantics of the existing > code in that area, so that we don't end up with patch conflicts. Ping? FWIW, I don't believe that CL_UNPRIVELEGED approach is workable; the logics we want is, AFAICS, "when destination is not yours, whatever you attach to it should be all locked". Correct? If so, the natural place to deal with that would be in commit_tree() after propagate_mnt() has created all copies, not while creating those copies. That, after all, is where we mark all those struct mount as belonging to namespace... Again, this is quite independent from the oops you've reported (and that, BTW, can be triggered without any userns involvement - commit_tree() itself will oops just fine if parent's ->mnt_ns is NULL, userns or no userns). I think I understand how to deal with that thing, but it's a separate story; handling of MNT_LOCK... is a problem that exists in the mainline and whatever fix we come up with for this one will need to be backportable. Al "resurfaced from long and thoroughly nasty dive through the LSM gutter and finally getting around to more pleasant stuff" Viro
Re: [git pull] mount API series
On Sun, Nov 11, 2018 at 08:07:20PM -0600, Eric W. Biederman wrote: > Steven Whitehouse writes: > > Can you share some details of what this NULL dereference is? David and > > Al have been working on the changes as requested by Linus later in > > this thread, and they'd like to tidy up this issue too at the same > > time if possible. We are not asking you to actually provide a fix, in > > case you are too busy to do so, however it would be good to know what > > the issue is so that we can make sure that it is resolved in the next > > round of patches, > > https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/ Thought it had been dealt with, but you are right - oops is still there and obviously needs fixing. However, looking at that place in mainline... How does that thing manage to work? Look: /* Notice when we are propagating across user namespaces */ if (m->mnt_ns->user_ns != user_ns) type |= CL_UNPRIVILEGED; child = copy_tree(last_source, last_source->mnt.mnt_root, type); if (IS_ERR(child)) return PTR_ERR(child); child->mnt.mnt_flags &= ~MNT_LOCKED; mnt_set_mountpoint(m, mp, child); last_dest = m; last_source = child; OK, we'd created the copy to be attached to the next candidate mountpoint. If we have e.g. a 4-element peer group, we'll start with what we'd been asked to mount, then call that sucker three times, getting a copy for each of those mountpoints. Right? Now, what happens if the 1st, 3rd and 4th members live in your namespace, with the second one being elsewhere? We have source_mnt: that'll go on top of the 1st mountpoint copy of source_mnt: that'll go on top of the 2nd mountpoint copy of copy of source_mnt: that'll go on top of the 3rd mountpoint copy of copy of copy of source_mnt: that'll go on top of the 4th one And AFAICS your logics there has just made sure that everything except the source_mnt will have MNT_LOCK_... all over the place. IOW, the effect of CL_UNPRIVELEGED is cumulative. How the hell does that code avoid this randomness? Note had the members of that peer group been in a different order, you would've gotten a different result. What am I missing here? Oops is a separate story, and a regression in its own right; it needs to be fixed. But I would really like to sort out the semantics of the existing code in that area, so that we don't end up with patch conflicts.
Re: [git pull] mount API series
Steven Whitehouse writes: > Hi, > > > On 31/10/18 15:38, Eric W. Biederman wrote: >> Al Viro writes: >> >>> mount API series from David Howells. Last cycle's objections >>> had been of the "I'd do it differently" variety and with no such >>> differently done variants having ever materialized over several >>> cycles... >> Absolutely not. >> >> My objections fundamentally is that I can find real problems when I look >> at the code. Further the changes have not been incremental changes that >> have evolved the code from one state to another but complete >> replacements of code that make code review very difficult and bisection >> completely inapplicable. >> >> I also object that this series completely fails to fix the worst but I >> have ever seen in the mount API. Whit no real intrest shown in working >> to fix it. >> >> A couple of bugs that I can see quickly. Several of which I have >> previously reported: >> >> - There is an easily triggered NULL pointer deference with open_tree >>and mount propagation. >> >> > Can you share some details of what this NULL dereference is? David and > Al have been working on the changes as requested by Linus later in > this thread, and they'd like to tidy up this issue too at the same > time if possible. We are not asking you to actually provide a fix, in > case you are too busy to do so, however it would be good to know what > the issue is so that we can make sure that it is resolved in the next > round of patches, https://lore.kernel.org/lkml/87bm7n5k1r@xmission.com/ Eric
Re: [git pull] mount API series
Hi, On 31/10/18 15:38, Eric W. Biederman wrote: Al Viro writes: mount API series from David Howells. Last cycle's objections had been of the "I'd do it differently" variety and with no such differently done variants having ever materialized over several cycles... Absolutely not. My objections fundamentally is that I can find real problems when I look at the code. Further the changes have not been incremental changes that have evolved the code from one state to another but complete replacements of code that make code review very difficult and bisection completely inapplicable. I also object that this series completely fails to fix the worst but I have ever seen in the mount API. Whit no real intrest shown in working to fix it. A couple of bugs that I can see quickly. Several of which I have previously reported: - There is an easily triggered NULL pointer deference with open_tree and mount propagation. Can you share some details of what this NULL dereference is? David and Al have been working on the changes as requested by Linus later in this thread, and they'd like to tidy up this issue too at the same time if possible. We are not asking you to actually provide a fix, in case you are too busy to do so, however it would be good to know what the issue is so that we can make sure that it is resolved in the next round of patches, Steve.
Re: [git pull] mount API series
Hi Al, On 2018/11/2 12:07, Al Viro wrote: > On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote: > >> (*) mount-api-core. These are the internal-only patches that add the >> fs_context, the legacy wrapper and the security hooks and make certain >> filesystems make use of it. > > FWIW, while rereading that series I'd spotted something very odd in erofs. > It's orthogonal to everything else, but just to make sure it doesn't get > lost: > * sbi->dev_name thing in erofs is used only for debugging printks, > basically. Just use sb->s_id[] and be done with that. > * dump struct erofs_mount_private - you don't need dev_name in > your erofs_fill_super(). Just use mount_bdev() in usual fashion. OK, these two points are the same, the original alternative patch to fixup it is to use bdevname(), However I saw what is done in drivers/usb/gadget/function/f_fs.c, therefore I fixed in as what I saw in f_fs.c. Refer: https://lists.ozlabs.org/pipermail/linux-erofs/2018-September/000548.html https://lists.ozlabs.org/pipermail/linux-erofs/2018-September/000551.html I could remove erofs_mount_private entirely if you want. :) > * what the hell are you doing with ->s_root??? Why would you > possibly want it hashed and what kind of dcache lookup could find it? > That d_rehash() looks deeply confused; what are you trying to do there? Thanks for pointing out. After I think into this piece of code, I also think that is redundant. I will fix it immediately, thanks again for pointing out. Thanks, Gao Xiang >
Re: [git pull] mount API series
Hi Al, On 2018/11/3 3:42, Al Viro wrote: > On Fri, Nov 02, 2018 at 04:07:01AM +, Al Viro wrote: >> On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote: >> >>> (*) mount-api-core. These are the internal-only patches that add the >>> fs_context, the legacy wrapper and the security hooks and make certain >>> filesystems make use of it. >> >> FWIW, while rereading that series I'd spotted something very odd in erofs. >> It's orthogonal to everything else, but just to make sure it doesn't get >> lost: >> * sbi->dev_name thing in erofs is used only for debugging printks, >> basically. Just use sb->s_id[] and be done with that. >> * dump struct erofs_mount_private - you don't need dev_name in >> your erofs_fill_super(). Just use mount_bdev() in usual fashion. >> * what the hell are you doing with ->s_root??? Why would you >> possibly want it hashed and what kind of dcache lookup could find it? >> That d_rehash() looks deeply confused; what are you trying to do there? > > ... and while we are at it, what happens to > unsigned int nameoff = le16_to_cpu(de[mid].nameoff); > unsigned int matched = min(startprfx, endprfx); > > struct qstr dname = QSTR_INIT(data + nameoff, > unlikely(mid >= ndirents - 1) ? > maxsize - nameoff : > le16_to_cpu(de[mid + 1].nameoff) - nameoff); > > /* string comparison without already matched prefix */ > int ret = dirnamecmp(name, &dname, &matched); > if le16_to_cpu(de[...].nameoff) is not monotonically increasing? I.e. > what's to prevent e.g. (unsigned)-1 ending up in dname.len? > > Corrupted fs image shouldn't oops the kernel... Yes, thanks for pointing out. :) I will add more boundary check later before moving into fs/ directory... erofs now is under dm-verity for our HUAWEI mobile phone, so it doesn't be corruptted. I will add more checks and meta checksum later after EROFS productization successfully... :) Thanks, Gao Xiang >
Re: [git pull] mount API series
On Fri, Nov 02, 2018 at 04:07:01AM +, Al Viro wrote: > On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote: > > > (*) mount-api-core. These are the internal-only patches that add the > > fs_context, the legacy wrapper and the security hooks and make certain > > filesystems make use of it. > > FWIW, while rereading that series I'd spotted something very odd in erofs. > It's orthogonal to everything else, but just to make sure it doesn't get > lost: > * sbi->dev_name thing in erofs is used only for debugging printks, > basically. Just use sb->s_id[] and be done with that. > * dump struct erofs_mount_private - you don't need dev_name in > your erofs_fill_super(). Just use mount_bdev() in usual fashion. > * what the hell are you doing with ->s_root??? Why would you > possibly want it hashed and what kind of dcache lookup could find it? > That d_rehash() looks deeply confused; what are you trying to do there? ... and while we are at it, what happens to unsigned int nameoff = le16_to_cpu(de[mid].nameoff); unsigned int matched = min(startprfx, endprfx); struct qstr dname = QSTR_INIT(data + nameoff, unlikely(mid >= ndirents - 1) ? maxsize - nameoff : le16_to_cpu(de[mid + 1].nameoff) - nameoff); /* string comparison without already matched prefix */ int ret = dirnamecmp(name, &dname, &matched); if le16_to_cpu(de[...].nameoff) is not monotonically increasing? I.e. what's to prevent e.g. (unsigned)-1 ending up in dname.len? Corrupted fs image shouldn't oops the kernel...
Re: [git pull] mount API series
On Thu, Nov 01, 2018 at 11:59:23PM +, David Howells wrote: > (*) mount-api-core. These are the internal-only patches that add the > fs_context, the legacy wrapper and the security hooks and make certain > filesystems make use of it. FWIW, while rereading that series I'd spotted something very odd in erofs. It's orthogonal to everything else, but just to make sure it doesn't get lost: * sbi->dev_name thing in erofs is used only for debugging printks, basically. Just use sb->s_id[] and be done with that. * dump struct erofs_mount_private - you don't need dev_name in your erofs_fill_super(). Just use mount_bdev() in usual fashion. * what the hell are you doing with ->s_root??? Why would you possibly want it hashed and what kind of dcache lookup could find it? That d_rehash() looks deeply confused; what are you trying to do there?
Re: [git pull] mount API series
Linus Torvalds wrote: > So if the patch series can be split up into a prep-phase that doesn't > change any user-visible semantics (including the security side), but > that uses the fs_context internally and allows the filesystems to be > converted to the new world order, than that would make merging the > early work much easier (and then my worry about the later phases would > probably be much less too). As a first go, I've rebased the patches to v4.19 (which required no other changes), folded in some small bugfixes (fix error handling in remount, fix incorrect user_ns in proc and mqueue) and split the set up. There are now three branches in my git tree: (*) mount-api-core. These are the internal-only patches that add the fs_context, the legacy wrapper and the security hooks and make certain filesystems make use of it. (*) mount-api-uapi. This is mount-api-core with the UAPI-visible patches stacked thereon. (*) mount-api. This is the original patchset. "git diff mount-api mount-api-uapi" shows no differences. Note that the commit "vfs: Implement logging through fs_context" appears in both sets. I was just going to leave it as macros that just wrap pr_notice(), but I think it might be wiser to pull it out of line (as will be required later) and make it produce messages at different levels. The git tree in question is at: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git David
Re: [git pull] mount API series
On Thu, Nov 1, 2018 at 3:05 PM Al Viro wrote: > > Do you mind if we end up with work.mount rebased? > The usual objections re testing in -next do not apply in this case, > AFAICS... I was assuming that the work.mount branch would be entirely re-done, yes. Linus
Re: [git pull] mount API series
On Thu, Nov 01, 2018 at 11:33:31AM -0700, Linus Torvalds wrote: > Al - can I ask you to look at helping David with something like that? > You tend to be very good at generating those patch-series with > "obviously no changes" for the individual patches, but the end result > ends up being totally different from the starting point (I'm thinking > of all the locking and dentry refcounting series). I'll try. Before we go there, I'd like to get the rest of vfs.git off my hands - AFS series and misc pile. Will send pull requests shortly, then - this stuff. Do you mind if we end up with work.mount rebased? The usual objections re testing in -next do not apply in this case, AFAICS...
Re: [git pull] mount API series
On Thu, Nov 1, 2018 at 10:18 AM David Howells wrote: > > No. What we have upstream now is most definitely *not* atomic. That said, > some of the steps of the sequence are atomic in their own right: It's more that what we have traditionally is atomic wrt user space. Use space does *one* indivisible operation. There's no way for user-space to do a half-way mount and say "ok, that's it, I'll just exit now" (or skip a phase and fool the kernel into handling a half-way setup issue). So the new model does require a bit more care in tear-down etc. I'm not so worried about this, actually, but it *is* different. Afaik, some of the bugs were exactly in this half-way state bits. [ And yes, looking back at some the reports I found, it was Alan Jenkins and mainly the open-tree thing. ] > Would you be willing to take just the *internal* fs_context changes and leave > the UAPI for the next window? Hmm. I had to think about that, but the more I thought about it, the more I liked it. Yes. Depending on how that is done, that would make a lot of my worries go away. *Particularly* if it then allows us to do the per-filesystem conversion one by one. So if the patch series can be split up into a prep-phase that doesn't change any user-visible semantics (including the security side), but that uses the fs_context internally and allows the filesystems to be converted to the new world order, than that would make merging the early work much easier (and then my worry about the later phases would probably be much less too). It would be _really_ nice to see that prep-phase be done in a way where each individual patch very obviously doesn't change semantics. If it's obvious, then I'm happy to consider that pure prep-work and willing to merge it after the merge window in order to make the _next_ merge window easier. Al - can I ask you to look at helping David with something like that? You tend to be very good at generating those patch-series with "obviously no changes" for the individual patches, but the end result ends up being totally different from the starting point (I'm thinking of all the locking and dentry refcounting series). Linus
Re: [git pull] mount API series
Linus Torvalds wrote: > Exactly because it splits up what used to be an atomic sequence, No. What we have upstream now is most definitely *not* atomic. That said, some of the steps of the sequence are atomic in their own right: - Creation and initialisation of the superblock - Creation of the vfsmount - Attachment of a new vfsmount and these units have not changed. What I have done is to extract the parameter processing and move it to the front and provided a way to provide more exotic parameters more easily. This means that all parse errors are found before we start creating objects and applying parameters (which is a bug in some of our existing remount operations that this fixes). Also, as the parameters are passed one at a time, I've also inserted a validation step so that the parameter set as a whole can be checked for inter-parameter consistency before any object creation is performed. Note further that upstream, remount/reconfiguration is *not* atomic, but with these changes that becomes closer to being realisable. I've looked into how to do it for btrfs using RCU and a CoW'd struct with the settings in, but that's a much bigger, more intrusive change that doesn't need to be coupled to the mount API changes. > I worry about the intermediate states having issues (the refcounting things > we've already seen, for example), but I also worry about the fact that it > completely changes the model, and that that makes things like security hooks > fundamentally different. A lot of the recent bugs aren't from fsopen()/fsconfig()/fsmount() at all, but rather from the open_tree() syscall and, to a lesser extent, the move_mount() syscall. Would you be willing to take just the *internal* fs_context changes and leave the UAPI for the next window? I have patches that make NFS and BTRFS use it, which would then allow me to remove nearly 1000 lines of LSM code[1], but I can't reasonably ask Trond and Chris to take them unless there's some schedule to get the core dependencies in. [1] See the "security: Remove the set_mnt_opts and clone_mnt_opts ops" commit here: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=btrfs-mount-api > The latter may not be a "bug" in the sense that it's all intentional, > but it does mean that I see *one* mount-time security hook now having > been replaced by *five* security hooks. That's not really accurate. I'm adding: (1) security_fs_context_parse_param() (2) security_fs_context_validate() (3) security_sb_get_tree() (4) security_sb_reconfigure() (5) security_sb_mountpoint() (6) security_move_mount() Hook (1) allows individual mount parameters to examined/removed and then (2) allows the whole set to be checked for consistency. This is just parameter parsing and checking. Then (3) allows super_block::security to be initialised and (4) allows an LSM's sb parameters to be reconfigured by remount (or changes to be prohibited). These use the context set up by (1) and checked by (2). Finally, (5) is used to check that a mountpoint may be used and (6) to see if a mount may be moved. Further, I'm actually removing more than one: (1) security_sb_kern_mount() (2) security_sb_remount() (3) security_sb_copy_data() (4) security_sb_set_mnt_opts() (5) security_sb_clone_mnt_opts() (6) security_sb_parse_opts_str() But four of them are only applicable to NFS and BTRFS, and can only be removed after those have been dealt with. > And that's ignoring the alloc/dup/free ones. Yes, for the record, I'm adding: (7) security_fs_context_alloc() (8) security_fs_context_dup() (9) security_fs_context_free() to allow the LSM to store data in the fs_context. > As far as I can tell, the patch-series simply added the hooks. It made > no attempt at making sure that previous hooks had sane semantics. Ummm... I can't say that what's upstream has sane semantics - that's not really the focus of these patches. What I've tried to do is to retain the behaviour where possible. > Do they? So now a system that has an old mount hook can be bypassed by > simplky using the new model instead. Not so easily. The LSM is *still* being consulted and I think both in SELinux and Smack the same rules will still be applied. There's a case in AppArmor that requires a rejigging of the FSM compiler to make it work, and I discussed this with John Johansen at the LSS in Edinburgh last week, but we can simply disallow the use of the new API with AppArmor. > I dunno. The patches are illegible in this regard (and I don't blame > the fsmount ones, I blame the security subsystem that just is full of > random indirection to random sub-security systems, which in turn just > have hash lookups for data structures set up by other operations > entirely). > > Eric was pointing out bugs as late as the weekend before the merge > window opened. That, to me, does not say "ready for the merge window". Actually, Alan Jenkins has been mostly the one pointing out the
Re: [git pull] mount API series
On Wed, Oct 31, 2018 at 04:36:01PM +, Al Viro wrote: > On Wed, Oct 31, 2018 at 10:38:17AM -0500, Eric W. Biederman wrote: > > A couple of bugs that I can see quickly. Several of which I have > > previously reported: > > > > - There is an easily triggered NULL pointer deference with open_tree > > and mount propagation. > > What the hell? If the fixes that went in do not handle something, > especially if you have testcases, where the fuck have you been and > where _are_ those testcases, while we are at it? > > Eric, this is bloody ridiculous - "I have an easily triggered NULL pointer > dereference in..., here it is" is hard to miscommunicate. Sure, any such > needs to be fixed. For crying out loud, that thing has not been hidden - > it sat in -next, it's been reposted several times... Again, would you mind telling what exactly does the above refer to and whether it is still true? I'm not asking to put details in every time you mention something of that sort, but generally one is expected to come up with those on demand, especially if the bug _is_ easily triggered. You, IIRC, had been Cc'd on the threads where open_tree breakage was dealt with. I can certainly believe that there might be something else in that area (or any other, for that matter). I *am* interested in finding and fixing that and I would rather appreciate the details of what you are seeing.
Re: [git pull] mount API series
On Thu, Nov 1, 2018 at 3:53 AM Steven Whitehouse wrote: > > When I look at the discussions I'm seeing two main issues (please > correct me if you think I'm wrong about this) which are (a) whether the > design is correct and (b) whether there are still bugs in the current > patch set. > > Which of these are you most concerned about? I'm most worried about bugs _due_ to the new design. Exactly because it splits up what used to be an atomic sequence, I worry about the intermediate states having issues (the refcounting things we've already seen, for example), but I also worry about the fact that it completely changes the model, and that that makes things like security hooks fundamentally different. The latter may not be a "bug" in the sense that it's all intentional, but it does mean that I see *one* mount-time security hook now having been replaced by *five* security hooks. And that's ignoring the alloc/dup/free ones. As far as I can tell, the patch-series simply added the hooks. It made no attempt at making sure that previous hooks had sane semantics. Do they? So now a system that has an old mount hook can be bypassed by simplky using the new model instead. I dunno. The patches are illegible in this regard (and I don't blame the fsmount ones, I blame the security subsystem that just is full of random indirection to random sub-security systems, which in turn just have hash lookups for data structures set up by other operations entirely). Eric was pointing out bugs as late as the weekend before the merge window opened. That, to me, does not say "ready for the merge window". Linus
Re: [git pull] mount API series
Hi, On 31/10/18 16:18, Linus Torvalds wrote: On Tue, Oct 30, 2018 at 10:34 PM Al Viro wrote: mount API series from David Howells. Last cycle's objections had been of the "I'd do it differently" variety and with no such differently done variants having ever materialized over several cycles... Having just lurked on the discussions about the bugs in here, I don't think this is ready. Maybe they got fixed, but if so, it was recent and it was pretty fundamental. The stated aim of the series is to make the mount API _better_, not worse. And right now it looks like a "two steps back, one theoretical step forwards" kind of better. Linus The design of the new mount API has been under discussion for some time, both on the mailing lists, and also at LSF/MM too. Al and David (and others) have put a lot of work into getting to the current position, and have specifically requested input from Eric about his concerns over past cycles. When I look at the discussions I'm seeing two main issues (please correct me if you think I'm wrong about this) which are (a) whether the design is correct and (b) whether there are still bugs in the current patch set. Which of these are you most concerned about? It seems to me that there is not a lot of point in spending a large amount of time in additional review/testing of the current patch set if the overall design is set to be rejected. If your concerns are only with the robustness/stability of the patch set, then that provides a clear route to resolving the current impasse, at least assuming that Eric is able to enumerate the issues that he has discovered. It looks like David has already provided a fix for one of the issues which Eric mentioned in his recent email. Eric it would be good if you could confirm that this addresses that particular concern, Steve.
Re: [git pull] mount API series
On Wed, Oct 31, 2018 at 7:39 PM, David Howells wrote: >> My objections fundamentally is that I can find real problems when I look >> at the code. I think the big risk with such a change is not that there are bugs, but that we get the API wrong, and have to keep supporting a broken API forever. Were any of your objections of that type? There's the argument about sharing of super blocks doing weird things, but we can't get rid of that one due to having to support the old mount(2) API forever anyway. That's not a new burden, and the new API allows fixing this incrementally by adding better models later. Thanks, Miklos
Re: [git pull] mount API series
> My objections fundamentally is that I can find real problems when I look > at the code. Eric. You have repeatedly stated that there are "thinkos, typos and bugs" in the code, but you have not been very forthcoming in actually disclosing *what* those things are. You had a go at rewriting it for yourself, with different philosophical aims, and at one point, you posted a partially reworked git branch my way. However, you changed the patches in various ways that rendered it incomparable with mine, making it very hard to actually separate bug fixes from your other changes. No useful notes were given. > Further the changes have not been incremental changes that have evolved the > code from one state to another but complete replacements of code that make > code review very difficult and bisection completely inapplicable. ... therefore I find this a little hypocritical. Further, your rewrite breaks NFS and other things, such as mis-renumbering the tokens in SELinux. Splitting the lookup-and-create operation into separate lookup-no-create and create-or-fail won't work for NFS. Despite your protestation, this *has* been discussed publically - and you even joined in the discussion. > A couple of bugs that I can see quickly. Several of which I have > previously reported: > > - There is an easily triggered NULL pointer deference with open_tree > and mount propagation. As far as I know, these have all been cleared up. I think that I've cleared up all the bugs that Alan Jenkins found (many thanks to him for that!). > - Bisection will not work with the cpuset filesystem patch. It won't? I don't recall this been mentioned before (but I may have missed it). > At least cpuset looks like it may be mountable now. It does mount now. Thanks for pointing that out. > - The setting of fc->user_ns on proc remains broken. In particular if you > create a child user namespace and attempt to mount proc it will succeed > instead of fail. You did mention this before - but you haven't been clear on how to fix it. I tried to work this out from your branch, but it's mixed in with loads of other changes. > - The mqueue filesystem has the same issue as proc. fc->user_ns is misset. I have a patch that I think should fix both of these. I'll follow up this message with it, if you could check it please. David
Re: [git pull] mount API series
On Wed, Oct 31, 2018 at 10:38:17AM -0500, Eric W. Biederman wrote: > A couple of bugs that I can see quickly. Several of which I have > previously reported: > > - There is an easily triggered NULL pointer deference with open_tree > and mount propagation. What the hell? If the fixes that went in do not handle something, especially if you have testcases, where the fuck have you been and where _are_ those testcases, while we are at it? Eric, this is bloody ridiculous - "I have an easily triggered NULL pointer dereference in..., here it is" is hard to miscommunicate. Sure, any such needs to be fixed. For crying out loud, that thing has not been hidden - it sat in -next, it's been reposted several times... I'm glad that you can see bugs. How about _showing_ them? Because I don't see anything that would prevent this reply of yours from being posted pretty much verbatim cycle after cycle, no matter what gets done.
Re: [git pull] mount API series
ebied...@xmission.com (Eric W. Biederman) writes: > I am going to stop there. I believe there are more issues in the code. > I am relieved that I am not seeing the loss of some of the security > hooks that I thought I saw last time I looked at the code. Bah. Now I see the missing security hook. There are a set of security hooks that allow security modules to parse mount options. On a good day they look like: security_mnt_opts opts; char *secdata; secdata = alloc_secdata(); security_sb_copy_data("a,mount,options,string", secdata); security_init_mnt_opts(&opts); security_parse_opts_str(secdata, &opts); security_set_mnt_opts(sb, &opts, 0, NULL); security_free_mnt_opts(&opts); In practice however things are not that explicit. With security_sb_kern_mount performing all of the mnt_opts work. However after the rewrite in the patchset. The function sb_kern_mount no longer exists and it's replacement sb_get_tree out of necessity does not call parse_opts_str. This is because the mount options can no longer be passed as a string. The legacy compatibility code also does not call sb_parse_opts_str. The result is using the existing apis all of the security module command line parsing except for (btrfs and nfs) no longer works. The changes are not structured in a way that makes any of this easy to find. Which is why I have been saying I wouldn't do it that way. It also is the case that this pattern repeats through out the patches. Replacing code with something brand new, instead of evolving what is there. That makes it easy for this kind of thing to slip through. Eric
Re: [git pull] mount API series
On Tue, Oct 30, 2018 at 10:34 PM Al Viro wrote: > > mount API series from David Howells. Last cycle's objections > had been of the "I'd do it differently" variety and with no such > differently done variants having ever materialized over several > cycles... Having just lurked on the discussions about the bugs in here, I don't think this is ready. Maybe they got fixed, but if so, it was recent and it was pretty fundamental. The stated aim of the series is to make the mount API _better_, not worse. And right now it looks like a "two steps back, one theoretical step forwards" kind of better. Linus
Re: [git pull] mount API series
Al Viro writes: > mount API series from David Howells. Last cycle's objections > had been of the "I'd do it differently" variety and with no such > differently done variants having ever materialized over several > cycles... Absolutely not. My objections fundamentally is that I can find real problems when I look at the code. Further the changes have not been incremental changes that have evolved the code from one state to another but complete replacements of code that make code review very difficult and bisection completely inapplicable. I also object that this series completely fails to fix the worst but I have ever seen in the mount API. Whit no real intrest shown in working to fix it. A couple of bugs that I can see quickly. Several of which I have previously reported: - There is an easily triggered NULL pointer deference with open_tree and mount propagation. - Bisection will not work with the cpuset filesystem patch. At least cpuset looks like it may be mountable now. - The setting of fc->user_ns on proc remains broken. In particular if you create a child user namespace and attempt to mount proc it will succeed instead of fail. - The mqueue filesystem has the same issue as proc. fc->user_ns is misset. I suspect I didn't report it well but I reported both the proc and the mqueue filesystem weeks before the merge window opened. I am going to stop there. I believe there are more issues in the code. I am relieved that I am not seeing the loss of some of the security hooks that I thought I saw last time I looked at the code. Given both that I have reported bugs that remain unfixed and it's non-evolutionary nature that makes this patchset hard to review I have no confidence in this set of patches. I think the scope has been too large for anyone to properly review the code and it should be broken into smaller pieces. That there were significant open_tree and move_mount issues being found just before the merge window opened seems a testament to that. (AKA the first 3 or so patches in the patchset and fundamental to the rest). My apologies that we have not been able communication better and that I have not been able to clearly point out all of the bugs. My apologies I have not yet been able to give more constructive feedback. Still at the end of the day there remain regressions of existing functionality in that tree. Eric
[git pull] mount API series
mount API series from David Howells. Last cycle's objections had been of the "I'd do it differently" variety and with no such differently done variants having ever materialized over several cycles... Conflicts: two trivial ones in drivers/infiniband/Kconfig (removal of select ANON_INODES) fs/f2fs/super.c (->remount signature change) and a non-trivial one in fs/proc/inode.c - there we have mainline adding /* procfs dentries and inodes don't require IO to create */ s->s_shrink.seeks = 0; to proc_fill_super() (in 4b85afbdacd2 "mm: zero-seek shrinkers") while that series moves the sucker to fs/proc/root.c. Resolved by removing the old copy from fs/proc/inode.c and adding the same lines into the new copy in fs/proc/root.c. I'd put a variant of resolution into #proposed-merge. David's cover letter follows; it's obviously over the top for commit message of the merge. Where to cut it is up to you... = Here are a set of patches to create a filesystem context prior to setting up a new mount, populating it with the parsed options/binary data, looking up/creating the superblock, querying it and then effecting the mount. This is also used for remount since much of the parsing stuff is common in many filesystems. This allows namespaces and other information to be conveyed through the mount procedure. This is done with something like: fd = fsopen("nfs", 0); fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0); fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); struct fsinfo_statfs statfs; fsinfo(fd, NULL, NULL, &statfs, sizeof(statfs)); mfd = fsmount(fd, MS_NODEV); move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); The new move_mount() syscall can also be used simply to move mounts around: move_mount(AT_FDCWD, "/mnt", AT_FDCWD, "/mnt2", 0); And, in conjunction with the open_tree() syscall, can be used to clone mounts: fd = open_tree(AT_FDCWD, "/mnt", AT_RECURSIVE | OPEN_TREE_CLONE); move_mount(mfd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH); File descriptors can be used as mountpoint references: fd = open_tree(AT_FDCWD, "/mnt", 0); move_mount(mfd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH); move_mount(mfd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH); which, in this example, will *move* the mount at /mnt to /mnt2 and thence to /mnt3. Superblocks can be picked and reconfigured: fd = fspick(AT_FDCWD, "/mnt", 0) fsconfig(fd, FSCONFIG_SET_STRING, "option", "other-val", 0); fsconfig(fd, FSCONFIG_SET_STRING, "option2", "true", 0); fsconfig(fd, FSCONFIG_SET_STRING, "option3", "1234", 0); fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0); Filesystem parameters and other attributes can also be queried from the fd returned by fsopen() or fspick(): fd = fspick(AT_FDCWD, "/mnt", 0) struct fsinfo_params params = { .request= FSINFO_ATTR_PARAMETER, .Nth= 1, .Mth= 3, }; char param_buf[4096]; fsinfo(fd, NULL, ¶ms, param_buf, sizeof(param_buf)); which will retrieve the 4th value of the 2nd parameter (0 being first) as a printable string. Parameters and attributes can also be queried by path or on an ordinary fd: struct fsinfo_params params = { .request= FSINFO_ATTR_VOLUME_NAME, }; char param_buf[4096]; fsinfo(AT_FDCWD, "/etc/passwd", ¶ms, param_buf, sizeof(param_buf)); The details of a filesystem's parser can also be queried: fd = fsopen("ext4", 0); struct fsinfo_params params = { .request= FSINFO_ATTR_PARAM_NAME, .Nth= 1, }; char param_buf[4096]; fsinfo(fd, NULL, ¶ms, param_buf, sizeof(param_buf)); which, in this instance, will retrieve the name of parameter #1. I have implemented filesystem context handling for procfs, nfs, mqueue, cpuset, kernfs, sysfs, cgroup and afs filesystems. Unconverted filesystems are handled by a legacy filesystem wrapper for the moment. Note that I didn't use netlink as that would make the core kernel depend on CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing issues. WHY DO WE WANT THIS? Firstly, there's a bunch of problems with the mount(2) syscall: (1) It's actually six or seven different interfaces rolled into one and weird combinations of flags make it do different things beyond the original specification of the syscall. (2) It produces a particularly large and diverse set of errors, which have to be mapped back to a small error code. Yes, there's dmesg - if you have it configured - but you can't necessarily see that if you're