Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-23 Thread Ritesh Harjani




On 10/23/19 1:41 AM, Al Viro wrote:

On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:

On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:

I think we have still not taken this patch. Al?



or, for that matter, any callers of filename_lookup() assuming that the
lack of ENOENT means that the last call of walk_component() (from lookup_last())
has not failed with the same ENOENT and thus the result has been observed
positive.
You've picked the easiest one to hit, but on e.g. KVM setups you can have the
host thread representing the CPU where __d_set_inode_and_type() runs get
preempted (by the host kernel), leaving others with much wider window.


I had thought so about the other call sites, but as you said
maybe this was the easiest one to hit.
Then, I kind of followed your suggested fix in below link to fix at 
least this current crash.

https://patchwork.kernel.org/patch/10909881/



Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
the same goes for any places that assumes that d_is_dir() implies that
the sucker is positive, etc.

What we have guaranteed is
* ->d_lock serializes ->d_flags/->d_inode changes
* ->d_seq is bumped before/after such changes
* positive dentry never changes ->d_inode as long as you hold
a reference (negative dentry *can* become positive right under you)

So there are 3 classes of valid users: those holding ->d_lock, those
sampling and rechecking ->d_seq and those relying upon having observed
the sucker they've pinned to be positive.


:) Thanks for simplifying like this. Agreed.





What you've been hitting is "we have it pinned, ->d_flags says it's
positive but we still observe the value of ->d_inode from before the
store to ->d_flags that has made it look positive".


Actually, your patch opens another problem there.  Suppose you make
it d_really_is_positive() and hit the same race sans reordering.
Dentry is found by __d_lookup() and is negative.  Right after we
return from __d_lookup() another thread makes it positive (a symlink)
- ->d_inode is set, d_really_is_positive() becomes true.  OK, on we
go, pick the inode and move on.  Right?  ->d_flags is still not set
by the thread that made it positive.  We return from lookup_fast()
and call step_into().  And get to
 if (likely(!d_is_symlink(path->dentry)) ||
Which checks ->d_flags and sees the value from before the sucker
became positive.  IOW, d_is_symlink() is false here.  If that
was the last path component and we'd been told to follow links,
we will end up with positive dentry of a symlink coming out of
pathname resolution.



hmm. yes, looks like it, based on your above explanation.
So even though this could avoid crash, but still we may end up with
a bogus entry with current patch.




Similar fun happens if you have mkdir racing with lookup - ENOENT
is what should've happened if lookup comes first, success - if
mkdir does.  This way we can hit ENOTDIR, due to false negative
from d_can_lookup().

IOW, d_really_is_negative() in lookup_fast() will paper over
one of oopsen, but it
* won't cover similar oopsen on other codepaths and
* will lead to bogus behaviour.

I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
is the right solution; it might very well be that we need to do that
only on a small subset of call sites, lookup_fast() being one of
those.  But we do want at least to be certain that something we'd
got from lookup_fast() in non-RCU mode already has ->d_flags visible.


We may also need similar guarantees with __d_clear_type_and_inode().

So do you think we should make use of ->d_seq for verifying this?
I see both __d_set_inode_and_type & __d_clear_type_and_inode() called
under ->d_seq_begin/->d_seq_end.

Then maybe we should use ->d_seq checking at those call sites.
We cannot unconditionally use ->d_seq checking in __d_entry_type(),
since we sometimes call this function inside ->d_seq_begin
(like in lookup_fast).




I'm going through the callers right now, will post a followup once
the things get cleaner...


Thanks for looking into this.



Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-22 Thread Al Viro
On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:
> On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
> > I think we have still not taken this patch. Al?

> You've picked the easiest one to hit, but on e.g. KVM setups you can have the
> host thread representing the CPU where __d_set_inode_and_type() runs get
> preempted (by the host kernel), leaving others with much wider window.
> 
> Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
> the same goes for any places that assumes that d_is_dir() implies that
> the sucker is positive, etc.
> 
> What we have guaranteed is
>   * ->d_lock serializes ->d_flags/->d_inode changes
>   * ->d_seq is bumped before/after such changes
>   * positive dentry never changes ->d_inode as long as you hold
> a reference (negative dentry *can* become positive right under you)
> 
> So there are 3 classes of valid users: those holding ->d_lock, those
> sampling and rechecking ->d_seq and those relying upon having observed
> the sucker they've pinned to be positive.
> 
> What you've been hitting is "we have it pinned, ->d_flags says it's
> positive but we still observe the value of ->d_inode from before the
> store to ->d_flags that has made it look positive".

Actually, your patch opens another problem there.  Suppose you make
it d_really_is_positive() and hit the same race sans reordering.
Dentry is found by __d_lookup() and is negative.  Right after we
return from __d_lookup() another thread makes it positive (a symlink)
- ->d_inode is set, d_really_is_positive() becomes true.  OK, on we
go, pick the inode and move on.  Right?  ->d_flags is still not set
by the thread that made it positive.  We return from lookup_fast()
and call step_into().  And get to
if (likely(!d_is_symlink(path->dentry)) ||
Which checks ->d_flags and sees the value from before the sucker
became positive.  IOW, d_is_symlink() is false here.  If that
was the last path component and we'd been told to follow links,
we will end up with positive dentry of a symlink coming out of
pathname resolution.

Similar fun happens if you have mkdir racing with lookup - ENOENT
is what should've happened if lookup comes first, success - if
mkdir does.  This way we can hit ENOTDIR, due to false negative
from d_can_lookup().

IOW, d_really_is_negative() in lookup_fast() will paper over
one of oopsen, but it
* won't cover similar oopsen on other codepaths and
* will lead to bogus behaviour.

I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
is the right solution; it might very well be that we need to do that
only on a small subset of call sites, lookup_fast() being one of
those.  But we do want at least to be certain that something we'd
got from lookup_fast() in non-RCU mode already has ->d_flags visible.

I'm going through the callers right now, will post a followup once
the things get cleaner...


Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-22 Thread Al Viro
On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:

> I'm somewhat tempted to make __d_set_inode_and_type() do smp_store_release()
> for setting ->d_flags and __d_entry_type() - smp_load_acquire(); that would
> be pretty much free for x86 (as well as sparc, s390 and itanic) and reasonably
> cheap on ppc and arm64.  How badly would e.g. SMP arm suffer from the below
> (completely untested)?

PS: if we really go that way, we'd probably want __d_is_negative(), to be
used only under ->d_lock or with ->d_seq validation.

d_is_negative() callers in fs/dcache.c (all under ->d_lock) as well as
the RCU-side one in lookup_fast() (->d_seq validated) would you that.

Another fun place is do_unlinkat() - there we want to reorder
inode = dentry->d_inode;
if (d_is_negative(dentry))
goto slashes;
As it were, the same race here can lead to unbalanced iput(), if you
hit the window just right.


Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-22 Thread Al Viro
On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
> I think we have still not taken this patch. Al?

I'm still not sure it's a good way to deal with the whole
mess, TBH ;-/  Consider e.g. walk_component(), with its

if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(, nd);
return -ENOENT;
}

seq = 0;/* we are already out of RCU mode */
inode = d_backing_inode(path.dentry);

or mountpoint_last()
if (d_is_negative(path.dentry)) {
dput(path.dentry);
return -ENOENT;
}
path.mnt = nd->path.mnt;
return step_into(nd, , 0, d_backing_inode(path.dentry), 0);
or last_open()
if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(, nd);
return -ENOENT; 
}

/*
 * create/update audit record if it already exists.
 */
audit_inode(nd->name, path.dentry, 0);

if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
path_to_nameidata(, nd);
return -EEXIST;
}

seq = 0;/* out of RCU mode, so the value doesn't matter */
inode = d_backing_inode(path.dentry);
or, for that matter, any callers of filename_lookup() assuming that the
lack of ENOENT means that the last call of walk_component() (from lookup_last())
has not failed with the same ENOENT and thus the result has been observed
positive.

You've picked the easiest one to hit, but on e.g. KVM setups you can have the
host thread representing the CPU where __d_set_inode_and_type() runs get
preempted (by the host kernel), leaving others with much wider window.

Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
the same goes for any places that assumes that d_is_dir() implies that
the sucker is positive, etc.

What we have guaranteed is
* ->d_lock serializes ->d_flags/->d_inode changes
* ->d_seq is bumped before/after such changes
* positive dentry never changes ->d_inode as long as you hold
a reference (negative dentry *can* become positive right under you)

So there are 3 classes of valid users: those holding ->d_lock, those
sampling and rechecking ->d_seq and those relying upon having observed
the sucker they've pinned to be positive.

What you've been hitting is "we have it pinned, ->d_flags says it's
positive but we still observe the value of ->d_inode from before the
store to ->d_flags that has made it look positive".

I'm somewhat tempted to make __d_set_inode_and_type() do smp_store_release()
for setting ->d_flags and __d_entry_type() - smp_load_acquire(); that would
be pretty much free for x86 (as well as sparc, s390 and itanic) and reasonably
cheap on ppc and arm64.  How badly would e.g. SMP arm suffer from the below
(completely untested)?

diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..35368316c582 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -319,7 +319,7 @@ static inline void __d_set_inode_and_type(struct dentry 
*dentry,
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
-   WRITE_ONCE(dentry->d_flags, flags);
+   smp_store_release(>d_flags, flags);
 }
 
 static inline void __d_clear_type_and_inode(struct dentry *dentry)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 10090f11ab95..bee28076a3fd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -386,7 +386,7 @@ static inline bool d_mountpoint(const struct dentry *dentry)
  */
 static inline unsigned __d_entry_type(const struct dentry *dentry)
 {
-   return dentry->d_flags & DCACHE_ENTRY_TYPE;
+   return smp_load_acquire(>d_flags) & DCACHE_ENTRY_TYPE;
 }
 
 static inline bool d_is_miss(const struct dentry *dentry)


Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-22 Thread Ritesh Harjani

I think we have still not taken this patch. Al?


On 10/15/19 9:37 AM, Ritesh Harjani wrote:

ping!!

On 9/27/19 10:12 AM, Ritesh Harjani wrote:

d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
   [NIP  : trailing_symlink+80]
   [LR   : trailing_symlink+1092]
   #4 [c0198069bb70] trailing_symlink at c04bae60  
(unreliable)

   #5 [c0198069bc00] path_openat at c04bdd14
   #6 [c0198069bc90] do_filp_open at c04c0274
   #7 [c0198069bdb0] do_sys_open at c049b248
   #8 [c0198069be30] system_call at c000b388

Sequence of events:-
Thread-2(Comm: ln)    Thread-1(Comm: cat)

    dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
 flags = READ_ONCE(dentry->d_flags);
 flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 flags |= type_flags;
 WRITE_ONCE(dentry->d_flags, flags);

    if (unlikely(d_is_negative()) // fails
   {}
    // since d_flags is already updated in
    // Thread-2 in parallel but inode
    // not yet set.
    // d_is_negative returns false

    *inode = d_backing_inode(path->dentry);
    // means inode is still NULL

 dentry->d_inode = inode;

    trailing_symlink()
    may_follow_link()
    inode = nd->link_inode;
    // nd->link_inode = NULL
    //Then it crashes while
    //doing inode->i_uid

Reported-by: Guang Yuan Wu 
Tested-by: Guang Yuan Wu 
Acked-by: Jeff Layton 
Signed-off-by: Ritesh Harjani 
---
  fs/namei.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
  dput(dentry);
  return status;
  }
-    if (unlikely(d_is_negative(dentry))) {
+
+    /*
+ * Caution: d_is_negative() can race with
+ * __d_set_inode_and_type().
+ * For e.g. in use cases where Thread-1 is creating
+ * symlink (doing d_instantiate_new()) & Thread-2 is doing
+ * cat of that symlink and falling here (via Ref-walk) while
+ * doing lookup_fast (one such case is when ->permission
+ * returns -ECHILD).
+ * Now if __d_set_inode_and_type() does out-of-order execution
+ * i.e. it first sets the dentry->d_flags & then dentry->inode
+ * then it can result into inode being NULL, causing panic later.
+ * Hence directly check if inode is NULL here.
+ */
+    if (unlikely(d_really_is_negative(dentry))) {
  dput(dentry);
  return -ENOENT;
  }







Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-10-14 Thread Ritesh Harjani

ping!!

On 9/27/19 10:12 AM, Ritesh Harjani wrote:

d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
   [NIP  : trailing_symlink+80]
   [LR   : trailing_symlink+1092]
   #4 [c0198069bb70] trailing_symlink at c04bae60  (unreliable)
   #5 [c0198069bc00] path_openat at c04bdd14
   #6 [c0198069bc90] do_filp_open at c04c0274
   #7 [c0198069bdb0] do_sys_open at c049b248
   #8 [c0198069be30] system_call at c000b388

Sequence of events:-
Thread-2(Comm: ln) Thread-1(Comm: cat)

dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
 flags = READ_ONCE(dentry->d_flags);
 flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 flags |= type_flags;
 WRITE_ONCE(dentry->d_flags, flags);

if (unlikely(d_is_negative()) // fails
   {}
// since d_flags is already updated in
// Thread-2 in parallel but inode
// not yet set.
// d_is_negative returns false

*inode = d_backing_inode(path->dentry);
// means inode is still NULL

 dentry->d_inode = inode;

trailing_symlink()
may_follow_link()
inode = nd->link_inode;
// nd->link_inode = NULL
//Then it crashes while
//doing inode->i_uid

Reported-by: Guang Yuan Wu 
Tested-by: Guang Yuan Wu 
Acked-by: Jeff Layton 
Signed-off-by: Ritesh Harjani 
---
  fs/namei.c | 16 +++-
  1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
-   if (unlikely(d_is_negative(dentry))) {
+
+   /*
+* Caution: d_is_negative() can race with
+* __d_set_inode_and_type().
+* For e.g. in use cases where Thread-1 is creating
+* symlink (doing d_instantiate_new()) & Thread-2 is doing
+* cat of that symlink and falling here (via Ref-walk) while
+* doing lookup_fast (one such case is when ->permission
+* returns -ECHILD).
+* Now if __d_set_inode_and_type() does out-of-order execution
+* i.e. it first sets the dentry->d_flags & then dentry->inode
+* then it can result into inode being NULL, causing panic later.
+* Hence directly check if inode is NULL here.
+*/
+   if (unlikely(d_really_is_negative(dentry))) {
dput(dentry);
return -ENOENT;
}





[PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

2019-09-26 Thread Ritesh Harjani
d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
  [NIP  : trailing_symlink+80]
  [LR   : trailing_symlink+1092]
  #4 [c0198069bb70] trailing_symlink at c04bae60  (unreliable)
  #5 [c0198069bc00] path_openat at c04bdd14
  #6 [c0198069bc90] do_filp_open at c04c0274
  #7 [c0198069bdb0] do_sys_open at c049b248
  #8 [c0198069be30] system_call at c000b388

Sequence of events:-
Thread-2(Comm: ln) Thread-1(Comm: cat)

dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
WRITE_ONCE(dentry->d_flags, flags);

if (unlikely(d_is_negative()) // fails
   {}
// since d_flags is already updated in
// Thread-2 in parallel but inode
// not yet set.
// d_is_negative returns false

*inode = d_backing_inode(path->dentry);
// means inode is still NULL

dentry->d_inode = inode;

trailing_symlink()
may_follow_link()
inode = nd->link_inode;
// nd->link_inode = NULL
//Then it crashes while
//doing inode->i_uid

Reported-by: Guang Yuan Wu 
Tested-by: Guang Yuan Wu 
Acked-by: Jeff Layton 
Signed-off-by: Ritesh Harjani 
---
 fs/namei.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
dput(dentry);
return status;
}
-   if (unlikely(d_is_negative(dentry))) {
+
+   /*
+* Caution: d_is_negative() can race with
+* __d_set_inode_and_type().
+* For e.g. in use cases where Thread-1 is creating
+* symlink (doing d_instantiate_new()) & Thread-2 is doing
+* cat of that symlink and falling here (via Ref-walk) while
+* doing lookup_fast (one such case is when ->permission
+* returns -ECHILD).
+* Now if __d_set_inode_and_type() does out-of-order execution
+* i.e. it first sets the dentry->d_flags & then dentry->inode
+* then it can result into inode being NULL, causing panic later.
+* Hence directly check if inode is NULL here.
+*/
+   if (unlikely(d_really_is_negative(dentry))) {
dput(dentry);
return -ENOENT;
}
-- 
2.21.0