Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-03-02 Thread Sebastian Andrzej Siewior
On 2018-02-25 07:40:05 [+], Al Viro wrote:
> FWIW, a variant of that series is in #work.dcache; it's
> almost certainly not the final (I want to clean the things up
> and probably reorder them as well) and it needs one hell of
> profiling and review.  It seems to work, and I don't see any
> obvious performance regressions (everything seems to be within
> normal noise), but I'd like it beaten up a lot more.

If you have something in mind, I could fire up something.

> I'll play with cleaning up and reordering tomorrow after I get
> some sleep.  In the meanwhile, the current state of that set is at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

I've put it in -RT, let it do some upgrades and other things and nothing
bad happened so far.

Sebastian


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-03-02 Thread Sebastian Andrzej Siewior
On 2018-02-25 07:40:05 [+], Al Viro wrote:
> FWIW, a variant of that series is in #work.dcache; it's
> almost certainly not the final (I want to clean the things up
> and probably reorder them as well) and it needs one hell of
> profiling and review.  It seems to work, and I don't see any
> obvious performance regressions (everything seems to be within
> normal noise), but I'd like it beaten up a lot more.

If you have something in mind, I could fire up something.

> I'll play with cleaning up and reordering tomorrow after I get
> some sleep.  In the meanwhile, the current state of that set is at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache

I've put it in -RT, let it do some upgrades and other things and nothing
bad happened so far.

Sebastian


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-24 Thread Al Viro
On Sat, Feb 24, 2018 at 12:22:48AM +, Al Viro wrote:
> On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:
> 
> > This is too subtle, and your fix to check d_lockref.count < 0 sounds
> > wrong to me. If it's really gone, maybe it has been reused and the
> > refcount is positive again, but it's something else than a dentry
> > entirely?
> > 
> > Hmm.
> > 
> > No, you extended the rcu read section, so I guess your patch is fine.
> > And lock_parent already has that pattern, soiit's not new.
> > 
> > Ok, I agree, looks like lock_parent should just re-check that thing
> > that it already checked earler, but that now might be true again
> > because of we dropped d_lock.
> 
> IMO that's the right thing for backports; whether we keep it after
> the getting rid of trylock loops is a different question.  Note that
> the only case where we do not have __dentry_kill() prevention
> guaranteed by the caller (either by holding a reference, or by
> holding onto ->i_lock all along) is in shrink_dentry_list().
> And there we have more than enough of other subtle crap.
> 
> Moreover, there we have a good reason to treat "it had been moved"
> as "kick it off the shrink list and free if it's already dead",
> which might simplify the things.  Below is a stab at that:

FWIW, a variant of that series is in #work.dcache; it's
almost certainly not the final (I want to clean the things up
and probably reorder them as well) and it needs one hell of
profiling and review.  It seems to work, and I don't see any
obvious performance regressions (everything seems to be within
normal noise), but I'd like it beaten up a lot more.

Parts are from John's series, parts are rewritten as discussed
upthread.  As the result, trylock loops are gone and no new
retry loops had been added in their place - lock_parent() still
has one, but that's it.

I'll play with cleaning up and reordering tomorrow after I get
some sleep.  In the meanwhile, the current state of that set is at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-24 Thread Al Viro
On Sat, Feb 24, 2018 at 12:22:48AM +, Al Viro wrote:
> On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:
> 
> > This is too subtle, and your fix to check d_lockref.count < 0 sounds
> > wrong to me. If it's really gone, maybe it has been reused and the
> > refcount is positive again, but it's something else than a dentry
> > entirely?
> > 
> > Hmm.
> > 
> > No, you extended the rcu read section, so I guess your patch is fine.
> > And lock_parent already has that pattern, soiit's not new.
> > 
> > Ok, I agree, looks like lock_parent should just re-check that thing
> > that it already checked earler, but that now might be true again
> > because of we dropped d_lock.
> 
> IMO that's the right thing for backports; whether we keep it after
> the getting rid of trylock loops is a different question.  Note that
> the only case where we do not have __dentry_kill() prevention
> guaranteed by the caller (either by holding a reference, or by
> holding onto ->i_lock all along) is in shrink_dentry_list().
> And there we have more than enough of other subtle crap.
> 
> Moreover, there we have a good reason to treat "it had been moved"
> as "kick it off the shrink list and free if it's already dead",
> which might simplify the things.  Below is a stab at that:

FWIW, a variant of that series is in #work.dcache; it's
almost certainly not the final (I want to clean the things up
and probably reorder them as well) and it needs one hell of
profiling and review.  It seems to work, and I don't see any
obvious performance regressions (everything seems to be within
normal noise), but I'd like it beaten up a lot more.

Parts are from John's series, parts are rewritten as discussed
upthread.  As the result, trylock loops are gone and no new
retry loops had been added in their place - lock_parent() still
has one, but that's it.

I'll play with cleaning up and reordering tomorrow after I get
some sleep.  In the meanwhile, the current state of that set is at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-23 Thread Al Viro
On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:

> This is too subtle, and your fix to check d_lockref.count < 0 sounds
> wrong to me. If it's really gone, maybe it has been reused and the
> refcount is positive again, but it's something else than a dentry
> entirely?
> 
> Hmm.
> 
> No, you extended the rcu read section, so I guess your patch is fine.
> And lock_parent already has that pattern, soiit's not new.
> 
> Ok, I agree, looks like lock_parent should just re-check that thing
> that it already checked earler, but that now might be true again
> because of we dropped d_lock.

IMO that's the right thing for backports; whether we keep it after
the getting rid of trylock loops is a different question.  Note that
the only case where we do not have __dentry_kill() prevention
guaranteed by the caller (either by holding a reference, or by
holding onto ->i_lock all along) is in shrink_dentry_list().
And there we have more than enough of other subtle crap.

Moreover, there we have a good reason to treat "it had been moved"
as "kick it off the shrink list and free if it's already dead",
which might simplify the things.  Below is a stab at that:

/*
 * ONLY for shrink_dentry_list() - it returns false if it finds
 * dentry grabbed, moved or killed, which is fine there but not
 * anywhere else.  OTOH, nobody else needs to deal with dentries
 * getting killed under them.
 */
static bool shrink_lock_for_kill(struct dentry *dentry)
{
if (dentry->d_lockref.count)
return false;

inode = dentry->d_inode;
if (inode && unlikely(!spin_trylock(>i_lock))) {
rcu_read_lock();/* to protect inode */
spin_unlock(>d_lock);
spin_lock(>i_lock);
spin_lock(>d_lock);
if (unlikely(dentry->d_lockref.count))
goto out;
/* changed inode means that somebody had grabbed it */
if (unlikely(inode != dentry->d_inode))
goto out;
rcu_read_unlock();
}

parent = dentry->d_parent;
if (IS_ROOT(dentry) || likely(spin_trylock(>d_lock)))
return true;

rcu_read_lock();/* to protect parent */
spin_unlock(>d_lock);
parent = READ_ONCE(dentry->d_parent);
spin_lock(>d_lock);
if (unlikely(parent != dentry->d_parent)) {
spin_unlock(>d_lock);
spin_lock(>d_lock);
goto out;
}
spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
if (likely(!dentry->d_lockref.count)) {
rcu_read_unlock();
return true;
}
spin_unlock(>d_lock);
out:
spin_unlock(>i_lock);
rcu_read_unlock();
return false;
}

static void shrink_dentry_list(struct list_head *list)
{
struct dentry *dentry, *parent;

while (!list_empty(list)) {
struct inode *inode;
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(>d_lock);
if (!shrink_lock_for_kill(dentry)) {
bool can_free = false;
d_shrink_del(dentry);
if (dentry->d_lockref.count < 0)
can_free = dentry->d_flags & DCACHE_MAY_FREE;
spin_unlock(>d_lock);
if (can_free)
dentry_free(dentry);
continue;
}
d_shrink_del(dentry);
parent = dentry->d_parent;
__dentry_kill(dentry);
if (dentry == parent)
continue;
dentry = parent;
 
same as now

}
}


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-23 Thread Al Viro
On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote:

> This is too subtle, and your fix to check d_lockref.count < 0 sounds
> wrong to me. If it's really gone, maybe it has been reused and the
> refcount is positive again, but it's something else than a dentry
> entirely?
> 
> Hmm.
> 
> No, you extended the rcu read section, so I guess your patch is fine.
> And lock_parent already has that pattern, soiit's not new.
> 
> Ok, I agree, looks like lock_parent should just re-check that thing
> that it already checked earler, but that now might be true again
> because of we dropped d_lock.

IMO that's the right thing for backports; whether we keep it after
the getting rid of trylock loops is a different question.  Note that
the only case where we do not have __dentry_kill() prevention
guaranteed by the caller (either by holding a reference, or by
holding onto ->i_lock all along) is in shrink_dentry_list().
And there we have more than enough of other subtle crap.

Moreover, there we have a good reason to treat "it had been moved"
as "kick it off the shrink list and free if it's already dead",
which might simplify the things.  Below is a stab at that:

/*
 * ONLY for shrink_dentry_list() - it returns false if it finds
 * dentry grabbed, moved or killed, which is fine there but not
 * anywhere else.  OTOH, nobody else needs to deal with dentries
 * getting killed under them.
 */
static bool shrink_lock_for_kill(struct dentry *dentry)
{
if (dentry->d_lockref.count)
return false;

inode = dentry->d_inode;
if (inode && unlikely(!spin_trylock(>i_lock))) {
rcu_read_lock();/* to protect inode */
spin_unlock(>d_lock);
spin_lock(>i_lock);
spin_lock(>d_lock);
if (unlikely(dentry->d_lockref.count))
goto out;
/* changed inode means that somebody had grabbed it */
if (unlikely(inode != dentry->d_inode))
goto out;
rcu_read_unlock();
}

parent = dentry->d_parent;
if (IS_ROOT(dentry) || likely(spin_trylock(>d_lock)))
return true;

rcu_read_lock();/* to protect parent */
spin_unlock(>d_lock);
parent = READ_ONCE(dentry->d_parent);
spin_lock(>d_lock);
if (unlikely(parent != dentry->d_parent)) {
spin_unlock(>d_lock);
spin_lock(>d_lock);
goto out;
}
spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
if (likely(!dentry->d_lockref.count)) {
rcu_read_unlock();
return true;
}
spin_unlock(>d_lock);
out:
spin_unlock(>i_lock);
rcu_read_unlock();
return false;
}

static void shrink_dentry_list(struct list_head *list)
{
struct dentry *dentry, *parent;

while (!list_empty(list)) {
struct inode *inode;
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(>d_lock);
if (!shrink_lock_for_kill(dentry)) {
bool can_free = false;
d_shrink_del(dentry);
if (dentry->d_lockref.count < 0)
can_free = dentry->d_flags & DCACHE_MAY_FREE;
spin_unlock(>d_lock);
if (can_free)
dentry_free(dentry);
continue;
}
d_shrink_del(dentry);
parent = dentry->d_parent;
__dentry_kill(dentry);
if (dentry == parent)
continue;
dentry = parent;
 
same as now

}
}


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-23 Thread Linus Torvalds
On Fri, Feb 23, 2018 at 12:13 PM, Al Viro  wrote:
> Look:
> dentry placed on a shrink list
> we pick the fucker from the list and lock it.
> we call lock_parent() on it.
> dentry is not a root and it's not deleted, so we proceed.
> trylock fails.
> we grab rcu_read_lock()
> we drop dentry->d_lock

[ deleted the bad things ]

Should we just instead get the ref to the dentry before dropping the
lock, so that nobody else can get to dentry_kill?

This is too subtle, and your fix to check d_lockref.count < 0 sounds
wrong to me. If it's really gone, maybe it has been reused and the
refcount is positive again, but it's something else than a dentry
entirely?

Hmm.

No, you extended the rcu read section, so I guess your patch is fine.
And lock_parent already has that pattern, soiit's not new.

Ok, I agree, looks like lock_parent should just re-check that thing
that it already checked earler, but that now might be true again
because of we dropped d_lock.

  Linus


Re: [BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-23 Thread Linus Torvalds
On Fri, Feb 23, 2018 at 12:13 PM, Al Viro  wrote:
> Look:
> dentry placed on a shrink list
> we pick the fucker from the list and lock it.
> we call lock_parent() on it.
> dentry is not a root and it's not deleted, so we proceed.
> trylock fails.
> we grab rcu_read_lock()
> we drop dentry->d_lock

[ deleted the bad things ]

Should we just instead get the ref to the dentry before dropping the
lock, so that nobody else can get to dentry_kill?

This is too subtle, and your fix to check d_lockref.count < 0 sounds
wrong to me. If it's really gone, maybe it has been reused and the
refcount is positive again, but it's something else than a dentry
entirely?

Hmm.

No, you extended the rcu read section, so I guess your patch is fine.
And lock_parent already has that pattern, soiit's not new.

Ok, I agree, looks like lock_parent should just re-check that thing
that it already checked earler, but that now might be true again
because of we dropped d_lock.

  Linus


[BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-23 Thread Al Viro
On Fri, Feb 23, 2018 at 05:42:16PM +, Al Viro wrote:

>   4) the nasty one - shrink_dentry_list() evictions of zero-count 
> dentries.
> _That_ calls for careful use of RCU, etc. - none of the others need that.  
> Need
> to think how to deal with that sucker; in any case, I do not believe that 
> sharing
> said RCU use, etc. with any other cases would do anything other than 
> obfuscating
> the rest.

Arrrgh...  Actually, there's a nasty corner case where the variant in mainline 
is
broken.  Look:
dentry placed on a shrink list
we pick the fucker from the list and lock it.
we call lock_parent() on it.
dentry is not a root and it's not deleted, so we proceed.
trylock fails.
we grab rcu_read_lock()
we drop dentry->d_lock
on another CPU, something does e.g. d_prune_aliases() (or finds the
sucker in hash and proceeds to unhash and dput(), etc.) - anything
that evicts that dentry.  It is marked with DCACHE_MAY_FREE and left
alone.  The parent, OTOH, is dropped and freeing gets scheduled as
soon as RCU allows.
we grab parent->d_lock
we verify that dentry->d_parent is still the same (it is)
we do rcu_read_unlock()
we grab dentry->d_lock
we return parent
At that point we are fucked - there's nothing to prevent parent from being
freed at any point.  And we assume that its ->d_lock is held and needs to
be dropped.

The call site in d_prune_aliases() avoids the same scenario, since there we
are already holding ->i_lock and another thread won't get to __dentry_kill()
until we are done with lock_parent().

Unless I'm missing something, that's a (narrow) memory corruptor.  The window is
narrow, but not impossibly so - if that other thread had been spinning on 
attempt
to grab dentry->d_lock in d_prune_alias(), it has to squeeze through
if (!dentry->d_lockref.count) {
and then in lock_parent() called there - through
if (IS_ROOT(dentry))
return NULL;
if (unlikely(dentry->d_lockref.count < 0))
return NULL;
if (likely(spin_trylock(>d_lock)))
before the first CPU gets through
parent = READ_ONCE(dentry->d_parent);
spin_lock(>d_lock);

The first CPU can't be preempted, but there's nothing to prevent an IRQ arriving
at that point, letting the second one win the race.

Comments?

I think the (untested) patch below is -stable fodder:

lock_parent() needs to recheck if dentry got __dentry_kill'ed under it

In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Signed-off-by: Al Viro 
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 7c38f39958bc..32aaab21e648 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -647,11 +647,16 @@ static inline struct dentry *lock_parent(struct dentry 
*dentry)
spin_unlock(>d_lock);
goto again;
}
-   rcu_read_unlock();
-   if (parent != dentry)
+   if (parent != dentry) {
spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
-   else
+   if (unlikely(dentry->d_lockref.count < 0)) {
+   spin_unlock(>d_lock);
+   parent = NULL;
+   }
+   } else {
parent = NULL;
+   }
+   rcu_read_unlock();
return parent;
 }
 


[BUG] lock_parent() breakage when used from shrink_dentry_list() (was Re: [PATCH v2 6/6] fs/dcache: Avoid remaining try_lock loop in shrink_dentry_list())

2018-02-23 Thread Al Viro
On Fri, Feb 23, 2018 at 05:42:16PM +, Al Viro wrote:

>   4) the nasty one - shrink_dentry_list() evictions of zero-count 
> dentries.
> _That_ calls for careful use of RCU, etc. - none of the others need that.  
> Need
> to think how to deal with that sucker; in any case, I do not believe that 
> sharing
> said RCU use, etc. with any other cases would do anything other than 
> obfuscating
> the rest.

Arrrgh...  Actually, there's a nasty corner case where the variant in mainline 
is
broken.  Look:
dentry placed on a shrink list
we pick the fucker from the list and lock it.
we call lock_parent() on it.
dentry is not a root and it's not deleted, so we proceed.
trylock fails.
we grab rcu_read_lock()
we drop dentry->d_lock
on another CPU, something does e.g. d_prune_aliases() (or finds the
sucker in hash and proceeds to unhash and dput(), etc.) - anything
that evicts that dentry.  It is marked with DCACHE_MAY_FREE and left
alone.  The parent, OTOH, is dropped and freeing gets scheduled as
soon as RCU allows.
we grab parent->d_lock
we verify that dentry->d_parent is still the same (it is)
we do rcu_read_unlock()
we grab dentry->d_lock
we return parent
At that point we are fucked - there's nothing to prevent parent from being
freed at any point.  And we assume that its ->d_lock is held and needs to
be dropped.

The call site in d_prune_aliases() avoids the same scenario, since there we
are already holding ->i_lock and another thread won't get to __dentry_kill()
until we are done with lock_parent().

Unless I'm missing something, that's a (narrow) memory corruptor.  The window is
narrow, but not impossibly so - if that other thread had been spinning on 
attempt
to grab dentry->d_lock in d_prune_alias(), it has to squeeze through
if (!dentry->d_lockref.count) {
and then in lock_parent() called there - through
if (IS_ROOT(dentry))
return NULL;
if (unlikely(dentry->d_lockref.count < 0))
return NULL;
if (likely(spin_trylock(>d_lock)))
before the first CPU gets through
parent = READ_ONCE(dentry->d_parent);
spin_lock(>d_lock);

The first CPU can't be preempted, but there's nothing to prevent an IRQ arriving
at that point, letting the second one win the race.

Comments?

I think the (untested) patch below is -stable fodder:

lock_parent() needs to recheck if dentry got __dentry_kill'ed under it

In case when dentry passed to lock_parent() is protected from freeing only
by the fact that it's on a shrink list and trylock of parent fails, we
could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
between unlocking dentry and locking presumed parent.  We need to recheck
that dentry is alive once we lock both it and parent *and* postpone
rcu_read_unlock() until after that point.  Otherwise we could return
a pointer to struct dentry that already is rcu-scheduled for freeing, with
->d_lock held on it; caller's subsequent attempt to unlock it can end
up with memory corruption.

Signed-off-by: Al Viro 
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 7c38f39958bc..32aaab21e648 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -647,11 +647,16 @@ static inline struct dentry *lock_parent(struct dentry 
*dentry)
spin_unlock(>d_lock);
goto again;
}
-   rcu_read_unlock();
-   if (parent != dentry)
+   if (parent != dentry) {
spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
-   else
+   if (unlikely(dentry->d_lockref.count < 0)) {
+   spin_unlock(>d_lock);
+   parent = NULL;
+   }
+   } else {
parent = NULL;
+   }
+   rcu_read_unlock();
return parent;
 }