from:"Boqun Feng"

Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

2024-01-02 Thread Boqun Feng

On Sat, Dec 30, 2023 at 03:49:52PM +, David Laight wrote:
[...]
> I don't completely understand the 'acquire'/'release' semantics (they didn't
> exist when I started doing SMP kernel code in the late 1980s).
> But it looks odd that osq_unlock()'s fast path uses _release but the very
> similar code in osq_wait_next() uses _acquire.
> 

The _release in osq_unlock() is needed since unlocks are needed to be
RELEASE so that lock+unlock can be a critical section (i.e. no memory
accesses can escape). When osq_wait_next() is used in non unlock cases,
the RELEASE is not required. As for the case where osq_wait_next() is
used in osq_unlock(), there is a xchg() preceding it, which provides a
full barrier, so things are fine.

/me wonders whether we can relax the _acquire in osq_wait_next() into
a _relaxed.

> Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> could just be:
>   next = osq_wait_next(lock, this_cpu_ptr(_node), 0);
>   if (next)
>   next->locked = 1;
> 

If so we need to provide some sort of RELEASE semantics for the
osq_unlock() in all the cases.

Regards,
Boqun

> I don't think the order of the tests for lock->tail and node->next
> matter in osq_wait_next().
> If they were swapped the 'Second most likely case' code from osq_unlock()
> could be removed.
> (The 'uncontended case' doesn't need to load the address of 'node'.)
> 
>   David
>   
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)

Re: [PATCH 04/10] rcu/nocb: Remove needless full barrier after callback advancing

2023-09-09 Thread Boqun Feng

On Sat, Sep 09, 2023 at 04:31:25AM +, Joel Fernandes wrote:
> On Fri, Sep 08, 2023 at 10:35:57PM +0200, Frederic Weisbecker wrote:
> > A full barrier is issued from nocb_gp_wait() upon callbacks advancing
> > to order grace period completion with callbacks execution.
> > 
> > However these two events are already ordered by the
> > smp_mb__after_unlock_lock() barrier within the call to
> > raw_spin_lock_rcu_node() that is necessary for callbacks advancing to
> > happen.
> > 
> > The following litmus test shows the kind of guarantee that this barrier
> > provides:
> > 
> > C smp_mb__after_unlock_lock
> > 
> > {}
> > 
> > // rcu_gp_cleanup()
> > P0(spinlock_t *rnp_lock, int *gpnum)
> > {
> > // Grace period cleanup increase gp sequence number
> > spin_lock(rnp_lock);
> > WRITE_ONCE(*gpnum, 1);
> > spin_unlock(rnp_lock);
> > }
> > 
> > // nocb_gp_wait()
> > P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int 
> > *cb_ready)
> > {
> > int r1;
> > 
> > // Call rcu_advance_cbs() from nocb_gp_wait()
> > spin_lock(nocb_lock);
> > spin_lock(rnp_lock);
> > smp_mb__after_unlock_lock();
> > r1 = READ_ONCE(*gpnum);
> > WRITE_ONCE(*cb_ready, 1);
> > spin_unlock(rnp_lock);
> > spin_unlock(nocb_lock);
> > }
> > 
> > // nocb_cb_wait()
> > P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed)
> > {
> > int r2;
> > 
> > // rcu_do_batch() -> rcu_segcblist_extract_done_cbs()
> > spin_lock(nocb_lock);
> > r2 = READ_ONCE(*cb_ready);
> > spin_unlock(nocb_lock);
> > 
> > // Actual callback execution
> > WRITE_ONCE(*cb_executed, 1);
> 
> So related to this something in the docs caught my attention under "Callback
> Invocation" [1]
> 
> 
> However, if the callback function communicates to other CPUs, for example,
> doing a wakeup, then it is that function's responsibility to maintain
> ordering. For example, if the callback function wakes up a task that runs on
> some other CPU, proper ordering must in place in both the callback function
> and the task being awakened. To see why this is important, consider the top
> half of the grace-period cleanup diagram. The callback might be running on a
> CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task
> that is to run on a CPU corresponding to the rightmost leaf rcu_node
> structure, and the grace-period kernel thread might not yet have reached the
> rightmost leaf. In this case, the grace period's memory ordering might not
> yet have reached that CPU, so again the callback function and the awakened
> task must supply proper ordering.
> 
> 
> I believe this text is for non-nocb but if we apply that to the nocb case,
> lets see what happens.
> 
> In the litmus, he rcu_advance_cbs() happened on P1, however the callback is
> executing on P2. That sounds very similar to the non-nocb world described in
> the text where a callback tries to wake something up on a different CPU and
> needs to take care of all the ordering.
> 
> So unless I'm missing something (quite possible), P2 must see the update to
> gpnum as well. However, per your limus test, the only thing P2  does is
> acquire the nocb_lock. I don't see how it is guaranteed to see gpnum == 1.

Because P1 writes cb_ready under nocb_lock, and P2 reads cb_ready under
nocb_lock as well and if P2 read P1's write, then we know the serialized
order of locking is P1 first (i.e. the spin_lock(nocb_lock) on P2 reads
from the spin_unlock(nocb_lock) on P1), in other words:

(fact #1)

unlock(nocb_lock) // on P1
->rfe
lock(nocb_lock) // on P2

so if P1 reads P0's write on gpnum

(assumption #1)

W(gpnum)=1 // on P0
->rfe
R(gpnum)=1 // on P1

and we have

(fact #2)

R(gpnum)=1 // on P1
->(po; [UL])
unlock(nocb_lock) // on P1

combine them you get

W(gpnum)=1 // on P0
->rfe   // fact #1
->(po; [UL])// fact #2
->rfe   // assumption #1
lock(nocb_lock) // on P2
->([LKR]; po)
M // any access on P2 after spin_lock(nocb_lock);

so
W(gpnum)=1 // on P0
->rfe ->po-unlock-lock-po
M // on P2

and po-unlock-lock-po is A-culum, hence "->rfe ->po-unlock-lock-po" or
"rfe; po-unlock-lock-po" is culum-fence, hence it's a ->prop, which
means the write of gpnum on P0 propagates to P2 before any memory
accesses after spin_lock(nocb_lock)?

Of course, I haven't looked into the bigger picture here (whether the
barrier is for something else, etc.) ;-)

Regards,
Boqun

> I am curious what happens in your litmus if you read gpnum in a register and
> checked for it.
> 
> So maybe the memory barriers you are deleting need to be kept in place? Idk.
> 
> thanks,
> 
>  - Joel
> 
>

Re: [PATCH 00/13] [RFC] Rust support

2021-04-15 Thread Boqun Feng

[Copy LKMM people, Josh, Nick and Wedson]

On Thu, Apr 15, 2021 at 08:58:16PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 14, 2021 at 08:45:51PM +0200, oj...@kernel.org wrote:
> 
> > Rust is a systems programming language that brings several key
> > advantages over C in the context of the Linux kernel:
> > 
> >   - No undefined behavior in the safe subset (when unsafe code is
> > sound), including memory safety and the absence of data races.
> 
> And yet I see not a single mention of the Rust Memory Model and how it
> aligns (or not) with the LKMM. The C11 memory model for example is a
> really poor fit for LKMM.
> 

I think Rust currently uses C11 memory model as per:

https://doc.rust-lang.org/nomicon/atomics.html

, also I guess another reason that they pick C11 memory model is because
LLVM has the support by default.

But I think the Rust Community still wants to have a good memory model,
and they are open to any kind of suggestion and input. I think we (LKMM
people) should really get involved, because the recent discussion on
RISC-V's atomics shows that if we didn't people might get a "broken"
design because they thought C11 memory model is good enough:

https://lore.kernel.org/lkml/YGyZPCxJYGOvqYZQ@boqun-archlinux/

And the benefits are mutual: a) Linux Kernel Memory Model (LKMM) is
defined by combining the requirements of developers and the behavior of
hardwares, it's pratical and can be a very good input for memory model
designing in Rust; b) Once Rust has a better memory model, the compiler
technologies whatever Rust compilers use to suppor the memory model can
be adopted to C compilers and we can get that part for free.

At least I personally is very intereted to help Rust on a complete and
pratical memory model ;-)

Josh, I think it's good if we can connect to the people working on Rust
memoryg model, I think the right person is Ralf Jung and the right place
is https://github.com/rust-lang/unsafe-code-guidelines, but you
cerntainly know better than me ;-) Or maybe we can use Rust-for-Linux or
linux-toolchains list to discuss.

[...]
> >   - Boqun Feng is working hard on the different options for
> > threading abstractions and has reviewed most of the `sync` PRs.
> 
> Boqun, I know you're familiar with LKMM, can you please talk about how
> Rust does things and how it interacts?

As Wedson said in the other email, currently there is no code requiring
synchronization between C side and Rust side, so we are currently fine.
But in the longer term, we need to teach Rust memory model about the
"design patterns" used in Linux kernel for parallel programming.

What I have been doing so far is reviewing patches which have memory
orderings in Rust-for-Linux project, try to make sure we don't include
memory ordering bugs for the beginning.

Regards,
Boqun

Re: [PATCH v4 3/4] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-04-06 Thread Boqun Feng

On Wed, Mar 31, 2021 at 11:22:35PM +0800, Guo Ren wrote:
> On Mon, Mar 29, 2021 at 8:50 PM Peter Zijlstra  wrote:
> >
> > On Mon, Mar 29, 2021 at 08:01:41PM +0800, Guo Ren wrote:
> > > u32 a = 0x55aa66bb;
> > > u16 *ptr = 
> > >
> > > CPU0   CPU1
> > > = =
> > > xchg16(ptr, new) while(1)
> > > WRITE_ONCE(*(ptr + 1), x);
> > >
> > > When we use lr.w/sc.w implement xchg16, it'll cause CPU0 deadlock.
> >
> > Then I think your LL/SC is broken.
> No, it's not broken LR.W/SC.W. Quote <8.3 Eventual Success of
> Store-Conditional Instructions>:
> 
> "As a consequence of the eventuality guarantee, if some harts in an
> execution environment are
> executing constrained LR/SC loops, and no other harts or devices in
> the execution environment
> execute an unconditional store or AMO to that reservation set, then at
> least one hart will
> eventually exit its constrained LR/SC loop. By contrast, if other
> harts or devices continue to
> write to that reservation set, it is not guaranteed that any hart will
> exit its LR/SC loop."
> 
> So I think it's a feature of LR/SC. How does the above code (also use
> ll.w/sc.w to implement xchg16) running on arm64?
> 
> 1: ldxr
> eor
> cbnz ... 2f
> stxr
> cbnz ... 1b   // I think it would deadlock for arm64.
> 
> "LL/SC fwd progress" which you have mentioned could guarantee stxr
> success? How hardware could do that?
> 

Actually, "old" riscv standard does provide fwd progress ;-) In

https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf

Section "7.2 Load-Reserved/Store-Conditional Instructions":

"""
One advantage of CAS is that it guarantees that some hart eventually
makes progress, whereas an LR/SC atomic sequence could livelock
indefinitely on some systems. To avoid this concern, we added an
architectural guarantee of forward progress to LR/SC atomic sequences.
The restrictions on LR/SC sequence contents allows an implementation to
**capture a cache line on the LR and complete the LR/SC sequence by
holding off remote cache interventions for a bounded short time**.
"""

The guarantee is removed later due to "Earlier versions of this
specification imposed a stronger starvation-freedom guarantee. However,
the weaker livelock-freedom guarantee is sufficient to implement the C11
and C++11 languages, and is substantially easier to provide in some
microarchitectural styles."

But I take it as an example that hardware can guarantee this.

Regards,
Boqun

> >
> > That also means you really don't want to build super complex locking
> > primitives on top, because that live-lock will percolate through.
> >
> > Step 1 would be to get your architecute fixed such that it can provide
> > fwd progress guarantees for LL/SC. Otherwise there's absolutely no point
> > in building complex systems with it.
> --
> Best Regards
>  Guo Ren
> 
> ML: https://lore.kernel.org/linux-csky/

Re: [PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-04-06 Thread Boqun Feng

Hi,

On Wed, Mar 31, 2021 at 02:30:32PM +, guo...@kernel.org wrote:
> From: Guo Ren 
> 
> Some architectures don't have sub-word swap atomic instruction,
> they only have the full word's one.
> 
> The sub-word swap only improve the performance when:
> NR_CPUS < 16K
>  *  0- 7: locked byte
>  * 8: pending
>  *  9-15: not used
>  * 16-17: tail index
>  * 18-31: tail cpu (+1)
> 
> The 9-15 bits are wasted to use xchg16 in xchg_tail.
> 
> Please let architecture select xchg16/xchg32 to implement
> xchg_tail.
> 

If the architecture doesn't have sub-word swap atomic, won't it generate
the same/similar code no matter which version xchg_tail() is used? That
is even CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y, xchg_tail() acts
similar to an xchg16() implemented by cmpxchg(), which means we still
don't have forward progress guarantee. So this configuration doesn't
solve the problem.

I think it's OK to introduce this config and don't provide xchg16() for
risc-v. But I don't see the point of converting other architectures to
use it.

Regards,
Boqun

> Signed-off-by: Guo Ren 
> Cc: Peter Zijlstra 
> Cc: Will Deacon 
> Cc: Ingo Molnar 
> Cc: Waiman Long 
> Cc: Arnd Bergmann 
> Cc: Anup Patel 
> ---
>  kernel/Kconfig.locks   |  3 +++
>  kernel/locking/qspinlock.c | 46 +-
>  2 files changed, 28 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 3de8fd11873b..d02f1261f73f 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -239,6 +239,9 @@ config LOCK_SPIN_ON_OWNER
>  config ARCH_USE_QUEUED_SPINLOCKS
>   bool
>  
> +config ARCH_USE_QUEUED_SPINLOCKS_XCHG32
> + bool
> +
>  config QUEUED_SPINLOCKS
>   def_bool y if ARCH_USE_QUEUED_SPINLOCKS
>   depends on SMP
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba53d56..4bfaa969bd15 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -163,26 +163,6 @@ static __always_inline void 
> clear_pending_set_locked(struct qspinlock *lock)
>   WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
>  }
>  
> -/*
> - * xchg_tail - Put in the new queue tail code word & retrieve previous one
> - * @lock : Pointer to queued spinlock structure
> - * @tail : The new queue tail code word
> - * Return: The previous queue tail code word
> - *
> - * xchg(lock, tail), which heads an address dependency
> - *
> - * p,*,* -> n,*,* ; prev = xchg(lock, node)
> - */
> -static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> -{
> - /*
> -  * We can use relaxed semantics since the caller ensures that the
> -  * MCS node is properly initialized before updating the tail.
> -  */
> - return (u32)xchg_relaxed(>tail,
> -  tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
> -}
> -
>  #else /* _Q_PENDING_BITS == 8 */
>  
>  /**
> @@ -206,6 +186,30 @@ static __always_inline void 
> clear_pending_set_locked(struct qspinlock *lock)
>  {
>   atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, >val);
>  }
> +#endif /* _Q_PENDING_BITS == 8 */
> +
> +#if _Q_PENDING_BITS == 8 && !defined(CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32)
> +/*
> + * xchg_tail - Put in the new queue tail code word & retrieve previous one
> + * @lock : Pointer to queued spinlock structure
> + * @tail : The new queue tail code word
> + * Return: The previous queue tail code word
> + *
> + * xchg(lock, tail), which heads an address dependency
> + *
> + * p,*,* -> n,*,* ; prev = xchg(lock, node)
> + */
> +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> +{
> + /*
> +  * We can use relaxed semantics since the caller ensures that the
> +  * MCS node is properly initialized before updating the tail.
> +  */
> + return (u32)xchg_relaxed(>tail,
> +  tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
> +}
> +
> +#else
>  
>  /**
>   * xchg_tail - Put in the new queue tail code word & retrieve previous one
> @@ -236,7 +240,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
> *lock, u32 tail)
>   }
>   return old;
>  }
> -#endif /* _Q_PENDING_BITS == 8 */
> +#endif
>  
>  /**
>   * queued_fetch_set_pending_acquire - fetch the whole lock value and set 
> pending
> -- 
> 2.17.1
>

Re: [syzbot] WARNING: suspicious RCU usage in get_timespec64

2021-04-05 Thread Boqun Feng

On Mon, Apr 05, 2021 at 04:38:07PM -0700, Paul E. McKenney wrote:
> On Tue, Apr 06, 2021 at 07:25:44AM +0800, Boqun Feng wrote:
> > On Mon, Apr 05, 2021 at 10:27:52AM -0700, Paul E. McKenney wrote:
> > > On Mon, Apr 05, 2021 at 01:23:30PM +0800, Boqun Feng wrote:
> > > > On Sun, Apr 04, 2021 at 09:30:38PM -0700, Paul E. McKenney wrote:
> > > > > On Sun, Apr 04, 2021 at 09:01:25PM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Apr 05, 2021 at 04:08:55AM +0100, Matthew Wilcox wrote:
> > > > > > > On Sun, Apr 04, 2021 at 02:40:30PM -0700, Paul E. McKenney wrote:
> > > > > > > > On Sun, Apr 04, 2021 at 10:38:41PM +0200, Thomas Gleixner wrote:
> > > > > > > > > On Sun, Apr 04 2021 at 12:05, syzbot wrote:
> > > > > > > > > 
> > > > > > > > > Cc + ...
> > > > > > > > 
> > > > > > > > And a couple more...
> > > > > > > > 
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > syzbot found the following issue on:
> > > > > > > > > >
> > > > > > > > > > HEAD commit:5e46d1b7 reiserfs: update 
> > > > > > > > > > reiserfs_xattrs_initialized() co..
> > > > > > > > > > git tree:   upstream
> > > > > > > > > > console output: 
> > > > > > > > > > https://syzkaller.appspot.com/x/log.txt?x=1125f831d0
> > > > > > > > > > kernel config:  
> > > > > > > > > > https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> > > > > > > > > > dashboard link: 
> > > > > > > > > > https://syzkaller.appspot.com/bug?extid=88e4f02896967fe1ab0d
> > > > > > > > > >
> > > > > > > > > > Unfortunately, I don't have any reproducer for this issue 
> > > > > > > > > > yet.
> > > > > > > > > >
> > > > > > > > > > IMPORTANT: if you fix the issue, please add the following 
> > > > > > > > > > tag to the commit:
> > > > > > > > > > Reported-by: 
> > > > > > > > > > syzbot+88e4f02896967fe1a...@syzkaller.appspotmail.com
> > > > > > > > > >
> > > > > > > > > > =
> > > > > > > > > > WARNING: suspicious RCU usage
> > > > > > > > > > 5.12.0-rc5-syzkaller #0 Not tainted
> > > > > > > > > > -
> > > > > > > > > > kernel/sched/core.c:8294 Illegal context switch in 
> > > > > > > > > > RCU-sched read-side critical section!
> > > > > > > > > >
> > > > > > > > > > other info that might help us debug this:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > rcu_scheduler_active = 2, debug_locks = 0
> > > > > > > > > > 3 locks held by syz-executor.4/8418:
> > > > > > > > > >  #0: 
> > > > > > > > > > 8880751d2b28
> > > > > > > > > >  (
> > > > > > > > > > >pi_lock
> > > > > > > > > > ){-.-.}-{2:2}
> > > > > > > > > > , at: try_to_wake_up+0x98/0x14a0 kernel/sched/core.c:3345
> > > > > > > > > >  #1: 
> > > > > > > > > > 8880b9d35258
> > > > > > > > > >  (
> > > > > > > > > > >lock
> > > > > > > > > > ){-.-.}-{2:2}
> > > > > > > > > > , at: rq_lock kernel/sched/sched.h:1321 [inline]
> > > > > > > > > > , at: ttwu_queue kernel/sched/core.c:3184 [inline]
> > > > > > > > > > , at: try_to_wake_up+0x5e6/0x14a0 kernel/sched/core.c:3464
> > > > > > > > > >  #2: 8880b9d1f948 (_cpu_ptr(group->pcpu, 
> > > > > > > > > > cpu)->seq){-.-.}-{0:0}, at: psi_task_change+0x142/0x220 
> > > >

Re: [syzbot] WARNING: suspicious RCU usage in get_timespec64

2021-04-05 Thread Boqun Feng

On Mon, Apr 05, 2021 at 10:27:52AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 05, 2021 at 01:23:30PM +0800, Boqun Feng wrote:
> > On Sun, Apr 04, 2021 at 09:30:38PM -0700, Paul E. McKenney wrote:
> > > On Sun, Apr 04, 2021 at 09:01:25PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Apr 05, 2021 at 04:08:55AM +0100, Matthew Wilcox wrote:
> > > > > On Sun, Apr 04, 2021 at 02:40:30PM -0700, Paul E. McKenney wrote:
> > > > > > On Sun, Apr 04, 2021 at 10:38:41PM +0200, Thomas Gleixner wrote:
> > > > > > > On Sun, Apr 04 2021 at 12:05, syzbot wrote:
> > > > > > > 
> > > > > > > Cc + ...
> > > > > > 
> > > > > > And a couple more...
> > > > > > 
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > syzbot found the following issue on:
> > > > > > > >
> > > > > > > > HEAD commit:5e46d1b7 reiserfs: update 
> > > > > > > > reiserfs_xattrs_initialized() co..
> > > > > > > > git tree:   upstream
> > > > > > > > console output: 
> > > > > > > > https://syzkaller.appspot.com/x/log.txt?x=1125f831d0
> > > > > > > > kernel config:  
> > > > > > > > https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> > > > > > > > dashboard link: 
> > > > > > > > https://syzkaller.appspot.com/bug?extid=88e4f02896967fe1ab0d
> > > > > > > >
> > > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > > >
> > > > > > > > IMPORTANT: if you fix the issue, please add the following tag 
> > > > > > > > to the commit:
> > > > > > > > Reported-by: 
> > > > > > > > syzbot+88e4f02896967fe1a...@syzkaller.appspotmail.com
> > > > > > > >
> > > > > > > > =
> > > > > > > > WARNING: suspicious RCU usage
> > > > > > > > 5.12.0-rc5-syzkaller #0 Not tainted
> > > > > > > > -
> > > > > > > > kernel/sched/core.c:8294 Illegal context switch in RCU-sched 
> > > > > > > > read-side critical section!
> > > > > > > >
> > > > > > > > other info that might help us debug this:
> > > > > > > >
> > > > > > > >
> > > > > > > > rcu_scheduler_active = 2, debug_locks = 0
> > > > > > > > 3 locks held by syz-executor.4/8418:
> > > > > > > >  #0: 
> > > > > > > > 8880751d2b28
> > > > > > > >  (
> > > > > > > > >pi_lock
> > > > > > > > ){-.-.}-{2:2}
> > > > > > > > , at: try_to_wake_up+0x98/0x14a0 kernel/sched/core.c:3345
> > > > > > > >  #1: 
> > > > > > > > 8880b9d35258
> > > > > > > >  (
> > > > > > > > >lock
> > > > > > > > ){-.-.}-{2:2}
> > > > > > > > , at: rq_lock kernel/sched/sched.h:1321 [inline]
> > > > > > > > , at: ttwu_queue kernel/sched/core.c:3184 [inline]
> > > > > > > > , at: try_to_wake_up+0x5e6/0x14a0 kernel/sched/core.c:3464
> > > > > > > >  #2: 8880b9d1f948 (_cpu_ptr(group->pcpu, 
> > > > > > > > cpu)->seq){-.-.}-{0:0}, at: psi_task_change+0x142/0x220 
> > > > > > > > kernel/sched/psi.c:807
> > > > > > 
> > > > > > This looks similar to 
> > > > > > syzbot+dde0cc33951735441...@syzkaller.appspotmail.com
> > > > > > in that rcu_sleep_check() sees an RCU lock held, but the later call 
> > > > > > to
> > > > > > lockdep_print_held_locks() does not.  Did something change recently 
> > > > > > that
> > > > > > could let the ->lockdep_depth counter get out of sync with the 
> > > > > > actual
> > > > > > number of locks held?
> > > > > 
> > > > > Dmitri had a different theory here:
> > > > > 
> > > > > https://gro

Re: [syzbot] WARNING: suspicious RCU usage in get_timespec64

2021-04-04 Thread Boqun Feng

On Sun, Apr 04, 2021 at 09:30:38PM -0700, Paul E. McKenney wrote:
> On Sun, Apr 04, 2021 at 09:01:25PM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 05, 2021 at 04:08:55AM +0100, Matthew Wilcox wrote:
> > > On Sun, Apr 04, 2021 at 02:40:30PM -0700, Paul E. McKenney wrote:
> > > > On Sun, Apr 04, 2021 at 10:38:41PM +0200, Thomas Gleixner wrote:
> > > > > On Sun, Apr 04 2021 at 12:05, syzbot wrote:
> > > > > 
> > > > > Cc + ...
> > > > 
> > > > And a couple more...
> > > > 
> > > > > > Hello,
> > > > > >
> > > > > > syzbot found the following issue on:
> > > > > >
> > > > > > HEAD commit:5e46d1b7 reiserfs: update 
> > > > > > reiserfs_xattrs_initialized() co..
> > > > > > git tree:   upstream
> > > > > > console output: 
> > > > > > https://syzkaller.appspot.com/x/log.txt?x=1125f831d0
> > > > > > kernel config:  
> > > > > > https://syzkaller.appspot.com/x/.config?x=78ef1d159159890
> > > > > > dashboard link: 
> > > > > > https://syzkaller.appspot.com/bug?extid=88e4f02896967fe1ab0d
> > > > > >
> > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > >
> > > > > > IMPORTANT: if you fix the issue, please add the following tag to 
> > > > > > the commit:
> > > > > > Reported-by: syzbot+88e4f02896967fe1a...@syzkaller.appspotmail.com
> > > > > >
> > > > > > =
> > > > > > WARNING: suspicious RCU usage
> > > > > > 5.12.0-rc5-syzkaller #0 Not tainted
> > > > > > -
> > > > > > kernel/sched/core.c:8294 Illegal context switch in RCU-sched 
> > > > > > read-side critical section!
> > > > > >
> > > > > > other info that might help us debug this:
> > > > > >
> > > > > >
> > > > > > rcu_scheduler_active = 2, debug_locks = 0
> > > > > > 3 locks held by syz-executor.4/8418:
> > > > > >  #0: 
> > > > > > 8880751d2b28
> > > > > >  (
> > > > > > >pi_lock
> > > > > > ){-.-.}-{2:2}
> > > > > > , at: try_to_wake_up+0x98/0x14a0 kernel/sched/core.c:3345
> > > > > >  #1: 
> > > > > > 8880b9d35258
> > > > > >  (
> > > > > > >lock
> > > > > > ){-.-.}-{2:2}
> > > > > > , at: rq_lock kernel/sched/sched.h:1321 [inline]
> > > > > > , at: ttwu_queue kernel/sched/core.c:3184 [inline]
> > > > > > , at: try_to_wake_up+0x5e6/0x14a0 kernel/sched/core.c:3464
> > > > > >  #2: 8880b9d1f948 (_cpu_ptr(group->pcpu, 
> > > > > > cpu)->seq){-.-.}-{0:0}, at: psi_task_change+0x142/0x220 
> > > > > > kernel/sched/psi.c:807
> > > > 
> > > > This looks similar to 
> > > > syzbot+dde0cc33951735441...@syzkaller.appspotmail.com
> > > > in that rcu_sleep_check() sees an RCU lock held, but the later call to
> > > > lockdep_print_held_locks() does not.  Did something change recently that
> > > > could let the ->lockdep_depth counter get out of sync with the actual
> > > > number of locks held?
> > > 
> > > Dmitri had a different theory here:
> > > 
> > > https://groups.google.com/g/syzkaller-bugs/c/FmYvfZCZzqA/m/nc2CXUgsAgAJ
> > 
> > There is always room for more than one bug.  ;-)
> > 
> > He says "one-off false positives".  I was afraid of that...
> 
> And both the examples I have been copied on today are consistent with
> debug_locks getting zeroed (e.g., via a call to __debug_locks_off())
> in the midst of a call to rcu_sleep_check().  But I would expect to see
> a panic or another splat if that were to happen.
> 
> Dmitry's example did have an additional splat, but I would expect the
> RCU-related one to come second.  Again, there is always room for more
> than one bug.
> 
> On the other hand, there are a lot more callers to debug_locks_off()
> than there were last I looked into this.  And both of these splats
> are consistent with an interrupt in the middle of rcu_sleep_check(),
> and that interrupt's handler invoking debug_locks_off(), but without
> printing anything to the console.  Does that sequence of events ring a
> bell for anyone?
> 
> If this is the new normal, I could make RCU_LOCKDEP_WARN() recheck
> debug_lockdep_rcu_enabled() after evaluating the condition, but with
> a memory barrier immediately before the recheck.  But I am not at all
> excited by doing this on speculation.  Especially given that doing
> so might be covering up some other bug.
> 

Just check the original console log and find:

[  356.696686][ T8418] =
[  356.696692][ T8418] WARNING: suspicious RCU usage
[  356.700193][T14782] 
[  356.704548][ T8418] 5.12.0-rc5-syzkaller #0 Not tainted
[  356.729981][ T8418] -
[  356.732473][T14782] WARNING: iou-sqp-14780/14782 still has locks held!

, so there are two warnnings here, one is from lockdep_rcu_suspisous()
and the other is from print_held_locks_bug(). I think this is what
happened:

in RCU_LOCKDEP_WARN():

if (debug_lockdep_rcu_enabled() // this is true and at this time 
debug_locks = 1

// lockdep detects a lock bug, set debug_locks = 0

&& !__warned // true
&& (c))  //

Re: [RFC 1/2] arm64: PCI: Allow use arch-specific pci sysdata

2021-03-29 Thread Boqun Feng

Hi Arnd,

On Sat, Mar 20, 2021 at 05:09:10PM +0100, Arnd Bergmann wrote:
> On Sat, Mar 20, 2021 at 1:54 PM Arnd Bergmann  wrote:
> >  I actually still have a (not really tested) patch series to clean up
> > the pci host bridge registration, and this should make this a lot easier
> > to add on top.
> >
> > I should dig that out of my backlog and post for review.
> 
> I've uploaded my series to
> https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git
> pci-probe-rework-20210320
> 
> The purpose of this series is mostly to simplify what variations of
> host probe methods exist, towards using pci_host_probe() as the
> only method. It does provide some simplifications based on that
> that, including a way to universally have access to the pci_host_bridge
> pointer during the probe function.
> 

Thanks for the suggestion and code. I spend some time to catch up. Yes,
Bjorn and you are correct, the better way is having a 'domain_nr' in the
'pci_host_bridge' and making sure every driver fill that correctly
before probe. I definitly will use this approach.

However, I may start small: I plan to introduce 'domain_nr' and only
fill the field at probe time for PCI_DOMAINS_GENERIC=y archs, and leave
other archs and driver alone. (honestly, I was shocked by the number of
pci_scan_root_bus_bridge() and pci_host_probe() that I need to adjust if
I really want to unify the 'domain_nr' handling for every arch and
driver ;-)). This will fulfil my requirement for Hyper-V PCI controller
on ARM64. And later on, we can switch each arch to this approach one by
one and keep the rest still working.

Thoughts?

Regards,
Boqun

>  Arnd

[RFC 1/2] arm64: PCI: Allow use arch-specific pci sysdata

2021-03-19 Thread Boqun Feng

Currently, if an architecture selects CONFIG_PCI_DOMAINS_GENERIC, the
->sysdata in bus and bridge will be treated as struct pci_config_window,
which is created by generic ECAM using the data from acpi.

However, for a virtualized PCI bus, there might be no enough data in of
or acpi table to create a pci_config_window. This is similar to the case
where CONFIG_PCI_DOMAINS_GENERIC=n, IOW, architectures use their own
structure for sysdata, so no apci table lookup is required.

In order to enable Hyper-V's virtual PCI (which doesn't have acpi table
entry for PCI) on ARM64 (which selects CONFIG_PCI_DOMAINS_GENERIC), we
introduce arch-specific pci sysdata (similar to the one for x86) for
ARM64, and allow the core PCI code to detect the type of sysdata at the
runtime. The latter is achieved by adding a pci_ops::use_arch_sysdata
field.

Originally-by: Sunil Muthuswamy 
Signed-off-by: Boqun Feng (Microsoft) 
---
 arch/arm64/include/asm/pci.h | 29 +
 arch/arm64/kernel/pci.c  | 15 ---
 include/linux/pci.h  |  3 +++
 3 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
index b33ca260e3c9..dade061a0658 100644
--- a/arch/arm64/include/asm/pci.h
+++ b/arch/arm64/include/asm/pci.h
@@ -22,6 +22,16 @@
 
 extern int isa_dma_bridge_buggy;
 
+struct pci_sysdata {
+   int domain; /* PCI domain */
+   int node;   /* NUMA Node */
+#ifdef CONFIG_ACPI
+   struct acpi_device *companion;  /* ACPI companion device */
+#endif
+#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
+   void *fwnode;   /* IRQ domain for MSI assignment */
+#endif
+};
 #ifdef CONFIG_PCI
 static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
 {
@@ -31,8 +41,27 @@ static inline int pci_get_legacy_ide_irq(struct pci_dev 
*dev, int channel)
 
 static inline int pci_proc_domain(struct pci_bus *bus)
 {
+   if (bus->ops->use_arch_sysdata)
+   return pci_domain_nr(bus);
return 1;
 }
+#ifdef CONFIG_PCI_MSI_IRQ_DOMAIN
+static inline void *_pci_root_bus_fwnode(struct pci_bus *bus)
+{
+   struct pci_sysdata *sd = bus->sysdata;
+
+   if (bus->ops->use_arch_sysdata)
+   return sd->fwnode;
+
+   /*
+* bus->sysdata is not struct pci_sysdata, fwnode should be able to
+* be queried from of/acpi.
+*/
+   return NULL;
+}
+#define pci_root_bus_fwnode_pci_root_bus_fwnode
+#endif /* CONFIG_PCI_MSI_IRQ_DOMAIN */
+
 #endif  /* CONFIG_PCI */
 
 #endif  /* __ASM_PCI_H */
diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 1006ed2d7c60..63d420d57e63 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -74,15 +74,24 @@ struct acpi_pci_generic_root_info {
 int acpi_pci_bus_find_domain_nr(struct pci_bus *bus)
 {
struct pci_config_window *cfg = bus->sysdata;
-   struct acpi_device *adev = to_acpi_device(cfg->parent);
-   struct acpi_pci_root *root = acpi_driver_data(adev);
+   struct pci_sysdata *sd = bus->sysdata;
+   struct acpi_device *adev;
+   struct acpi_pci_root *root;
+
+   /* struct pci_sysdata has domain nr in it */
+   if (bus->ops->use_arch_sysdata)
+   return sd->domain;
+
+   /* or pci_config_window is used as sysdata */
+   adev = to_acpi_device(cfg->parent);
+   root = acpi_driver_data(adev);
 
return root->segment;
 }
 
 int pcibios_root_bridge_prepare(struct pci_host_bridge *bridge)
 {
-   if (!acpi_disabled) {
+   if (!acpi_disabled && bridge->ops->use_arch_sysdata) {
struct pci_config_window *cfg = bridge->bus->sysdata;
struct acpi_device *adev = to_acpi_device(cfg->parent);
struct device *bus_dev = >bus->dev;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 86c799c97b77..4036aac40361 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -740,6 +740,9 @@ struct pci_ops {
void __iomem *(*map_bus)(struct pci_bus *bus, unsigned int devfn, int 
where);
int (*read)(struct pci_bus *bus, unsigned int devfn, int where, int 
size, u32 *val);
int (*write)(struct pci_bus *bus, unsigned int devfn, int where, int 
size, u32 val);
+#ifdef CONFIG_PCI_DOMAINS_GENERIC
+   int use_arch_sysdata;   /* ->sysdata is arch-specific */
+#endif
 };
 
 /*
-- 
2.30.2

[RFC 0/2] PCI: Introduce pci_ops::use_arch_sysdata

2021-03-19 Thread Boqun Feng

Hi Bjorn,

I'm currently working on virtual PCI support for Hyper-V ARM64 guests.
Similar to virtual PCI on x86 Hyper-V guests, the PCI root bus is not
probed via ACPI (or of), it's probed from Hyper-V VMbus, therefore it
doesn't have config window.

Since ARM64 is a CONFIG_PCI_DOMAINS_GENERIC=y, PCI core code always
treats as the root bus has a config window. So we need to resolve this
and want to reuse the code as much as possible. My current solution is
introducing a pci_ops::use_arch_sysdata, and if it's true, the PCI core
code treats the pci_bus::sysdata as an arch-specific sysdata (rather
than pci_config_window) for CONFIG_PCI_DOMAINS_GENERIC=y architectures.
This allows us to reuse the existing code for Hyper-V PCI controller.

This is simply a proposal, I'm open to any suggestion.

Thanks!

Regards,
Boqun


Boqun Feng (2):
  arm64: PCI: Allow use arch-specific pci sysdata
  PCI: hv: Tell PCI core arch-specific sysdata is used

 arch/arm64/include/asm/pci.h| 29 +
 arch/arm64/kernel/pci.c | 15 ---
 drivers/pci/controller/pci-hyperv.c |  3 +++
 include/linux/pci.h |  3 +++
 4 files changed, 47 insertions(+), 3 deletions(-)

-- 
2.30.2

[RFC 2/2] PCI: hv: Tell PCI core arch-specific sysdata is used

2021-03-19 Thread Boqun Feng

Use the newly introduced ->use_arch_sysdata to tell PCI core, we still
use the arch-specific sysdata way to set up root PCI buses on
CONFIG_PCI_DOMAINS_GENERIC=y architectures, this is preparation fo
Hyper-V ARM64 guest virtual PCI support.

Signed-off-by: Boqun Feng (Microsoft) 
---
 drivers/pci/controller/pci-hyperv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index 27a17a1e4a7c..7cfa18d8a26e 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -859,6 +859,9 @@ static int hv_pcifront_write_config(struct pci_bus *bus, 
unsigned int devfn,
 static struct pci_ops hv_pcifront_ops = {
.read  = hv_pcifront_read_config,
.write = hv_pcifront_write_config,
+#ifdef CONFIG_PCI_DOMAINS_GENERIC
+   .use_arch_sysdata = 1,
+#endif
 };
 
 /*
-- 
2.30.2

Re: [PATCH 3/4] locking/ww_mutex: Treat ww_mutex_lock() like a trylock

2021-03-18 Thread Boqun Feng

On Wed, Mar 17, 2021 at 10:54:17PM -0400, Waiman Long wrote:
> On 3/17/21 10:24 PM, Boqun Feng wrote:
> > Hi Waiman,
> > 
> > Just a question out of curiosity: how does this problem hide so long?
> > ;-) Because IIUC, both locktorture and ww_mutex_lock have been there for
> > a while, so why didn't we spot this earlier?
> > 
> > I ask just to make sure we don't introduce the problem because of some
> > subtle problems in lock(dep).
> > 
> You have to explicitly specify ww_mutex in the locktorture module parameter
> to run the test. ww_mutex is usually not the intended target of testing as
> there aren't that many places that use it. Even if someone run it, it
> probably is not on a debug kernel.
> 
> Our QA people try to run locktorture on ww_mutex and discover that.
> 

Got it. Thanks ;-)

Regards,
Boqun

> Cheers,
> Longman
>

Re: [PATCH 3/4] locking/ww_mutex: Treat ww_mutex_lock() like a trylock

2021-03-17 Thread Boqun Feng

Hi Waiman,

Just a question out of curiosity: how does this problem hide so long?
;-) Because IIUC, both locktorture and ww_mutex_lock have been there for
a while, so why didn't we spot this earlier?

I ask just to make sure we don't introduce the problem because of some
subtle problems in lock(dep).

Regards,
Boqun

On Tue, Mar 16, 2021 at 11:31:18AM -0400, Waiman Long wrote:
> It was found that running the ww_mutex_lock-torture test produced the
> following lockdep splat almost immediately:
> 
> [  103.892638] ==
> [  103.892639] WARNING: possible circular locking dependency detected
> [  103.892641] 5.12.0-rc3-debug+ #2 Tainted: G S  W
> [  103.892643] --
> [  103.892643] lock_torture_wr/3234 is trying to acquire lock:
> [  103.892646] c0b35b10 (torture_ww_mutex_2.base){+.+.}-{3:3}, at: 
> torture_ww_mutex_lock+0x316/0x720 [locktorture]
> [  103.892660]
> [  103.892660] but task is already holding lock:
> [  103.892661] c0b35cd0 (torture_ww_mutex_0.base){+.+.}-{3:3}, at: 
> torture_ww_mutex_lock+0x3e2/0x720 [locktorture]
> [  103.892669]
> [  103.892669] which lock already depends on the new lock.
> [  103.892669]
> [  103.892670]
> [  103.892670] the existing dependency chain (in reverse order) is:
> [  103.892671]
> [  103.892671] -> #2 (torture_ww_mutex_0.base){+.+.}-{3:3}:
> [  103.892675]lock_acquire+0x1c5/0x830
> [  103.892682]__ww_mutex_lock.constprop.15+0x1d1/0x2e50
> [  103.892687]ww_mutex_lock+0x4b/0x180
> [  103.892690]torture_ww_mutex_lock+0x316/0x720 [locktorture]
> [  103.892694]lock_torture_writer+0x142/0x3a0 [locktorture]
> [  103.892698]kthread+0x35f/0x430
> [  103.892701]ret_from_fork+0x1f/0x30
> [  103.892706]
> [  103.892706] -> #1 (torture_ww_mutex_1.base){+.+.}-{3:3}:
> [  103.892709]lock_acquire+0x1c5/0x830
> [  103.892712]__ww_mutex_lock.constprop.15+0x1d1/0x2e50
> [  103.892715]ww_mutex_lock+0x4b/0x180
> [  103.892717]torture_ww_mutex_lock+0x316/0x720 [locktorture]
> [  103.892721]lock_torture_writer+0x142/0x3a0 [locktorture]
> [  103.892725]kthread+0x35f/0x430
> [  103.892727]ret_from_fork+0x1f/0x30
> [  103.892730]
> [  103.892730] -> #0 (torture_ww_mutex_2.base){+.+.}-{3:3}:
> [  103.892733]check_prevs_add+0x3fd/0x2470
> [  103.892736]__lock_acquire+0x2602/0x3100
> [  103.892738]lock_acquire+0x1c5/0x830
> [  103.892740]__ww_mutex_lock.constprop.15+0x1d1/0x2e50
> [  103.892743]ww_mutex_lock+0x4b/0x180
> [  103.892746]torture_ww_mutex_lock+0x316/0x720 [locktorture]
> [  103.892749]lock_torture_writer+0x142/0x3a0 [locktorture]
> [  103.892753]kthread+0x35f/0x430
> [  103.892755]ret_from_fork+0x1f/0x30
> [  103.892757]
> [  103.892757] other info that might help us debug this:
> [  103.892757]
> [  103.892758] Chain exists of:
> [  103.892758]   torture_ww_mutex_2.base --> torture_ww_mutex_1.base --> 
> torture_ww_mutex_0.base
> [  103.892758]
> [  103.892763]  Possible unsafe locking scenario:
> [  103.892763]
> [  103.892764]CPU0CPU1
> [  103.892765]
> [  103.892765]   lock(torture_ww_mutex_0.base);
> [  103.892767]  
> lock(torture_ww_mutex_1.base);
> [  103.892770]  
> lock(torture_ww_mutex_0.base);
> [  103.892772]   lock(torture_ww_mutex_2.base);
> [  103.892774]
> [  103.892774]  *** DEADLOCK ***
> 
> Since ww_mutex is supposed to be deadlock-proof if used properly, such
> deadlock scenario should not happen. To avoid this false positive splat,
> treat ww_mutex_lock() like a trylock().
> 
> After applying this patch, the locktorture test can run for a long time
> without triggering the circular locking dependency splat.
> 
> Signed-off-by: Waiman Long 
> ---
>  kernel/locking/mutex.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 622ebdfcd083..bb89393cd3a2 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -946,7 +946,10 @@ __mutex_lock_common(struct mutex *lock, long state, 
> unsigned int subclass,
>   }
>  
>   preempt_disable();
> - mutex_acquire_nest(>dep_map, subclass, 0, nest_lock, ip);
> + /*
> +  * Treat as trylock for ww_mutex.
> +  */
> + mutex_acquire_nest(>dep_map, subclass, !!ww_ctx, nest_lock, ip);
>  
>   if (__mutex_trylock(lock) ||
>   mutex_optimistic_spin(lock, ww_ctx, NULL)) {
> -- 
> 2.18.1
>

Re: [PATCH 10/13] rcu/nocb: Delete bypass_timer upon nocb_gp wakeup

2021-03-15 Thread Boqun Feng

On Wed, Mar 10, 2021 at 11:17:02PM +0100, Frederic Weisbecker wrote:
> On Tue, Mar 02, 2021 at 05:24:56PM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 23, 2021 at 01:10:08AM +0100, Frederic Weisbecker wrote:
> > > A NOCB-gp wake up can safely delete the nocb_bypass_timer. nocb_gp_wait()
> > > is going to check again the bypass state and rearm the bypass timer if
> > > necessary.
> > > 
> > > Signed-off-by: Frederic Weisbecker 
> > > Cc: Josh Triplett 
> > > Cc: Lai Jiangshan 
> > > Cc: Joel Fernandes 
> > > Cc: Neeraj Upadhyay 
> > > Cc: Boqun Feng 
> > 
> > Give that you delete this code a couple of patches later in this series,
> > why not just leave it out entirely?  ;-)
> 
> It's not exactly deleted later, it's rather merged within the
> "del_timer(_gp->nocb_timer)".
> 
> The purpose of that patch is to make it clear that we explicitly cancel
> the nocb_bypass_timer here before we do it implicitly later with the
> merge of nocb_bypass_timer into nocb_timer.
> 
> We could drop that patch, the resulting code in the end of the patchset
> will be the same of course but the behaviour detail described here might
> slip out of the reviewers attention :-)
> 

How about merging the timers first and adding those small improvements
later? i.e. move patch #12 #13 right after #7 (IIUC, #7 is the last
requirement you need for merging timers), and then patch #8~#11 just
follow, because IIUC, those patches are not about correctness but more
about avoiding necessary timer fire-ups, right?

Just my 2 cents. The overall patchset looks good to me ;-)

Feel free to add

Reviewed-by: Boqun Feng 

Regards,
Boqun

> > 
> > Thanx, Paul
> > 
> > > ---
> > >  kernel/rcu/tree_plugin.h | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index b62ad79bbda5..9da67b0d3997 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -1711,6 +1711,8 @@ static bool __wake_nocb_gp(struct rcu_data *rdp_gp,
> > >   del_timer(_gp->nocb_timer);
> > >   }
> > >  
> > > + del_timer(_gp->nocb_bypass_timer);
> > > +
> > >   if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
> > >   WRITE_ONCE(rdp_gp->nocb_gp_sleep, false);
> > >   needwake = true;
> 
> Thanks.

Re: [PATCH] leds: trigger: fix potential deadlock with libata

2021-03-06 Thread Boqun Feng

On Sat, Mar 06, 2021 at 09:39:54PM +0100, Marc Kleine-Budde wrote:
> Hello *,
> 
> On 02.11.2020 11:41:52, Andrea Righi wrote:
> > We have the following potential deadlock condition:
> > 
> >  
> >  WARNING: possible irq lock inversion dependency detected
> >  5.10.0-rc2+ #25 Not tainted
> >  
> >  swapper/3/0 just changed the state of lock:
> >  8880063bd618 (>lock){-...}-{2:2}, at: 
> > ata_bmdma_interrupt+0x27/0x200
> >  but this lock took another, HARDIRQ-READ-unsafe lock in the past:
> >   (>leddev_list_lock){.+.?}-{2:2}
> > 
> >  and interrupts could create inverse lock ordering between them.
> 
> [...]
> 
> > ---
> >  drivers/leds/led-triggers.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
> > index 91da90cfb11d..16d1a93a10a8 100644
> > --- a/drivers/leds/led-triggers.c
> > +++ b/drivers/leds/led-triggers.c
> > @@ -378,14 +378,15 @@ void led_trigger_event(struct led_trigger *trig,
> > enum led_brightness brightness)
> >  {
> > struct led_classdev *led_cdev;
> > +   unsigned long flags;
> >  
> > if (!trig)
> > return;
> >  
> > -   read_lock(>leddev_list_lock);
> > +   read_lock_irqsave(>leddev_list_lock, flags);
> > list_for_each_entry(led_cdev, >led_cdevs, trig_list)
> > led_set_brightness(led_cdev, brightness);
> > -   read_unlock(>leddev_list_lock);
> > +   read_unlock_irqrestore(>leddev_list_lock, flags);
> >  }
> >  EXPORT_SYMBOL_GPL(led_trigger_event);
> 
> meanwhile this patch hit v5.10.x stable and caused a performance
> degradation on our use case:
> 
> It's an embedded ARM system, 4x Cortex A53, with an SPI attached CAN
> controller. CAN stands for Controller Area Network and here used to
> connect to some automotive equipment. Over CAN an ISOTP (a CAN-specific
> Transport Protocol) transfer is running. With this patch, we see CAN
> frames delayed for ~6ms, the usual gap between CAN frames is 240µs.
> 
> Reverting this patch, restores the old performance.
> 
> What is the best way to solve this dilemma? Identify the critical path
> in our use case? Is there a way we can get around the irqsave in
> led_trigger_event()?
> 

Probably, we can change from rwlock to rcu here, POC code as follow,
only compile tested. Marc, could you see whether this help the
performance on your platform? Please note that I haven't test it in a
running kernel and I'm not that familir with led subsystem, so use it
with caution ;-)

(While at it, I think maybe we miss the leddev_list_lock in net/mac80211
in the patch)

Regards,
Boqun
--->8
diff --git a/drivers/leds/led-triggers.c b/drivers/leds/led-triggers.c
index 4e7b78a84149..ae68ccab6cc9 100644
--- a/drivers/leds/led-triggers.c
+++ b/drivers/leds/led-triggers.c
@@ -171,10 +171,12 @@ int led_trigger_set(struct led_classdev *led_cdev, struct 
led_trigger *trig)
 
/* Remove any existing trigger */
if (led_cdev->trigger) {
-   write_lock_irqsave(_cdev->trigger->leddev_list_lock, flags);
-   list_del(_cdev->trig_list);
-   write_unlock_irqrestore(_cdev->trigger->leddev_list_lock,
+   spin_lock_irqsave(_cdev->trigger->leddev_list_lock, flags);
+   list_del_rcu(_cdev->trig_list);
+   spin_unlock_irqrestore(_cdev->trigger->leddev_list_lock,
flags);
+   /* Wait for the readers gone */
+   synchronize_rcu();
cancel_work_sync(_cdev->set_brightness_work);
led_stop_software_blink(led_cdev);
if (led_cdev->trigger->deactivate)
@@ -186,9 +188,9 @@ int led_trigger_set(struct led_classdev *led_cdev, struct 
led_trigger *trig)
led_set_brightness(led_cdev, LED_OFF);
}
if (trig) {
-   write_lock_irqsave(>leddev_list_lock, flags);
-   list_add_tail(_cdev->trig_list, >led_cdevs);
-   write_unlock_irqrestore(>leddev_list_lock, flags);
+   spin_lock_irqsave(>leddev_list_lock, flags);
+   list_add_tail_rcu(_cdev->trig_list, >led_cdevs);
+   spin_unlock_irqrestore(>leddev_list_lock, flags);
led_cdev->trigger = trig;
 
if (trig->activate)
@@ -223,9 +225,12 @@ int led_trigger_set(struct led_classdev *led_cdev, struct 
led_trigger *trig)
trig->deactivate(led_cdev);
 err_activate:
 
-   write_lock_irqsave(_cdev->trigger->leddev_list_lock, flags);
-   list_del(_cdev->trig_list);
-   write_unlock_irqrestore(_cdev->trigger->leddev_list_lock, flags);
+   spin_lock_irqsave(_cdev->trigger->leddev_list_lock, flags);
+   list_del_rcu(_cdev->trig_list);
+   spin_unlock_irqrestore(_cdev->trigger->leddev_list_lock, flags);
+
+   /* XXX could use call_rcu() here? */
+

Re: XDP socket rings, and LKMM litmus tests

2021-03-04 Thread Boqun Feng

On Thu, Mar 04, 2021 at 11:11:42AM -0500, Alan Stern wrote:
> On Thu, Mar 04, 2021 at 02:33:32PM +0800, Boqun Feng wrote:
> 
> > Right, I was thinking about something unrelated.. but how about the
> > following case:
> > 
> > local_v = 
> > r1 = READ_ONCE(*x); // f
> > 
> > if (r1 == 1) {
> > local_v =  // e
> > } else {
> > local_v =  // d
> > }
> > 
> > p = READ_ONCE(local_v); // g
> > 
> > r2 = READ_ONCE(*p);   // h
> > 
> > if r1 == 1, we definitely think we have:
> > 
> > f ->ctrl e ->rfi g ->addr h
> > 
> > , and if we treat ctrl;rfi as "to-r", then we have "f" happens before
> > "h". However compile can optimze the above as:
> > 
> > local_v = 
> > 
> > r1 = READ_ONCE(*x); // f
> > 
> > if (r1 != 1) {
> > local_v =  // d
> > }
> > 
> > p = READ_ONCE(local_v); // g
> > 
> > r2 = READ_ONCE(*p);   // h
> > 
> > , and when this gets executed, I don't think we have the guarantee we
> > have "f" happens before "h", because CPU can do optimistic read for "g"
> > and "h".
> 
> In your example, which accesses are supposed to be to actual memory and 
> which to registers?  Also, remember that the memory model assumes the 

Given that we use READ_ONCE() on local_v, local_v should be a memory
location but only accessed by this thread.

> hardware does not reorder loads if there is an address dependency 
> between them.
> 

Right, so "g" won't be reordered after "h".

> > Part of this is because when we take plain access into consideration, we
> > won't guarantee a read-from or other relations exists if compiler
> > optimization happens.
> > 
> > Maybe I'm missing something subtle, but just try to think through the
> > effect of making dep; rfi as "to-r".
> 
> Forget about local variables for the time being and just consider
> 
>   dep ; [Plain] ; rfi
> 
> For example:
> 
>   A: r1 = READ_ONCE(x);
>  y = r1;
>   B: r2 = READ_ONCE(y);
> 
> Should B be ordered after A?  I don't see how any CPU could hope to 
> excute B before A, but maybe I'm missing something.
> 

Agreed.

> There's another twist, connected with the fact that herd7 can't detect 
> control dependencies caused by unexecuted code.  If we have:
> 
>   A: r1 = READ_ONCE(x);
>   if (r1)
>   WRITE_ONCE(y, 5);
>   r2 = READ_ONCE(y);
>   B: WRITE_ONCE(z, r2);
> 
> then in executions where x == 0, herd7 doesn't see any control 
> dependency.  But CPUs do see control dependencies whenever there is a 
> conditional branch, whether the branch is taken or not, and so they will 
> never reorder B before A.
> 

Right, because B in this example is a write, what if B is a read that
depends on r2, like in my example? Let y be a pointer to a memory
location, and initialized as a valid value (pointing to a valid memory
location) you example changed to:

A: r1 = READ_ONCE(x);
if (r1)
WRITE_ONCE(y, 5);
C: r2 = READ_ONCE(y);
B: r3 = READ_ONCE(*r2);

, then A don't have the control dependency to B, because A and B is
read+read. So B can be ordered before A, right?

> One last thing to think about: My original assessment or Björn's problem 
> wasn't right, because the dep in (dep ; rfi) doesn't include control 
> dependencies.  Only data and address.  So I believe that the LKMM 

Ah, right. I was mising that part (ctrl is not in dep). So I guess my
example is pointless for the question we are discussing here ;-(

> wouldn't consider A to be ordered before B in this example even if x 
> was nonzero.

Yes, and similar to my example (changing B to a read).

I did try to run my example with herd, and got confused no matter I make
dep; [Plain]; rfi as to-r (I got the same result telling me a reorder
can happen). Now the reason is clear, because this is a ctrl; rfi not a
dep; rfi.

Thanks so much for walking with me on this ;-)

Regards,
Boqun

> 
> Alan

Re: XDP socket rings, and LKMM litmus tests

2021-03-03 Thread Boqun Feng

On Wed, Mar 03, 2021 at 10:13:22PM -0500, Alan Stern wrote:
> On Thu, Mar 04, 2021 at 09:26:31AM +0800, Boqun Feng wrote:
> > On Wed, Mar 03, 2021 at 03:22:46PM -0500, Alan Stern wrote:
> 
> > > Which brings us back to the case of the
> > > 
> > >   dep ; rfi
> > > 
> > > dependency relation, where the accesses in the middle are plain and 
> > > non-racy.  Should the LKMM be changed to allow this?
> > > 
> > 
> > For this particular question, do we need to consider code as the follow?
> > 
> > r1 = READ_ONCE(x);  // f
> > if (r == 1) {
> > local_v =  // g
> > do_something_a();
> > }
> > else {
> > local_v = 
> > do_something_b();
> > }
> > 
> > r2 = READ_ONCE(*local_v); // e
> > 
> > , do we have the guarantee that the first READ_ONCE() happens before the
> > second one? Can compiler optimize the code as:
> > 
> > r2 = READ_ONCE(y);
> > r1 = READ_ONCE(x);
> 
> Well, it can't do that because the compiler isn't allowed to reorder
> volatile accesses (which includes READ_ONCE).  But the compiler could
> do:
> 
>   r1 = READ_ONCE(x);
>   r2 = READ_ONCE(y);
> 
> > if (r == 1) {
> > do_something_a();
> > }
> > else {
> > do_something_b();
> > }
> > 
> > ? Although we have:
> > 
> > f ->dep g ->rfi ->addr e
> 
> This would be an example of a problem Paul has described on several
> occasions, where both arms of an "if" statement store the same value
> (in this case to local_v).  This problem arises even when local
> variables are not involved.  For example:
> 
>   if (READ_ONCE(x) == 0) {
>   WRITE_ONCE(y, 1);
>   do_a();
>   } else {
>   WRITE_ONCE(y, 1);
>   do_b();
>   }
> 
> The compiler can change this to:
> 
>   r = READ_ONCE(x);
>   WRITE_ONCE(y, 1);
>   if (r == 0)
>   do_a();
>   else
>   do_b();
> 
> thus allowing the marked accesses to be reordered by the CPU and
> breaking the apparent control dependency.
> 
> So the answer to your question is: No, we don't have this guarantee,
> but the reason is because of doing the same store in both arms, not
> because of the use of local variables.
> 

Right, I was thinking about something unrelated.. but how about the
following case:

local_v = 
r1 = READ_ONCE(*x); // f

if (r1 == 1) {
local_v =  // e
} else {
local_v =  // d
}

p = READ_ONCE(local_v); // g

r2 = READ_ONCE(*p);   // h

if r1 == 1, we definitely think we have:

f ->ctrl e ->rfi g ->addr h

, and if we treat ctrl;rfi as "to-r", then we have "f" happens before
"h". However compile can optimze the above as:

local_v = 

r1 = READ_ONCE(*x); // f

if (r1 != 1) {
local_v =  // d
}

p = READ_ONCE(local_v); // g

r2 = READ_ONCE(*p);   // h

, and when this gets executed, I don't think we have the guarantee we
have "f" happens before "h", because CPU can do optimistic read for "g"
and "h".

Part of this is because when we take plain access into consideration, we
won't guarantee a read-from or other relations exists if compiler
optimization happens.

Maybe I'm missing something subtle, but just try to think through the
effect of making dep; rfi as "to-r".

Regards,
Boqun

> Alan

Re: XDP socket rings, and LKMM litmus tests

2021-03-03 Thread Boqun Feng

On Wed, Mar 03, 2021 at 03:22:46PM -0500, Alan Stern wrote:
> On Wed, Mar 03, 2021 at 09:40:22AM -0800, Paul E. McKenney wrote:
> > On Wed, Mar 03, 2021 at 12:12:21PM -0500, Alan Stern wrote:
> 
> > > Local variables absolutely should be treated just like CPU registers, if 
> > > possible.  In fact, the compiler has the option of keeping local 
> > > variables stored in registers.
> > > 
> > > (Of course, things may get complicated if anyone writes a litmus test 
> > > that uses a pointer to a local variable,  Especially if the pointer 
> > > could hold the address of a local variable in one execution and a 
> > > shared variable in another!  Or if the pointer is itself a shared 
> > > variable and is dereferenced in another thread!)
> > 
> > Good point!  I did miss this complication.  ;-)
> 
> I suspect it wouldn't be so bad if herd7 disallowed taking addresses of 
> local variables.
> 
> > As you say, when its address is taken, the "local" variable needs to be
> > treated as is it were shared.  There are exceptions where the pointed-to
> > local is still used only by its process.  Are any of these exceptions
> > problematic?
> 
> Easiest just to rule out the whole can of worms.
> 
> > > But even if local variables are treated as non-shared storage locations, 
> > > we should still handle this correctly.  Part of the problem seems to lie 
> > > in the definition of the to-r dependency relation; the relevant portion 
> > > is:
> > > 
> > >   (dep ; [Marked] ; rfi)
> > > 
> > > Here dep is the control dependency from the READ_ONCE to the 
> > > local-variable store, and the rfi refers to the following load of the 
> > > local variable.  The problem is that the store to the local variable 
> > > doesn't go in the Marked class, because it is notated as a plain C 
> > > assignment.  (And likewise for the following load.)
> > > 
> > > Should we change the model to make loads from and stores to local 
> > > variables always count as Marked?
> > 
> > As long as the initial (possibly unmarked) load would be properly
> > complained about.
> 
> Sorry, I don't understand what you mean.
> 
> >  And I cannot immediately think of a situation where
> > this approach would break that would not result in a data race being
> > flagged.  Or is this yet another failure of my imagination?
> 
> By definition, an access to a local variable cannot participate in a 
> data race because all such accesses are confined to a single thread.
> 
> However, there are other aspects to consider, in particular, the 
> ordering relations on local-variable accesses.  But if, as Luc says, 
> local variables are treated just like registers then perhaps the issue 
> doesn't arise.
> 
> > > What should have happened if the local variable were instead a shared 
> > > variable which the other thread didn't access at all?  It seems like a 
> > > weak point of the memory model that it treats these two things 
> > > differently.
> > 
> > But is this really any different than the situation where a global
> > variable is only accessed by a single thread?
> 
> Indeed; it is the _same_ situation.  Which leads to some interesting 
> questions, such as: What does READ_ONCE(r) mean when r is a local 
> variable?  Should it be allowed at all?  In what way is it different 
> from a plain read of r?
> 
> One difference is that the LKMM doesn't allow dependencies to originate 
> from a plain load.  Of course, when you're dealing with a local 
> variable, what matters is not the load from that variable but rather the 
> earlier loads which determined the value that had been stored there.  
> Which brings us back to the case of the
> 
>   dep ; rfi
> 
> dependency relation, where the accesses in the middle are plain and 
> non-racy.  Should the LKMM be changed to allow this?
> 

For this particular question, do we need to consider code as the follow?

r1 = READ_ONCE(x);  // f
if (r == 1) {
local_v =  // g
do_something_a();
}
else {
local_v = 
do_something_b();
}

r2 = READ_ONCE(*local_v); // e

, do we have the guarantee that the first READ_ONCE() happens before the
second one? Can compiler optimize the code as:

r2 = READ_ONCE(y);
r1 = READ_ONCE(x);

if (r == 1) {
do_something_a();
}
else {
do_something_b();
}

? Although we have:

f ->dep g ->rfi ->addr e

Regards,
Boqun

> There are other differences to consider.  For example:
> 
>   r = READ_ONCE(x);
>   smp_wmb();
>   WRITE_ONCE(y, 1);
> 
> If the write to r were treated as a marked store, the smp_wmb would 
> order it (and consequently the READ_ONCE) before the WRITE_ONCE.  
> However we don't want to do this when r is a local variable.  Indeed, a 
> plain store wouldn't be ordered this way because the compiler might 
> optimize the store away entirely, leaving the smp_wmb nothing to act on.
> 
>

Re: [PATCH v8 1/6] arm64: hyperv: Add Hyper-V hypercall and register access utilities

2021-02-23 Thread Boqun Feng

On Thu, Feb 18, 2021 at 03:16:29PM -0800, Michael Kelley wrote:
[...]
> +
> +/*
> + * Get the value of a single VP register.  One version
> + * returns just 64 bits and another returns the full 128 bits.
> + * The two versions are separate to avoid complicating the
> + * calling sequence for the more frequently used 64 bit version.
> + */
> +
> +void __hv_get_vpreg_128(u32 msr,
> + struct hv_get_vp_registers_input  *input,
> + struct hv_get_vp_registers_output *res)
> +{
> + u64 status;
> +
> + input->header.partitionid = HV_PARTITION_ID_SELF;
> + input->header.vpindex = HV_VP_INDEX_SELF;
> + input->header.inputvtl = 0;
> + input->element[0].name0 = msr;
> + input->element[0].name1 = 0;
> +
> +
> + status = hv_do_hypercall(
> + HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_REP_COMP_1,
> + input, res);
> +
> + /*
> +  * Something is fundamentally broken in the hypervisor if
> +  * getting a VP register fails. There's really no way to
> +  * continue as a guest VM, so panic.
> +  */
> + BUG_ON((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS);
> +}
> +
> +u64 hv_get_vpreg(u32 msr)
> +{
> + struct hv_get_vp_registers_input*input;
> + struct hv_get_vp_registers_output   *output;
> + u64 result;
> +
> + /*
> +  * Allocate a power of 2 size so alignment to that size is
> +  * guaranteed, since the hypercall input and output areas
> +  * must not cross a page boundary.
> +  */
> + input = kzalloc(roundup_pow_of_two(sizeof(input->header) +
> + sizeof(input->element[0])), GFP_ATOMIC);
> + output = kmalloc(roundup_pow_of_two(sizeof(*output)), GFP_ATOMIC);
> +

Do we need to BUG_ON(!input || !output)? Or we expect the page fault
(for input being NULL) or the failure of hypercall (for output being
NULL) to tell us the allocation failed?

Hmm.. think a bit more on this, maybe we'd better retry the allocation
if it failed. Because say we are under memory pressusre, and only have
memory enough for doing one hvcall, and one thread allocates that memory
but gets preempted by another thread trying to do another hvcall:


hv_get_vpreg():
  input = kzalloc(...);
  output = kmalloc(...);

hv_get_vpreg():
  intput = kzalloc(...); // allocation fails, but actually if
 // we wait for thread 1 to finish its
 // hvcall, we can get enough memory.

, in this case, if thread 2 retried, it might get the enough memory,
therefore there is no need to BUG_ON() on allocation failure. That said,
I don't think this is likely to happen, and there may be better
solutions for this, so maybe we can keep it as it is (assuming that
memory allocation for hvcall never fails) and improve later.

Regards,
Boqun

> + __hv_get_vpreg_128(msr, input, output);
> +
> + result = output->as64.low;
> + kfree(input);
> + kfree(output);
> + return result;
> +}
> +EXPORT_SYMBOL_GPL(hv_get_vpreg);
> +
> +void hv_get_vpreg_128(u32 msr, struct hv_get_vp_registers_output *res)
> +{
> + struct hv_get_vp_registers_input*input;
> + struct hv_get_vp_registers_output   *output;
> +
> + /*
> +  * Allocate a power of 2 size so alignment to that size is
> +  * guaranteed, since the hypercall input and output areas
> +  * must not cross a page boundary.
> +  */
> + input = kzalloc(roundup_pow_of_two(sizeof(input->header) +
> + sizeof(input->element[0])), GFP_ATOMIC);
> + output = kmalloc(roundup_pow_of_two(sizeof(*output)), GFP_ATOMIC);
> +
> + __hv_get_vpreg_128(msr, input, output);
> +
> + res->as64.low = output->as64.low;
> + res->as64.high = output->as64.high;
> + kfree(input);
> + kfree(output);
> +}
[...]

Re: [PATCH 10/10] clocksource/drivers/hyper-v: Move handling of STIMER0 interrupts

2021-02-22 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:45PM -0800, Michael Kelley wrote:
> STIMER0 interrupts are most naturally modeled as per-cpu IRQs. But
> because x86/x64 doesn't have per-cpu IRQs, the core STIMER0 interrupt
> handling machinery is done in code under arch/x86 and Linux IRQs are
> not used. Adding support for ARM64 means adding equivalent code
> using per-cpu IRQs under arch/arm64.
> 
> A better model is to treat per-cpu IRQs as the normal path (which it is
> for modern architectures), and the x86/x64 path as the exception. Do this
> by incorporating standard Linux per-cpu IRQ allocation into the main
> SITMER0 driver code, and bypass it in the x86/x64 exception case. For
> x86/x64, special case code is retained under arch/x86, but no STIMER0
> interrupt handling code is needed under arch/arm64.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 
> ---
>  arch/x86/hyperv/hv_init.c  |   2 +-
>  arch/x86/include/asm/mshyperv.h|   4 -
>  arch/x86/kernel/cpu/mshyperv.c |  10 +--
>  drivers/clocksource/hyperv_timer.c | 170 
> +
>  include/asm-generic/mshyperv.h |   5 --
>  include/clocksource/hyperv_timer.h |   3 +-
>  6 files changed, 123 insertions(+), 71 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 22e9557..fe37546 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -371,7 +371,7 @@ void __init hyperv_init(void)
>* Ignore any errors in setting up stimer clockevents
>* as we can run with the LAPIC timer as a fallback.
>*/
> - (void)hv_stimer_alloc();
> + (void)hv_stimer_alloc(false);
>  
>   hv_apic_init();
>  
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 5ccbba8..941dd55 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -31,10 +31,6 @@ static inline u64 hv_get_register(unsigned int reg)
>  
>  void hyperv_vector_handler(struct pt_regs *regs);
>  
> -static inline void hv_enable_stimer0_percpu_irq(int irq) {}
> -static inline void hv_disable_stimer0_percpu_irq(int irq) {}
> -
> -
>  #if IS_ENABLED(CONFIG_HYPERV)
>  extern void *hv_hypercall_pg;
>  extern void  __percpu  **hyperv_pcpu_input_arg;
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 5679100a1..440507e 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -85,21 +85,17 @@ void hv_remove_vmbus_handler(void)
>   set_irq_regs(old_regs);
>  }
>  
> -int hv_setup_stimer0_irq(int *irq, int *vector, void (*handler)(void))
> +/* For x86/x64, override weak placeholders in hyperv_timer.c */
> +void hv_setup_stimer0_handler(void (*handler)(void))
>  {
> - *vector = HYPERV_STIMER0_VECTOR;
> - *irq = -1;   /* Unused on x86/x64 */
>   hv_stimer0_handler = handler;
> - return 0;
>  }
> -EXPORT_SYMBOL_GPL(hv_setup_stimer0_irq);
>  
> -void hv_remove_stimer0_irq(int irq)
> +void hv_remove_stimer0_handler(void)
>  {
>   /* We have no way to deallocate the interrupt gate */
>   hv_stimer0_handler = NULL;
>  }
> -EXPORT_SYMBOL_GPL(hv_remove_stimer0_irq);
>  
>  void hv_setup_kexec_handler(void (*handler)(void))
>  {
> diff --git a/drivers/clocksource/hyperv_timer.c 
> b/drivers/clocksource/hyperv_timer.c
> index edf2d43..c553b8c 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -18,6 +18,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -43,14 +46,13 @@
>   */
>  static bool direct_mode_enabled;
>  
> -static int stimer0_irq;
> -static int stimer0_vector;
> +static int stimer0_irq = -1;
> +static long __percpu *stimer0_evt;
>  static int stimer0_message_sint;
>  
>  /*
> - * ISR for when stimer0 is operating in Direct Mode.  Direct Mode
> - * does not use VMbus or any VMbus messages, so process here and not
> - * in the VMbus driver code.
> + * Common code for stimer0 interrupts coming via Direct Mode or
> + * as a VMbus message.
>   */
>  void hv_stimer0_isr(void)
>  {
> @@ -61,6 +63,16 @@ void hv_stimer0_isr(void)
>  }
>  EXPORT_SYMBOL_GPL(hv_stimer0_isr);
>  
> +/*
> + * stimer0 interrupt handler for architectures that support
> + * per-cpu interrupts, which also implies Direct Mode.
> + */
> +static irqreturn_t hv_stimer0_percpu_isr(int irq, void *dev_id)
> +{
> + hv_stimer0_isr();
> + return IRQ_HANDLED;
> +}
> +
>  static int hv_ce_set_next_event(unsigned long delta,
>   struct clock_event_device *evt)
>  {
> @@ -76,8 +88,8 @@ static int hv_ce_shutdown(struct clock_event_device *evt)
>  {
>   hv_set_register(HV_REGISTER_STIMER0_COUNT, 0);
>   hv_set_register(HV_REGISTER_STIMER0_CONFIG, 0);
> - if (direct_mode_enabled)
> - hv_disable_stimer0_percpu_irq(stimer0_irq);
> + if (direct_mode_enabled && stimer0_irq >= 0)
> +

Re: [PATCH 09/10] clocksource/drivers/hyper-v: Set clocksource rating based on Hyper-V feature

2021-02-22 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:44PM -0800, Michael Kelley wrote:
> On x86/x64, the TSC clocksource is available in a Hyper-V VM only if
> Hyper-V provides the TSC_INVARIANT flag. The rating on the Hyper-V
> Reference TSC page clocksource is currently set so that it will not
> override the TSC clocksource in this case.  Alternatively, if the TSC
> clocksource is not available, then the Hyper-V clocksource is used.
> 
> But on ARM64, the Hyper-V Reference TSC page clocksource should
> override the ARM arch counter, since the Hyper-V clocksource provides
> scaling and offsetting during live migrations that is not provided
> for the ARM arch counter.
> 
> To get the needed behavior for both x86/x64 and ARM64, tweak the
> logic by defaulting the Hyper-V Reference TSC page clocksource
> rating to a large value that will always override.  If the Hyper-V
> TSC_INVARIANT flag is set, then reduce the rating so that it will not
> override the TSC.
> 
> While the logic for getting there is slightly different, the net
> result in the normal cases is no functional change.
> 

One question here, please see below:

> Signed-off-by: Michael Kelley 
> ---
>  drivers/clocksource/hyperv_timer.c | 23 +--
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/clocksource/hyperv_timer.c 
> b/drivers/clocksource/hyperv_timer.c
> index a2bee50..edf2d43 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -302,14 +302,6 @@ void hv_stimer_global_cleanup(void)
>   * the other that uses the TSC reference page feature as defined in the
>   * TLFS.  The MSR version is for compatibility with old versions of
>   * Hyper-V and 32-bit x86.  The TSC reference page version is preferred.
> - *
> - * The Hyper-V clocksource ratings of 250 are chosen to be below the
> - * TSC clocksource rating of 300.  In configurations where Hyper-V offers
> - * an InvariantTSC, the TSC is not marked "unstable", so the TSC clocksource
> - * is available and preferred.  With the higher rating, it will be the
> - * default.  On older hardware and Hyper-V versions, the TSC is marked
> - * "unstable", so no TSC clocksource is created and the selected Hyper-V
> - * clocksource will be the default.
>   */
>  
>  u64 (*hv_read_reference_counter)(void);
> @@ -380,7 +372,7 @@ static int hv_cs_enable(struct clocksource *cs)
>  
>  static struct clocksource hyperv_cs_tsc = {
>   .name   = "hyperv_clocksource_tsc_page",
> - .rating = 250,
> + .rating = 500,
>   .read   = read_hv_clock_tsc_cs,
>   .mask   = CLOCKSOURCE_MASK(64),
>   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
> @@ -417,7 +409,7 @@ static u64 notrace read_hv_sched_clock_msr(void)
>  
>  static struct clocksource hyperv_cs_msr = {
>   .name   = "hyperv_clocksource_msr",
> - .rating = 250,
> + .rating = 500,

Before this patch, since the ".rating" of hyper_cs_msr is 250 which is
smaller than the TSC clocksource rating, the TSC clocksource is better.
After this patch, in the case where HV_MSR_REFERENCE_TSC_AVAILABLE bit
is 0, we make hyperv_cs_msr better than the TSC clocksource (and we
don't lower the rating of hyperv_cs_msr if TSC_INVARIANT is not
offered), right?  Could you explain why we need the change? Or maybe I'm
missing something?

Regards,
Boqun

>   .read   = read_hv_clock_msr_cs,
>   .mask   = CLOCKSOURCE_MASK(64),
>   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
> @@ -452,6 +444,17 @@ static bool __init hv_init_tsc_clocksource(void)
>   if (!(ms_hyperv.features & HV_MSR_REFERENCE_TSC_AVAILABLE))
>   return false;
>  
> + /*
> +  * If Hyper-V offers TSC_INVARIANT, then the virtualized TSC correctly
> +  * handles frequency and offset changes due to live migration,
> +  * pause/resume, and other VM management operations.  So lower the
> +  * Hyper-V Reference TSC rating, causing the generic TSC to be used.
> +  * TSC_INVARIANT is not offered on ARM64, so the Hyper-V Reference
> +  * TSC will be preferred over the virtualized ARM64 arch counter.
> +  */
> + if (ms_hyperv.features & HV_ACCESS_TSC_INVARIANT)
> + hyperv_cs_tsc.rating = 250;
> +
>   hv_read_reference_counter = read_hv_clock_tsc;
>   phys_addr = virt_to_phys(hv_get_tsc_page());
>  
> -- 
> 1.8.3.1
>

Re: [PATCH 08/10] clocksource/drivers/hyper-v: Handle sched_clock differences inline

2021-02-22 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:43PM -0800, Michael Kelley wrote:
> While the Hyper-V Reference TSC code is architecture neutral, the
> pv_ops.time.sched_clock() function is implemented for x86/x64, but not
> for ARM64. Current code calls a utility function under arch/x86 (and
> coming, under arch/arm64) to handle the difference.
> 
> Change this approach to handle the difference inline based on whether
> GENERIC_SCHED_CLOCK is present.  The new approach removes code under
> arch/* since the difference is tied more to the specifics of the Linux
> implementation than to the architecture.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/include/asm/mshyperv.h| 11 ---
>  drivers/clocksource/hyperv_timer.c | 21 +
>  2 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index ed9dc56..5ccbba8 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -29,17 +29,6 @@ static inline u64 hv_get_register(unsigned int reg)
>  
>  #define hv_get_raw_timer() rdtsc_ordered()
>  
> -/*
> - * Reference to pv_ops must be inline so objtool
> - * detection of noinstr violations can work correctly.
> - */
> -static __always_inline void hv_setup_sched_clock(void *sched_clock)
> -{
> -#ifdef CONFIG_PARAVIRT
> - pv_ops.time.sched_clock = sched_clock;
> -#endif
> -}
> -
>  void hyperv_vector_handler(struct pt_regs *regs);
>  
>  static inline void hv_enable_stimer0_percpu_irq(int irq) {}
> diff --git a/drivers/clocksource/hyperv_timer.c 
> b/drivers/clocksource/hyperv_timer.c
> index 9cee6db..a2bee50 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -423,6 +423,27 @@ static u64 notrace read_hv_sched_clock_msr(void)
>   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
>  };
>  
> +/*
> + * Reference to pv_ops must be inline so objtool
> + * detection of noinstr violations can work correctly.
> + */
> +static __always_inline void hv_setup_sched_clock(void *sched_clock)
> +{
> +#ifdef CONFIG_GENERIC_SCHED_CLOCK
> + /*
> +  * We're on an architecture with generic sched clock (not x86/x64).
> +  * The Hyper-V sched clock read function returns nanoseconds, not
> +  * the normal 100ns units of the Hyper-V synthetic clock.
> +  */
> + sched_clock_register(sched_clock, 64, NSEC_PER_SEC);
> +#else
> +#ifdef CONFIG_PARAVIRT
> + /* We're on x86/x64 *and* using PV ops */
> + pv_ops.time.sched_clock = sched_clock;
> +#endif
> +#endif
> +}
> +
>  static bool __init hv_init_tsc_clocksource(void)
>  {
>   u64 tsc_msr;
> -- 
> 1.8.3.1
>

Re: [PATCH 07/10] clocksource/drivers/hyper-v: Handle vDSO differences inline

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:42PM -0800, Michael Kelley wrote:
> While the driver for the Hyper-V Reference TSC and STIMERs is architecture
> neutral, vDSO is implemented for x86/x64, but not for ARM64.  Current code
> calls into utility functions under arch/x86 (and coming, under arch/arm64)
> to handle the difference.
> 
> Change this approach to handle the difference inline based on whether
> VDSO_CLOCK_MODE_HVCLOCK is present.  The new approach removes code under
> arch/* since the difference is tied more to the specifics of the Linux
> implementation than to the architecture.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/include/asm/mshyperv.h|  4 
>  drivers/clocksource/hyperv_timer.c | 10 --
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 4d3e0c5..ed9dc56 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -27,10 +27,6 @@ static inline u64 hv_get_register(unsigned int reg)
>   return value;
>  }
>  
> -#define hv_set_clocksource_vdso(val) \
> - ((val).vdso_clock_mode = VDSO_CLOCKMODE_HVCLOCK)
> -#define hv_enable_vdso_clocksource() \
> - vclocks_set_used(VDSO_CLOCKMODE_HVCLOCK);
>  #define hv_get_raw_timer() rdtsc_ordered()
>  
>  /*
> diff --git a/drivers/clocksource/hyperv_timer.c 
> b/drivers/clocksource/hyperv_timer.c
> index 9425308..9cee6db 100644
> --- a/drivers/clocksource/hyperv_timer.c
> +++ b/drivers/clocksource/hyperv_timer.c
> @@ -372,7 +372,9 @@ static void resume_hv_clock_tsc(struct clocksource *arg)
>  
>  static int hv_cs_enable(struct clocksource *cs)
>  {
> - hv_enable_vdso_clocksource();
> +#ifdef VDSO_CLOCKMODE_HVCLOCK
> + vclocks_set_used(VDSO_CLOCKMODE_HVCLOCK);
> +#endif
>   return 0;
>  }
>  
> @@ -385,6 +387,11 @@ static int hv_cs_enable(struct clocksource *cs)
>   .suspend= suspend_hv_clock_tsc,
>   .resume = resume_hv_clock_tsc,
>   .enable = hv_cs_enable,
> +#ifdef VDSO_CLOCKMODE_HVCLOCK
> + .vdso_clock_mode = VDSO_CLOCKMODE_HVCLOCK,
> +#else
> + .vdso_clock_mode = VDSO_CLOCKMODE_NONE,
> +#endif
>  };
>  
>  static u64 notrace read_hv_clock_msr(void)
> @@ -439,7 +446,6 @@ static bool __init hv_init_tsc_clocksource(void)
>   tsc_msr = tsc_msr | 0x1 | (u64)phys_addr;
>   hv_set_register(HV_REGISTER_REFERENCE_TSC, tsc_msr);
>  
> - hv_set_clocksource_vdso(hyperv_cs_tsc);
>   clocksource_register_hz(_cs_tsc, NSEC_PER_SEC/100);
>  
>   hv_sched_clock_offset = hv_read_reference_counter();
> -- 
> 1.8.3.1
>

Re: [PATCH 06/10] Drivers: hv: vmbus: Move handling of VMbus interrupts

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:41PM -0800, Michael Kelley wrote:
> VMbus interrupts are most naturally modelled as per-cpu IRQs.  But
> because x86/x64 doesn't have per-cpu IRQs, the core VMbus interrupt
> handling machinery is done in code under arch/x86 and Linux IRQs are
> not used.  Adding support for ARM64 means adding equivalent code
> using per-cpu IRQs under arch/arm64.
> 
> A better model is to treat per-cpu IRQs as the normal path (which it is
> for modern architectures), and the x86/x64 path as the exception.  Do this
> by incorporating standard Linux per-cpu IRQ allocation into the main VMbus
> driver, and bypassing it in the x86/x64 exception case. For x86/x64,
> special case code is retained under arch/x86, but no VMbus interrupt
> handling code is needed under arch/arm64.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/include/asm/mshyperv.h |  1 -
>  arch/x86/kernel/cpu/mshyperv.c  | 13 +++--
>  drivers/hv/hv.c |  8 +-
>  drivers/hv/vmbus_drv.c  | 63 
> -
>  include/asm-generic/mshyperv.h  |  7 ++---
>  5 files changed, 70 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index d12a188..4d3e0c5 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -32,7 +32,6 @@ static inline u64 hv_get_register(unsigned int reg)
>  #define hv_enable_vdso_clocksource() \
>   vclocks_set_used(VDSO_CLOCKMODE_HVCLOCK);
>  #define hv_get_raw_timer() rdtsc_ordered()
> -#define hv_get_vector() HYPERVISOR_CALLBACK_VECTOR
>  
>  /*
>   * Reference to pv_ops must be inline so objtool
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index f628e3dc..5679100a1 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -55,23 +55,18 @@
>   set_irq_regs(old_regs);
>  }
>  
> -int hv_setup_vmbus_irq(int irq, void (*handler)(void))
> +void hv_setup_vmbus_handler(void (*handler)(void))
>  {
> - /*
> -  * The 'irq' argument is ignored on x86/x64 because a hard-coded
> -  * interrupt vector is used for Hyper-V interrupts.
> -  */
>   vmbus_handler = handler;
> - return 0;
>  }
> +EXPORT_SYMBOL_GPL(hv_setup_vmbus_handler);
>  
> -void hv_remove_vmbus_irq(void)
> +void hv_remove_vmbus_handler(void)
>  {
>   /* We have no way to deallocate the interrupt gate */
>   vmbus_handler = NULL;
>  }
> -EXPORT_SYMBOL_GPL(hv_setup_vmbus_irq);
> -EXPORT_SYMBOL_GPL(hv_remove_vmbus_irq);
> +EXPORT_SYMBOL_GPL(hv_remove_vmbus_handler);
>  
>  /*
>   * Routines to do per-architecture handling of stimer0
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index afe7a62..917b29e 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "hyperv_vmbus.h"
> @@ -214,10 +215,12 @@ void hv_synic_enable_regs(unsigned int cpu)
>   hv_set_register(HV_REGISTER_SIEFP, siefp.as_uint64);
>  
>   /* Setup the shared SINT. */
> + if (vmbus_irq != -1)
> + enable_percpu_irq(vmbus_irq, 0);
>   shared_sint.as_uint64 = hv_get_register(HV_REGISTER_SINT0 +
>   VMBUS_MESSAGE_SINT);
>  
> - shared_sint.vector = hv_get_vector();
> + shared_sint.vector = vmbus_interrupt;
>   shared_sint.masked = false;
>  
>   /*
> @@ -285,6 +288,9 @@ void hv_synic_disable_regs(unsigned int cpu)
>   sctrl.as_uint64 = hv_get_register(HV_REGISTER_SCONTROL);
>   sctrl.enable = 0;
>   hv_set_register(HV_REGISTER_SCONTROL, sctrl.as_uint64);
> +
> + if (vmbus_irq != -1)
> + disable_percpu_irq(vmbus_irq);
>  }
>  
>  
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 8affe68..62721a7 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -48,8 +48,10 @@ struct vmbus_dynid {
>  
>  static void *hv_panic_page;
>  
> +static long __percpu *vmbus_evt;
> +
>  /* Values parsed from ACPI DSDT */
> -static int vmbus_irq;
> +int vmbus_irq;
>  int vmbus_interrupt;
>  
>  /*
> @@ -1354,7 +1356,13 @@ static void vmbus_isr(void)
>   tasklet_schedule(_cpu->msg_dpc);
>   }
>  
> - add_interrupt_randomness(hv_get_vector(), 0);
> + add_interrupt_randomness(vmbus_interrupt, 0);
> +}
> +
> +static irqreturn_t vmbus_percpu_isr(int irq, void *dev_id)
> +{
> + vmbus_isr()

Re: [PATCH 05/10] Drivers: hv: vmbus: Handle auto EOI quirk inline

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:40PM -0800, Michael Kelley wrote:
> On x86/x64, Hyper-V provides a flag to indicate auto EOI functionality,
> but it doesn't on ARM64. Handle this quirk inline instead of calling
> into code under arch/x86 (and coming, under arch/arm64).
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/include/asm/mshyperv.h |  3 ---
>  drivers/hv/hv.c | 12 +++-
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index eba637d1..d12a188 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -27,9 +27,6 @@ static inline u64 hv_get_register(unsigned int reg)
>   return value;
>  }
>  
> -#define hv_recommend_using_aeoi() \
> - (!(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED))
> -
>  #define hv_set_clocksource_vdso(val) \
>   ((val).vdso_clock_mode = VDSO_CLOCKMODE_HVCLOCK)
>  #define hv_enable_vdso_clocksource() \
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 0c1fa69..afe7a62 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -219,7 +219,17 @@ void hv_synic_enable_regs(unsigned int cpu)
>  
>   shared_sint.vector = hv_get_vector();
>   shared_sint.masked = false;
> - shared_sint.auto_eoi = hv_recommend_using_aeoi();
> +
> + /*
> +  * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
> +  * it doesn't provide a recommendation flag and AEOI must be disabled.
> +  */
> +#ifdef HV_DEPRECATING_AEOI_RECOMMENDED
> + shared_sint.auto_eoi =
> + !(ms_hyperv.hints & HV_DEPRECATING_AEOI_RECOMMENDED);
> +#else
> + shared_sint.auto_eoi = 0;
> +#endif
>   hv_set_register(HV_REGISTER_SINT0 + VMBUS_MESSAGE_SINT,
>   shared_sint.as_uint64);
>  
> -- 
> 1.8.3.1
>

Re: [PATCH 04/10] Drivers: hv: vmbus: Move hyperv_report_panic_msg to arch neutral code

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:39PM -0800, Michael Kelley wrote:
> With the new Hyper-V MSR set function, hyperv_report_panic_msg() can be
> architecture neutral, so move it out from under arch/x86 and merge into
> hv_kmsg_dump(). This move also avoids needing a separate implementation
> under arch/arm64.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/hyperv/hv_init.c  | 27 ---
>  drivers/hv/vmbus_drv.c | 24 +++-
>  include/asm-generic/mshyperv.h |  1 -
>  3 files changed, 19 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 9b2cdbe..22e9557 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -452,33 +452,6 @@ void hyperv_report_panic(struct pt_regs *regs, long err, 
> bool in_die)
>  }
>  EXPORT_SYMBOL_GPL(hyperv_report_panic);
>  
> -/**
> - * hyperv_report_panic_msg - report panic message to Hyper-V
> - * @pa: physical address of the panic page containing the message
> - * @size: size of the message in the page
> - */
> -void hyperv_report_panic_msg(phys_addr_t pa, size_t size)
> -{
> - /*
> -  * P3 to contain the physical address of the panic page & P4 to
> -  * contain the size of the panic data in that page. Rest of the
> -  * registers are no-op when the NOTIFY_MSG flag is set.
> -  */
> - wrmsrl(HV_X64_MSR_CRASH_P0, 0);
> - wrmsrl(HV_X64_MSR_CRASH_P1, 0);
> - wrmsrl(HV_X64_MSR_CRASH_P2, 0);
> - wrmsrl(HV_X64_MSR_CRASH_P3, pa);
> - wrmsrl(HV_X64_MSR_CRASH_P4, size);
> -
> - /*
> -  * Let Hyper-V know there is crash data available along with
> -  * the panic message.
> -  */
> - wrmsrl(HV_X64_MSR_CRASH_CTL,
> -(HV_CRASH_CTL_CRASH_NOTIFY | HV_CRASH_CTL_CRASH_NOTIFY_MSG));
> -}
> -EXPORT_SYMBOL_GPL(hyperv_report_panic_msg);
> -
>  bool hv_is_hyperv_initialized(void)
>  {
>   union hv_x64_msr_hypercall_contents hypercall_msr;
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 089f165..8affe68 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1365,22 +1365,36 @@ static void hv_kmsg_dump(struct kmsg_dumper *dumper,
>enum kmsg_dump_reason reason)
>  {
>   size_t bytes_written;
> - phys_addr_t panic_pa;
>  
>   /* We are only interested in panics. */
>   if ((reason != KMSG_DUMP_PANIC) || (!sysctl_record_panic_msg))
>   return;
>  
> - panic_pa = virt_to_phys(hv_panic_page);
> -
>   /*
>* Write dump contents to the page. No need to synchronize; panic should
>* be single-threaded.
>*/
>   kmsg_dump_get_buffer(dumper, false, hv_panic_page, HV_HYP_PAGE_SIZE,
>_written);
> - if (bytes_written)
> - hyperv_report_panic_msg(panic_pa, bytes_written);
> + if (!bytes_written)
> + return;
> + /*
> +  * P3 to contain the physical address of the panic page & P4 to
> +  * contain the size of the panic data in that page. Rest of the
> +  * registers are no-op when the NOTIFY_MSG flag is set.
> +  */
> + hv_set_register(HV_REGISTER_CRASH_P0, 0);
> + hv_set_register(HV_REGISTER_CRASH_P1, 0);
> + hv_set_register(HV_REGISTER_CRASH_P2, 0);
> + hv_set_register(HV_REGISTER_CRASH_P3, virt_to_phys(hv_panic_page));
> + hv_set_register(HV_REGISTER_CRASH_P4, bytes_written);
> +
> + /*
> +  * Let Hyper-V know there is crash data available along with
> +  * the panic message.
> +  */
> + hv_set_register(HV_REGISTER_CRASH_CTL,
> +(HV_CRASH_CTL_CRASH_NOTIFY | HV_CRASH_CTL_CRASH_NOTIFY_MSG));
>  }
>  
>  static struct kmsg_dumper hv_kmsg_dumper = {
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 10c97a9..6a8072f 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -170,7 +170,6 @@ static inline int cpumask_to_vpset(struct hv_vpset *vpset,
>  }
>  
>  void hyperv_report_panic(struct pt_regs *regs, long err, bool in_die);
> -void hyperv_report_panic_msg(phys_addr_t pa, size_t size);
>  bool hv_is_hyperv_initialized(void);
>  bool hv_is_hibernation_supported(void);
>  void hyperv_cleanup(void);
> -- 
> 1.8.3.1
>

Re: [PATCH 03/10] Drivers: hv: Redo Hyper-V synthetic MSR get/set functions

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:38PM -0800, Michael Kelley wrote:
> Current code defines a separate get and set macro for each Hyper-V
> synthetic MSR used by the VMbus driver. Furthermore, the get macro
> can't be converted to a standard function because the second argument
> is modified in place, which is somewhat bad form.
> 
> Redo this by providing a single get and a single set function that
> take a parameter specifying the MSR to be operated on. Fixup usage
> of the get function. Calling locations are no more complex than before,
> but the code under arch/x86 and the upcoming code under arch/arm64
> is significantly simplified.
> 
> Also standardize the names of Hyper-V synthetic MSRs that are
> architecture neutral. But keep the old x86-specific names as aliases
> that can be removed later when all references (particularly in KVM
> code) have been cleaned up in a separate patch series.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/hyperv/hv_init.c  |   2 +-
>  arch/x86/include/asm/hyperv-tlfs.h | 102 
> +++--
>  arch/x86/include/asm/mshyperv.h|  39 --
>  drivers/clocksource/hyperv_timer.c |  26 +-
>  drivers/hv/hv.c|  37 --
>  drivers/hv/vmbus_drv.c |   2 +-
>  include/asm-generic/mshyperv.h |   2 +-
>  7 files changed, 110 insertions(+), 100 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 2d1688e..9b2cdbe 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -58,7 +58,7 @@ static int hv_cpu_init(unsigned int cpu)
>   return -ENOMEM;
>   *input_arg = page_address(pg);
>  
> - hv_get_vp_index(msr_vp_index);
> + msr_vp_index = hv_get_register(HV_REGISTER_VP_INDEX);
>  
>   hv_vp_index[smp_processor_id()] = msr_vp_index;
>  
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index dd74066..545026e 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -131,7 +131,7 @@
>  #define HV_X64_MSR_HYPERCALL 0x4001
>  
>  /* MSR used to provide vcpu index */
> -#define HV_X64_MSR_VP_INDEX  0x4002
> +#define HV_REGISTER_VP_INDEX 0x4002
>  
>  /* MSR used to reset the guest OS. */
>  #define HV_X64_MSR_RESET 0x4003
> @@ -140,10 +140,10 @@
>  #define HV_X64_MSR_VP_RUNTIME0x4010
>  
>  /* MSR used to read the per-partition time reference counter */
> -#define HV_X64_MSR_TIME_REF_COUNT0x4020
> +#define HV_REGISTER_TIME_REF_COUNT   0x4020
>  
>  /* A partition's reference time stamp counter (TSC) page */
> -#define HV_X64_MSR_REFERENCE_TSC 0x4021
> +#define HV_REGISTER_REFERENCE_TSC0x4021
>  
>  /* MSR used to retrieve the TSC frequency */
>  #define HV_X64_MSR_TSC_FREQUENCY 0x4022
> @@ -158,50 +158,50 @@
>  #define HV_X64_MSR_VP_ASSIST_PAGE0x4073
>  
>  /* Define synthetic interrupt controller model specific registers. */
> -#define HV_X64_MSR_SCONTROL  0x4080
> -#define HV_X64_MSR_SVERSION  0x4081
> -#define HV_X64_MSR_SIEFP 0x4082
> -#define HV_X64_MSR_SIMP  0x4083
> -#define HV_X64_MSR_EOM   0x4084
> -#define HV_X64_MSR_SINT0 0x4090
> -#define HV_X64_MSR_SINT1 0x4091
> -#define HV_X64_MSR_SINT2 0x4092
> -#define HV_X64_MSR_SINT3 0x4093
> -#define HV_X64_MSR_SINT4 0x4094
> -#define HV_X64_MSR_SINT5 0x4095
> -#define HV_X64_MSR_SINT6 0x4096
> -#define HV_X64_MSR_SINT7 0x4097
> -#define HV_X64_MSR_SINT8 0x4098
> -#define HV_X64_MSR_SINT9 0x4099
> -#define HV_X64_MSR_SINT100x409A
> -#define HV_X64_MSR_SINT110x409B
> -#define HV_X64_MSR_SINT120x409C
> -#define HV_X64_MSR_SINT130x409D
> -#define HV_X64_MSR_SINT140x409E
> -#define HV_X64_MSR_SINT150x409F
> +#define HV_REGISTER_SCONTROL 0x4080
> +#define HV_REGISTER_SVERSION 0x4081
> +#define HV_REGISTER_SIEFP0x4082
> +#define HV_RE

Re: [PATCH 02/10] x86/hyper-v: Move hv_message_type to architecture neutral module

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:37PM -0800, Michael Kelley wrote:
> The definition of enum hv_message_type includes arch neutral and
> x86/x64-specific values. Ideally there would be a way to put the
> arch neutral values in an arch neutral module, and the arch
> specific values in an arch specific module. But C doesn't provide
> a way to extend enum types. As a compromise, move the entire
> definition into an arch neutral module, to avoid duplicating the
> arch neutral values for x86/x64 and for ARM64.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 29 -
>  include/asm-generic/hyperv-tlfs.h  | 35 +++
>  2 files changed, 35 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index 6bf42ae..dd74066 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -263,35 +263,6 @@ struct hv_tsc_emulation_status {
>  #define HV_X64_MSR_TSC_REFERENCE_ENABLE  0x0001
>  #define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT   12
>  
> -
> -/* Define hypervisor message types. */
> -enum hv_message_type {
> - HVMSG_NONE  = 0x,
> -
> - /* Memory access messages. */
> - HVMSG_UNMAPPED_GPA  = 0x8000,
> - HVMSG_GPA_INTERCEPT = 0x8001,
> -
> - /* Timer notification messages. */
> - HVMSG_TIMER_EXPIRED = 0x8010,
> -
> - /* Error messages. */
> - HVMSG_INVALID_VP_REGISTER_VALUE = 0x8020,
> - HVMSG_UNRECOVERABLE_EXCEPTION   = 0x8021,
> - HVMSG_UNSUPPORTED_FEATURE   = 0x8022,
> -
> - /* Trace buffer complete messages. */
> - HVMSG_EVENTLOG_BUFFERCOMPLETE   = 0x8040,
> -
> - /* Platform-specific processor intercept messages. */
> - HVMSG_X64_IOPORT_INTERCEPT  = 0x8001,
> - HVMSG_X64_MSR_INTERCEPT = 0x80010001,
> - HVMSG_X64_CPUID_INTERCEPT   = 0x80010002,
> - HVMSG_X64_EXCEPTION_INTERCEPT   = 0x80010003,
> - HVMSG_X64_APIC_EOI  = 0x80010004,
> - HVMSG_X64_LEGACY_FP_ERROR   = 0x80010005
> -};
> -
>  struct hv_nested_enlightenments_control {
>   struct {
>   __u32 directhypercall:1;
> diff --git a/include/asm-generic/hyperv-tlfs.h 
> b/include/asm-generic/hyperv-tlfs.h
> index e73a118..d06f7b1 100644
> --- a/include/asm-generic/hyperv-tlfs.h
> +++ b/include/asm-generic/hyperv-tlfs.h
> @@ -213,6 +213,41 @@ enum HV_GENERIC_SET_FORMAT {
>  #define HV_MESSAGE_PAYLOAD_BYTE_COUNT(240)
>  #define HV_MESSAGE_PAYLOAD_QWORD_COUNT   (30)
>  
> +/*
> + * Define hypervisor message types. Some of the message types
> + * are x86/x64 specific, but there's no good way to separate
> + * them out into the arch-specific version of hyperv-tlfs.h
> + * because C doesn't provide a way to extend enum types.
> + * Keeping them all in the arch neutral hyperv-tlfs.h seems
> + * the least messy compromise.
> + */
> +enum hv_message_type {
> + HVMSG_NONE  = 0x,
> +
> + /* Memory access messages. */
> + HVMSG_UNMAPPED_GPA  = 0x8000,
> + HVMSG_GPA_INTERCEPT = 0x8001,
> +
> + /* Timer notification messages. */
> + HVMSG_TIMER_EXPIRED = 0x8010,
> +
> + /* Error messages. */
> + HVMSG_INVALID_VP_REGISTER_VALUE = 0x8020,
> + HVMSG_UNRECOVERABLE_EXCEPTION   = 0x8021,
> + HVMSG_UNSUPPORTED_FEATURE   = 0x8022,
> +
> + /* Trace buffer complete messages. */
> + HVMSG_EVENTLOG_BUFFERCOMPLETE   = 0x8040,
> +
> + /* Platform-specific processor intercept messages. */
> + HVMSG_X64_IOPORT_INTERCEPT  = 0x8001,
> + HVMSG_X64_MSR_INTERCEPT = 0x80010001,
> + HVMSG_X64_CPUID_INTERCEPT   = 0x80010002,
> + HVMSG_X64_EXCEPTION_INTERCEPT   = 0x80010003,
> + HVMSG_X64_APIC_EOI  = 0x80010004,
> + HVMSG_X64_LEGACY_FP_ERROR   = 0x80010005
> +};
> +
>  /* Define synthetic interrupt controller message flags. */
>  union hv_message_flags {
>   __u8 asu8;
> -- 
> 1.8.3.1
>

Re: [PATCH 01/10] Drivers: hv: vmbus: Move Hyper-V page allocator to arch neutral code

2021-02-21 Thread Boqun Feng

On Wed, Jan 27, 2021 at 12:23:36PM -0800, Michael Kelley wrote:
> The Hyper-V page allocator functions are implemented in an architecture
> neutral way.  Move them into the architecture neutral VMbus module so
> a separate implementation for ARM64 is not needed.
> 
> No functional change.
> 
> Signed-off-by: Michael Kelley 

Reviewed-by: Boqun Feng 

Regards,
Boqun

> ---
>  arch/x86/hyperv/hv_init.c   | 22 --
>  arch/x86/include/asm/mshyperv.h |  5 -
>  drivers/hv/hv.c | 36 
>  include/asm-generic/mshyperv.h  |  4 
>  4 files changed, 40 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index e04d90a..2d1688e 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -44,28 +44,6 @@
>  u32 hv_max_vp_index;
>  EXPORT_SYMBOL_GPL(hv_max_vp_index);
>  
> -void *hv_alloc_hyperv_page(void)
> -{
> - BUILD_BUG_ON(PAGE_SIZE != HV_HYP_PAGE_SIZE);
> -
> - return (void *)__get_free_page(GFP_KERNEL);
> -}
> -EXPORT_SYMBOL_GPL(hv_alloc_hyperv_page);
> -
> -void *hv_alloc_hyperv_zeroed_page(void)
> -{
> -BUILD_BUG_ON(PAGE_SIZE != HV_HYP_PAGE_SIZE);
> -
> -return (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> -}
> -EXPORT_SYMBOL_GPL(hv_alloc_hyperv_zeroed_page);
> -
> -void hv_free_hyperv_page(unsigned long addr)
> -{
> - free_page(addr);
> -}
> -EXPORT_SYMBOL_GPL(hv_free_hyperv_page);
> -
>  static int hv_cpu_init(unsigned int cpu)
>  {
>   u64 msr_vp_index;
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index ffc2899..29d0414 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -224,9 +224,6 @@ static inline struct hv_vp_assist_page 
> *hv_get_vp_assist_page(unsigned int cpu)
>  
>  void __init hyperv_init(void);
>  void hyperv_setup_mmu_ops(void);
> -void *hv_alloc_hyperv_page(void);
> -void *hv_alloc_hyperv_zeroed_page(void);
> -void hv_free_hyperv_page(unsigned long addr);
>  void set_hv_tscchange_cb(void (*cb)(void));
>  void clear_hv_tscchange_cb(void);
>  void hyperv_stop_tsc_emulation(void);
> @@ -255,8 +252,6 @@ static inline void hv_set_msi_entry_from_desc(union 
> hv_msi_entry *msi_entry,
>  #else /* CONFIG_HYPERV */
>  static inline void hyperv_init(void) {}
>  static inline void hyperv_setup_mmu_ops(void) {}
> -static inline void *hv_alloc_hyperv_page(void) { return NULL; }
> -static inline void hv_free_hyperv_page(unsigned long addr) {}
>  static inline void set_hv_tscchange_cb(void (*cb)(void)) {}
>  static inline void clear_hv_tscchange_cb(void) {}
>  static inline void hyperv_stop_tsc_emulation(void) {};
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index f202ac7..cca8d5e 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -37,6 +37,42 @@ int hv_init(void)
>  }
>  
>  /*
> + * Functions for allocating and freeing memory with size and
> + * alignment HV_HYP_PAGE_SIZE. These functions are needed because
> + * the guest page size may not be the same as the Hyper-V page
> + * size. We depend upon kmalloc() aligning power-of-two size
> + * allocations to the allocation size boundary, so that the
> + * allocated memory appears to Hyper-V as a page of the size
> + * it expects.
> + */
> +
> +void *hv_alloc_hyperv_page(void)
> +{
> + BUILD_BUG_ON(PAGE_SIZE <  HV_HYP_PAGE_SIZE);
> +
> + if (PAGE_SIZE == HV_HYP_PAGE_SIZE)
> + return (void *)__get_free_page(GFP_KERNEL);
> + else
> + return kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
> +}
> +
> +void *hv_alloc_hyperv_zeroed_page(void)
> +{
> + if (PAGE_SIZE == HV_HYP_PAGE_SIZE)
> + return (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> + else
> + return kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
> +}
> +
> +void hv_free_hyperv_page(unsigned long addr)
> +{
> + if (PAGE_SIZE == HV_HYP_PAGE_SIZE)
> + free_page(addr);
> + else
> + kfree((void *)addr);
> +}
> +
> +/*
>   * hv_post_message - Post a message using the hypervisor message IPC.
>   *
>   * This involves a hypercall.
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index c577996..762d3ac 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -114,6 +114,10 @@ static inline void vmbus_signal_eom(struct hv_message 
> *msg, u32 old_msg_type)
>  /* Sentinel value for an uninitialized entry in hv_vp_index array */
>  #define VP_INVAL U32_MAX
>  
> +void *hv_alloc_hyperv_page(void);
> +void *hv_alloc_hyperv_zeroed_page(void);
> +void hv_free_hyperv_page(unsigned long addr);
> +
>  /**
>   * hv_cpu_number_to_vp_number() - Map CPU to VP.
>   * @cpu_number: CPU number in Linux terms
> -- 
> 1.8.3.1
>

Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-27 Thread Boqun Feng

On Wed, Jan 27, 2021 at 11:18:31AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 27, 2021 at 03:05:24PM +0800, Boqun Feng wrote:
> > On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > > > Hi Paul,
> > > > 
> > > > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paul...@kernel.org wrote:
> > > > > From: "Paul E. McKenney" 
> > > > > 
> > > > > Historically, a task that has been subjected to RCU priority boosting 
> > > > > is
> > > > > deboosted at rcu_read_unlock() time.  However, with the advent of 
> > > > > deferred
> > > > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > > > either bottom halves, interrupts, or preemption disabled, the 
> > > > > deboosting
> > > > > will be delayed for some time.  During this time, a low-priority 
> > > > > process
> > > > > might be incorrectly running at a high real-time priority level.
> > > > > 
> > > > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > > > forcing a minimal deferral of quiescent states, at least for kernels
> > > > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > > > when expedited grace periods are pending that might be blocked by the
> > > > > current task.  This commit therefore causes those mechanisms to also 
> > > > > be
> > > > > used in cases where the current task has been or might soon be 
> > > > > subjected
> > > > > to RCU priority boosting.  Note that this applies to all kernels built
> > > > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > > > built with CONFIG_PREEMPT_RT=y.
> > > > > 
> > > > > This approach assumes that kernels build for use with aggressive 
> > > > > real-time
> > > > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > > > simpler to enable CONFIG_IRQ_WORK=y than to implement a 
> > > > > fast-deboosting
> > > > > scheme that works correctly in its absence.
> > > > > 
> > > > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > > > function's local variables.
> > > > > 
> > > > > Cc: Sebastian Andrzej Siewior 
> > > > > Cc: Scott Wood 
> > > > > Cc: Lai Jiangshan 
> > > > > Cc: Thomas Gleixner 
> > > > > Signed-off-by: Paul E. McKenney 
> > > > > ---
> > > > >  kernel/rcu/tree_plugin.h | 26 ++
> > > > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 8b0feb2..fca31c6 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -660,9 +660,9 @@ static void 
> > > > > rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > >  static void rcu_read_unlock_special(struct task_struct *t)
> > > > >  {
> > > > >   unsigned long flags;
> > > > > + bool irqs_were_disabled;
> > > > >   bool preempt_bh_were_disabled =
> > > > >   !!(preempt_count() & (PREEMPT_MASK | 
> > > > > SOFTIRQ_MASK));
> > > > > - bool irqs_were_disabled;
> > > > >  
> > > > >   /* NMI handlers cannot block and cannot safely manipulate 
> > > > > state. */
> > > > >   if (in_nmi())
> > > > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct 
> > > > > task_struct *t)
> > > > >   local_irq_save(flags);
> > > > >   irqs_were_disabled = irqs_disabled_flags(flags);
> > > > >   if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > > > - bool exp;
> > > > > + bool expboost; // Expedited GP in flight or possible 
> > > > > boosting.
> > > > >   struct rcu_data *rdp = this_cpu_ptr(_data);
> > > > >   struct rcu_node *rnp = rdp->mynode;
> > > > >  
> > > > > -

Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-26 Thread Boqun Feng

On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > Hi Paul,
> > 
> > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paul...@kernel.org wrote:
> > > From: "Paul E. McKenney" 
> > > 
> > > Historically, a task that has been subjected to RCU priority boosting is
> > > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > either bottom halves, interrupts, or preemption disabled, the deboosting
> > > will be delayed for some time.  During this time, a low-priority process
> > > might be incorrectly running at a high real-time priority level.
> > > 
> > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > forcing a minimal deferral of quiescent states, at least for kernels
> > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > when expedited grace periods are pending that might be blocked by the
> > > current task.  This commit therefore causes those mechanisms to also be
> > > used in cases where the current task has been or might soon be subjected
> > > to RCU priority boosting.  Note that this applies to all kernels built
> > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > built with CONFIG_PREEMPT_RT=y.
> > > 
> > > This approach assumes that kernels build for use with aggressive real-time
> > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > > scheme that works correctly in its absence.
> > > 
> > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > function's local variables.
> > > 
> > > Cc: Sebastian Andrzej Siewior 
> > > Cc: Scott Wood 
> > > Cc: Lai Jiangshan 
> > > Cc: Thomas Gleixner 
> > > Signed-off-by: Paul E. McKenney 
> > > ---
> > >  kernel/rcu/tree_plugin.h | 26 ++
> > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 8b0feb2..fca31c6 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct 
> > > irq_work *iwp)
> > >  static void rcu_read_unlock_special(struct task_struct *t)
> > >  {
> > >   unsigned long flags;
> > > + bool irqs_were_disabled;
> > >   bool preempt_bh_were_disabled =
> > >   !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > - bool irqs_were_disabled;
> > >  
> > >   /* NMI handlers cannot block and cannot safely manipulate state. */
> > >   if (in_nmi())
> > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct 
> > > task_struct *t)
> > >   local_irq_save(flags);
> > >   irqs_were_disabled = irqs_disabled_flags(flags);
> > >   if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > - bool exp;
> > > + bool expboost; // Expedited GP in flight or possible boosting.
> > >   struct rcu_data *rdp = this_cpu_ptr(_data);
> > >   struct rcu_node *rnp = rdp->mynode;
> > >  
> > > - exp = (t->rcu_blocked_node &&
> > > -READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > -   (rdp->grpmask & READ_ONCE(rnp->expmask));
> > > + expboost = (t->rcu_blocked_node && 
> > > READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > +(rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > > +(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled 
> > > &&
> > > + t->rcu_blocked_node);
> > 
> > I take it that you check whether possible boosting is in progress via
> > the last expression of "||", ie:
> > 
> > (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > t->rcu_blocked_node)
> > 
> > if so, I don't see the point of using the new "expboost" in the
> > raise_softirq_irqoff() branch, because if in_irq() is false, we only
> > raise softirq if irqs_were_disabled is false (otherwise, we may take

Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-26 Thread Boqun Feng

Hi Paul,

On Tue, Jan 19, 2021 at 08:32:33PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> Historically, a task that has been subjected to RCU priority boosting is
> deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> quiescent states, if the outermost rcu_read_unlock() was invoked with
> either bottom halves, interrupts, or preemption disabled, the deboosting
> will be delayed for some time.  During this time, a low-priority process
> might be incorrectly running at a high real-time priority level.
> 
> Fortunately, rcu_read_unlock_special() already provides mechanisms for
> forcing a minimal deferral of quiescent states, at least for kernels
> built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> when expedited grace periods are pending that might be blocked by the
> current task.  This commit therefore causes those mechanisms to also be
> used in cases where the current task has been or might soon be subjected
> to RCU priority boosting.  Note that this applies to all kernels built
> with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> built with CONFIG_PREEMPT_RT=y.
> 
> This approach assumes that kernels build for use with aggressive real-time
> applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> scheme that works correctly in its absence.
> 
> While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> function's local variables.
> 
> Cc: Sebastian Andrzej Siewior 
> Cc: Scott Wood 
> Cc: Lai Jiangshan 
> Cc: Thomas Gleixner 
> Signed-off-by: Paul E. McKenney 
> ---
>  kernel/rcu/tree_plugin.h | 26 ++
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 8b0feb2..fca31c6 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct 
> irq_work *iwp)
>  static void rcu_read_unlock_special(struct task_struct *t)
>  {
>   unsigned long flags;
> + bool irqs_were_disabled;
>   bool preempt_bh_were_disabled =
>   !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> - bool irqs_were_disabled;
>  
>   /* NMI handlers cannot block and cannot safely manipulate state. */
>   if (in_nmi())
> @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct 
> *t)
>   local_irq_save(flags);
>   irqs_were_disabled = irqs_disabled_flags(flags);
>   if (preempt_bh_were_disabled || irqs_were_disabled) {
> - bool exp;
> + bool expboost; // Expedited GP in flight or possible boosting.
>   struct rcu_data *rdp = this_cpu_ptr(_data);
>   struct rcu_node *rnp = rdp->mynode;
>  
> - exp = (t->rcu_blocked_node &&
> -READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> -   (rdp->grpmask & READ_ONCE(rnp->expmask));
> + expboost = (t->rcu_blocked_node && 
> READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> +(rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> +(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled 
> &&
> + t->rcu_blocked_node);

I take it that you check whether possible boosting is in progress via
the last expression of "||", ie:

(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
t->rcu_blocked_node)

if so, I don't see the point of using the new "expboost" in the
raise_softirq_irqoff() branch, because if in_irq() is false, we only
raise softirq if irqs_were_disabled is false (otherwise, we may take the
risk of doing a wakeup with a pi or rq lock held, IIRC), and the
boosting part of the "expboost" above is only true if irqs_were_disabled
is true, so using expboost makes no different here.

>   // Need to defer quiescent state until everything is enabled.
> - if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> + if (use_softirq && (in_irq() || (expboost && 
> !irqs_were_disabled))) {
>   // Using softirq, safe to awaken, and either the
> - // wakeup is free or there is an expedited GP.
> + // wakeup is free or there is either an expedited
> + // GP in flight or a potential need to deboost.

and this comment will be incorrect, we won't enter here solely because
there is a potential need to deboost.

That said, why the boosting condition has a "irqs_were_disabled" in it?
What if a task gets boosted because of RCU boosting, and exit the RCU
read-side c.s. with irq enabled and there is no expedited GP in flight,
will the task get deboosted quickly enough?

Maybe I'm missing some subtle?

Regards,
Boqun

>   raise_softirq_irqoff(RCU_SOFTIRQ);
>

[tip: locking/core] locking/lockdep: Add a skip() function to __bfs()

2021-01-14 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/core branch of tip:

Commit-ID: bc2dd71b283665f0a409d5b6fc603d5a6fdc219e
Gitweb:
https://git.kernel.org/tip/bc2dd71b283665f0a409d5b6fc603d5a6fdc219e
Author:Boqun Feng 
AuthorDate:Thu, 10 Dec 2020 11:02:40 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 14 Jan 2021 11:20:17 +01:00

locking/lockdep: Add a skip() function to __bfs()

Some __bfs() walks will have additional iteration constraints (beyond
the path being strong). Provide an additional function to allow
terminating graph walks.

Signed-off-by: Boqun Feng 
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/locking/lockdep.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b061e29..f50f026 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1672,6 +1672,7 @@ static inline struct lock_list *__bfs_next(struct 
lock_list *lock, int offset)
 static enum bfs_result __bfs(struct lock_list *source_entry,
 void *data,
 bool (*match)(struct lock_list *entry, void *data),
+bool (*skip)(struct lock_list *entry, void *data),
 struct lock_list **target_entry,
 int offset)
 {
@@ -1732,7 +1733,12 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
/*
 * Step 3: we haven't visited this and there is a strong
 * dependency path to this, so check with @match.
+* If @skip is provide and returns true, we skip this
+* lock (and any path this lock is in).
 */
+   if (skip && skip(lock, data))
+   continue;
+
if (match(lock, data)) {
*target_entry = lock;
return BFS_RMATCH;
@@ -1775,9 +1781,10 @@ static inline enum bfs_result
 __bfs_forwards(struct lock_list *src_entry,
   void *data,
   bool (*match)(struct lock_list *entry, void *data),
+  bool (*skip)(struct lock_list *entry, void *data),
   struct lock_list **target_entry)
 {
-   return __bfs(src_entry, data, match, target_entry,
+   return __bfs(src_entry, data, match, skip, target_entry,
 offsetof(struct lock_class, locks_after));
 
 }
@@ -1786,9 +1793,10 @@ static inline enum bfs_result
 __bfs_backwards(struct lock_list *src_entry,
void *data,
bool (*match)(struct lock_list *entry, void *data),
+  bool (*skip)(struct lock_list *entry, void *data),
struct lock_list **target_entry)
 {
-   return __bfs(src_entry, data, match, target_entry,
+   return __bfs(src_entry, data, match, skip, target_entry,
 offsetof(struct lock_class, locks_before));
 
 }
@@ -2019,7 +2027,7 @@ static unsigned long __lockdep_count_forward_deps(struct 
lock_list *this)
unsigned long  count = 0;
struct lock_list *target_entry;
 
-   __bfs_forwards(this, (void *), noop_count, _entry);
+   __bfs_forwards(this, (void *), noop_count, NULL, _entry);
 
return count;
 }
@@ -2044,7 +2052,7 @@ static unsigned long __lockdep_count_backward_deps(struct 
lock_list *this)
unsigned long  count = 0;
struct lock_list *target_entry;
 
-   __bfs_backwards(this, (void *), noop_count, _entry);
+   __bfs_backwards(this, (void *), noop_count, NULL, _entry);
 
return count;
 }
@@ -2072,11 +2080,12 @@ unsigned long lockdep_count_backward_deps(struct 
lock_class *class)
 static noinline enum bfs_result
 check_path(struct held_lock *target, struct lock_list *src_entry,
   bool (*match)(struct lock_list *entry, void *data),
+  bool (*skip)(struct lock_list *entry, void *data),
   struct lock_list **target_entry)
 {
enum bfs_result ret;
 
-   ret = __bfs_forwards(src_entry, target, match, target_entry);
+   ret = __bfs_forwards(src_entry, target, match, skip, target_entry);
 
if (unlikely(bfs_error(ret)))
print_bfs_bug(ret);
@@ -2103,7 +2112,7 @@ check_noncircular(struct held_lock *src, struct held_lock 
*target,
 
debug_atomic_inc(nr_cyclic_checks);
 
-   ret = check_path(target, _entry, hlock_conflict, _entry);
+   ret = check_path(target, _entry, hlock_conflict, NULL, 
_entry);
 
if (unlikely(ret == BFS_RMATCH)) {
if (!*trace) {
@@ -2152,7 +2161,7 @@ check_redundant(struct held_lock *src, struct held_lock 
*target)
 
debug_atomic_inc(nr_redundant_checks);
 
-   ret = check_path(target, _entry, hlock_equal, _entry);
+   ret = check_path(target, _entry, hlock_equal, NULL, _entry);
 
if (ret == BFS_RMATCH)
debug_at

[tip: locking/core] lockdep/selftest: Add wait context selftests

2021-01-14 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/core branch of tip:

Commit-ID: 9271a40d2a1429113160ccc4c16150921600bcc1
Gitweb:
https://git.kernel.org/tip/9271a40d2a1429113160ccc4c16150921600bcc1
Author:Boqun Feng 
AuthorDate:Tue, 08 Dec 2020 18:31:12 +08:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 14 Jan 2021 11:20:16 +01:00

lockdep/selftest: Add wait context selftests

These tests are added for two purposes:

*   Test the implementation of wait context checks and related
annotations.

*   Semi-document the rules for wait context nesting when
PROVE_RAW_LOCK_NESTING=y.

The test cases are only avaible for PROVE_RAW_LOCK_NESTING=y, as wait
context checking makes more sense for that configuration.

Signed-off-by: Boqun Feng 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201208103112.2838119-5-boqun.f...@gmail.com
---
 lib/locking-selftest.c | 232 -
 1 file changed, 232 insertions(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index 9959ea2..23376ee 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -64,6 +64,9 @@ static DEFINE_SPINLOCK(lock_B);
 static DEFINE_SPINLOCK(lock_C);
 static DEFINE_SPINLOCK(lock_D);
 
+static DEFINE_RAW_SPINLOCK(raw_lock_A);
+static DEFINE_RAW_SPINLOCK(raw_lock_B);
+
 static DEFINE_RWLOCK(rwlock_A);
 static DEFINE_RWLOCK(rwlock_B);
 static DEFINE_RWLOCK(rwlock_C);
@@ -1306,6 +1309,7 @@ 
GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion3_soft_wlock)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define I_SPINLOCK(x) lockdep_reset_lock(_##x.dep_map)
+# define I_RAW_SPINLOCK(x) lockdep_reset_lock(_lock_##x.dep_map)
 # define I_RWLOCK(x)   lockdep_reset_lock(_##x.dep_map)
 # define I_MUTEX(x)lockdep_reset_lock(_##x.dep_map)
 # define I_RWSEM(x)lockdep_reset_lock(_##x.dep_map)
@@ -1315,6 +1319,7 @@ 
GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion3_soft_wlock)
 #endif
 #else
 # define I_SPINLOCK(x)
+# define I_RAW_SPINLOCK(x)
 # define I_RWLOCK(x)
 # define I_MUTEX(x)
 # define I_RWSEM(x)
@@ -1358,9 +1363,12 @@ static void reset_locks(void)
I1(A); I1(B); I1(C); I1(D);
I1(X1); I1(X2); I1(Y1); I1(Y2); I1(Z1); I1(Z2);
I_WW(t); I_WW(t2); I_WW(o.base); I_WW(o2.base); I_WW(o3.base);
+   I_RAW_SPINLOCK(A); I_RAW_SPINLOCK(B);
lockdep_reset();
I2(A); I2(B); I2(C); I2(D);
init_shared_classes();
+   raw_spin_lock_init(_lock_A);
+   raw_spin_lock_init(_lock_B);
 
ww_mutex_init(, _lockdep); ww_mutex_init(, _lockdep); 
ww_mutex_init(, _lockdep);
memset(, 0, sizeof(t)); memset(, 0, sizeof(t2));
@@ -2419,6 +2427,226 @@ static void fs_reclaim_tests(void)
pr_cont("\n");
 }
 
+#define __guard(cleanup) __maybe_unused __attribute__((__cleanup__(cleanup)))
+
+static void hardirq_exit(int *_)
+{
+   HARDIRQ_EXIT();
+}
+
+#define HARDIRQ_CONTEXT(name, ...) \
+   int hardirq_guard_##name __guard(hardirq_exit); \
+   HARDIRQ_ENTER();
+
+#define NOTTHREADED_HARDIRQ_CONTEXT(name, ...) \
+   int notthreaded_hardirq_guard_##name __guard(hardirq_exit); \
+   local_irq_disable();\
+   __irq_enter();  \
+   WARN_ON(!in_irq());
+
+static void softirq_exit(int *_)
+{
+   SOFTIRQ_EXIT();
+}
+
+#define SOFTIRQ_CONTEXT(name, ...) \
+   int softirq_guard_##name __guard(softirq_exit); \
+   SOFTIRQ_ENTER();
+
+static void rcu_exit(int *_)
+{
+   rcu_read_unlock();
+}
+
+#define RCU_CONTEXT(name, ...) \
+   int rcu_guard_##name __guard(rcu_exit); \
+   rcu_read_lock();
+
+static void rcu_bh_exit(int *_)
+{
+   rcu_read_unlock_bh();
+}
+
+#define RCU_BH_CONTEXT(name, ...)  \
+   int rcu_bh_guard_##name __guard(rcu_bh_exit);   \
+   rcu_read_lock_bh();
+
+static void rcu_sched_exit(int *_)
+{
+   rcu_read_unlock_sched();
+}
+
+#define RCU_SCHED_CONTEXT(name, ...)   \
+   int rcu_sched_guard_##name __guard(rcu_sched_exit); \
+   rcu_read_lock_sched();
+
+static void rcu_callback_exit(int *_)
+{
+   rcu_lock_release(_callback_map);
+}
+
+#define RCU_CALLBACK_CONTEXT(name, ...)
\
+   int rcu_callback_guard_##name __guard(rcu_callback_exit);   \
+   rcu_lock_acquire(_callback_map);
+
+
+static void raw_spinlock_exit(raw_spinlock_t **lock)
+{
+   raw_spin_unlock(*lock);
+}
+
+#define RAW_SPINLOCK_CONTEXT(name, lock)   
\
+   raw_spinlock_t *raw_spinlock_guard_##name __guard(raw_spinlock_exit) = 
&(lock); \
+   raw_spin_lock(&(lock));
+
+static void spinlock_exit

[tip: locking/core] locking/lockdep: Exclude local_lock_t from IRQ inversions

2021-01-14 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/core branch of tip:

Commit-ID: 5f2962401c6e195222f320d12b3a55377b2d4653
Gitweb:
https://git.kernel.org/tip/5f2962401c6e195222f320d12b3a55377b2d4653
Author:Boqun Feng 
AuthorDate:Thu, 10 Dec 2020 11:15:00 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 14 Jan 2021 11:20:17 +01:00

locking/lockdep: Exclude local_lock_t from IRQ inversions

The purpose of local_lock_t is to abstract: preempt_disable() /
local_bh_disable() / local_irq_disable(). These are the traditional
means of gaining access to per-cpu data, but are fundamentally
non-preemptible.

local_lock_t provides a per-cpu lock, that on !PREEMPT_RT reduces to
no-ops, just like regular spinlocks do on UP.

This gives rise to:

CPU0CPU1

local_lock(B)   spin_lock_irq(A)

  spin_lock(A)  local_lock(B)

Where lockdep then figures things will lock up; which would be true if
B were any other kind of lock. However this is a false positive, no
such deadlock actually exists.

For !RT the above local_lock(B) is preempt_disable(), and there's
obviously no deadlock; alternatively, CPU0's B != CPU1's B.

For RT the argument is that since local_lock() nests inside
spin_lock(), it cannot be used in hardirq context, and therefore CPU0
cannot in fact happen. Even though B is a real lock, it is a
preemptible lock and any threaded-irq would simply schedule out and
let the preempted task (which holds B) continue such that the task on
CPU1 can make progress, after which the threaded-irq resumes and can
finish.

This means that we can never form an IRQ inversion on a local_lock
dependency, so terminate the graph walk when looking for IRQ
inversions when we encounter one.

One consequence is that (for LOCKDEP_SMALL) when we look for redundant
dependencies, A -> B is not redundant in the presence of A -> L -> B.

Signed-off-by: Boqun Feng 
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/locking/lockdep.c | 57 ---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index f2ae8a6..ad9afd8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2200,6 +2200,44 @@ static inline bool usage_match(struct lock_list *entry, 
void *mask)
return !!((entry->class->usage_mask & LOCKF_IRQ) & *(unsigned 
long *)mask);
 }
 
+static inline bool usage_skip(struct lock_list *entry, void *mask)
+{
+   /*
+* Skip local_lock() for irq inversion detection.
+*
+* For !RT, local_lock() is not a real lock, so it won't carry any
+* dependency.
+*
+* For RT, an irq inversion happens when we have lock A and B, and on
+* some CPU we can have:
+*
+*  lock(A);
+*  
+*lock(B);
+*
+* where lock(B) cannot sleep, and we have a dependency B -> ... -> A.
+*
+* Now we prove local_lock() cannot exist in that dependency. First we
+* have the observation for any lock chain L1 -> ... -> Ln, for any
+* 1 <= i <= n, Li.inner_wait_type <= L1.inner_wait_type, otherwise
+* wait context check will complain. And since B is not a sleep lock,
+* therefore B.inner_wait_type >= 2, and since the inner_wait_type of
+* local_lock() is 3, which is greater than 2, therefore there is no
+* way the local_lock() exists in the dependency B -> ... -> A.
+*
+* As a result, we will skip local_lock(), when we search for irq
+* inversion bugs.
+*/
+   if (entry->class->lock_type == LD_LOCK_PERCPU) {
+   if (DEBUG_LOCKS_WARN_ON(entry->class->wait_type_inner < 
LD_WAIT_CONFIG))
+   return false;
+
+   return true;
+   }
+
+   return false;
+}
+
 /*
  * Find a node in the forwards-direction dependency sub-graph starting
  * at @root->class that matches @bit.
@@ -2215,7 +2253,7 @@ find_usage_forwards(struct lock_list *root, unsigned long 
usage_mask,
 
debug_atomic_inc(nr_find_usage_forwards_checks);
 
-   result = __bfs_forwards(root, _mask, usage_match, NULL, 
target_entry);
+   result = __bfs_forwards(root, _mask, usage_match, usage_skip, 
target_entry);
 
return result;
 }
@@ -2232,7 +2270,7 @@ find_usage_backwards(struct lock_list *root, unsigned 
long usage_mask,
 
debug_atomic_inc(nr_find_usage_backwards_checks);
 
-   result = __bfs_backwards(root, _mask, usage_match, NULL, 
target_entry);
+   result = __bfs_backwards(root, _mask, usage_match, usage_skip, 
target_entry);
 
return result;
 }
@@ -2597,7 +2635,7 @@ static int check_irq_usage(struct task_struct *curr, 
struct held_lock *prev,
 */
bfs_init_rootb(, prev)

Re: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features

2021-01-07 Thread Boqun Feng

On Wed, Jan 06, 2021 at 08:49:32PM +, Dexuan Cui wrote:
> > From: Michael Kelley 
> > Sent: Wednesday, January 6, 2021 9:38 AM
> > From: Dexuan Cui 
> > Sent: Tuesday, December 22, 2020 4:12 PM
> > >
> > > When a Linux VM runs on Hyper-V, if the host toolstack doesn't support
> > > hibernation for the VM (this happens on old Hyper-V hosts like Windows
> > > Server 2016, or new Hyper-V hosts if the admin or user doesn't declare
> > > the hibernation intent for the VM), the VM is discouraged from trying
> > > hibernation (because the host doesn't guarantee that the VM's virtual
> > > hardware configuration will remain exactly the same across hibernation),
> > > i.e. the VM should not try to set up the swap partition/file for
> > > hibernation, etc.
> > >
> > > x86 Hyper-V uses the presence of the virtual ACPI S4 state as the
> > > indication of the host toolstack support for a VM. Currently there is
> > > no easy and reliable way for the userspace to detect the presence of
> > > the state (see ...).  Add
> > > /sys/bus/vmbus/supported_features for this purpose.
> >
> > I'm OK with surfacing the hibernation capability via an entry in
> > /sys/bus/vmbus.  Correct me if I'm wrong, but I think the concept
> > being surfaced is not "ACPI S4 state" precisely, but slightly more
> > generally whether hibernation is supported for the VM.  While
> > those two concepts may be 1:1 for the moment, there might be
> > future configurations where "hibernation is supported" depends
> > on other factors as well.
> 
> For x86, I believe the virtual ACPI S4 state exists only when the
> admin/user declares the intent of "enable hibernation for the VM" via
> some PowwerShell/WMI command. On Azure, if a VM size is not suitable
> for hibernation (e.g. an existing VM has an ephemeral local disk),
> the toolstack on the host should not enable the ACPI S4 state for the
> VM. That's why we implemented hv_is_hibernation_supported() for x86 by
> checking the ACPI S4 state, and we have used the function
> hv_is_hibernation_supported() in hv_utils and hv_balloon for quite a
> while.
> 
> For ARM, IIRC there is no concept of ACPI S4 state, so currently
> hv_is_hibernation_supported() is actually not implemented. Not sure

Because the core support for ARM64 Hyper-V is not merged yet. In
Michael's core patchset, hv_is_hibernation_supported() is implemented as
always returning false, and there is more work (other than Michael's
core pachset) to make hiberation work on ARM64 Hyper-V guest.

Regards,
Boqun

> why hv_utils and hv_balloon can build successfully... :-) Probably
> Boqun can help to take a look.
> 
> >
> > The guidance for things in /sys is that they generally should
> > be single valued (see Documentation/filesystems/sysfs.rst).  So my
> > recommendation is to create a "hibernation" entry that has a value
> > of 0 or 1.
> >
> > Michael
> 
> Got it. Then let's use /sys/bus/vmbus/hibernation.
> 
> Will post v3.
> 
> Thanks,
> -- Dexuan
>

Re: [PATCH 3/3] powerpc: rewrite atomics to use ARCH_ATOMIC

2020-12-22 Thread Boqun Feng

On Tue, Dec 22, 2020 at 01:52:50PM +1000, Nicholas Piggin wrote:
> Excerpts from Boqun Feng's message of November 14, 2020 1:30 am:
> > Hi Nicholas,
> > 
> > On Wed, Nov 11, 2020 at 09:07:23PM +1000, Nicholas Piggin wrote:
> >> All the cool kids are doing it.
> >> 
> >> Signed-off-by: Nicholas Piggin 
> >> ---
> >>  arch/powerpc/include/asm/atomic.h  | 681 ++---
> >>  arch/powerpc/include/asm/cmpxchg.h |  62 +--
> >>  2 files changed, 248 insertions(+), 495 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/include/asm/atomic.h 
> >> b/arch/powerpc/include/asm/atomic.h
> >> index 8a55eb8cc97b..899aa2403ba7 100644
> >> --- a/arch/powerpc/include/asm/atomic.h
> >> +++ b/arch/powerpc/include/asm/atomic.h
> >> @@ -11,185 +11,285 @@
> >>  #include 
> >>  #include 
> >>  
> >> +#define ARCH_ATOMIC
> >> +
> >> +#ifndef CONFIG_64BIT
> >> +#include 
> >> +#endif
> >> +
> >>  /*
> >>   * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
> >>   * a "bne-" instruction at the end, so an isync is enough as a acquire 
> >> barrier
> >>   * on the platform without lwsync.
> >>   */
> >>  #define __atomic_acquire_fence()  \
> >> -  __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory")
> >> +  asm volatile(PPC_ACQUIRE_BARRIER "" : : : "memory")
> >>  
> >>  #define __atomic_release_fence()  \
> >> -  __asm__ __volatile__(PPC_RELEASE_BARRIER "" : : : "memory")
> >> +  asm volatile(PPC_RELEASE_BARRIER "" : : : "memory")
> >>  
> >> -static __inline__ int atomic_read(const atomic_t *v)
> >> -{
> >> -  int t;
> >> +#define __atomic_pre_full_fence   smp_mb
> >>  
> >> -  __asm__ __volatile__("lwz%U1%X1 %0,%1" : "=r"(t) : "m"(v->counter));
> >> +#define __atomic_post_full_fence  smp_mb
> >>  
> 
> Thanks for the review.
> 
> > Do you need to define __atomic_{pre,post}_full_fence for PPC? IIRC, they
> > are default smp_mb__{before,atomic}_atomic(), so are smp_mb() defautly
> > on PPC.
> 
> Okay I didn't realise that's not required.
> 
> >> -  return t;
> >> +#define arch_atomic_read(v)   
> >> __READ_ONCE((v)->counter)
> >> +#define arch_atomic_set(v, i) 
> >> __WRITE_ONCE(((v)->counter), (i))
> >> +#ifdef CONFIG_64BIT
> >> +#define ATOMIC64_INIT(i)  { (i) }
> >> +#define arch_atomic64_read(v) 
> >> __READ_ONCE((v)->counter)
> >> +#define arch_atomic64_set(v, i)   
> >> __WRITE_ONCE(((v)->counter), (i))
> >> +#endif
> >> +
> > [...]
> >>  
> >> +#define ATOMIC_FETCH_OP_UNLESS_RELAXED(name, type, dtype, width, asm_op) \
> >> +static inline int arch_##name##_relaxed(type *v, dtype a, dtype u)
> >> \
> > 
> > I don't think we have atomic_fetch_*_unless_relaxed() at atomic APIs,
> > ditto for:
> > 
> > atomic_fetch_add_unless_relaxed()
> > atomic_inc_not_zero_relaxed()
> > atomic_dec_if_positive_relaxed()
> > 
> > , and we don't have the _acquire() and _release() variants for them
> > either, and if you don't define their fully-ordered version (e.g.
> > atomic_inc_not_zero()), atomic-arch-fallback.h will use read and cmpxchg
> > to implement them, and I think not what we want.
> 
> Okay. How can those be added? The atoimc generation is pretty 
> complicated.
> 

Yeah, I know ;-) I think you can just implement and define fully-ordered
verions:

arch_atomic_fetch_*_unless()
arch_atomic_inc_not_zero()
arch_atomic_dec_if_postive()

, that should work.

Rules of atomic generation, IIRC:

1.  If you define _relaxed, _acquire, _release or fully-ordered
version, atomic generation will use that version

2.  If you define _relaxed, atomic generation will use that and
barriers to generate _acquire, _release and fully-ordered
versions, unless they are already defined (as Rule #1 says)

3.  If you don't define _relaxed, but define the fully-ordered
version, atomic generation will use the fully-ordered version
and use it as _relaxed variants and generate the rest using Rule
#2.

> > [...]
> >>  
> >>  #endif /* __KERNEL__ */
> >>  #endif /* _ASM_POWERPC_ATOMIC_H_ */
> >> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> >> b/arch/powerpc/include/asm/cmpxchg.h
> >> index cf091c4c22e5..181f7e8b3281 100644
> >> --- a/arch/powerpc/include/asm/cmpxchg.h
> >> +++ b/arch/powerpc/include/asm/cmpxchg.h
> >> @@ -192,7 +192,7 @@ __xchg_relaxed(void *ptr, unsigned long x, unsigned 
> >> int size)
> >>(unsigned long)_x_, sizeof(*(ptr)));
> >>  \
> >>})
> >>  
> >> -#define xchg_relaxed(ptr, x)  
> >> \
> >> +#define arch_xchg_relaxed(ptr, x) \
> >>  ({
> >> \
> >>__typeof__(*(ptr)) _x_ = (x);   \
> >>(__typeof__(*(ptr)))

Re: WARNING: suspicious RCU usage in modeset_lock

2020-12-21 Thread Boqun Feng

Hi Dmitry,

On Fri, Dec 18, 2020 at 12:27:04PM +0100, Dmitry Vyukov wrote:
> On Fri, Dec 18, 2020 at 2:30 AM Boqun Feng  wrote:
> >
> > On Thu, Dec 17, 2020 at 07:21:18AM -0800, Paul E. McKenney wrote:
> > > On Thu, Dec 17, 2020 at 11:03:20AM +0100, Daniel Vetter wrote:
> > > > On Wed, Dec 16, 2020 at 5:16 PM Paul E. McKenney  
> > > > wrote:
> > > > >
> > > > > On Wed, Dec 16, 2020 at 10:52:06AM +0100, Daniel Vetter wrote:
> > > > > > On Wed, Dec 16, 2020 at 2:14 AM syzbot
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > syzbot found the following issue on:
> > > > > > >
> > > > > > > HEAD commit:94801e5c Merge tag 'pinctrl-v5.10-3' of 
> > > > > > > git://git.kernel.o..
> > > > > > > git tree:   upstream
> > > > > > > console output: 
> > > > > > > https://syzkaller.appspot.com/x/log.txt?x=130558c550
> > > > > > > kernel config:  
> > > > > > > https://syzkaller.appspot.com/x/.config?x=ee8a1012a5314210
> > > > > > > dashboard link: 
> > > > > > > https://syzkaller.appspot.com/bug?extid=972b924c988834e868b2
> > > > > > > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > > > > > > userspace arch: i386
> > > > > > >
> > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > >
> > > > > > > IMPORTANT: if you fix the issue, please add the following tag to 
> > > > > > > the commit:
> > > > > > > Reported-by: syzbot+972b924c988834e86...@syzkaller.appspotmail.com
> > > > > > >
> > > > > > > =
> > > > > > > WARNING: suspicious RCU usage
> > > > > > > 5.10.0-rc7-syzkaller #0 Not tainted
> > > > > > > -
> > > > > > > kernel/sched/core.c:7270 Illegal context switch in RCU-sched 
> > > > > > > read-side critical section!
> > > > > > >
> > > > > > > other info that might help us debug this:
> > > > > > >
> > > > > > >
> > > > > > > rcu_scheduler_active = 2, debug_locks = 0
> > > > > > > 7 locks held by syz-executor.1/9232:
> > > > > > >  #0: 8b328c60 (console_lock){+.+.}-{0:0}, at: 
> > > > > > > do_fb_ioctl+0x2e4/0x690 drivers/video/fbdev/core/fbmem.c:1106
> > > > > > >  #1: 888041bd4078 (_info->lock){+.+.}-{3:3}, at: 
> > > > > > > lock_fb_info include/linux/fb.h:636 [inline]
> > > > > > >  #1: 888041bd4078 (_info->lock){+.+.}-{3:3}, at: 
> > > > > > > do_fb_ioctl+0x2ee/0x690 drivers/video/fbdev/core/fbmem.c:1107
> > > > > > >  #2: 888041adca78 (>lock){+.+.}-{3:3}, at: 
> > > > > > > drm_fb_helper_pan_display+0xce/0x970 
> > > > > > > drivers/gpu/drm/drm_fb_helper.c:1448
> > > > > > >  #3: 8880159f01b8 (>master_mutex){+.+.}-{3:3}, at: 
> > > > > > > drm_master_internal_acquire+0x1d/0x70 
> > > > > > > drivers/gpu/drm/drm_auth.c:407
> > > > > > >  #4: 888041adc898 (>modeset_mutex){+.+.}-{3:3}, at: 
> > > > > > > drm_client_modeset_commit_locked+0x44/0x580 
> > > > > > > drivers/gpu/drm/drm_client_modeset.c:1143
> > > > > > >  #5: c90001c07730 (crtc_ww_class_acquire){+.+.}-{0:0}, at: 
> > > > > > > drm_client_modeset_commit_atomic+0xb7/0x7c0 
> > > > > > > drivers/gpu/drm/drm_client_modeset.c:981
> > > > > > >  #6: 888015986108 (crtc_ww_class_mutex){+.+.}-{3:3}, at: 
> > > > > > > ww_mutex_lock_slow include/linux/ww_mutex.h:287 [inline]
> > > > > > >  #6: 888015986108 (crtc_ww_class_mutex){+.+.}-{3:3}, at: 
> > > > > > > modeset_lock+0x31c/0x650 drivers/gpu/drm/drm_modeset_lock.c:260
> > > > > >
> > > > > > Given that we managed to take all these locks without upsetting 
> > > > > > anyone
> > > > > > the rcu section is very deep down. And looking at the backtrace 
> &

Re: WARNING: suspicious RCU usage in modeset_lock

2020-12-17 Thread Boqun Feng

On Thu, Dec 17, 2020 at 07:21:18AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 17, 2020 at 11:03:20AM +0100, Daniel Vetter wrote:
> > On Wed, Dec 16, 2020 at 5:16 PM Paul E. McKenney  wrote:
> > >
> > > On Wed, Dec 16, 2020 at 10:52:06AM +0100, Daniel Vetter wrote:
> > > > On Wed, Dec 16, 2020 at 2:14 AM syzbot
> > > >  wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > syzbot found the following issue on:
> > > > >
> > > > > HEAD commit:94801e5c Merge tag 'pinctrl-v5.10-3' of 
> > > > > git://git.kernel.o..
> > > > > git tree:   upstream
> > > > > console output: 
> > > > > https://syzkaller.appspot.com/x/log.txt?x=130558c550
> > > > > kernel config:  
> > > > > https://syzkaller.appspot.com/x/.config?x=ee8a1012a5314210
> > > > > dashboard link: 
> > > > > https://syzkaller.appspot.com/bug?extid=972b924c988834e868b2
> > > > > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > > > > userspace arch: i386
> > > > >
> > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > >
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > > > commit:
> > > > > Reported-by: syzbot+972b924c988834e86...@syzkaller.appspotmail.com
> > > > >
> > > > > =
> > > > > WARNING: suspicious RCU usage
> > > > > 5.10.0-rc7-syzkaller #0 Not tainted
> > > > > -
> > > > > kernel/sched/core.c:7270 Illegal context switch in RCU-sched 
> > > > > read-side critical section!
> > > > >
> > > > > other info that might help us debug this:
> > > > >
> > > > >
> > > > > rcu_scheduler_active = 2, debug_locks = 0
> > > > > 7 locks held by syz-executor.1/9232:
> > > > >  #0: 8b328c60 (console_lock){+.+.}-{0:0}, at: 
> > > > > do_fb_ioctl+0x2e4/0x690 drivers/video/fbdev/core/fbmem.c:1106
> > > > >  #1: 888041bd4078 (_info->lock){+.+.}-{3:3}, at: lock_fb_info 
> > > > > include/linux/fb.h:636 [inline]
> > > > >  #1: 888041bd4078 (_info->lock){+.+.}-{3:3}, at: 
> > > > > do_fb_ioctl+0x2ee/0x690 drivers/video/fbdev/core/fbmem.c:1107
> > > > >  #2: 888041adca78 (>lock){+.+.}-{3:3}, at: 
> > > > > drm_fb_helper_pan_display+0xce/0x970 
> > > > > drivers/gpu/drm/drm_fb_helper.c:1448
> > > > >  #3: 8880159f01b8 (>master_mutex){+.+.}-{3:3}, at: 
> > > > > drm_master_internal_acquire+0x1d/0x70 drivers/gpu/drm/drm_auth.c:407
> > > > >  #4: 888041adc898 (>modeset_mutex){+.+.}-{3:3}, at: 
> > > > > drm_client_modeset_commit_locked+0x44/0x580 
> > > > > drivers/gpu/drm/drm_client_modeset.c:1143
> > > > >  #5: c90001c07730 (crtc_ww_class_acquire){+.+.}-{0:0}, at: 
> > > > > drm_client_modeset_commit_atomic+0xb7/0x7c0 
> > > > > drivers/gpu/drm/drm_client_modeset.c:981
> > > > >  #6: 888015986108 (crtc_ww_class_mutex){+.+.}-{3:3}, at: 
> > > > > ww_mutex_lock_slow include/linux/ww_mutex.h:287 [inline]
> > > > >  #6: 888015986108 (crtc_ww_class_mutex){+.+.}-{3:3}, at: 
> > > > > modeset_lock+0x31c/0x650 drivers/gpu/drm/drm_modeset_lock.c:260
> > > >
> > > > Given that we managed to take all these locks without upsetting anyone
> > > > the rcu section is very deep down. And looking at the backtrace below
> > > > I just couldn't find anything.
> > > >
> > > > Best I can think of is that an interrupt of some sort leaked an rcu
> > > > section, and we got shot here. But I'd assume the rcu debugging would
> > > > catch this? Backtrace of the start of that rcu read side section would
> > > > be really useful here, but I'm not seeing that in the logs. There's
> > > > more stuff there, but it's just the usual "everything falls apart"
> > > > stuff of little value to understanding how we got there.
> > >
> > > In my experience, lockdep will indeed complain if an interrupt handler
> > > returns while in an RCU read-side critical section.
> > >
> > > > Adding some rcu people for more insights on what could have gone wrong 
> > > > here.
> > > > -Daniel
> > > >
> > > > > stack backtrace:
> > > > > CPU: 1 PID: 9232 Comm: syz-executor.1 Not tainted 
> > > > > 5.10.0-rc7-syzkaller #0
> > > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > > > > rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > > > > Call Trace:
> > > > >  __dump_stack lib/dump_stack.c:77 [inline]
> > > > >  dump_stack+0x107/0x163 lib/dump_stack.c:118
> > > > >  ___might_sleep+0x25d/0x2b0 kernel/sched/core.c:7270
> > > > >  __mutex_lock_common kernel/locking/mutex.c:935 [inline]
> > > > >  __ww_mutex_lock.constprop.0+0xa9/0x2cc0 kernel/locking/mutex.c:
> > > > >  ww_mutex_lock+0x3d/0x170 kernel/locking/mutex.c:1190
> > >
> > > Acquiring a mutex while under the influence of rcu_read_lock() will
> > > definitely get you this lockdep complaint, and rightfully so.
> > >
> > > If you need to acquire a mutex with RCU-like protection, one approach
> > > is to use SRCU.  But usually this indicates (as you suspected) that
> > > someone forgot to invoke rcu_read_unlock().
> > >
> > > One way to locate this is to enlist the aid of

Re: [PATCH 00/19] rcu/nocb: De-offload and re-offload support v4

2020-12-09 Thread Boqun Feng

On Wed, Dec 09, 2020 at 05:21:58PM -0800, Paul E. McKenney wrote:
> On Tue, Dec 08, 2020 at 01:51:04PM +0100, Frederic Weisbecker wrote:
> > Hi Boqun Feng,
> > 
> > On Tue, Dec 08, 2020 at 10:41:31AM +0800, Boqun Feng wrote:
> > > Hi Frederic,
> > > 
> > > On Fri, Nov 13, 2020 at 01:13:15PM +0100, Frederic Weisbecker wrote:
> > > > This keeps growing up. Rest assured, most of it is debug code and sanity
> > > > checks.
> > > > 
> > > > Boqun Feng found that holding rnp lock while updating the offloaded
> > > > state of an rdp isn't needed, and he was right despite my initial
> > > > reaction. The sites that read the offloaded state while holding the rnp
> > > > lock are actually protected because they read it locally in a non
> > > > preemptible context.
> > > > 
> > > > So I removed the rnp lock in "rcu/nocb: De-offloading CB". And just to
> > > > make sure I'm not missing something, I added sanity checks that ensure
> > > > we always read the offloaded state in a safe way (3 last patches).
> > > > 
> > > > Still passes TREE01 (but I had to fight!)
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > > rcu/nocb-toggle-v4
> > > > 
> > > > HEAD: 579e15efa48fb6fc4ecf14961804051f385807fe
> > > > 
> > > 
> > > This whole series look good to me, plus I've run a test, so far
> > > everything seems working ;-) Here is my setup for the test:
> > > 
> > > I'm using a ARM64 guest (running on Hyper-V) to do the test, and the
> > > guest has 8 VCPUs. The code I'm using is v5.10-rc6 + Hyper-V ARM64 guest
> > > support [1] + your patchset (I actually did a merge from your
> > > rcu/nocb-toggle-v5 branch, because IIUC some modification for rcutorture
> > > is still in Paul's tree). I compiled with my normal configuration for
> > > ARM64 Hyper-V guest plus TREE01, boot the kernel with:
> > > 
> > >   ignore_loglevel rcutree.gp_preinit_delay=3 rcutree.gp_init_delay=3 
> > > rcutree.gp_cleanup_delay=3 rcu_nocbs=0-1,3-7 
> > > 
> > > and run rcutorture via:
> > > 
> > >   modprobe rcutorture nocbs_nthreads=8 nocbs_toggle=1000 fwd_progress=0
> > > 
> > > I ran the rcutorture twice, one last for a week or so and one for a day
> > > or two and I didn't observe any problem so far. The latest test summary
> > > is:
> > > 
> > >   [...] rcu-torture: rtc: f794686f ver: 2226396 tfle: 0 rta: 
> > > 2226397 rtaf: 0 rtf: 2226385 rtmbe: 0 rtmbkf: 0/1390141 rtbe: 0 rtbke: 0 
> > > rtbre: 0 rtbf: 0 rtb: 0 nt: 181415346 onoff: 0/0:0/0 -1,0:-1,0 0:0 
> > > (HZ=1000) barrier: 0/0:0 read-exits: 108102 nocb-toggles: 306964:306974 
> > > 
> > > Is there anything I'm missing for a useful test? Do you have other setup
> > > (kernel cmdline or rcutorture parameters) that you want me to try?
> > 
> > Thanks a lot for reviewing and testing. You seem to have tested with the 
> > right
> > options, I have nothing better to suggest. Plus I'm glad you tested on
> > ARM64. x86 is the only target I have tested so far.
> 
> Boqun, would you be willing to give your Tested-by?
> 

Sure, FWIW, for the whole series, feel free to add:

Tested-by: Boqun Feng 

Regards,
Boqun

>   Thanx, Paul

Re: One potential issue with concurrent execution of RCU callbacks...

2020-12-08 Thread Boqun Feng

Hi Frederic,

On Tue, Dec 08, 2020 at 11:04:38PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 08, 2020 at 10:24:09AM -0800, Paul E. McKenney wrote:
> > > It reduces the code scope running with BH disabled.
> > > Also narrowing down helps to understand what it actually protects.
> > 
> > I thought that you would call out unnecessarily delaying other softirq
> > handlers.  ;-)
> > 
> > But if such delays are a problem (and they might well be), then to
> > avoid them on non-rcu_nocb CPUs would instead/also require changing the
> > early-exit checks to check for other pending softirqs to the existing
> > checks involving time, need_resched, and idle.  At which point, entering and
> > exiting BH-disabled again doesn't help, other than your point about the
> > difference in BH-disabled scopes on rcu_nocb and non-rcu_nocb CPUs.
> 
> Wise observation!
> 
> > 
> > Would it make sense to exit rcu_do_batch() if more than some amount
> > of time had elapsed and there was some non-RCU softirq pending?
> > 
> > My guess is that the current tlimit checks in rcu_do_batch() make this
> > unnecessary.
> 
> Right and nobody has complained about it so far.
> 
> But I should add a comment explaining the reason for the BH-disabled
> section in my series.
> 

Some background for the original question: I'm revisiting the wait
context checking feature of lockdep (which can detect bugs like
acquiring a spinlock_t lock inside a raw_spinlock_t), I've post my first
version:

https://lore.kernel.org/lkml/20201208103112.2838119-1-boqun.f...@gmail.com/ 

, and will surely copy you in the next version ;-)

The reason I asked for the RCU callback context requirement is that we
have the virtual lock (rcu_callback_map) that marks a RCU callback
context, so if RCU callback contexts have special restrictions on the
locking usage inside, we can use the wait context checking to do the
check (like what I did in the patch #3 of the above series).

My current summary is that since in certain configs (use_softirq is true
and nocb is disabled) RCU callbacks are executed in a softirq context,
so the least requirement for any RCU callbacks is they need to obey the
rules in softirq contexts. And yes, I'm aware that in some configs, RCU
callbacks are not executed in a softirq context (sometimes, even the BH
is not disabled), but we need to make all the callback work in the
"worst" (or strictest) case (callbacks executing in softirq contexts).
Currently, the effect of using wait context for rcu_callback_map in my
patchset is that lockdep will complain if a RCU callback use a mutex or
other sleepable locks, but using spinlock_t (even in PREEMPT_RT) won't
cause lockdep to complain. Am I getting this correct?

Regards,
Boqun

> Thanks.
> 
> > 
> > Thoughts?
> > 
> > Thanx, Paul

Re: [RFC lockdep 4/4] lockdep/selftest: Add wait context selftests

2020-12-08 Thread Boqun Feng

On Tue, Dec 08, 2020 at 03:33:24PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 08, 2020 at 06:31:12PM +0800, Boqun Feng wrote:
> > These tests are added for two purposes:
> > 
> > *   Test the implementation of wait context checks and related
> > annotations.
> > 
> > *   Semi-document the rules for wait context nesting when
> > PROVE_RAW_LOCK_NESTING=y.
> 
> Documentation/locking/locktypes.rst should have that.
> 

Thanks for the pointer!

I miss it before, and it's really a comprehensive document for lock
nesting rules. Still I think more rules can be (and should be) put in
that document: a broader idea is the context nesting rule (e.g. whether
a spinlock_t is allowed in a hard irq handler). And the document
reminders me that I'm missing some locks (e.g local_lock) in the test
cases. So will improve both the document and the test cases in the next
version. In the meanwhile, feel free to point out any mistake or
misunderstanding of mine in the rules or the tests, I'm really still
learning through these locks with PREEMPT_RT into consideration, thanks!

Regards,
Boqun

> > The test cases are only avaible for PROVE_RAW_LOCK_NESTING=y, as wait
> > context checking makes more sense for that configuration.
> 
> Looks about right ;-)

[RFC lockdep 0/4] Fixes and self testcases for wait context detection

2020-12-08 Thread Boqun Feng

Hi Peter,

Recently I looked into the wait context check feature and found some
places could use fixes, besides a suite of test cases is also added to
verify these fixes and future development.

Note: I'm not 100% sure all the expected results of the test cases are
correct, please do have a look at the the comment of patch #4 in case I
miss something subtle.

Suggestion and comments are welcome!

Regards,
Boqun


Boqun Feng (4):
  lockdep/selftest: Make HARDIRQ context threaded
  lockdep: Allow wait context checking with empty ->held_locks
  rcu/lockdep: Annotate the rcu_callback_map with proper wait contexts
  lockdep/selftest: Add wait context selftests

 kernel/locking/lockdep.c |   6 +-
 kernel/rcu/update.c  |   8 +-
 lib/locking-selftest.c   | 233 +++
 3 files changed, 244 insertions(+), 3 deletions(-)

-- 
2.29.2

[RFC lockdep 2/4] lockdep: Allow wait context checking with empty ->held_locks

2020-12-08 Thread Boqun Feng

Currently, the guard for !curr->lockdep_depth in check_wait_context() is
unnecessary, because the code will work well without it. Moreover, there
are cases that we will miss if we skip for curr->lockdep_depth == 0. For
example:


some_irq_handler():
  // curr->lockdep_depth == 0
  mutex_lock(_mutex):
check_wait_context() // skip the check!

Clearly, it's a bug, but due to the skip for !curr->lockdep_depth, we
cannot detect it in check_wait_context().

Therefore, remove the !curr->lockdep_depth checks and add comments on
why it's still working without it. The idea is that if we currently
don't hold any lock, then the current context is the only one we should
use to check.

Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index c1418b47f625..d4fd52b22804 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4508,7 +4508,7 @@ static int check_wait_context(struct task_struct *curr, 
struct held_lock *next)
short curr_inner;
int depth;
 
-   if (!curr->lockdep_depth || !next_inner || next->trylock)
+   if (!next_inner || next->trylock)
return 0;
 
if (!next_outer)
@@ -4516,6 +4516,10 @@ static int check_wait_context(struct task_struct *curr, 
struct held_lock *next)
 
/*
 * Find start of current irq_context..
+*
+* Note: if curr->lockdep_depth == 0, we have depth == 0 after the
+* "depth++" below, and will skip the second for loop, i.e. using
+* the current task context as the curr_inner.
 */
for (depth = curr->lockdep_depth - 1; depth >= 0; depth--) {
struct held_lock *prev = curr->held_locks + depth;
-- 
2.29.2

[RFC lockdep 4/4] lockdep/selftest: Add wait context selftests

2020-12-08 Thread Boqun Feng

These tests are added for two purposes:

*   Test the implementation of wait context checks and related
annotations.

*   Semi-document the rules for wait context nesting when
PROVE_RAW_LOCK_NESTING=y.

The test cases are only avaible for PROVE_RAW_LOCK_NESTING=y, as wait
context checking makes more sense for that configuration.

Signed-off-by: Boqun Feng 
---
 lib/locking-selftest.c | 232 +
 1 file changed, 232 insertions(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index 0af91a07fd18..c00ef4e69637 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -63,6 +63,9 @@ static DEFINE_SPINLOCK(lock_B);
 static DEFINE_SPINLOCK(lock_C);
 static DEFINE_SPINLOCK(lock_D);
 
+static DEFINE_RAW_SPINLOCK(raw_lock_A);
+static DEFINE_RAW_SPINLOCK(raw_lock_B);
+
 static DEFINE_RWLOCK(rwlock_A);
 static DEFINE_RWLOCK(rwlock_B);
 static DEFINE_RWLOCK(rwlock_C);
@@ -1306,6 +1309,7 @@ 
GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion3_soft_wlock)
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define I_SPINLOCK(x) lockdep_reset_lock(_##x.dep_map)
+# define I_RAW_SPINLOCK(x) lockdep_reset_lock(_lock_##x.dep_map)
 # define I_RWLOCK(x)   lockdep_reset_lock(_##x.dep_map)
 # define I_MUTEX(x)lockdep_reset_lock(_##x.dep_map)
 # define I_RWSEM(x)lockdep_reset_lock(_##x.dep_map)
@@ -1315,6 +1319,7 @@ 
GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion3_soft_wlock)
 #endif
 #else
 # define I_SPINLOCK(x)
+# define I_RAW_SPINLOCK(x)
 # define I_RWLOCK(x)
 # define I_MUTEX(x)
 # define I_RWSEM(x)
@@ -1358,9 +1363,12 @@ static void reset_locks(void)
I1(A); I1(B); I1(C); I1(D);
I1(X1); I1(X2); I1(Y1); I1(Y2); I1(Z1); I1(Z2);
I_WW(t); I_WW(t2); I_WW(o.base); I_WW(o2.base); I_WW(o3.base);
+   I_RAW_SPINLOCK(A); I_RAW_SPINLOCK(B);
lockdep_reset();
I2(A); I2(B); I2(C); I2(D);
init_shared_classes();
+   raw_spin_lock_init(_lock_A);
+   raw_spin_lock_init(_lock_B);
 
ww_mutex_init(, _lockdep); ww_mutex_init(, _lockdep); 
ww_mutex_init(, _lockdep);
memset(, 0, sizeof(t)); memset(, 0, sizeof(t2));
@@ -2358,6 +2366,226 @@ static void queued_read_lock_tests(void)
pr_cont("\n");
 }
 
+#define __guard(cleanup) __maybe_unused __attribute__((__cleanup__(cleanup)))
+
+static void hardirq_exit(int *_)
+{
+   HARDIRQ_EXIT();
+}
+
+#define HARDIRQ_CONTEXT(name, ...) \
+   int hardirq_guard_##name __guard(hardirq_exit); \
+   HARDIRQ_ENTER();
+
+#define NOTTHREADED_HARDIRQ_CONTEXT(name, ...) \
+   int notthreaded_hardirq_guard_##name __guard(hardirq_exit); \
+   local_irq_disable();\
+   __irq_enter();  \
+   WARN_ON(!in_irq());
+
+static void softirq_exit(int *_)
+{
+   SOFTIRQ_EXIT();
+}
+
+#define SOFTIRQ_CONTEXT(name, ...) \
+   int softirq_guard_##name __guard(softirq_exit); \
+   SOFTIRQ_ENTER();
+
+static void rcu_exit(int *_)
+{
+   rcu_read_unlock();
+}
+
+#define RCU_CONTEXT(name, ...) \
+   int rcu_guard_##name __guard(rcu_exit); \
+   rcu_read_lock();
+
+static void rcu_bh_exit(int *_)
+{
+   rcu_read_unlock_bh();
+}
+
+#define RCU_BH_CONTEXT(name, ...)  \
+   int rcu_bh_guard_##name __guard(rcu_bh_exit);   \
+   rcu_read_lock_bh();
+
+static void rcu_sched_exit(int *_)
+{
+   rcu_read_unlock_sched();
+}
+
+#define RCU_SCHED_CONTEXT(name, ...)   \
+   int rcu_sched_guard_##name __guard(rcu_sched_exit); \
+   rcu_read_lock_sched();
+
+static void rcu_callback_exit(int *_)
+{
+   rcu_lock_release(_callback_map);
+}
+
+#define RCU_CALLBACK_CONTEXT(name, ...)
\
+   int rcu_callback_guard_##name __guard(rcu_callback_exit);   \
+   rcu_lock_acquire(_callback_map);
+
+
+static void raw_spinlock_exit(raw_spinlock_t **lock)
+{
+   raw_spin_unlock(*lock);
+}
+
+#define RAW_SPINLOCK_CONTEXT(name, lock)   
\
+   raw_spinlock_t *raw_spinlock_guard_##name __guard(raw_spinlock_exit) = 
&(lock); \
+   raw_spin_lock(&(lock));
+
+static void spinlock_exit(spinlock_t **lock)
+{
+   spin_unlock(*lock);
+}
+
+#define SPINLOCK_CONTEXT(name, lock)   
\
+   spinlock_t *spinlock_guard_##name __guard(spinlock_exit) = &(lock); 
\
+   spin_lock(&(lock));
+
+static void mutex_exit(struct mutex **lock)
+{
+   mutex_unlock(*lock);
+}
+
+#define MUTEX_CONTEXT(name, lock)  \
+   struct mutex *mutex_guard_##name __guard(mutex_exit) = &(lock); \
+   m

[RFC lockdep 3/4] rcu/lockdep: Annotate the rcu_callback_map with proper wait contexts

2020-12-08 Thread Boqun Feng

rcu_callback_map is a virtual lock to annotate a context where RCU
callbacks are executed. RCU callbacks are required in softirq disable
contexts because with some config combination (use_softirq is true and
nocb is disabled) RCU callbacks only execute in softirq contexts.
Therefore wait context annotations can be added to detect bugs like
using mutex in a RCU callback.

Signed-off-by: Boqun Feng 
---
 kernel/rcu/update.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 39334d2d2b37..dd59e6412f61 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -269,8 +269,12 @@ EXPORT_SYMBOL_GPL(rcu_sched_lock_map);
 
 // Tell lockdep when RCU callbacks are being invoked.
 static struct lock_class_key rcu_callback_key;
-struct lockdep_map rcu_callback_map =
-   STATIC_LOCKDEP_MAP_INIT("rcu_callback", _callback_key);
+struct lockdep_map rcu_callback_map = {
+   .name = "rcu_callback",
+   .key = _callback_key,
+   .wait_type_outer = LD_WAIT_FREE,
+   .wait_type_inner = LD_WAIT_CONFIG, /* RCU callbacks are handled in 
softirq context */
+};
 EXPORT_SYMBOL_GPL(rcu_callback_map);
 
 noinstr int notrace debug_lockdep_rcu_enabled(void)
-- 
2.29.2

[RFC lockdep 1/4] lockdep/selftest: Make HARDIRQ context threaded

2020-12-08 Thread Boqun Feng

Since we now use spinlock_t instead of raw_spinlock_t in lockdep self
tests, we should make the emulated HARDIRQ context threaded, otherwise,
spinlock_t cannot be used in the HARDIRQ context and some test cases
will fail because of wait context checking when
PROVE_RAW_LOCK_NESTING=y.

Signed-off-by: Boqun Feng 
---
 lib/locking-selftest.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index afa7d4bb291f..0af91a07fd18 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -186,6 +186,7 @@ static void init_shared_classes(void)
 #define HARDIRQ_ENTER()\
local_irq_disable();\
__irq_enter();  \
+   lockdep_hardirq_threaded(); \
WARN_ON(!in_irq());
 
 #define HARDIRQ_EXIT() \
-- 
2.29.2

Re: [PATCH 00/19] rcu/nocb: De-offload and re-offload support v4

2020-12-07 Thread Boqun Feng

Hi Frederic,

On Fri, Nov 13, 2020 at 01:13:15PM +0100, Frederic Weisbecker wrote:
> This keeps growing up. Rest assured, most of it is debug code and sanity
> checks.
> 
> Boqun Feng found that holding rnp lock while updating the offloaded
> state of an rdp isn't needed, and he was right despite my initial
> reaction. The sites that read the offloaded state while holding the rnp
> lock are actually protected because they read it locally in a non
> preemptible context.
> 
> So I removed the rnp lock in "rcu/nocb: De-offloading CB". And just to
> make sure I'm not missing something, I added sanity checks that ensure
> we always read the offloaded state in a safe way (3 last patches).
> 
> Still passes TREE01 (but I had to fight!)
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>   rcu/nocb-toggle-v4
> 
> HEAD: 579e15efa48fb6fc4ecf14961804051f385807fe
> 

This whole series look good to me, plus I've run a test, so far
everything seems working ;-) Here is my setup for the test:

I'm using a ARM64 guest (running on Hyper-V) to do the test, and the
guest has 8 VCPUs. The code I'm using is v5.10-rc6 + Hyper-V ARM64 guest
support [1] + your patchset (I actually did a merge from your
rcu/nocb-toggle-v5 branch, because IIUC some modification for rcutorture
is still in Paul's tree). I compiled with my normal configuration for
ARM64 Hyper-V guest plus TREE01, boot the kernel with:

ignore_loglevel rcutree.gp_preinit_delay=3 rcutree.gp_init_delay=3 
rcutree.gp_cleanup_delay=3 rcu_nocbs=0-1,3-7 

and run rcutorture via:

modprobe rcutorture nocbs_nthreads=8 nocbs_toggle=1000 fwd_progress=0

I ran the rcutorture twice, one last for a week or so and one for a day
or two and I didn't observe any problem so far. The latest test summary
is:

[...] rcu-torture: rtc: f794686f ver: 2226396 tfle: 0 rta: 
2226397 rtaf: 0 rtf: 2226385 rtmbe: 0 rtmbkf: 0/1390141 rtbe: 0 rtbke: 0 rtbre: 
0 rtbf: 0 rtb: 0 nt: 181415346 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=1000) barrier: 
0/0:0 read-exits: 108102 nocb-toggles: 306964:306974 

Is there anything I'm missing for a useful test? Do you have other setup
(kernel cmdline or rcutorture parameters) that you want me to try?

Regards,
Boqun

> Thanks,
>   Frederic
> ---
> 
> Frederic Weisbecker (19):
>   rcu/nocb: Turn enabled/offload states into a common flag
>   rcu/nocb: Provide basic callback offloading state machine bits
>   rcu/nocb: Always init segcblist on CPU up
>   rcu/nocb: De-offloading CB kthread
>   rcu/nocb: Don't deoffload an offline CPU with pending work
>   rcu/nocb: De-offloading GP kthread
>   rcu/nocb: Re-offload support
>   rcu/nocb: Shutdown nocb timer on de-offloading
>   rcu: Flush bypass before setting SEGCBLIST_SOFTIRQ_ONLY
>   rcu/nocb: Set SEGCBLIST_SOFTIRQ_ONLY at the very last stage of 
> de-offloading
>   rcu/nocb: Only cond_resched() from actual offloaded batch processing
>   rcu/nocb: Process batch locally as long as offloading isn't complete
>   rcu/nocb: Locally accelerate callbacks as long as offloading isn't 
> complete
>   tools/rcutorture: Support nocb toggle in TREE01
>   rcutorture: Remove weak nocb declarations
>   rcutorture: Export nocb (de)offloading functions
>   cpu/hotplug: Add lockdep_is_cpus_held()
>   timer: Add timer_curr_running()
>   rcu/nocb: Detect unsafe checks for offloaded rdp
> 
> 
>  include/linux/cpu.h|   1 +
>  include/linux/rcu_segcblist.h  | 119 +-
>  include/linux/rcupdate.h   |   4 +
>  include/linux/timer.h  |   2 +
>  kernel/cpu.c   |   7 +
>  kernel/rcu/rcu_segcblist.c |  13 +-
>  kernel/rcu/rcu_segcblist.h |  45 ++-
>  kernel/rcu/rcutorture.c|   3 -
>  kernel/rcu/tree.c  |  49 ++-
>  kernel/rcu/tree.h  |   2 +
>  kernel/rcu/tree_plugin.h   | 416 
> +++--
>  kernel/time/timer.c|  13 +
>  .../selftests/rcutorture/configs/rcu/TREE01.boot   |   4 +-
>  13 files changed, 614 insertions(+), 64 deletions(-)

Re: BUG: Invalid wait context with KMEMLEAK and KASAN enabled

2020-12-06 Thread Boqun Feng

Hi Richard,

On Sun, Dec 06, 2020 at 11:59:16PM +0100, Richard Weinberger wrote:
> Hi!
> 
> With both KMEMLEAK and KASAN enabled, I'm facing the following lockdep
> splat at random times on Linus' tree as of today.
> Sometimes it happens at bootup, sometimes much later when userspace has 
> started.
> 
> Does this ring a bell?
> 
> [2.298447] =
> [2.298971] [ BUG: Invalid wait context ]
> [2.298971] 5.10.0-rc6+ #388 Not tainted
> [2.298971] -
> [2.298971] ksoftirqd/1/15 is trying to lock:
> [2.298971] 888100b94598 (>list_lock){}-{3:3}, at:
> free_debug_processing+0x3d/0x210

I guest you also had CONFIG_PROVE_RAW_LOCK_NESTING=y, right? With that
config, the wait context detetion of lockdep will treat spinlock_t as
sleepable locks (considering PREEMPT_RT kernel), and here it complained
about trying to acquire a sleepable lock (in PREEMPT_RT kernel) inside a
irq context which cannot be threaded (in this case, it's the IPI). A
proper fix will be modifying kmem_cache_node->list_lock to
raw_spinlock_t.

Regards,
Boqun

> [2.298971] other info that might help us debug this:
> [2.298971] context-{2:2}
> [2.298971] 1 lock held by ksoftirqd/1/15:
> [2.298971]  #0: 835f4140 (rcu_callback){}-{0:0}, at:
> rcu_core+0x408/0x1040
> [2.298971] stack backtrace:
> [2.298971] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 5.10.0-rc6+ #388
> [2.298971] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.12.0-0-ga698c89-rebuilt.opensuse.org 04/01/2014
> [2.298971] Call Trace:
> [2.298971]  
> [2.298971]  dump_stack+0x9a/0xcc
> [2.298971]  __lock_acquire.cold+0xce/0x34b
> [2.298971]  ? lockdep_hardirqs_on_prepare+0x1f0/0x1f0
> [2.298971]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [2.298971]  lock_acquire+0x153/0x4c0
> [2.298971]  ? free_debug_processing+0x3d/0x210
> [2.298971]  ? lock_release+0x690/0x690
> [2.298971]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [2.298971]  ? pvclock_clocksource_read+0xd9/0x1a0
> [2.298971]  _raw_spin_lock_irqsave+0x3b/0x80
> [2.298971]  ? free_debug_processing+0x3d/0x210
> [2.298971]  ? qlist_free_all+0x35/0xd0
> [2.298971]  free_debug_processing+0x3d/0x210
> [2.298971]  __slab_free+0x286/0x490
> [2.298971]  ? lockdep_enabled+0x39/0x50
> [2.298971]  ? rcu_read_lock_sched_held+0x9c/0xd0
> [2.298971]  ? run_posix_cpu_timers+0x256/0x2c0
> [2.298971]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [2.298971]  ? posix_cpu_timers_exit_group+0x30/0x30
> [2.298971]  qlist_free_all+0x59/0xd0
> [2.298971]  ? qlist_free_all+0xd0/0xd0
> [2.298971]  per_cpu_remove_cache+0x47/0x50
> [2.298971]  flush_smp_call_function_queue+0xea/0x2b0
> [2.298971]  __sysvec_call_function+0x6c/0x250
> [2.298971]  asm_call_irq_on_stack+0x12/0x20
> [2.298971]  
> [2.298971]  sysvec_call_function+0x84/0xa0
> [2.298971]  asm_sysvec_call_function+0x12/0x20
> [2.298971] RIP: 0010:__asan_load4+0x1d/0x80
> [2.298971] Code: 10 00 75 ee c3 0f 1f 84 00 00 00 00 00 4c 8b 04
> 24 48 83 ff fb 77 4d 48 b8 ff ff ff ff ff 7f ff ff 48 39 c7 76 3e 48
> 8d 47 03 <48> 89 c2 83 e2 07 48 83 fa 02 76 17 48 b9 00 00 00 00 00 fc
> ff df
> [2.298971] RSP: :888100e4f858 EFLAGS: 0216
> [2.298971] RAX: 83c55773 RBX: 81002431 RCX: 
> dc00
> [2.298971] RDX: 0001 RSI: 83ee8d78 RDI: 
> 83c55770
> [2.298971] RBP: 83c5576c R08: 81083433 R09: 
> fbfff07e333d
> [2.298971] R10: 0001803d R11: fbfff07e333c R12: 
> 83c5575c
> [2.298971] R13: 83c55774 R14: 83c55770 R15: 
> 83c55770
> [2.298971]  ? ret_from_fork+0x21/0x30
> [2.298971]  ? __orc_find+0x63/0xc0
> [2.298971]  ? stack_access_ok+0x35/0x90
> [2.298971]  __orc_find+0x63/0xc0
> [2.298971]  unwind_next_frame+0x1ee/0xbd0
> [2.298971]  ? ret_from_fork+0x22/0x30
> [2.298971]  ? ret_from_fork+0x21/0x30
> [2.298971]  ? deref_stack_reg+0x40/0x40
> [2.298971]  ? __unwind_start+0x2e8/0x370
> [2.298971]  ? create_prof_cpu_mask+0x20/0x20
> [2.298971]  arch_stack_walk+0x83/0xf0
> [2.298971]  ? ret_from_fork+0x22/0x30
> [2.298971]  ? rcu_core+0x488/0x1040
> [2.298971]  stack_trace_save+0x8c/0xc0
> [2.298971]  ? stack_trace_consume_entry+0x80/0x80
> [2.298971]  ? sched_clock_local+0x99/0xc0
> [2.298971]  kasan_save_stack+0x1b/0x40
> [2.298971]  ? kasan_save_stack+0x1b/0x40
> [2.298971]  ? kasan_set_track+0x1c/0x30
> [2.298971]  ? kasan_set_free_info+0x1b/0x30
> [2.298971]  ? __kasan_slab_free+0x10f/0x150
> [2.298971]  ? kmem_cache_free+0xa8/0x350
> [2.298971]  ? rcu_core+0x488/0x1040
> [2.298971]  ? __do_softirq+0x101/0x573
> [2.298971]  ? run_ksoftirqd+0x21/0x50
> [2.298971]  ? smpboot_thread_fn+0x1fc/0x380
> [2.298971]

[tip: locking/core] lockdep/selftest: Add spin_nest_lock test

2020-12-03 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/core branch of tip:

Commit-ID: e04ce676e7aa490dcf5df880592e3db5e842a9bc
Gitweb:
https://git.kernel.org/tip/e04ce676e7aa490dcf5df880592e3db5e842a9bc
Author:Boqun Feng 
AuthorDate:Mon, 02 Nov 2020 13:37:42 +08:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 03 Dec 2020 11:20:50 +01:00

lockdep/selftest: Add spin_nest_lock test

Add a self test case to test the behavior for the following case:

lock(A);
lock_nest_lock(C1, A);
lock(B);
lock_nest_lock(C2, A);

This is a reproducer for a problem[1] reported by Chris Wilson, and is
helpful to prevent this.

[1]: 
https://lore.kernel.org/lkml/160390684819.31966.12048967113267928...@build.alporthouse.com/

Signed-off-by: Boqun Feng 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201102053743.450459-2-boqun.f...@gmail.com
---
 lib/locking-selftest.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index afa7d4b..4c24ac8 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -2009,6 +2009,19 @@ static void ww_test_spin_nest_unlocked(void)
U(A);
 }
 
+/* This is not a deadlock, because we have X1 to serialize Y1 and Y2 */
+static void ww_test_spin_nest_lock(void)
+{
+   spin_lock(_X1);
+   spin_lock_nest_lock(_Y1, _X1);
+   spin_lock(_A);
+   spin_lock_nest_lock(_Y2, _X1);
+   spin_unlock(_A);
+   spin_unlock(_Y2);
+   spin_unlock(_Y1);
+   spin_unlock(_X1);
+}
+
 static void ww_test_unneeded_slow(void)
 {
WWAI();
@@ -2226,6 +2239,10 @@ static void ww_tests(void)
dotest(ww_test_spin_nest_unlocked, FAILURE, LOCKTYPE_WW);
pr_cont("\n");
 
+   print_testname("spinlock nest test");
+   dotest(ww_test_spin_nest_lock, SUCCESS, LOCKTYPE_WW);
+   pr_cont("\n");
+
printk("  -\n");
printk(" |block | try  |context|\n");
printk("  -\n");

Re: [arm64] db410c: BUG: Invalid wait context

2020-12-02 Thread Boqun Feng

Hi Naresh,

On Wed, Dec 02, 2020 at 10:15:44AM +0530, Naresh Kamboju wrote:
> While running kselftests on arm64 db410c platform "BUG: Invalid wait context"
> noticed at different runs this specific platform running stable-rc 5.9.12-rc1.
> 
> While running these two test cases we have noticed this BUG and not easily
> reproducible.
> 
> # selftests: bpf: test_xdp_redirect.sh
> # selftests: net: ip6_gre_headroom.sh
> 
> [  245.694901] kauditd_printk_skb: 100 callbacks suppressed
> [  245.694913] audit: type=1334 audit(251.699:25757): prog-id=12883 op=LOAD
> [  245.735658] audit: type=1334 audit(251.743:25758): prog-id=12884 op=LOAD
> [  245.801299] audit: type=1334 audit(251.807:25759): prog-id=12885 op=LOAD
> [  245.832034] audit: type=1334 audit(251.839:25760): prog-id=12886 op=LOAD
> [  245.888601]
> [  245.888631] =
> [  245.889156] [ BUG: Invalid wait context ]
> [  245.893071] 5.9.12-rc1 #1 Tainted: GW
> [  245.897056] -
> [  245.902091] pool/1279 is trying to lock:
> [  245.906083] 32fc1218
> (>perf_event_mutex){+.+.}-{3:3}, at:
> perf_event_exit_task+0x34/0x3a8
> [  245.910085] other info that might help us debug this:
> [  245.919539] context-{4:4}
> [  245.924484] 1 lock held by pool/1279:
> [  245.927087]  #0: 8000127819b8 (rcu_read_lock){}-{1:2}, at:
> dput+0x54/0x460
> [  245.930739] stack backtrace:
> [  245.938203] CPU: 1 PID: 1279 Comm: pool Tainted: GW
> 5.9.12-rc1 #1
> [  245.941243] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [  245.948621] Call trace:
> [  245.955390]  dump_backtrace+0x0/0x1f8
> [  245.957560]  show_stack+0x2c/0x38
> [  245.961382]  dump_stack+0xec/0x158
> [  245.964679]  __lock_acquire+0x59c/0x15c8
> [  245.967978]  lock_acquire+0x124/0x4d0
> [  245.972058]  __mutex_lock+0xa4/0x970
> [  245.975615]  mutex_lock_nested+0x54/0x70
> [  245.979261]  perf_event_exit_task+0x34/0x3a8
> [  245.983168]  do_exit+0x394/0xad8
> [  245.987420]  do_group_exit+0x4c/0xa8
> [  245.990633]  get_signal+0x16c/0xb40
> [  245.994193]  do_notify_resume+0x2ec/0x678
> [  245.997404]  work_pending+0x8/0x200
> 

For the PoV of lockdep, this means some one tries to acquire a mutex
inside an RCU read-side critical section, which is bad, because one can
not sleep (voluntarily) inside RCU.

However I don't think it's the true case here, because 1) normally
people are very careful not putting mutex or other sleepable locks
inside RCU and 2) in the above splats, lockdep find the rcu read lock is
held at dput() while the acquiring of the mutex is at ret_to_user(),
clearly there is no call site (in the same context) from the RCU
read-side critial section of dput() to ret_to_user().

One chance of hitting this is that there is a bug in context/irq tracing
that makes the contexts of dput() and ret_to_user() as one contexts so
that lockdep gets confused and reports a false postive.

FWIW, I think this might be related to some know issues for ARM64 with
lockdep and irq tracing:

https://lore.kernel.org/lkml/20201119225352.GA5251@willie-the-truck/

And Mark already has series to fix them:


https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/irq-fixes

But I must defer to Mark for the latest fix ;-)

Regards,
Boqun

> and BUG in an other run,
> 
> [ 1012.068407] audit: type=1700 audit(1018.803:25886): dev=eth0 prom=0
> old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
> [ 1012.250561] IPv6: ADDRCONF(NETDEV_CHANGE): swp1: link becomes ready
> [ 1012.251298] IPv6: ADDRCONF(NETDEV_CHANGE): h1: link becomes ready
> [ 1012.252559]
> [ 1012.261892] =
> [ 1012.263453] [ BUG: Invalid wait context ]
> [ 1012.267363] 5.9.12-rc1 #1 Tainted: GW
> [ 1012.271354] -
> [ 1012.276389] systemd/454 is trying to lock:
> [ 1012.280381] 3985a918 (>mmap_lock){}-{3:3}, at:
> __might_fault+0x60/0xa8
> [ 1012.284378] other info that might help us debug this:
> [ 1012.292275] context-{4:4}
> [ 1012.297396] 1 lock held by systemd/454:
> [ 1012.27]  #0: 8000127d1f38 (rcu_read_lock){}-{1:2}, at:
> path_init+0x40/0x718
> [ 1012.303649] stack backtrace:
> [ 1012.311638] CPU: 2 PID: 454 Comm: systemd Tainted: GW
>   5.9.12-rc1 #1
> [ 1012.314760] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [ 1012.322139] Call trace:
> [ 1012.329084]  dump_backtrace+0x0/0x1f8
> [ 1012.331254]  show_stack+0x2c/0x38
> [ 1012.335075]  dump_stack+0xec/0x158
> [ 1012.338371]  __lock_acquire+0x59c/0x15c8
> [ 1012.341672]  lock_acquire+0x124/0x4d0
> [ 1012.345751]  __might_fault+0x84/0xa8
> [ 1012.349311]  cp_new_stat+0x114/0x1b8
> [ 1012.352956]  __do_sys_newfstat+0x44/0x70
> [ 1012.356513]  __arm64_sys_newfstat+0x24/0x30
> [ 1012.358652] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
> [ 1012.360424]  el0_svc_common.constprop.3+0x7c/0x198
> [ 1012.370575]  do_el0_svc+0x34/0xa0
> [ 1012.375437]

Re: [PATCH] leds: trigger: fix potential deadlock with libata

2020-11-25 Thread Boqun Feng

asm_common_interrupt+0x1e/0x40
> >  native_safe_halt+0xe/0x10
> >  arch_cpu_idle+0x15/0x20
> >  default_idle_call+0x59/0x1c0
> >  do_idle+0x22c/0x2c0
> >  cpu_startup_entry+0x20/0x30
> >  start_secondary+0x11d/0x150
> >  secondary_startup_64_no_verify+0xa6/0xab
> > INITIAL USE at:
> > lock_acquire+0x15f/0x420
> > _raw_spin_lock_irqsave+0x52/0xa0
> > ata_dev_init+0x54/0xe0
> > ata_link_init+0x8b/0xd0
> > ata_port_alloc+0x1f1/0x210
> > ata_host_alloc+0xf1/0x130
> > ata_host_alloc_pinfo+0x14/0xb0
> > ata_pci_sff_prepare_host+0x41/0xa0
> > ata_pci_bmdma_prepare_host+0x14/0x30
> > piix_init_one+0x21f/0x600
> > local_pci_probe+0x48/0x80
> > pci_device_probe+0x105/0x1c0
> > really_probe+0x221/0x490
> > driver_probe_device+0xe9/0x160
> > device_driver_attach+0xb2/0xc0
> > __driver_attach+0x91/0x150
> > bus_for_each_dev+0x81/0xc0
> > driver_attach+0x1e/0x20
> > bus_add_driver+0x138/0x1f0
> > driver_register+0x91/0xf0
> > __pci_register_driver+0x73/0x80
> > piix_init+0x1e/0x2e
> > do_one_initcall+0x5f/0x2d0
> > kernel_init_freeable+0x26f/0x2cf
> > kernel_init+0xe/0x113
> > ret_from_fork+0x1f/0x30
> >   }
> >   ... key  at: [] __key.6+0x0/0x10
> >   ... acquired at:
> > __lock_acquire+0x9da/0x2370
> > lock_acquire+0x15f/0x420
> > _raw_spin_lock_irqsave+0x52/0xa0
> > ata_bmdma_interrupt+0x27/0x200
> > __handle_irq_event_percpu+0xd5/0x2b0
> > handle_irq_event+0x57/0xb0
> > handle_edge_irq+0x8c/0x230
> > asm_call_irq_on_stack+0xf/0x20
> > common_interrupt+0x100/0x1c0
> > asm_common_interrupt+0x1e/0x40
> > native_safe_halt+0xe/0x10
> > arch_cpu_idle+0x15/0x20
> > default_idle_call+0x59/0x1c0
> > do_idle+0x22c/0x2c0
> > cpu_startup_entry+0x20/0x30
> > start_secondary+0x11d/0x150
> > secondary_startup_64_no_verify+0xa6/0xab
> > 
> > This lockdep splat is reported after:
> > commit e918188611f0 ("locking: More accurate annotations for read_lock()")
> > 
> > To clarify:
> >  - read-locks are recursive only in interrupt context (when
> >in_interrupt() returns true)
> >  - after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
> >write_lock(>leddev_list_lock) that would be blocked by CPU0
> >that holds trig->leddev_list_lock in read-mode
> >  - when CPU1 (ata_ac_complete()) tries to read-lock
> >trig->leddev_list_lock, it would be blocked by the write-lock waiter
> >on CPU2 (because we are not in interrupt context, so the read-lock is
> >not recursive)
> >  - at this point if an interrupt happens on CPU0 and
> >ata_bmdma_interrupt() is executed it will try to acquire host->lock,
> >that is held by CPU1, that is currently blocked by CPU2, so:
> > 
> >* CPU0 blocked by CPU1
> >* CPU1 blocked by CPU2
> >* CPU2 blocked by CPU0
> > 
> >  *** DEADLOCK ***
> > 
> > The deadlock scenario is better represented by the following schema
> > (thanks to Boqun Feng  for the schema and the
> > detailed explanation of the deadlock condition):
> > 
> >  CPU 0:  CPU 1:CPU 2:
> >  -   - -
> >  led_trigger_event():
> >read_lock(>leddev_list_lock);
> > 
> > ata_hsm_qc_complete():
> >   spin_lock_irqsave(>lock);
> > 
> > write_lock(>leddev_list_lock);
> >   ata_port_freeze():
> > ata_do_link_abort():
> >   ata_qc_complete():
> > ledtrig_disk_activity():
> >

Re: [PATCH] kfence: Avoid stalling work queue task without allocations

2020-11-23 Thread Boqun Feng

Hi Steven,

On Mon, Nov 23, 2020 at 01:42:27PM -0500, Steven Rostedt wrote:
> On Mon, 23 Nov 2020 11:28:12 -0500
> Steven Rostedt  wrote:
> 
> > I noticed:
> > 
> > 
> > [  237.650900] enabling event benchmark_event
> > 
> > In both traces. Could you disable CONFIG_TRACEPOINT_BENCHMARK and see if
> > the issue goes away. That event kicks off a thread that spins in a tight
> > loop for some time and could possibly cause some issues.
> > 
> > It still shouldn't break things, we can narrow it down if it is the culprit.
> 
> [ Added Thomas  ]
> 
> And that's just one issue. I don't think that has anything to do with the
> other one:
> 
> [ 1614.162007] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 1614.168625]  (detected by 0, t=3752 jiffies, g=3529, q=1)
> [ 1614.170825] rcu: All QSes seen, last rcu_preempt kthread activity 242 
> (4295293115-4295292873), jiffies_till_next_fqs=1, root ->qsmask 0x0
> [ 1614.194272] 
> [ 1614.196673] 
> [ 1614.199738] WARNING: inconsistent lock state
> [ 1614.203056] 5.10.0-rc4-next-20201119-4-g77838ee21ff6-dirty #21 Not 
> tainted
> [ 1614.207012] 
> [ 1614.210125] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> [ 1614.213832] swapper/0/1 [HC0[0]:SC0[0]:HE0:SE1] takes:
> [ 1614.217288] d942547f47d8 (rcu_node_0){?.-.}-{2:2}, at: 
> rcu_sched_clock_irq+0x7c0/0x17a0
> [ 1614.225496] {IN-HARDIRQ-W} state was registered at:
> [ 1614.229031]   __lock_acquire+0xae8/0x1ac8
> [ 1614.232203]   lock_acquire+0x268/0x508
> [ 1614.235254]   _raw_spin_lock_irqsave+0x78/0x14c
> [ 1614.238547]   rcu_sched_clock_irq+0x7c0/0x17a0
> [ 1614.241757]   update_process_times+0x6c/0xb8
> [ 1614.244950]   tick_sched_handle.isra.0+0x58/0x88
> [ 1614.248225]   tick_sched_timer+0x68/0xe0
> [ 1614.251304]   __hrtimer_run_queues+0x288/0x730
> [ 1614.254516]   hrtimer_interrupt+0x114/0x288
> [ 1614.257650]   arch_timer_handler_virt+0x50/0x70
> [ 1614.260922]   handle_percpu_devid_irq+0x104/0x4c0
> [ 1614.264236]   generic_handle_irq+0x54/0x78
> [ 1614.267385]   __handle_domain_irq+0xac/0x130
> [ 1614.270585]   gic_handle_irq+0x70/0x108
> [ 1614.273633]   el1_irq+0xc0/0x180
> [ 1614.276526]   rcu_irq_exit_irqson+0x40/0x78
> [ 1614.279704]   trace_preempt_on+0x144/0x1a0
> [ 1614.282834]   preempt_schedule_common+0xf8/0x1a8
> [ 1614.286126]   preempt_schedule+0x38/0x40
> [ 1614.289240]   __mutex_lock+0x608/0x8e8
> [ 1614.292302]   mutex_lock_nested+0x3c/0x58
> [ 1614.295450]   static_key_enable_cpuslocked+0x7c/0xf8
> [ 1614.298828]   static_key_enable+0x2c/0x40
> [ 1614.301961]   tracepoint_probe_register_prio+0x284/0x3a0
> [ 1614.305464]   tracepoint_probe_register+0x40/0x58
> [ 1614.308776]   trace_event_reg+0xe8/0x150
> [ 1614.311852]   __ftrace_event_enable_disable+0x2e8/0x608
> [ 1614.315351]   __ftrace_set_clr_event_nolock+0x160/0x1d8
> [ 1614.318809]   __ftrace_set_clr_event+0x60/0x90
> [ 1614.322061]   event_trace_self_tests+0x64/0x12c
> [ 1614.325335]   event_trace_self_tests_init+0x88/0xa8
> [ 1614.328758]   do_one_initcall+0xa4/0x500
> [ 1614.331860]   kernel_init_freeable+0x344/0x3c4
> [ 1614.335110]   kernel_init+0x20/0x16c
> [ 1614.338102]   ret_from_fork+0x10/0x34
> [ 1614.341057] irq event stamp: 3206302
> [ 1614.344123] hardirqs last  enabled at (3206301): [] 
> rcu_irq_exit_irqson+0x64/0x78
> [ 1614.348697] hardirqs last disabled at (3206302): [] 
> el1_irq+0x80/0x180
> [ 1614.353013] softirqs last  enabled at (3204216): [] 
> __do_softirq+0x630/0x6b4
> [ 1614.357509] softirqs last disabled at (3204191): [] 
> irq_exit+0x1cc/0x1e0
> [ 1614.361737] 
> [ 1614.361737] other info that might help us debug this:
> [ 1614.365566]  Possible unsafe locking scenario:
> [ 1614.365566] 
> [ 1614.369128]CPU0
> [ 1614.371747]
> [ 1614.374282]   lock(rcu_node_0);
> [ 1614.378818]   
> [ 1614.381394] lock(rcu_node_0);
> [ 1614.385997] 
> [ 1614.385997]  *** DEADLOCK ***
> [ 1614.385997] 
> [ 1614.389613] 5 locks held by swapper/0/1:
> [ 1614.392655]  #0: d9425480e940 (event_mutex){+.+.}-{3:3}, at: 
> __ftrace_set_clr_event+0x48/0x90
> [ 1614.401701]  #1: d9425480a530 (tracepoints_mutex){+.+.}-{3:3}, at: 
> tracepoint_probe_register_prio+0x48/0x3a0
> [ 1614.410973]  #2: d9425476abf0 (cpu_hotplug_lock){}-{0:0}, at: 
> static_key_enable+0x24/0x40
> [ 1614.419858]  #3: d94254816348 (jump_label_mutex){+.+.}-{3:3}, at: 
> static_key_enable_cpuslocked+0x7c/0xf8
> [ 1614.429049]  #4: d942547f47d8 (rcu_node_0){?.-.}-{2:2}, at: 
> rcu_sched_clock_irq+0x7c0/0x17a0
> [ 1614.438029] 
> [ 1614.438029] stack backtrace:
> [ 1614.441436] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.10.0-rc4-next-20201119-4-g77838ee21ff6-dirty #21
> [ 1614.446149] Hardware name: linux,dummy-virt (DT)
> [ 1614.449621] Call trace:
> [ 1614.452337]  dump_backtrace+0x0/0x240
> [ 1614.455372]  show_stack+0x34/0x88
> [ 1614.458306]  dump_stack+0x140/0x1bc
> [ 1614.461258]  print_usage_bug+0x2a0/0x2f0
> [

Re: [PATCH] video: hyperv_fb: Directly use the MMIO VRAM

2020-11-21 Thread Boqun Feng

Hi Dexuan,

On Fri, Nov 20, 2020 at 05:45:47PM -0800, Dexuan Cui wrote:
> Late in 2019, 2 commits (see the 2 Fixes tags) were introduced to
> mitigate the slow framebuffer issue. Now that we have fixed the
> slowness issue by correctly mapping the MMIO VRAM (see
> commit 5f1251a48c17 ("video: hyperv_fb: Fix the cache type when mapping
> the VRAM")), we can remove most of the code introduced by the 2 commits,
> i.e. we no longer need to allocate physical memory and use it to back up
> the VRAM in Generation-1 VM, and we also no longer need to allocate
> physical memory to back up the framebuffer in a Generation-2 VM and copy
> the framebuffer to the real VRAM.
> 
> synthvid_deferred_io() is kept, because it's still desirable to send the
> SYNTHVID_DIRT message only for the exact dirty rectangle, and only when
> needed.
> 
> Fixes: d21987d709e8 ("video: hyperv: hyperv_fb: Support deferred IO for 
> Hyper-V frame buffer driver")
> Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on 
> HyperV Gen 1 VMs.")
> Cc: Wei Hu 
> Cc: Boqun Feng 
> Signed-off-by: Dexuan Cui 

After I applied this patch and patch ("video: hyperv_fb: Fix the cache
type when mapping the VRAM") on my development branch (with Michael's
patchset for ARM64 core support on Hyper-V), and everything worked fine.
So feel free to add:

Tested-by: Boqun Feng 

Regards,
Bqoun

> ---
> 
> This patch changes drivers/video/fbdev/Kconfig, but I hope this can
> still go through the Hyper-V tree
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-next
> because it's unlikely to cause any build issue to other fbdev drivers
> (that line was introduced by 3a6fb6c4255c only for hyperv_fb.c)
> 
> Note: this patch is based on the Hyper-V tree's hyperv-fixes branch, but
> it should also apply cleanly to the branch hyperv-next if the commit
> 5f1251a48c17 is applied first.  This patch is for v5.11 rather than
> v5.10.
> 
>  drivers/video/fbdev/Kconfig |   1 -
>  drivers/video/fbdev/hyperv_fb.c | 170 ++--
>  2 files changed, 9 insertions(+), 162 deletions(-)
> 
> diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
> index 402e85450bb5..05b37fb3c6d6 100644
> --- a/drivers/video/fbdev/Kconfig
> +++ b/drivers/video/fbdev/Kconfig
> @@ -2205,7 +2205,6 @@ config FB_HYPERV
>   select FB_CFB_COPYAREA
>   select FB_CFB_IMAGEBLIT
>   select FB_DEFERRED_IO
> - select DMA_CMA if HAVE_DMA_CONTIGUOUS && CMA
>   help
> This framebuffer driver supports Microsoft Hyper-V Synthetic Video.
>  
> diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
> index 58c74d2356ba..8131f4e66f98 100644
> --- a/drivers/video/fbdev/hyperv_fb.c
> +++ b/drivers/video/fbdev/hyperv_fb.c
> @@ -31,16 +31,6 @@
>   * "set-vmvideo" command. For example
>   * set-vmvideo -vmname name -horizontalresolution:1920 \
>   * -verticalresolution:1200 -resolutiontype single
> - *
> - * Gen 1 VMs also support direct using VM's physical memory for framebuffer.
> - * It could improve the efficiency and performance for framebuffer and VM.
> - * This requires to allocate contiguous physical memory from Linux kernel's
> - * CMA memory allocator. To enable this, supply a kernel parameter to give
> - * enough memory space to CMA allocator for framebuffer. For example:
> - *cma=130m
> - * This gives 130MB memory to CMA allocator that can be allocated to
> - * framebuffer. For reference, 8K resolution (7680x4320) takes about
> - * 127MB memory.
>   */
>  
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> @@ -267,16 +257,8 @@ struct hvfb_par {
>   /* If true, the VSC notifies the VSP on every framebuffer change */
>   bool synchronous_fb;
>  
> - /* If true, need to copy from deferred IO mem to framebuffer mem */
> - bool need_docopy;
> -
>   struct notifier_block hvfb_panic_nb;
>  
> - /* Memory for deferred IO and frame buffer itself */
> - unsigned char *dio_vp;
> - unsigned char *mmio_vp;
> - phys_addr_t mmio_pp;
> -
>   /* Dirty rectangle, protected by delayed_refresh_lock */
>   int x1, y1, x2, y2;
>   bool delayed_refresh;
> @@ -405,21 +387,6 @@ synthvid_update(struct fb_info *info, int x1, int y1, 
> int x2, int y2)
>   return 0;
>  }
>  
> -static void hvfb_docopy(struct hvfb_par *par,
> - unsigned long offset,
> - unsigned long size)
> -{
> - if (!par || !par->mmio_vp || !par->dio_vp || !par->fb_ready ||
> - size == 0 || offset >= dio_fb_size)
> - return;
> -
>

[tip: locking/urgent] lockdep: Put graph lock/unlock under lock_recursion protection

2020-11-19 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID: 43be4388e94b915799a24f0eaf664bf95b85231f
Gitweb:
https://git.kernel.org/tip/43be4388e94b915799a24f0eaf664bf95b85231f
Author:Boqun Feng 
AuthorDate:Fri, 13 Nov 2020 19:05:03 +08:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 17 Nov 2020 13:15:35 +01:00

lockdep: Put graph lock/unlock under lock_recursion protection

A warning was hit when running xfstests/generic/068 in a Hyper-V guest:

[...] [ cut here ]
[...] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[...] WARNING: CPU: 2 PID: 1350 at kernel/locking/lockdep.c:5280 
check_flags.part.0+0x165/0x170
[...] ...
[...] Workqueue: events pwq_unbound_release_workfn
[...] RIP: 0010:check_flags.part.0+0x165/0x170
[...] ...
[...] Call Trace:
[...]  lock_is_held_type+0x72/0x150
[...]  ? lock_acquire+0x16e/0x4a0
[...]  rcu_read_lock_sched_held+0x3f/0x80
[...]  __send_ipi_one+0x14d/0x1b0
[...]  hv_send_ipi+0x12/0x30
[...]  __pv_queued_spin_unlock_slowpath+0xd1/0x110
[...]  __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20
[...]  .slowpath+0x9/0xe
[...]  lockdep_unregister_key+0x128/0x180
[...]  pwq_unbound_release_workfn+0xbb/0xf0
[...]  process_one_work+0x227/0x5c0
[...]  worker_thread+0x55/0x3c0
[...]  ? process_one_work+0x5c0/0x5c0
[...]  kthread+0x153/0x170
[...]  ? __kthread_bind_mask+0x60/0x60
[...]  ret_from_fork+0x1f/0x30

The cause of the problem is we have call chain lockdep_unregister_key()
->  lockdep_unlock() ->
arch_spin_unlock() -> __pv_queued_spin_unlock_slowpath() -> pv_kick() ->
__send_ipi_one() -> trace_hyperv_send_ipi_one().

Although this particular warning is triggered because Hyper-V has a
trace point in ipi sending, but in general arch_spin_unlock() may call
another function having a trace point in it, so put the arch_spin_lock()
and arch_spin_unlock() after lock_recursion protection to fix this
problem and avoid similiar problems.

Signed-off-by: Boqun Feng 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201113110512.1056501-1-boqun.f...@gmail.com
---
 kernel/locking/lockdep.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d9fb9e1..c1418b4 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -108,19 +108,21 @@ static inline void lockdep_lock(void)
 {
DEBUG_LOCKS_WARN_ON(!irqs_disabled());
 
+   __this_cpu_inc(lockdep_recursion);
arch_spin_lock(&__lock);
__owner = current;
-   __this_cpu_inc(lockdep_recursion);
 }
 
 static inline void lockdep_unlock(void)
 {
+   DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+
if (debug_locks && DEBUG_LOCKS_WARN_ON(__owner != current))
return;
 
-   __this_cpu_dec(lockdep_recursion);
__owner = NULL;
arch_spin_unlock(&__lock);
+   __this_cpu_dec(lockdep_recursion);
 }
 
 static inline bool lockdep_assert_locked(void)

Re: [RFC] Are you good with Lockdep?

2020-11-17 Thread Boqun Feng

Hi Matthew,

On Mon, Nov 16, 2020 at 03:37:29PM +, Matthew Wilcox wrote:
[...]
> 
> It's not just about lockdep for semaphores.  Mutexes will spin if the
> current owner is still running, so to convert an interrupt-released
> semaphore to a mutex, we need a way to mark the mutex as being released

Could you provide an example for the conversion from interrupt-released
semaphore to a mutex? I'd like to see if we can improve lockdep to help
on that case.

Regards,
Boqun

> by the new owner.
> 
> I really don't think you want to report subsequent lockdep splats.

Re: [PATCH AUTOSEL 5.9 13/21] lockdep: Avoid to modify chain keys in validate_chain()

2020-11-17 Thread Boqun Feng

Hi Sasha,

I don't think this commit should be picked by stable, since the problem
it fixes is caused by commit f611e8cf98ec ("lockdep: Take read/write
status in consideration when generate chainkey"), which just got merged
in the merge window of 5.10. So 5.9 and 5.4 don't have the problem.

Regards,
Boqun

On Tue, Nov 17, 2020 at 07:56:44AM -0500, Sasha Levin wrote:
> From: Boqun Feng 
> 
> [ Upstream commit d61fc96a37603384cd531622c1e89de1096b5123 ]
> 
> Chris Wilson reported a problem spotted by check_chain_key(): a chain
> key got changed in validate_chain() because we modify the ->read in
> validate_chain() to skip checks for dependency adding, and ->read is
> taken into calculation for chain key since commit f611e8cf98ec
> ("lockdep: Take read/write status in consideration when generate
> chainkey").
> 
> Fix this by avoiding to modify ->read in validate_chain() based on two
> facts: a) since we now support recursive read lock detection, there is
> no need to skip checks for dependency adding for recursive readers, b)
> since we have a), there is only one case left (nest_lock) where we want
> to skip checks in validate_chain(), we simply remove the modification
> for ->read and rely on the return value of check_deadlock() to skip the
> dependency adding.
> 
> Reported-by: Chris Wilson 
> Signed-off-by: Boqun Feng 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: https://lkml.kernel.org/r/20201102053743.450459-1-boqun.f...@gmail.com
> Signed-off-by: Sasha Levin 
> ---
>  kernel/locking/lockdep.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3eb35ad1b5241..f3a4302a1251f 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2421,7 +2421,9 @@ print_deadlock_bug(struct task_struct *curr, struct 
> held_lock *prev,
>   * (Note that this has to be done separately, because the graph cannot
>   * detect such classes of deadlocks.)
>   *
> - * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read
> + * Returns: 0 on deadlock detected, 1 on OK, 2 if another lock with the same
> + * lock class is held but nest_lock is also held, i.e. we rely on the
> + * nest_lock to avoid the deadlock.
>   */
>  static int
>  check_deadlock(struct task_struct *curr, struct held_lock *next)
> @@ -2444,7 +2446,7 @@ check_deadlock(struct task_struct *curr, struct 
> held_lock *next)
>* lock class (i.e. read_lock(lock)+read_lock(lock)):
>*/
>   if ((next->read == 2) && prev->read)
> - return 2;
> + continue;
>  
>   /*
>* We're holding the nest_lock, which serializes this lock's
> @@ -3227,16 +3229,13 @@ static int validate_chain(struct task_struct *curr,
>  
>   if (!ret)
>   return 0;
> - /*
> -  * Mark recursive read, as we jump over it when
> -  * building dependencies (just like we jump over
> -  * trylock entries):
> -  */
> - if (ret == 2)
> - hlock->read = 2;
>   /*
>* Add dependency only if this lock is not the head
> -  * of the chain, and if it's not a secondary read-lock:
> +  * of the chain, and if the new lock introduces no more
> +  * lock dependency (because we already hold a lock with the
> +  * same lock class) nor deadlock (because the nest_lock
> +  * serializes nesting locks), see the comments for
> +  * check_deadlock().
>*/
>   if (!chain_head && ret != 2) {
>   if (!check_prevs_add(curr, hlock))
> -- 
> 2.27.0
>

Re: [PATCH 3/3] powerpc: rewrite atomics to use ARCH_ATOMIC

2020-11-13 Thread Boqun Feng

Hi Nicholas,

On Wed, Nov 11, 2020 at 09:07:23PM +1000, Nicholas Piggin wrote:
> All the cool kids are doing it.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/atomic.h  | 681 ++---
>  arch/powerpc/include/asm/cmpxchg.h |  62 +--
>  2 files changed, 248 insertions(+), 495 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/atomic.h 
> b/arch/powerpc/include/asm/atomic.h
> index 8a55eb8cc97b..899aa2403ba7 100644
> --- a/arch/powerpc/include/asm/atomic.h
> +++ b/arch/powerpc/include/asm/atomic.h
> @@ -11,185 +11,285 @@
>  #include 
>  #include 
>  
> +#define ARCH_ATOMIC
> +
> +#ifndef CONFIG_64BIT
> +#include 
> +#endif
> +
>  /*
>   * Since *_return_relaxed and {cmp}xchg_relaxed are implemented with
>   * a "bne-" instruction at the end, so an isync is enough as a acquire 
> barrier
>   * on the platform without lwsync.
>   */
>  #define __atomic_acquire_fence() \
> - __asm__ __volatile__(PPC_ACQUIRE_BARRIER "" : : : "memory")
> + asm volatile(PPC_ACQUIRE_BARRIER "" : : : "memory")
>  
>  #define __atomic_release_fence() \
> - __asm__ __volatile__(PPC_RELEASE_BARRIER "" : : : "memory")
> + asm volatile(PPC_RELEASE_BARRIER "" : : : "memory")
>  
> -static __inline__ int atomic_read(const atomic_t *v)
> -{
> - int t;
> +#define __atomic_pre_full_fence  smp_mb
>  
> - __asm__ __volatile__("lwz%U1%X1 %0,%1" : "=r"(t) : "m"(v->counter));
> +#define __atomic_post_full_fence smp_mb
>  

Do you need to define __atomic_{pre,post}_full_fence for PPC? IIRC, they
are default smp_mb__{before,atomic}_atomic(), so are smp_mb() defautly
on PPC.

> - return t;
> +#define arch_atomic_read(v)  __READ_ONCE((v)->counter)
> +#define arch_atomic_set(v, i)
> __WRITE_ONCE(((v)->counter), (i))
> +#ifdef CONFIG_64BIT
> +#define ATOMIC64_INIT(i) { (i) }
> +#define arch_atomic64_read(v)
> __READ_ONCE((v)->counter)
> +#define arch_atomic64_set(v, i)  
> __WRITE_ONCE(((v)->counter), (i))
> +#endif
> +
[...]
>  
> +#define ATOMIC_FETCH_OP_UNLESS_RELAXED(name, type, dtype, width, asm_op) \
> +static inline int arch_##name##_relaxed(type *v, dtype a, dtype u)   \

I don't think we have atomic_fetch_*_unless_relaxed() at atomic APIs,
ditto for:

atomic_fetch_add_unless_relaxed()
atomic_inc_not_zero_relaxed()
atomic_dec_if_positive_relaxed()

, and we don't have the _acquire() and _release() variants for them
either, and if you don't define their fully-ordered version (e.g.
atomic_inc_not_zero()), atomic-arch-fallback.h will use read and cmpxchg
to implement them, and I think not what we want.

[...]
>  
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_ATOMIC_H_ */
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index cf091c4c22e5..181f7e8b3281 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -192,7 +192,7 @@ __xchg_relaxed(void *ptr, unsigned long x, unsigned int 
> size)
>   (unsigned long)_x_, sizeof(*(ptr)));
>  \
>})
>  
> -#define xchg_relaxed(ptr, x) \
> +#define arch_xchg_relaxed(ptr, x)\
>  ({   \
>   __typeof__(*(ptr)) _x_ = (x);   \
>   (__typeof__(*(ptr))) __xchg_relaxed((ptr),  \
> @@ -448,35 +448,7 @@ __cmpxchg_relaxed(void *ptr, unsigned long old, unsigned 
> long new,
>   return old;
>  }
>  
> -static __always_inline unsigned long
> -__cmpxchg_acquire(void *ptr, unsigned long old, unsigned long new,
> -   unsigned int size)
> -{
> - switch (size) {
> - case 1:
> - return __cmpxchg_u8_acquire(ptr, old, new);
> - case 2:
> - return __cmpxchg_u16_acquire(ptr, old, new);
> - case 4:
> - return __cmpxchg_u32_acquire(ptr, old, new);
> -#ifdef CONFIG_PPC64
> - case 8:
> - return __cmpxchg_u64_acquire(ptr, old, new);
> -#endif
> - }
> - BUILD_BUG_ON_MSG(1, "Unsupported size for __cmpxchg_acquire");
> - return old;
> -}
> -#define cmpxchg(ptr, o, n)\
> -  ({  \
> - __typeof__(*(ptr)) _o_ = (o);\
> - __typeof__(*(ptr)) _n_ = (n);\
> - (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_,   
>  \
> - (unsigned long)_n_, sizeof(*(ptr))); \
> -  })
> -
> -

If you remove {atomic_}_cmpxchg_{,_acquire}() and use the version
provided by atomic-arch-fallback.h, then a fail cmpxchg

[RFC] lockdep: Put graph lock/unlock under lock_recursion protection

2020-11-13 Thread Boqun Feng

A warning was hit when running xfstests/generic/068 in a Hyper-V guest:

[...] [ cut here ]
[...] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[...] WARNING: CPU: 2 PID: 1350 at kernel/locking/lockdep.c:5280 
check_flags.part.0+0x165/0x170
[...] ...
[...] Workqueue: events pwq_unbound_release_workfn
[...] RIP: 0010:check_flags.part.0+0x165/0x170
[...] ...
[...] Call Trace:
[...]  lock_is_held_type+0x72/0x150
[...]  ? lock_acquire+0x16e/0x4a0
[...]  rcu_read_lock_sched_held+0x3f/0x80
[...]  __send_ipi_one+0x14d/0x1b0
[...]  hv_send_ipi+0x12/0x30
[...]  __pv_queued_spin_unlock_slowpath+0xd1/0x110
[...]  __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20
[...]  .slowpath+0x9/0xe
[...]  lockdep_unregister_key+0x128/0x180
[...]  pwq_unbound_release_workfn+0xbb/0xf0
[...]  process_one_work+0x227/0x5c0
[...]  worker_thread+0x55/0x3c0
[...]  ? process_one_work+0x5c0/0x5c0
[...]  kthread+0x153/0x170
[...]  ? __kthread_bind_mask+0x60/0x60
[...]  ret_from_fork+0x1f/0x30

The cause of the problem is we have call chain lockdep_unregister_key()
->  lockdep_unlock() ->
arch_spin_unlock() -> __pv_queued_spin_unlock_slowpath() -> pv_kick() ->
__send_ipi_one() -> trace_hyperv_send_ipi_one().

Although this particular warning is triggered because Hyper-V has a
trace point in ipi sending, but in general arch_spin_unlock() may call
another function having a trace point in it, so put the arch_spin_lock()
and arch_spin_unlock() after lock_recursion protection to fix this
problem and avoid similiar problems.

Signed-off-by: Boqun Feng 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
---
 kernel/locking/lockdep.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b71ad8d9f1c9..b98e44f88c6a 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -108,19 +108,21 @@ static inline void lockdep_lock(void)
 {
DEBUG_LOCKS_WARN_ON(!irqs_disabled());
 
+   __this_cpu_inc(lockdep_recursion);
arch_spin_lock(&__lock);
__owner = current;
-   __this_cpu_inc(lockdep_recursion);
 }
 
 static inline void lockdep_unlock(void)
 {
+   DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+
if (debug_locks && DEBUG_LOCKS_WARN_ON(__owner != current))
return;
 
-   __this_cpu_dec(lockdep_recursion);
__owner = NULL;
arch_spin_unlock(&__lock);
+   __this_cpu_dec(lockdep_recursion);
 }
 
 static inline bool lockdep_assert_locked(void)
-- 
2.29.2

Re: [RFC] fs: Avoid to use lockdep information if it's turned off

2020-11-11 Thread Boqun Feng

Hi David,

On Wed, Nov 11, 2020 at 03:01:21PM +0100, David Sterba wrote:
> On Tue, Nov 10, 2020 at 04:33:27PM +0100, David Sterba wrote:
> > On Tue, Nov 10, 2020 at 09:37:37AM +0800, Boqun Feng wrote:
> > 
> > I'll run another test on top of the development branch in case there are
> > unrelated lockdep warning bugs that have been fixed meanwhile.
> 
> Similar reports but earlier test and probably completely valid due to
> "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!"
> 

Thanks for trying this out. These results are as expected: first a
lockdep splat warning is hit, which could be either caused by the
detection of a deadlock case or caused by an internal lockdep issue
("BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" in this case), the lockdep
get turned off afterwards, and then when __sb_start_write() wants to use
lock holding information, we find out, stop using that information and
do a WARN_ON_ONCE().

Without this patch, __sb_start_write() will get incorrect lock holding
information, and result in task hanging as reported by Filipe. Darrick's
patch:


https://lore.kernel.org/linux-fsdevel/160494580419.772573.9286165021627298770.stgit@magnolia/T/#t

can also fix that by not relying the lock holding information at all in
__sb_start_write(). And I think that's a better fix.


For the "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" warning, do you see
that every time when you run xfstests and don't see other lockdep
splats? If so, that means we reach the limitation of number of lockdep
hlock chains, and we should fix that.

Regards,
Boqun

> btrfs/057 [16:01:29][ 1580.146087] run fstests btrfs/057 at 
> 2020-11-10 16:01:29
> [ 1580.787867] BTRFS info (device vda): disk space caching is enabled
> [ 1580.789366] BTRFS info (device vda): has skinny extents
> [ 1581.052542] BTRFS: device fsid 84018822-2e45-4341-80be-da6d2b4e033a devid 
> 1 transid 5 /dev/vdb scanned by mkfs.btrfs (18739)
> [ 1581.105177] BTRFS info (device vdb): turning on sync discard
> [ 1581.106834] BTRFS info (device vdb): disk space caching is enabled
> [ 1581.108423] BTRFS info (device vdb): has skinny extents
> [ 1581.109799] BTRFS info (device vdb): flagging fs with big metadata feature
> [ 1581.120343] BTRFS info (device vdb): checking UUID tree
> [ 1586.942699] BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!
> [ 1586.945725] turning off the locking correctness validator.
> [ 1586.948823] Please attach the output of /proc/lock_stat to the bug report
> [ 1586.952153] CPU: 0 PID: 18771 Comm: fsstress Not tainted 
> 5.10.0-rc3-default+ #1355
> [ 1586.954919] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> [ 1586.958630] Call Trace:
> [ 1586.959214]  dump_stack+0x77/0x97
> [ 1586.960030]  add_chain_cache.cold+0x29/0x30
> [ 1586.961028]  validate_chain+0x278/0x780
> [ 1586.961979]  __lock_acquire+0x3fb/0x730
> [ 1586.962880]  lock_acquire.part.0+0xac/0x1a0
> [ 1586.963895]  ? try_to_wake_up+0x59/0x450
> [ 1586.965153]  ? rcu_read_lock_sched_held+0x3f/0x70
> [ 1586.966569]  ? lock_acquire+0xc4/0x150
> [ 1586.967699]  ? try_to_wake_up+0x59/0x450
> [ 1586.968882]  _raw_spin_lock_irqsave+0x43/0x90
> [ 1586.970207]  ? try_to_wake_up+0x59/0x450
> [ 1586.971404]  try_to_wake_up+0x59/0x450
> [ 1586.973149]  wake_up_q+0x60/0xb0
> [ 1586.974620]  __up_write+0x117/0x1d0
> [ 1586.975080] [ cut here ]
> [ 1586.976039]  btrfs_release_path+0xc8/0x180 [btrfs]
> [ 1586.977718] WARNING: CPU: 2 PID: 18772 at fs/super.c:1676 
> __sb_start_write+0x113/0x2a0
> [ 1586.979478]  __btrfs_update_delayed_inode+0x1c1/0x2c0 [btrfs]
> [ 1586.979506]  btrfs_commit_inode_delayed_inode+0x115/0x120 [btrfs]
> [ 1586.982484] Modules linked in:
> [ 1586.984080]  btrfs_evict_inode+0x1e2/0x370 [btrfs]
> [ 1586.985557]  dm_flakey
> [ 1586.986419]  ? evict+0xc3/0x220
> [ 1586.986421]  evict+0xd5/0x220
> [ 1586.986423]  vfs_rmdir.part.0+0x10c/0x180
> [ 1586.986426]  do_rmdir+0x14b/0x1b0
> [ 1586.987504]  dm_mod
> [ 1586.988244]  do_syscall_64+0x2d/0x70
> [ 1586.988947]  xxhash_generic btrfs
> [ 1586.989779]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1586.990906]  blake2b_generic
> [ 1586.991808] RIP: 0033:0x7f0ad919b5d7
> [ 1586.992451]  libcrc32c
> [ 1586.993427] Code: 73 01 c3 48 8b 0d 99 f8 0c 00 f7 d8 64 89 01 48 83 c8 ff 
> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 54 00 00 00 0f 05 <48> 3d 
> 01 f0 ff ff 73 01 c3 48 8b 0d 69 f8 0c 00 f7 d8 64 89 01 48
> [ 1586.994380]  crc32c_intel
> [ 1586.995546] RSP: 002b:7ffc152bf368 EFLAGS: 0202 ORIG_RAX: 
> 0054
> [ 1586.996034]  xor
> [ 1586.996613] RAX: ffda RBX: 01f4 RCX: 
> 7f0ad919b5d7
> [ 1586.996615] RDX: 0

[tip: locking/urgent] lockdep: Avoid to modify chain keys in validate_chain()

2020-11-11 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID: d61fc96a37603384cd531622c1e89de1096b5123
Gitweb:
https://git.kernel.org/tip/d61fc96a37603384cd531622c1e89de1096b5123
Author:Boqun Feng 
AuthorDate:Mon, 02 Nov 2020 13:37:41 +08:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 10 Nov 2020 18:38:38 +01:00

lockdep: Avoid to modify chain keys in validate_chain()

Chris Wilson reported a problem spotted by check_chain_key(): a chain
key got changed in validate_chain() because we modify the ->read in
validate_chain() to skip checks for dependency adding, and ->read is
taken into calculation for chain key since commit f611e8cf98ec
("lockdep: Take read/write status in consideration when generate
chainkey").

Fix this by avoiding to modify ->read in validate_chain() based on two
facts: a) since we now support recursive read lock detection, there is
no need to skip checks for dependency adding for recursive readers, b)
since we have a), there is only one case left (nest_lock) where we want
to skip checks in validate_chain(), we simply remove the modification
for ->read and rely on the return value of check_deadlock() to skip the
dependency adding.

Reported-by: Chris Wilson 
Signed-off-by: Boqun Feng 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20201102053743.450459-1-boqun.f...@gmail.com
---
 kernel/locking/lockdep.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b71ad8d..d9fb9e1 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2765,7 +2765,9 @@ print_deadlock_bug(struct task_struct *curr, struct 
held_lock *prev,
  * (Note that this has to be done separately, because the graph cannot
  * detect such classes of deadlocks.)
  *
- * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read
+ * Returns: 0 on deadlock detected, 1 on OK, 2 if another lock with the same
+ * lock class is held but nest_lock is also held, i.e. we rely on the
+ * nest_lock to avoid the deadlock.
  */
 static int
 check_deadlock(struct task_struct *curr, struct held_lock *next)
@@ -2788,7 +2790,7 @@ check_deadlock(struct task_struct *curr, struct held_lock 
*next)
 * lock class (i.e. read_lock(lock)+read_lock(lock)):
 */
if ((next->read == 2) && prev->read)
-   return 2;
+   continue;
 
/*
 * We're holding the nest_lock, which serializes this lock's
@@ -3593,15 +3595,12 @@ static int validate_chain(struct task_struct *curr,
if (!ret)
return 0;
/*
-* Mark recursive read, as we jump over it when
-* building dependencies (just like we jump over
-* trylock entries):
-*/
-   if (ret == 2)
-   hlock->read = 2;
-   /*
 * Add dependency only if this lock is not the head
-* of the chain, and if it's not a secondary read-lock:
+* of the chain, and if the new lock introduces no more
+* lock dependency (because we already hold a lock with the
+* same lock class) nor deadlock (because the nest_lock
+* serializes nesting locks), see the comments for
+* check_deadlock().
 */
if (!chain_head && ret != 2) {
if (!check_prevs_add(curr, hlock))

Re: [RFC] fs: Avoid to use lockdep information if it's turned off

2020-11-09 Thread Boqun Feng

On Mon, Nov 09, 2020 at 05:49:25PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 10, 2020 at 09:37:37AM +0800, Boqun Feng wrote:
> > Filipe Manana reported a warning followed by task hanging after attempts
> > to freeze a filesystem[1]. The problem happened in a LOCKDEP=y kernel,
> > and percpu_rwsem_is_held() provided incorrect results when
> > debug_locks == 0. Although the behavior is caused by commit 4d004099a668
> > ("lockdep: Fix lockdep recursion"): after that lock_is_held() and its
> > friends always return true if debug_locks == 0. However, one could argue
> 
> ...the silent trylock conversion with no checking of the return value is
> completely broken.  I already sent a patch to tear all this out:
> 
> https://lore.kernel.org/linux-fsdevel/160494580419.772573.9286165021627298770.stgit@magnolia/T/#t
> 

Thanks! That looks good to me. I'm all for removing that piece of code.

While we are at it, I have to ask, when you hit the original problem
(warning after trylock in __start_sb_write()), did you see any lockdep
splat happened previously? Or just like Filipe, you hit that without
seeing any lockdep splat happened before? Thanks! I'm trying to track
down the silent lockdep turn-off.

Regards,
Boqun

> --D
> 
> > that querying the lock holding information regardless if the lockdep
> > turn-off status is inappropriate in the first place. Therefore instead
> > of reverting lock_is_held() and its friends to the previous semantics,
> > add the explicit checking in fs code to avoid use the lock holding
> > information if lockdpe is turned off. And since the original problem
> > also happened with a silent lockdep turn-off, put a warning if
> > debug_locks is 0, which will help us spot the silent lockdep turn-offs.
> > 
> > [1]: 
> > https://lore.kernel.org/lkml/a5cf643b-842f-7a60-73c7-85d738a92...@suse.com/
> > 
> > Reported-by: Filipe Manana 
> > Fixes: 4d004099a668 ("lockdep: Fix lockdep recursion")
> > Signed-off-by: Boqun Feng 
> > Cc: Peter Zijlstra 
> > Cc: Jan Kara 
> > Cc: David Sterba 
> > Cc: Nikolay Borisov 
> > Cc: "Darrick J. Wong" 
> > ---
> > Hi Filipe,
> > 
> > I use the slightly different approach to fix this problem, and I think
> > it should have the similar effect with my previous fix[2], except that
> > you will hit a warning if the problem happens now. The warning is added
> > on purpose because I don't want to miss a silent lockdep turn-off.
> > 
> > Could you and other fs folks give this a try?
> > 
> > Regards,
> > Boqun
> > 
> > [2]: https://lore.kernel.org/lkml/20201103140828.GA2713762@boqun-archlinux/
> > 
> >  fs/super.c | 11 +++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index a51c2083cd6b..1803c8d999e9 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1659,12 +1659,23 @@ int __sb_start_write(struct super_block *sb, int 
> > level, bool wait)
> >  * twice in some cases, which is OK only because we already hold a
> >  * freeze protection also on higher level. Due to these cases we have
> >  * to use wait == F (trylock mode) which must not fail.
> > +*
> > +* Note: lockdep can only prove correct information if debug_locks != 0
> >  */
> > if (wait) {
> > int i;
> >  
> > for (i = 0; i < level - 1; i++)
> > if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
> > +   /*
> > +* XXX: the WARN_ON_ONCE() here is to help
> > +* track down silent lockdep turn-off, i.e.
> > +* this warning is triggered, but no lockdep
> > +* splat is reported.
> > +*/
> > +   if (WARN_ON_ONCE(!debug_locks))
> > +   break;
> > +
> > force_trylock = true;
> > break;
> > }
> > -- 
> > 2.29.2
> >

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-09 Thread Boqun Feng

On Mon, Nov 09, 2020 at 09:57:05AM +, Filipe Manana wrote:
> 
> 
> On 09/11/20 08:44, Boqun Feng wrote:
> > Hi Filipe,
> > 
> > On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
> >> On Wed, Nov 04, 2020 at 07:54:40PM +, Filipe Manana wrote:
> >> [...]
> >>>
> >>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> >>>
> >>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> >>> index b71ad8d9f1c9..b31d4ad482c7 100644
> >>> --- a/kernel/locking/lockdep.c
> >>> +++ b/kernel/locking/lockdep.c
> >>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >>> LOCK_TRACE_SIZE_IN_LONGS;
> >>>
> >>> if (max_entries <= 0) {
> >>> -   if (!debug_locks_off_graph_unlock())
> >>> +   if (!debug_locks_off_graph_unlock()) {
> >>> +   WARN_ON_ONCE(1);
> >>> return NULL;
> >>> +   }
> >>>
> >>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too 
> >>> low!");
> >>> dump_stack();
> >>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> >>> lockdep_map *lock, int read)
> >>> unsigned long flags;
> >>> int ret = 0;
> >>>
> >>> -   if (unlikely(!lockdep_enabled()))
> >>> +   if (unlikely(debug_locks && !lockdep_enabled()))
> >>> return 1; /* avoid false negative lockdep_assert_held() */
> >>>
> >>> raw_local_irq_save(flags);
> >>>
> >>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> >>> Unexpected, right?
> >>>
> >>
> >> Kinda, that means we still don't know why lockdep was turned off.
> >>
> >>> Should I try something else?
> >>>
> >>
> >> Thanks for trying this. Let me set up the reproducer on my side, and see
> >> if I could get more information.
> >>
> > 
> > I could hit this with btrfs/187, and when we hit it, lockdep will report
> > the deadlock and turn off, and I think this is the root cause for your
> > hitting the original problem, I will add some analysis after the lockdep
> > splat.
> > 
> > [12295.973309] 
> > [12295.974770] WARNING: possible recursive locking detected
> > [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
> > [12295.974770] 
> > [12295.974770] zsh/701247 is trying to acquire lock:
> > [12295.974770] 92cef43480b8 (>lock){}-{2:2}, at: 
> > btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] 
> >but task is already holding lock:
> > [12295.974770] 92cef434a038 (>lock){}-{2:2}, at: 
> > btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] 
> >other info that might help us debug this:
> > [12295.974770]  Possible unsafe locking scenario:
> > 
> > [12295.974770]CPU0
> > [12295.974770]
> > [12295.974770]   lock(>lock);
> > [12295.974770]   lock(>lock);
> > [12295.974770] 
> > *** DEADLOCK ***
> > 
> > [12295.974770]  May be due to missing lock nesting notation
> > 
> > [12295.974770] 2 locks held by zsh/701247:
> > [12295.974770]  #0: 92cef3d315b0 (>cred_guard_mutex){+.+.}-{3:3}, 
> > at: bprm_execve+0xaa/0x920
> > [12295.974770]  #1: 92cef434a038 (>lock){}-{2:2}, at: 
> > btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] 
> >stack backtrace:
> > [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 
> > 5.10.0-rc2-btrfs-next-71 #20
> > [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
> > Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> > [12295.974770] Call Trace:
> > [12295.974770]  dump_stack+0x8b/0xb0
> > [12295.974770]  __lock_acquire.cold+0x175/0x2e9
> > [12295.974770]  lock_acquire+0x15b/0x490
> > [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]  ? read_block_for_search+0xf4/0x350 [btrfs]
> > [12295.974770]  _raw_read_lock+0x40/0xa0
> > [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]  btrf

[RFC] fs: Avoid to use lockdep information if it's turned off

2020-11-09 Thread Boqun Feng

Filipe Manana reported a warning followed by task hanging after attempts
to freeze a filesystem[1]. The problem happened in a LOCKDEP=y kernel,
and percpu_rwsem_is_held() provided incorrect results when
debug_locks == 0. Although the behavior is caused by commit 4d004099a668
("lockdep: Fix lockdep recursion"): after that lock_is_held() and its
friends always return true if debug_locks == 0. However, one could argue
that querying the lock holding information regardless if the lockdep
turn-off status is inappropriate in the first place. Therefore instead
of reverting lock_is_held() and its friends to the previous semantics,
add the explicit checking in fs code to avoid use the lock holding
information if lockdpe is turned off. And since the original problem
also happened with a silent lockdep turn-off, put a warning if
debug_locks is 0, which will help us spot the silent lockdep turn-offs.

[1]: https://lore.kernel.org/lkml/a5cf643b-842f-7a60-73c7-85d738a92...@suse.com/

Reported-by: Filipe Manana 
Fixes: 4d004099a668 ("lockdep: Fix lockdep recursion")
Signed-off-by: Boqun Feng 
Cc: Peter Zijlstra 
Cc: Jan Kara 
Cc: David Sterba 
Cc: Nikolay Borisov 
Cc: "Darrick J. Wong" 
---
Hi Filipe,

I use the slightly different approach to fix this problem, and I think
it should have the similar effect with my previous fix[2], except that
you will hit a warning if the problem happens now. The warning is added
on purpose because I don't want to miss a silent lockdep turn-off.

Could you and other fs folks give this a try?

Regards,
Boqun

[2]: https://lore.kernel.org/lkml/20201103140828.GA2713762@boqun-archlinux/

 fs/super.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index a51c2083cd6b..1803c8d999e9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1659,12 +1659,23 @@ int __sb_start_write(struct super_block *sb, int level, 
bool wait)
 * twice in some cases, which is OK only because we already hold a
 * freeze protection also on higher level. Due to these cases we have
 * to use wait == F (trylock mode) which must not fail.
+*
+* Note: lockdep can only prove correct information if debug_locks != 0
 */
if (wait) {
int i;
 
for (i = 0; i < level - 1; i++)
if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
+   /*
+* XXX: the WARN_ON_ONCE() here is to help
+* track down silent lockdep turn-off, i.e.
+* this warning is triggered, but no lockdep
+* splat is reported.
+*/
+   if (WARN_ON_ONCE(!debug_locks))
+   break;
+
force_trylock = true;
break;
}
-- 
2.29.2

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-09 Thread Boqun Feng

Hi Filipe,

On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
> On Wed, Nov 04, 2020 at 07:54:40PM +, Filipe Manana wrote:
> [...]
> > 
> > Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> > 
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index b71ad8d9f1c9..b31d4ad482c7 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> > LOCK_TRACE_SIZE_IN_LONGS;
> > 
> > if (max_entries <= 0) {
> > -   if (!debug_locks_off_graph_unlock())
> > +   if (!debug_locks_off_graph_unlock()) {
> > +   WARN_ON_ONCE(1);
> > return NULL;
> > +   }
> > 
> > print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> > dump_stack();
> > @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> > lockdep_map *lock, int read)
> > unsigned long flags;
> > int ret = 0;
> > 
> > -   if (unlikely(!lockdep_enabled()))
> > +   if (unlikely(debug_locks && !lockdep_enabled()))
> > return 1; /* avoid false negative lockdep_assert_held() */
> > 
> > raw_local_irq_save(flags);
> > 
> > With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> > Unexpected, right?
> > 
> 
> Kinda, that means we still don't know why lockdep was turned off.
> 
> > Should I try something else?
> > 
> 
> Thanks for trying this. Let me set up the reproducer on my side, and see
> if I could get more information.
> 

I could hit this with btrfs/187, and when we hit it, lockdep will report
the deadlock and turn off, and I think this is the root cause for your
hitting the original problem, I will add some analysis after the lockdep
splat.

[12295.973309] 
[12295.974770] WARNING: possible recursive locking detected
[12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
[12295.974770] 
[12295.974770] zsh/701247 is trying to acquire lock:
[12295.974770] 92cef43480b8 (>lock){}-{2:2}, at: 
btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] 
   but task is already holding lock:
[12295.974770] 92cef434a038 (>lock){}-{2:2}, at: 
btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] 
   other info that might help us debug this:
[12295.974770]  Possible unsafe locking scenario:

[12295.974770]CPU0
[12295.974770]
[12295.974770]   lock(>lock);
[12295.974770]   lock(>lock);
[12295.974770] 
*** DEADLOCK ***

[12295.974770]  May be due to missing lock nesting notation

[12295.974770] 2 locks held by zsh/701247:
[12295.974770]  #0: 92cef3d315b0 (>cred_guard_mutex){+.+.}-{3:3}, at: 
bprm_execve+0xaa/0x920
[12295.974770]  #1: 92cef434a038 (>lock){}-{2:2}, at: 
btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] 
   stack backtrace:
[12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 
5.10.0-rc2-btrfs-next-71 #20
[12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[12295.974770] Call Trace:
[12295.974770]  dump_stack+0x8b/0xb0
[12295.974770]  __lock_acquire.cold+0x175/0x2e9
[12295.974770]  lock_acquire+0x15b/0x490
[12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]  ? read_block_for_search+0xf4/0x350 [btrfs]
[12295.974770]  _raw_read_lock+0x40/0xa0
[12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]  btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]  btrfs_search_slot+0x6ac/0xca0 [btrfs]
[12295.974770]  btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
[12295.974770]  btrfs_getxattr+0x67/0x130 [btrfs]
[12295.974770]  __vfs_getxattr+0x53/0x70
[12295.974770]  get_vfs_caps_from_disk+0x68/0x1a0
[12295.974770]  ? sched_clock_cpu+0x114/0x180
[12295.974770]  cap_bprm_creds_from_file+0x181/0x6c0
[12295.974770]  security_bprm_creds_from_file+0x2a/0x40
[12295.974770]  begin_new_exec+0xf4/0xc40
[12295.974770]  ? load_elf_phdrs+0x6b/0xb0
[12295.974770]  load_elf_binary+0x66b/0x1620
[12295.974770]  ? read_hv_sched_clock_tsc+0x5/0x20
[12295.974770]  ? sched_clock+0x5/0x10
[12295.974770]  ? sched_clock_local+0x12/0x80
[12295.974770]  ? sched_clock_cpu+0x114/0x180
[12295.974770]  bprm_execve+0x3ce/0x920
[12295.974770]  do_execveat_common+0x1b0/0x1f0
[12295.974770]  __x64_sys_execve+0x39/0x50
[12295.974770]  do_syscall_64+0x33/0x80
[12295.974770]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[12295.974770] RIP: 0033:0x7f6aaefc13cb

Re: [PATCH memory-model 5/8] tools/memory-model: Add a glossary of LKMM terms

2020-11-06 Thread Boqun Feng

On Fri, Nov 06, 2020 at 10:01:02AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 06, 2020 at 09:47:22AM +0800, Boqun Feng wrote:
> > On Thu, Nov 05, 2020 at 02:00:14PM -0800, paul...@kernel.org wrote:
> > > From: "Paul E. McKenney" 
> > > 
> > > Signed-off-by: Paul E. McKenney 
> > > ---
> > >  tools/memory-model/Documentation/glossary.txt | 155 
> > > ++
> > >  1 file changed, 155 insertions(+)
> > >  create mode 100644 tools/memory-model/Documentation/glossary.txt
> > > 
> > > diff --git a/tools/memory-model/Documentation/glossary.txt 
> > > b/tools/memory-model/Documentation/glossary.txt
> > > new file mode 100644
> > > index 000..036fa28
> > > --- /dev/null
> > > +++ b/tools/memory-model/Documentation/glossary.txt
> > > @@ -0,0 +1,155 @@
> > > +This document contains brief definitions of LKMM-related terms.  Like 
> > > most
> > > +glossaries, it is not intended to be read front to back (except perhaps
> > > +as a way of confirming a diagnosis of OCD), but rather to be searched
> > > +for specific terms.
> > > +
> > > +
> > > +Address Dependency:  When the address of a later memory access is 
> > > computed
> > > + based on the value returned by an earlier load, an "address
> > > + dependency" extends from that load extending to the later access.
> > > + Address dependencies are quite common in RCU read-side critical
> > > + sections:
> > > +
> > > +  1 rcu_read_lock();
> > > +  2 p = rcu_dereference(gp);
> > > +  3 do_something(p->a);
> > > +  4 rcu_read_unlock();
> > > +
> > > +  In this case, because the address of "p->a" on line 3 is computed
> > > +  from the value returned by the rcu_dereference() on line 2, the
> > > +  address dependency extends from that rcu_dereference() to that
> > > +  "p->a".  In rare cases, optimizing compilers can destroy address
> > > +  dependencies.  Please see Documentation/RCU/rcu_dereference.txt
> > > +  for more information.
> > > +
> > > +  See also "Control Dependency".
> > > +
> > > +Acquire:  With respect to a lock, acquiring that lock, for example,
> > > + using spin_lock().  With respect to a non-lock shared variable,
> > > + a special operation that includes a load and which orders that
> > > + load before later memory references running on that same CPU.
> > > + An example special acquire operation is smp_load_acquire(),
> > > + but atomic_read_acquire() and atomic_xchg_acquire() also include
> > > + acquire loads.
> > > +
> > > + When an acquire load returns the value stored by a release store
> > > + to that same variable, then all operations preceding that store
> > 
> > Change this to:
> > 
> > When an acquire load reads-from a release store
> > 
> > , and put a reference to "Reads-from"? I think this makes the document
> > more consistent in that it makes clear "an acquire load returns the
> > value stored by a release store to the same variable" is not a special
> > case, it's simple a "Reads-from".
> > 
> > > + happen before any operations following that load acquire.
> > 
> > Add a reference to the definition of "happen before" in explanation.txt?
> 
> How about as shown below?  I currently am carrying this as a separate
> commit, but I might merge it into this one later on.
> 

Looks good to me, thanks!

Regards,
Boqun

>   Thanx, Paul
> 
> 
> 
> commit 774a52cd3d80d6b657ae6c14c10bd9fc437068f3
> Author: Paul E. McKenney 
> Date:   Fri Nov 6 09:58:01 2020 -0800
> 
> tools/memory-model: Tie acquire loads to reads-from
> 
> This commit explicitly makes the connection between acquire loads and
> the reads-from relation.  It also adds an entry for happens-before,
> and refers to the corresponding section of explanation.txt.
> 
> Reported-by: Boqun Feng 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/tools/memory-model/Documentation/glossary.txt 
> b/tools/memory-model/Documentation/glossary.txt
> index 3924aca..383151b 100644
> --- a/tools/memory-model/Documentation/glossary.txt
> +++ b/tools/memory-model/Documentation/glossary.txt
> @@ -33,10 +33,11 @@ Acquire:  With respect to a lock, acquiring that lock, 
> for

Re: [PATCH v2] kcsan: Fix encoding masks and regain address bit

2020-11-06 Thread Boqun Feng

On Fri, Nov 06, 2020 at 10:34:56AM +0100, Marco Elver wrote:
> The watchpoint encoding masks for size and address were off-by-one bit
> each, with the size mask using 1 unnecessary bit and the address mask
> missing 1 bit. However, due to the way the size is shifted into the
> encoded watchpoint, we were effectively wasting and never using the
> extra bit.
> 
> For example, on x86 with PAGE_SIZE==4K, we have 1 bit for the is-write
> bit, 14 bits for the size bits, and then 49 bits left for the address.
> Prior to this fix we would end up with this usage:
> 
>   [ write<1> | size<14> | wasted<1> | address<48> ]
> 
> Fix it by subtracting 1 bit from the GENMASK() end and start ranges of
> size and address respectively. The added static_assert()s verify that
> the masks are as expected. With the fixed version, we get the expected
> usage:
> 
>   [ write<1> | size<14> | address<49> ]
> 
> Functionally no change is expected, since that extra address bit is
> insignificant for enabled architectures.
> 
> Signed-off-by: Marco Elver 

Acked-by: Boqun Feng 

Regards,
Boqun

> ---
> v2:
> * Use WATCHPOINT_ADDR_BITS to avoid duplicating "BITS_PER_LONG-1 -
>   WATCHPOINT_SIZE_BITS" per Boqun's suggestion.
> ---
>  kernel/kcsan/encoding.h | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/kcsan/encoding.h b/kernel/kcsan/encoding.h
> index 4f73db6d1407..7ee405524904 100644
> --- a/kernel/kcsan/encoding.h
> +++ b/kernel/kcsan/encoding.h
> @@ -37,14 +37,12 @@
>   */
>  #define WATCHPOINT_ADDR_BITS (BITS_PER_LONG-1 - WATCHPOINT_SIZE_BITS)
>  
> -/*
> - * Masks to set/retrieve the encoded data.
> - */
> -#define WATCHPOINT_WRITE_MASK BIT(BITS_PER_LONG-1)
> -#define WATCHPOINT_SIZE_MASK 
>   \
> - GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-2 - WATCHPOINT_SIZE_BITS)
> -#define WATCHPOINT_ADDR_MASK 
>   \
> - GENMASK(BITS_PER_LONG-3 - WATCHPOINT_SIZE_BITS, 0)
> +/* Bitmasks for the encoded watchpoint access information. */
> +#define WATCHPOINT_WRITE_MASKBIT(BITS_PER_LONG-1)
> +#define WATCHPOINT_SIZE_MASK GENMASK(BITS_PER_LONG-2, WATCHPOINT_ADDR_BITS)
> +#define WATCHPOINT_ADDR_MASK GENMASK(WATCHPOINT_ADDR_BITS-1, 0)
> +static_assert(WATCHPOINT_ADDR_MASK == (1UL << WATCHPOINT_ADDR_BITS) - 1);
> +static_assert((WATCHPOINT_WRITE_MASK ^ WATCHPOINT_SIZE_MASK ^ 
> WATCHPOINT_ADDR_MASK) == ~0UL);
>  
>  static inline bool check_encodable(unsigned long addr, size_t size)
>  {
> -- 
> 2.29.2.222.g5d2a92d10f8-goog
>

Re: [PATCH kcsan 3/3] kcsan: Fix encoding masks and regain address bit

2020-11-06 Thread Boqun Feng

On Fri, Nov 06, 2020 at 10:03:21AM +0100, Marco Elver wrote:
> On Fri, 6 Nov 2020 at 02:23, Boqun Feng  wrote:
> > Hi Marco,
> >
> > On Thu, Nov 05, 2020 at 02:03:24PM -0800, paul...@kernel.org wrote:
> > > From: Marco Elver 
> > >
> > > The watchpoint encoding masks for size and address were off-by-one bit
> > > each, with the size mask using 1 unnecessary bit and the address mask
> > > missing 1 bit. However, due to the way the size is shifted into the
> > > encoded watchpoint, we were effectively wasting and never using the
> > > extra bit.
> > >
> > > For example, on x86 with PAGE_SIZE==4K, we have 1 bit for the is-write
> > > bit, 14 bits for the size bits, and then 49 bits left for the address.
> > > Prior to this fix we would end up with this usage:
> > >
> > >   [ write<1> | size<14> | wasted<1> | address<48> ]
> > >
> > > Fix it by subtracting 1 bit from the GENMASK() end and start ranges of
> > > size and address respectively. The added static_assert()s verify that
> > > the masks are as expected. With the fixed version, we get the expected
> > > usage:
> > >
> > >   [ write<1> | size<14> | address<49> ]
> > >
> > > Functionally no change is expected, since that extra address bit is
> > > insignificant for enabled architectures.
> > >
> > > Signed-off-by: Marco Elver 
> > > Signed-off-by: Paul E. McKenney 
> > > ---
> > >  kernel/kcsan/encoding.h | 14 ++
> > >  1 file changed, 6 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/kernel/kcsan/encoding.h b/kernel/kcsan/encoding.h
> > > index 4f73db6..b50bda9 100644
> > > --- a/kernel/kcsan/encoding.h
> > > +++ b/kernel/kcsan/encoding.h
> > > @@ -37,14 +37,12 @@
> > >   */
> > >  #define WATCHPOINT_ADDR_BITS (BITS_PER_LONG-1 - WATCHPOINT_SIZE_BITS)
> > >
> > > -/*
> > > - * Masks to set/retrieve the encoded data.
> > > - */
> > > -#define WATCHPOINT_WRITE_MASK BIT(BITS_PER_LONG-1)
> > > -#define WATCHPOINT_SIZE_MASK 
> > >   \
> > > - GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-2 - WATCHPOINT_SIZE_BITS)
> > > -#define WATCHPOINT_ADDR_MASK 
> > >   \
> > > - GENMASK(BITS_PER_LONG-3 - WATCHPOINT_SIZE_BITS, 0)
> > > +/* Bitmasks for the encoded watchpoint access information. */
> > > +#define WATCHPOINT_WRITE_MASKBIT(BITS_PER_LONG-1)
> > > +#define WATCHPOINT_SIZE_MASK GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-1 - 
> > > WATCHPOINT_SIZE_BITS)
> > > +#define WATCHPOINT_ADDR_MASK GENMASK(BITS_PER_LONG-2 - 
> > > WATCHPOINT_SIZE_BITS, 0)
> > > +static_assert(WATCHPOINT_ADDR_MASK == (1UL << WATCHPOINT_ADDR_BITS) - 1);
> >
> > Nit:
> >
> > Since you use the static_assert(), why not define WATCHPOINT_ADDR_MASK
> > as:
> >
> > #define WATCHPOINT_ADDR_MASK (BIT(WATCHPOINT_SIZE_BITS) - 1)
> 
> This is incorrect, as the static_assert()s would have indicated. It
> should probably be (BIT(WATCHPOINT_ADDR_BITS) - 1)?
> 
> As an aside, I explicitly did *not* want to use additional arithmetic
> to generate the masks but purely rely on BIT(), and GENMASK(), as it
> would be inconsistent otherwise. The static_assert()s then sanity
> check everything without BIT+GENMASK (because I've grown slightly
> paranoid about off-by-1s here). So I'd rather not start bikeshedding
> about which way around things should go.
> 
> In general, GENMASK() is safer, because subtracting 1 to get the mask
> doesn't always work, specifically e.g. (BIT(BITS_PER_LONG) - 1) does
> not work.
> 
> > Besides, WATCHPOINT_SIZE_MASK can also be defined as:
> 
> No, sorry it cannot.
> 
> > #define WATCHPOINT_SIZE_MASK GENMASK(BITS_PER_LONG - 2, 
> > WATCHPOINT_SIZE_BITS)
> 
>GENMASK(BITS_PER_LONG - 2, WATCHPOINT_SIZE_BITS)
> 
> is not equivalent to the current
> 
>   GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-1 - WATCHPOINT_SIZE_BITS)
> 
> Did you mean GENMASK(BITS_PER_LONG-2, WATCHPOINT_ADDR_BITS)? I can

You're right! Guess I should check first about what vim completes for me
;-) And I agree with you on the preference to GENMASK()

> send a v2 for this one.

Let me add an ack for that one, thanks!

Regards,
Boqun

> 
> Thanks,
> -- Marco

Re: [PATCH memory-model 5/8] tools/memory-model: Add a glossary of LKMM terms

2020-11-05 Thread Boqun Feng

On Thu, Nov 05, 2020 at 02:00:14PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> Signed-off-by: Paul E. McKenney 
> ---
>  tools/memory-model/Documentation/glossary.txt | 155 
> ++
>  1 file changed, 155 insertions(+)
>  create mode 100644 tools/memory-model/Documentation/glossary.txt
> 
> diff --git a/tools/memory-model/Documentation/glossary.txt 
> b/tools/memory-model/Documentation/glossary.txt
> new file mode 100644
> index 000..036fa28
> --- /dev/null
> +++ b/tools/memory-model/Documentation/glossary.txt
> @@ -0,0 +1,155 @@
> +This document contains brief definitions of LKMM-related terms.  Like most
> +glossaries, it is not intended to be read front to back (except perhaps
> +as a way of confirming a diagnosis of OCD), but rather to be searched
> +for specific terms.
> +
> +
> +Address Dependency:  When the address of a later memory access is computed
> + based on the value returned by an earlier load, an "address
> + dependency" extends from that load extending to the later access.
> + Address dependencies are quite common in RCU read-side critical
> + sections:
> +
> +  1 rcu_read_lock();
> +  2 p = rcu_dereference(gp);
> +  3 do_something(p->a);
> +  4 rcu_read_unlock();
> +
> +  In this case, because the address of "p->a" on line 3 is computed
> +  from the value returned by the rcu_dereference() on line 2, the
> +  address dependency extends from that rcu_dereference() to that
> +  "p->a".  In rare cases, optimizing compilers can destroy address
> +  dependencies.  Please see Documentation/RCU/rcu_dereference.txt
> +  for more information.
> +
> +  See also "Control Dependency".
> +
> +Acquire:  With respect to a lock, acquiring that lock, for example,
> + using spin_lock().  With respect to a non-lock shared variable,
> + a special operation that includes a load and which orders that
> + load before later memory references running on that same CPU.
> + An example special acquire operation is smp_load_acquire(),
> + but atomic_read_acquire() and atomic_xchg_acquire() also include
> + acquire loads.
> +
> + When an acquire load returns the value stored by a release store
> + to that same variable, then all operations preceding that store

Change this to:

When an acquire load reads-from a release store

, and put a reference to "Reads-from"? I think this makes the document
more consistent in that it makes clear "an acquire load returns the
value stored by a release store to the same variable" is not a special
case, it's simple a "Reads-from".

> + happen before any operations following that load acquire.

Add a reference to the definition of "happen before" in explanation.txt?

Regards,
Boqun

> +
> + See also "Relaxed" and "Release".
> +
[...]

Re: [PATCH kcsan 3/3] kcsan: Fix encoding masks and regain address bit

2020-11-05 Thread Boqun Feng

Hi Marco,

On Thu, Nov 05, 2020 at 02:03:24PM -0800, paul...@kernel.org wrote:
> From: Marco Elver 
> 
> The watchpoint encoding masks for size and address were off-by-one bit
> each, with the size mask using 1 unnecessary bit and the address mask
> missing 1 bit. However, due to the way the size is shifted into the
> encoded watchpoint, we were effectively wasting and never using the
> extra bit.
> 
> For example, on x86 with PAGE_SIZE==4K, we have 1 bit for the is-write
> bit, 14 bits for the size bits, and then 49 bits left for the address.
> Prior to this fix we would end up with this usage:
> 
>   [ write<1> | size<14> | wasted<1> | address<48> ]
> 
> Fix it by subtracting 1 bit from the GENMASK() end and start ranges of
> size and address respectively. The added static_assert()s verify that
> the masks are as expected. With the fixed version, we get the expected
> usage:
> 
>   [ write<1> | size<14> | address<49> ]
> 
> Functionally no change is expected, since that extra address bit is
> insignificant for enabled architectures.
> 
> Signed-off-by: Marco Elver 
> Signed-off-by: Paul E. McKenney 
> ---
>  kernel/kcsan/encoding.h | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/kcsan/encoding.h b/kernel/kcsan/encoding.h
> index 4f73db6..b50bda9 100644
> --- a/kernel/kcsan/encoding.h
> +++ b/kernel/kcsan/encoding.h
> @@ -37,14 +37,12 @@
>   */
>  #define WATCHPOINT_ADDR_BITS (BITS_PER_LONG-1 - WATCHPOINT_SIZE_BITS)
>  
> -/*
> - * Masks to set/retrieve the encoded data.
> - */
> -#define WATCHPOINT_WRITE_MASK BIT(BITS_PER_LONG-1)
> -#define WATCHPOINT_SIZE_MASK 
>   \
> - GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-2 - WATCHPOINT_SIZE_BITS)
> -#define WATCHPOINT_ADDR_MASK 
>   \
> - GENMASK(BITS_PER_LONG-3 - WATCHPOINT_SIZE_BITS, 0)
> +/* Bitmasks for the encoded watchpoint access information. */
> +#define WATCHPOINT_WRITE_MASKBIT(BITS_PER_LONG-1)
> +#define WATCHPOINT_SIZE_MASK GENMASK(BITS_PER_LONG-2, BITS_PER_LONG-1 - 
> WATCHPOINT_SIZE_BITS)
> +#define WATCHPOINT_ADDR_MASK GENMASK(BITS_PER_LONG-2 - WATCHPOINT_SIZE_BITS, 
> 0)
> +static_assert(WATCHPOINT_ADDR_MASK == (1UL << WATCHPOINT_ADDR_BITS) - 1);

Nit:

Since you use the static_assert(), why not define WATCHPOINT_ADDR_MASK
as:

#define WATCHPOINT_ADDR_MASK (BIT(WATCHPOINT_SIZE_BITS) - 1)

Besides, WATCHPOINT_SIZE_MASK can also be defined as:

#define WATCHPOINT_SIZE_MASK GENMASK(BITS_PER_LONG - 2, WATCHPOINT_SIZE_BITS)

Regards,
Boqun

> +static_assert((WATCHPOINT_WRITE_MASK ^ WATCHPOINT_SIZE_MASK ^ 
> WATCHPOINT_ADDR_MASK) == ~0UL);
>  
>  static inline bool check_encodable(unsigned long addr, size_t size)
>  {
> -- 
> 2.9.5
>

Re: [PATCH 1/2] lockdep: Avoid to modify chain keys in validate_chain()

2020-11-04 Thread Boqun Feng

Hi Chris,

Could you try this to see if it fixes the problem? Thanks!

Regards,
Boqun

On Mon, Nov 02, 2020 at 01:37:41PM +0800, Boqun Feng wrote:
> Chris Wilson reported a problem spotted by check_chain_key(): a chain
> key got changed in validate_chain() because we modify the ->read in
> validate_chain() to skip checks for dependency adding, and ->read is
> taken into calculation for chain key since commit f611e8cf98ec
> ("lockdep: Take read/write status in consideration when generate
> chainkey").
> 
> Fix this by avoiding to modify ->read in validate_chain() based on two
> facts: a) since we now support recursive read lock detection, there is
> no need to skip checks for dependency adding for recursive readers, b)
> since we have a), there is only one case left (nest_lock) where we want
> to skip checks in validate_chain(), we simply remove the modification
> for ->read and rely on the return value of check_deadlock() to skip the
> dependency adding.
> 
> Reported-by: Chris Wilson 
> Signed-off-by: Boqun Feng 
> Cc: Peter Zijlstra 
> ---
> Peter,
> 
> I managed to get a reproducer for the problem Chris reported, please see
> patch #2. With this patch, that problem gets fixed.
> 
> This small patchset is based on your locking/core, patch #2 actually
> relies on your "s/raw_spin/spin" changes, thanks for taking care of that
> ;-)
> 
> Regards,
> Boqun
> 
>  kernel/locking/lockdep.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..a294326fd998 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2765,7 +2765,9 @@ print_deadlock_bug(struct task_struct *curr, struct 
> held_lock *prev,
>   * (Note that this has to be done separately, because the graph cannot
>   * detect such classes of deadlocks.)
>   *
> - * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read
> + * Returns: 0 on deadlock detected, 1 on OK, 2 if another lock with the same
> + * lock class is held but nest_lock is also held, i.e. we rely on the
> + * nest_lock to avoid the deadlock.
>   */
>  static int
>  check_deadlock(struct task_struct *curr, struct held_lock *next)
> @@ -2788,7 +2790,7 @@ check_deadlock(struct task_struct *curr, struct 
> held_lock *next)
>* lock class (i.e. read_lock(lock)+read_lock(lock)):
>*/
>   if ((next->read == 2) && prev->read)
> - return 2;
> + continue;
>  
>   /*
>* We're holding the nest_lock, which serializes this lock's
> @@ -3592,16 +3594,13 @@ static int validate_chain(struct task_struct *curr,
>  
>   if (!ret)
>   return 0;
> - /*
> -  * Mark recursive read, as we jump over it when
> -  * building dependencies (just like we jump over
> -  * trylock entries):
> -  */
> - if (ret == 2)
> - hlock->read = 2;
>   /*
>* Add dependency only if this lock is not the head
> -  * of the chain, and if it's not a secondary read-lock:
> +  * of the chain, and if the new lock introduces no more
> +  * lock dependency (because we already hold a lock with the
> +  * same lock class) nor deadlock (because the nest_lock
> +  * serializes nesting locks), see the comments for
> +  * check_deadlock().
>*/
>   if (!chain_head && ret != 2) {
>   if (!check_prevs_add(curr, hlock))
> -- 
> 2.28.0
>

Re: possible deadlock in send_sigurg (2)

2020-11-04 Thread Boqun Feng

Hi,

On Wed, Nov 04, 2020 at 04:18:08AM -0800, syzbot wrote:
> syzbot has bisected this issue to:
> 
> commit e918188611f073063415f40fae568fa4d86d9044
> Author: Boqun Feng 
> Date:   Fri Aug 7 07:42:20 2020 +
> 
> locking: More accurate annotations for read_lock()
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1414273250
> start commit:   4ef8451b Merge tag 'perf-tools-for-v5.10-2020-11-03' of gi..
> git tree:   upstream
> final oops: https://syzkaller.appspot.com/x/report.txt?x=1614273250
> console output: https://syzkaller.appspot.com/x/log.txt?x=1214273250
> kernel config:  https://syzkaller.appspot.com/x/.config?x=61033507391c77ff
> dashboard link: https://syzkaller.appspot.com/bug?extid=c5e32344981ad9f33750
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1519786250
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13c59f6c50
> 
> Reported-by: syzbot+c5e32344981ad9f33...@syzkaller.appspotmail.com
> Fixes: e918188611f0 ("locking: More accurate annotations for read_lock()")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection

Thanks for reporting this, and this is actually a deadlock potential
detected by the newly added recursive read deadlock detection as my
analysis:


https://lore.kernel.org/lkml/20200910071523.gf7...@debian-boqun.qqnc3lrjykvubdpftowmye0fmh.lx.internal.cloudapp.net

Besides, other reports[1][2] are caused by the same problem. I made a
fix for this, please have a try and see if it's get fixed.

Regards,
Boqun

[1]: https://lore.kernel.org/lkml/d7136005aee14...@google.com
[2]: https://lore.kernel.org/lkml/6e29ed05b3009...@google.com

->8
>From 7fbe730fcff2d7909be034cf6dc8bf0604d0bf14 Mon Sep 17 00:00:00 2001
From: Boqun Feng 
Date: Thu, 5 Nov 2020 14:02:57 +0800
Subject: [PATCH] fs/fcntl: Fix potential deadlock in send_sig{io, urg}()

Syzbot reports a potential deadlock found by the newly added recursive
read deadlock detection in lockdep:

[...] 
[...] WARNING: possible irq lock inversion dependency detected
[...] 5.9.0-rc2-syzkaller #0 Not tainted
[...] 
[...] syz-executor.1/10214 just changed the state of lock:
[...] 88811f506338 (>f_owner.lock){.+..}-{2:2}, at: 
send_sigurg+0x1d/0x200
[...] but this lock was taken by another, HARDIRQ-safe lock in the past:
[...]  (>event_lock){-...}-{2:2}
[...]
[...]
[...] and interrupts could create inverse lock ordering between them.
[...]
[...]
[...] other info that might help us debug this:
[...] Chain exists of:
[...]   >event_lock --> >fa_lock --> >f_owner.lock
[...]
[...]  Possible interrupt unsafe locking scenario:
[...]
[...]CPU0CPU1
[...]
[...]   lock(>f_owner.lock);
[...]local_irq_disable();
[...]lock(>event_lock);
[...]lock(>fa_lock);
[...]   
[...] lock(>event_lock);
[...]
[...]  *** DEADLOCK ***

The corresponding deadlock case is as followed:

CPU 0   CPU 1   CPU 2
read_lock(>lock);
spin_lock_irqsave(>event_lock, ...)
write_lock_irq(>f_owner.lock); // 
wait for the lock
read_lock(); // have to wait until the writer 
release
   // due to the fairness

spin_lock_irqsave(>event_lock); // wait for the lock

The lock dependency on CPU 1 happens if there exists a call sequence:

input_inject_event():
  spin_lock_irqsave(>event_lock,...);
  input_handle_event():
input_pass_values():
  input_to_handler():
handler->event(): // evdev_event()
  evdev_pass_values():
spin_lock(>buffer_lock);
__pass_event():
  kill_fasync():
kill_fasync_rcu():
  read_lock(>fa_lock);
  send_sigio():
read_lock(>lock);

To fix this, make the reader in send_sigurg() and send_sigio() use
read_lock_irqsave() and read_lock_irqrestore().

Reported-by: syzbot+22e87cdf94021b984...@syzkaller.appspotmail.com
Reported-by: syzbot+c5e32344981ad9f33...@syzkaller.appspotmail.com
Signed-off-by: Boqun Feng 
---
 fs/fcntl.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 19ac5baad50f..05b36b28f2e8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -781,9 +781,10 @@ void send_sigio(struct fown_stru

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-04 Thread Boqun Feng

On Wed, Nov 04, 2020 at 07:54:40PM +, Filipe Manana wrote:
[...]
> 
> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index b71ad8d9f1c9..b31d4ad482c7 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> LOCK_TRACE_SIZE_IN_LONGS;
> 
> if (max_entries <= 0) {
> -   if (!debug_locks_off_graph_unlock())
> +   if (!debug_locks_off_graph_unlock()) {
> +   WARN_ON_ONCE(1);
> return NULL;
> +   }
> 
> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> dump_stack();
> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> lockdep_map *lock, int read)
> unsigned long flags;
> int ret = 0;
> 
> -   if (unlikely(!lockdep_enabled()))
> +   if (unlikely(debug_locks && !lockdep_enabled()))
> return 1; /* avoid false negative lockdep_assert_held() */
> 
> raw_local_irq_save(flags);
> 
> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> Unexpected, right?
> 

Kinda, that means we still don't know why lockdep was turned off.

> Should I try something else?
> 

Thanks for trying this. Let me set up the reproducer on my side, and see
if I could get more information.

Regards,
Boqun

> Thanks.
> 
> 
> > 
> > Thanks!
> > 
> >>
> >> Alternatively, it's also helpful if you can try the following debug
> >> diff, with teh full set of xfstests:
> >>
> >> Thanks! Just trying to understand the real problem.
> >>
> >> Regards,
> >> Boqun
> >>
> >> -->8
> >> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> >> index b71ad8d9f1c9..9ae3e089e5c0 100644
> >> --- a/kernel/locking/lockdep.c
> >> +++ b/kernel/locking/lockdep.c
> >> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >>LOCK_TRACE_SIZE_IN_LONGS;
> >>  
> >>if (max_entries <= 0) {
> >> -  if (!debug_locks_off_graph_unlock())
> >> +  if (!debug_locks_off_graph_unlock()) {
> >> +  WARN_ON_ONCE(1);
> >>return NULL;
> >> +  }
> >>  
> >>print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> >>dump_stack();
> >>
> >>> I guess I will have to reproduce this myself for further analysis, could
> >>> you share you .config?
> >>>
> >>> Anyway, I think this fix still makes a bit sense, I will send a proper
> >>> patch so that the problem won't block fs folks.
> >>>
> >>> Regards,
> >>> Boqun
> >>>
>  Thanks!
> 
> >
> >
> >
> >> What happens is percpu_rwsem_is_held() is apparently returning a false
> >> positive, so this makes __sb_start_write() do a
> >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> >> is expected to always succeed, because if the calling task is holding a
> >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> >> the semaphore at level 2, since the freeze semaphores are always
> >> acquired by increasing level order.
> >>
> >> But the try_lock fails, it triggers the warning at __sb_start_write(),
> >> then its caller sb_start_pagefault() ignores the return value and
> >> callers such as btrfs_page_mkwrite() make the assumption the freeze
> >> semaphore was taken, proceed to do their stuff, and later call
> >> sb_end_pagefault(), which which will do an up_read() on the 
> >> percpu_rwsem
> >> at level 2 despite not having not been able to down_read() the
> >> semaphore. This obviously corrupts the semaphore's read_count state, 
> >> and
> >> later causes any task trying to down_write() it to hang forever.
> >>
> >> After such a hang I ran a drgn script to confirm it:
> >>
> >> $ cat dump_freeze_sems.py
> >> import sys
> >> import drgn
> >> from drgn import NULL, Object, cast, container_of, execscript, \
> >> reinterpret, sizeof
> >> from drgn.helpers.linux import *
> >>
> >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> >>
> >> mnt = None
> >> for mnt in for_each_mount(prog, dst = mnt_path):
> >> pass
> >>
> >> if mnt is None:
> >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> >> sys.exit(1)
> >>
> >> def dump_sem(level_enum):
> >> level = level_enum.value_()
> >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> >> print(f'freeze semaphore at level {level}, {str(level_enum)}')
> >> print(f'block {sem.block.counter.value_()}')
> >> for i in for_each_possible_cpu(prog):
> >> read_count = per_cpu_ptr(sem.read_count, i)
> >> print(f'read_count at cpu {i} = {read_count}')
> >>

Re: [PATCH 05/16] rcu: De-offloading CB kthread

2020-11-04 Thread Boqun Feng

On Wed, Nov 04, 2020 at 03:31:35PM +0100, Frederic Weisbecker wrote:
[...]
> > 
> > > + rcu_segcblist_offload(cblist, false);
> > > + raw_spin_unlock_rcu_node(rnp);
> > > +
> > > + if (rdp->nocb_cb_sleep) {
> > > + rdp->nocb_cb_sleep = false;
> > > + wake_cb = true;
> > > + }
> > > + rcu_nocb_unlock_irqrestore(rdp, flags);
> > > +
> > > + if (wake_cb)
> > > + swake_up_one(>nocb_cb_wq);
> > > +
> > > + swait_event_exclusive(rdp->nocb_state_wq,
> > > +   !rcu_segcblist_test_flags(cblist, 
> > > SEGCBLIST_KTHREAD_CB));
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static long rcu_nocb_rdp_deoffload(void *arg)
> > > +{
> > > + struct rcu_data *rdp = arg;
> > > +
> > > + WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
> > 
> > I think this warning can actually happen, if I understand how workqueue
> > works correctly. Consider that the corresponding cpu gets offlined right
> > after the rcu_nocb_cpu_deoffloaed(), and the workqueue of that cpu
> > becomes unbound, and IIUC, workqueues don't do migration during
> > cpu-offlining, which means the worker can be scheduled to other CPUs,
> > and the work gets executed on another cpu. Am I missing something here?.
> 
> We are holding cpus_read_lock() in rcu_nocb_cpu_offload(), this should
> prevent from that.
> 

But what if the work doesn't get executed until we cpus_read_unlock()
and someone offlines that CPU?

Regards,
Boqun

> Thanks!

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-03 Thread Boqun Feng

On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
> On Tue, Nov 03, 2020 at 07:44:29PM +, Filipe Manana wrote:
> > 
> > 
> > On 03/11/20 14:08, Boqun Feng wrote:
> > > Hi Filipe,
> > > 
> > > On Mon, Oct 26, 2020 at 11:26:49AM +, Filipe Manana wrote:
> > >> Hello,
> > >>
> > >> I've recently started to hit a warning followed by tasks hanging after
> > >> attempts to freeze a filesystem. A git bisection pointed to the
> > >> following commit:
> > >>
> > >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> > >> Author: Peter Zijlstra 
> > >> Date:   Fri Oct 2 11:04:21 2020 +0200
> > >>
> > >> lockdep: Fix lockdep recursion
> > >>
> > >> This happens very reliably when running all xfstests with lockdep
> > >> enabled, and the tested filesystem is btrfs (haven't tried other
> > >> filesystems, but it shouldn't matter). The warning and task hangs always
> > >> happen at either test generic/068 or test generic/390, and (oddly)
> > >> always have to run all tests for it to trigger, running those tests
> > >> individually on an infinite loop doesn't seem to trigger it (at least
> > >> for a couple hours).
> > >>
> > >> The warning triggered is at fs/super.c:__sb_start_write() which always
> > >> results later in several tasks hanging on a percpu rw_sem:
> > >>
> > >> https://pastebin.com/qnLvf94E
> > >>
> > > 
> > > In your dmesg, I see line:
> > > 
> > >   [ 9304.920151] INFO: lockdep is turned off.
> > > 
> > > , that means debug_locks is 0, that usually happens when lockdep find a
> > > problem (i.e. a deadlock) and it turns itself off, because a problem is
> > > found and it's pointless for lockdep to continue to run.
> > > 
> > > And I haven't found a lockdep splat in your dmesg, do you have a full
> > > dmesg so that I can have a look?
> > > 
> > > This may be relevant because in commit 4d004099a66, we have
> > > 
> > >   @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
> > > lockdep_map *lock, int read)
> > >   unsigned long flags;
> > >   int ret = 0;
> > > 
> > >   -   if (unlikely(current->lockdep_recursion))
> > >   +   if (unlikely(!lockdep_enabled()))
> > >   return 1; /* avoid false negative lockdep_assert_held() 
> > > */
> > > 
> > > before this commit lock_is_held_type() and its friends may return false
> > > if debug_locks==0, after this commit lock_is_held_type() and its friends
> > > will always return true if debug_locks == 0. That could cause the
> > > behavior here.
> > > 
> > > In case I'm correct, the following "fix" may be helpful. 
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > --8
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 3e99dfef8408..c0e27fb949ff 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct 
> > > lockdep_map *lock, int read)
> > >   unsigned long flags;
> > >   int ret = 0;
> > >  
> > > - if (unlikely(!lockdep_enabled()))
> > > + if (unlikely(debug_locks && !lockdep_enabled()))
> > >   return 1; /* avoid false negative lockdep_assert_held() */
> > >  
> > >   raw_local_irq_save(flags);
> > 
> > Boqun, the patch fixes the problem for me!
> > You can have Tested-by: Filipe Manana 
> > 
> 
> Thanks. Although I think it still means that we have a lock issue when
> running xfstests (because we don't know why debug_locks gets cleared),

I might find a place where we could turn lockdep off silently:

in print_circular_bug(), we turn off lockdep via
debug_locks_off_graph_unlock(), and then we try to save the trace for
lockdep splat, however, if we use up the stack_trace buffer (i.e.
nr_stack_trace_entries), save_trace() will return NULL and we return
silently.

Filipe, in order to check whethter that happens, could you share me your
/proc/lockdep_stats after the full set of xfstests is finished?

Alternatively, it's also helpful if you can try the following debug
diff, with teh full set of xfstests:

Thanks! Just trying to understand the real problem.

Regards,
Boqun

-->8
diff --git a/kernel/locking/lockdep

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-03 Thread Boqun Feng

On Tue, Nov 03, 2020 at 07:44:29PM +, Filipe Manana wrote:
> 
> 
> On 03/11/20 14:08, Boqun Feng wrote:
> > Hi Filipe,
> > 
> > On Mon, Oct 26, 2020 at 11:26:49AM +, Filipe Manana wrote:
> >> Hello,
> >>
> >> I've recently started to hit a warning followed by tasks hanging after
> >> attempts to freeze a filesystem. A git bisection pointed to the
> >> following commit:
> >>
> >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> >> Author: Peter Zijlstra 
> >> Date:   Fri Oct 2 11:04:21 2020 +0200
> >>
> >> lockdep: Fix lockdep recursion
> >>
> >> This happens very reliably when running all xfstests with lockdep
> >> enabled, and the tested filesystem is btrfs (haven't tried other
> >> filesystems, but it shouldn't matter). The warning and task hangs always
> >> happen at either test generic/068 or test generic/390, and (oddly)
> >> always have to run all tests for it to trigger, running those tests
> >> individually on an infinite loop doesn't seem to trigger it (at least
> >> for a couple hours).
> >>
> >> The warning triggered is at fs/super.c:__sb_start_write() which always
> >> results later in several tasks hanging on a percpu rw_sem:
> >>
> >> https://pastebin.com/qnLvf94E
> >>
> > 
> > In your dmesg, I see line:
> > 
> > [ 9304.920151] INFO: lockdep is turned off.
> > 
> > , that means debug_locks is 0, that usually happens when lockdep find a
> > problem (i.e. a deadlock) and it turns itself off, because a problem is
> > found and it's pointless for lockdep to continue to run.
> > 
> > And I haven't found a lockdep splat in your dmesg, do you have a full
> > dmesg so that I can have a look?
> > 
> > This may be relevant because in commit 4d004099a66, we have
> > 
> > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
> > lockdep_map *lock, int read)
> > unsigned long flags;
> > int ret = 0;
> > 
> > -   if (unlikely(current->lockdep_recursion))
> > +   if (unlikely(!lockdep_enabled()))
> > return 1; /* avoid false negative lockdep_assert_held() 
> > */
> > 
> > before this commit lock_is_held_type() and its friends may return false
> > if debug_locks==0, after this commit lock_is_held_type() and its friends
> > will always return true if debug_locks == 0. That could cause the
> > behavior here.
> > 
> > In case I'm correct, the following "fix" may be helpful. 
> > 
> > Regards,
> > Boqun
> > 
> > --8
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index 3e99dfef8408..c0e27fb949ff 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct 
> > lockdep_map *lock, int read)
> > unsigned long flags;
> > int ret = 0;
> >  
> > -   if (unlikely(!lockdep_enabled()))
> > +   if (unlikely(debug_locks && !lockdep_enabled()))
> > return 1; /* avoid false negative lockdep_assert_held() */
> >  
> > raw_local_irq_save(flags);
> 
> Boqun, the patch fixes the problem for me!
> You can have Tested-by: Filipe Manana 
> 

Thanks. Although I think it still means that we have a lock issue when
running xfstests (because we don't know why debug_locks gets cleared),
I guess I will have to reproduce this myself for further analysis, could
you share you .config?

Anyway, I think this fix still makes a bit sense, I will send a proper
patch so that the problem won't block fs folks.

Regards,
Boqun

> Thanks!
> 
> > 
> > 
> > 
> >> What happens is percpu_rwsem_is_held() is apparently returning a false
> >> positive, so this makes __sb_start_write() do a
> >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> >> is expected to always succeed, because if the calling task is holding a
> >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> >> the semaphore at level 2, since the freeze semaphores are always
> >> acquired by increasing level order.
> >>
> >> But the try_lock fails, it triggers the warning at __sb_start_write(),
> >> then its caller sb_start_pagefault() ignores the return value and
> >> callers such as btrfs_page_mkwrite() make the assumption the freeze
> >> semaphore was taken, proceed to do their stuff, and later call
> >> sb_e

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-03 Thread Boqun Feng

Hi Filipe,

On Mon, Oct 26, 2020 at 11:26:49AM +, Filipe Manana wrote:
> Hello,
> 
> I've recently started to hit a warning followed by tasks hanging after
> attempts to freeze a filesystem. A git bisection pointed to the
> following commit:
> 
> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> Author: Peter Zijlstra 
> Date:   Fri Oct 2 11:04:21 2020 +0200
> 
> lockdep: Fix lockdep recursion
> 
> This happens very reliably when running all xfstests with lockdep
> enabled, and the tested filesystem is btrfs (haven't tried other
> filesystems, but it shouldn't matter). The warning and task hangs always
> happen at either test generic/068 or test generic/390, and (oddly)
> always have to run all tests for it to trigger, running those tests
> individually on an infinite loop doesn't seem to trigger it (at least
> for a couple hours).
> 
> The warning triggered is at fs/super.c:__sb_start_write() which always
> results later in several tasks hanging on a percpu rw_sem:
> 
> https://pastebin.com/qnLvf94E
> 

In your dmesg, I see line:

[ 9304.920151] INFO: lockdep is turned off.

, that means debug_locks is 0, that usually happens when lockdep find a
problem (i.e. a deadlock) and it turns itself off, because a problem is
found and it's pointless for lockdep to continue to run.

And I haven't found a lockdep splat in your dmesg, do you have a full
dmesg so that I can have a look?

This may be relevant because in commit 4d004099a66, we have

@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
lockdep_map *lock, int read)
unsigned long flags;
int ret = 0;

-   if (unlikely(current->lockdep_recursion))
+   if (unlikely(!lockdep_enabled()))
return 1; /* avoid false negative lockdep_assert_held() 
*/

before this commit lock_is_held_type() and its friends may return false
if debug_locks==0, after this commit lock_is_held_type() and its friends
will always return true if debug_locks == 0. That could cause the
behavior here.

In case I'm correct, the following "fix" may be helpful. 

Regards,
Boqun

--8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..c0e27fb949ff 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map 
*lock, int read)
unsigned long flags;
int ret = 0;
 
-   if (unlikely(!lockdep_enabled()))
+   if (unlikely(debug_locks && !lockdep_enabled()))
return 1; /* avoid false negative lockdep_assert_held() */
 
raw_local_irq_save(flags);



> What happens is percpu_rwsem_is_held() is apparently returning a false
> positive, so this makes __sb_start_write() do a
> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> is expected to always succeed, because if the calling task is holding a
> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> the semaphore at level 2, since the freeze semaphores are always
> acquired by increasing level order.
> 
> But the try_lock fails, it triggers the warning at __sb_start_write(),
> then its caller sb_start_pagefault() ignores the return value and
> callers such as btrfs_page_mkwrite() make the assumption the freeze
> semaphore was taken, proceed to do their stuff, and later call
> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> at level 2 despite not having not been able to down_read() the
> semaphore. This obviously corrupts the semaphore's read_count state, and
> later causes any task trying to down_write() it to hang forever.
> 
> After such a hang I ran a drgn script to confirm it:
> 
> $ cat dump_freeze_sems.py
> import sys
> import drgn
> from drgn import NULL, Object, cast, container_of, execscript, \
> reinterpret, sizeof
> from drgn.helpers.linux import *
> 
> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> 
> mnt = None
> for mnt in for_each_mount(prog, dst = mnt_path):
> pass
> 
> if mnt is None:
> sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> sys.exit(1)
> 
> def dump_sem(level_enum):
> level = level_enum.value_()
> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> print(f'freeze semaphore at level {level}, {str(level_enum)}')
> print(f'block {sem.block.counter.value_()}')
> for i in for_each_possible_cpu(prog):
> read_count = per_cpu_ptr(sem.read_count, i)
> print(f'read_count at cpu {i} = {read_count}')
> print()
> 
> # dump semaphore read counts for all freeze levels (fs.h)
> dump_sem(prog['SB_FREEZE_WRITE'])
> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> dump_sem(prog['SB_FREEZE_FS'])
> 
> 
> $ drgn dump_freeze_sems.py
> freeze semaphore at level 1, (enum )SB_FREEZE_WRITE
> block 1
> read_count at cpu 0 = *(unsigned int *)0xc2ec3ee00c74 = 3
> read_count at cpu 1 = *(unsigned int

Re: [PATCH 05/16] rcu: De-offloading CB kthread

2020-11-02 Thread Boqun Feng

Hi Frederic,

Could you copy the r...@vger.kernel.org if you have another version, it
will help RCU hobbyists like me to catch up news in RCU, thanks! ;-)

Please see below for some comments, I'm still reading the whole
patchset, so probably I miss something..

On Fri, Oct 23, 2020 at 04:46:38PM +0200, Frederic Weisbecker wrote:
> In order to de-offload the callbacks processing of an rdp, we must clear
> SEGCBLIST_OFFLOAD and notify the CB kthread so that it clears its own
> bit flag and goes to sleep to stop handling callbacks. The GP kthread
> will also be notified the same way. Whoever acknowledges and clears its
> own bit last must notify the de-offloading worker so that it can resume
> the de-offloading while being sure that callbacks won't be handled
> remotely anymore.
> 
> Inspired-by: Paul E. McKenney 
> Signed-off-by: Frederic Weisbecker 
> Cc: Paul E. McKenney 
> Cc: Josh Triplett 
> Cc: Steven Rostedt 
> Cc: Mathieu Desnoyers 
> Cc: Lai Jiangshan 
> Cc: Joel Fernandes 
> Cc: Neeraj Upadhyay 
> ---
>  include/linux/rcupdate.h   |   2 +
>  kernel/rcu/rcu_segcblist.c |  10 ++-
>  kernel/rcu/rcu_segcblist.h |   2 +-
>  kernel/rcu/tree.h  |   1 +
>  kernel/rcu/tree_plugin.h   | 134 +++--
>  5 files changed, 126 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 7c1ceff02852..bf8eb02411c2 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -104,8 +104,10 @@ static inline void rcu_user_exit(void) { }
>  
>  #ifdef CONFIG_RCU_NOCB_CPU
>  void rcu_init_nohz(void);
> +int rcu_nocb_cpu_deoffload(int cpu);
>  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
>  static inline void rcu_init_nohz(void) { }
> +static inline int rcu_nocb_cpu_deoffload(int cpu) { return 0; }
>  #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
>  
>  /**
> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
> index a96511b7cc98..3f6b5b724b39 100644
> --- a/kernel/rcu/rcu_segcblist.c
> +++ b/kernel/rcu/rcu_segcblist.c
> @@ -170,10 +170,14 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
>   * Mark the specified rcu_segcblist structure as offloaded.  This
>   * structure must be empty.
>   */
> -void rcu_segcblist_offload(struct rcu_segcblist *rsclp)
> +void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload)
>  {
> - rcu_segcblist_clear_flags(rsclp, SEGCBLIST_SOFTIRQ_ONLY);
> - rcu_segcblist_set_flags(rsclp, SEGCBLIST_OFFLOADED);
> + if (offload) {
> + rcu_segcblist_clear_flags(rsclp, SEGCBLIST_SOFTIRQ_ONLY);
> + rcu_segcblist_set_flags(rsclp, SEGCBLIST_OFFLOADED);
> + } else {
> + rcu_segcblist_clear_flags(rsclp, SEGCBLIST_OFFLOADED);
> + }
>  }
>  
>  /*
> diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
> index 575896a2518b..00ebeb8d39b7 100644
> --- a/kernel/rcu/rcu_segcblist.h
> +++ b/kernel/rcu/rcu_segcblist.h
> @@ -105,7 +105,7 @@ static inline bool rcu_segcblist_restempty(struct 
> rcu_segcblist *rsclp, int seg)
>  void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp);
>  void rcu_segcblist_init(struct rcu_segcblist *rsclp);
>  void rcu_segcblist_disable(struct rcu_segcblist *rsclp);
> -void rcu_segcblist_offload(struct rcu_segcblist *rsclp);
> +void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload);
>  bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp);
>  bool rcu_segcblist_pend_cbs(struct rcu_segcblist *rsclp);
>  struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp);
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index e4f66b8f7c47..8047102be878 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -200,6 +200,7 @@ struct rcu_data {
>   /* 5) Callback offloading. */
>  #ifdef CONFIG_RCU_NOCB_CPU
>   struct swait_queue_head nocb_cb_wq; /* For nocb kthreads to sleep on. */
> + struct swait_queue_head nocb_state_wq; /* For offloading state changes 
> */
>   struct task_struct *nocb_gp_kthread;
>   raw_spinlock_t nocb_lock;   /* Guard following pair of fields. */
>   atomic_t nocb_lock_contended;   /* Contention experienced. */
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index fd8a52e9a887..09caf319a4a9 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2081,16 +2081,29 @@ static int rcu_nocb_gp_kthread(void *arg)
>   return 0;
>  }
>  
> +static inline bool nocb_cb_can_run(struct rcu_data *rdp)
> +{
> + u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_CB;
> + return rcu_segcblist_test_flags(>cblist, flags);
> +}
> +
> +static inline bool nocb_cb_wait_cond(struct rcu_data *rdp)
> +{
> + return nocb_cb_can_run(rdp) && !READ_ONCE(rdp->nocb_cb_sleep);
> +}
> +
>  /*
>   * Invoke any ready callbacks from the corresponding no-CBs CPU,
>   * then, if there are no more, wait for more to appear.
>   */
>  static void nocb_cb_wait(struct rcu_data *rdp)
>

[PATCH 2/2] lockdep/selftest: Add spin_nest_lock test

2020-11-01 Thread Boqun Feng

Add a self test case to test the behavior for the following case:

lock(A);
lock_nest_lock(C1, A);
lock(B);
lock_nest_lock(C2, A);

This is a reproducer for a problem[1] reported by Chris Wilson, and is
helpful to prevent this.

[1]: 
https://lore.kernel.org/lkml/160390684819.31966.12048967113267928...@build.alporthouse.com/

Signed-off-by: Boqun Feng 
Cc: Chris Wilson 
---
 lib/locking-selftest.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index afa7d4bb291f..4c24ac8a456c 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -2009,6 +2009,19 @@ static void ww_test_spin_nest_unlocked(void)
U(A);
 }
 
+/* This is not a deadlock, because we have X1 to serialize Y1 and Y2 */
+static void ww_test_spin_nest_lock(void)
+{
+   spin_lock(_X1);
+   spin_lock_nest_lock(_Y1, _X1);
+   spin_lock(_A);
+   spin_lock_nest_lock(_Y2, _X1);
+   spin_unlock(_A);
+   spin_unlock(_Y2);
+   spin_unlock(_Y1);
+   spin_unlock(_X1);
+}
+
 static void ww_test_unneeded_slow(void)
 {
WWAI();
@@ -2226,6 +2239,10 @@ static void ww_tests(void)
dotest(ww_test_spin_nest_unlocked, FAILURE, LOCKTYPE_WW);
pr_cont("\n");
 
+   print_testname("spinlock nest test");
+   dotest(ww_test_spin_nest_lock, SUCCESS, LOCKTYPE_WW);
+   pr_cont("\n");
+
printk("  -\n");
printk(" |block | try  |context|\n");
printk("  -\n");
-- 
2.28.0

[PATCH 1/2] lockdep: Avoid to modify chain keys in validate_chain()

2020-11-01 Thread Boqun Feng

Chris Wilson reported a problem spotted by check_chain_key(): a chain
key got changed in validate_chain() because we modify the ->read in
validate_chain() to skip checks for dependency adding, and ->read is
taken into calculation for chain key since commit f611e8cf98ec
("lockdep: Take read/write status in consideration when generate
chainkey").

Fix this by avoiding to modify ->read in validate_chain() based on two
facts: a) since we now support recursive read lock detection, there is
no need to skip checks for dependency adding for recursive readers, b)
since we have a), there is only one case left (nest_lock) where we want
to skip checks in validate_chain(), we simply remove the modification
for ->read and rely on the return value of check_deadlock() to skip the
dependency adding.

Reported-by: Chris Wilson 
Signed-off-by: Boqun Feng 
Cc: Peter Zijlstra 
---
Peter,

I managed to get a reproducer for the problem Chris reported, please see
patch #2. With this patch, that problem gets fixed.

This small patchset is based on your locking/core, patch #2 actually
relies on your "s/raw_spin/spin" changes, thanks for taking care of that
;-)

Regards,
Boqun

 kernel/locking/lockdep.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..a294326fd998 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2765,7 +2765,9 @@ print_deadlock_bug(struct task_struct *curr, struct 
held_lock *prev,
  * (Note that this has to be done separately, because the graph cannot
  * detect such classes of deadlocks.)
  *
- * Returns: 0 on deadlock detected, 1 on OK, 2 on recursive read
+ * Returns: 0 on deadlock detected, 1 on OK, 2 if another lock with the same
+ * lock class is held but nest_lock is also held, i.e. we rely on the
+ * nest_lock to avoid the deadlock.
  */
 static int
 check_deadlock(struct task_struct *curr, struct held_lock *next)
@@ -2788,7 +2790,7 @@ check_deadlock(struct task_struct *curr, struct held_lock 
*next)
 * lock class (i.e. read_lock(lock)+read_lock(lock)):
 */
if ((next->read == 2) && prev->read)
-   return 2;
+   continue;
 
/*
 * We're holding the nest_lock, which serializes this lock's
@@ -3592,16 +3594,13 @@ static int validate_chain(struct task_struct *curr,
 
if (!ret)
return 0;
-   /*
-* Mark recursive read, as we jump over it when
-* building dependencies (just like we jump over
-* trylock entries):
-*/
-   if (ret == 2)
-   hlock->read = 2;
/*
 * Add dependency only if this lock is not the head
-* of the chain, and if it's not a secondary read-lock:
+* of the chain, and if the new lock introduces no more
+* lock dependency (because we already hold a lock with the
+* same lock class) nor deadlock (because the nest_lock
+* serializes nesting locks), see the comments for
+* check_deadlock().
 */
if (!chain_head && ret != 2) {
if (!check_prevs_add(curr, hlock))
-- 
2.28.0

Re: lockdep: possible irq lock inversion dependency detected (trig->leddev_list_lock)

2020-11-01 Thread Boqun Feng

Hi Andrea,

On Sun, Nov 01, 2020 at 10:26:14AM +0100, Andrea Righi wrote:
> I'm getting the following lockdep splat (see below).
> 
> Apparently this warning starts to be reported after applying:
> 
>  e918188611f0 ("locking: More accurate annotations for read_lock()")
> 
> It looks like a false positive to me, but it made me think a bit and
> IIUC there can be still a potential deadlock, even if the deadlock
> scenario is a bit different than what lockdep is showing.
> 
> In the assumption that read-locks are recursive only in_interrupt()
> context (as stated in e918188611f0), the following scenario can still
> happen:
> 
>  CPU0 CPU1
>   
>  read_lock(>leddev_list_lock);
>   write_lock(>leddev_list_lock);
>  
>  kbd_bh()
>-> read_lock(>leddev_list_lock);
> 
>  *** DEADLOCK ***
> 
> The write-lock is waiting on CPU1 and the second read_lock() on CPU0
> would be blocked by the write-lock *waiter* on CPU1 => deadlock.
> 

No, this is not a deadlock, as a write-lock waiter only blocks
*non-recursive* readers, so since the read_lock() in kbd_bh() is called
in soft-irq (which in_interrupt() returns true), so it's a recursive
reader and won't get blocked by the write-lock waiter.

> In that case we could prevent this deadlock condition using a workqueue
> to call kbd_propagate_led_state() instead of calling it directly from
> kbd_bh() (even if lockdep would still report the false positive).
> 

The deadlock senario reported by the following splat is:


CPU 0:  CPU 1:  
CPU 2:
-   -   
-
led_trigger_event():
  read_lock(>leddev_list_lock);

ata_hsm_qs_complete():
  spin_lock_irqsave(>lock);

write_lock(>leddev_list_lock);
  ata_port_freeze():
ata_do_link_abort():
  ata_qc_complete():
ledtrig_disk_activity():
  led_trigger_blink_oneshot():

read_lock(>leddev_list_lock);
// ^ not in in_interrupt() 
context, so could get blocked by CPU 2

  ata_bmdma_interrupt():
spin_lock_irqsave(>lock);
  
, where CPU 0 is blocked by CPU 1 because of the spin_lock_irqsave() in
ata_bmdma_interrupt() and CPU 1 is blocked by CPU 2 because of the
read_lock() in led_trigger_blink_oneshot() and CPU 2 is blocked by CPU 0
because of an arbitrary writer on >leddev_list_lock.

So I don't think it's false positive, but I might miss something
obvious, because I don't know what the code here actually does ;-)

Regards,
Boqun

> Can you help me to understand if this assumption is correct or if I'm
> missing something?
> 
> Thanks,
> -Andrea
> 
> Lockdep trace:
> 
> [1.087260] WARNING: possible irq lock inversion dependency detected
> [1.087267] 5.10.0-rc1+ #18 Not tainted
> [1.088829] softirqs last  enabled at (0): [] 
> copy_process+0x6c7/0x1c70
> [1.089662] 
> [1.090284] softirqs last disabled at (0): [<>] 0x0
> [1.092766] swapper/3/0 just changed the state of lock:
> [1.093325] 888006394c18 (>lock){-...}-{2:2}, at: 
> ata_bmdma_interrupt+0x27/0x200
> [1.094190] but this lock took another, HARDIRQ-READ-unsafe lock in the 
> past:
> [1.094944]  (>leddev_list_lock){.+.?}-{2:2}
> [1.094946] 
> [1.094946] 
> [1.094946] and interrupts could create inverse lock ordering between them.
> [1.094946] 
> [1.096600] 
> [1.096600] other info that might help us debug this:
> [1.097250]  Possible interrupt unsafe locking scenario:
> [1.097250] 
> [1.097940]CPU0CPU1
> [1.098401]
> [1.098873]   lock(>leddev_list_lock);
> [1.099315]local_irq_disable();
> [1.099932]lock(>lock);
> [1.100527]lock(>leddev_list_lock);
> [1.101219]   
> [1.101490] lock(>lock);
> [1.101844] 
> [1.101844]  *** DEADLOCK ***
> [1.101844] 
> [1.102447] no locks held by swapper/3/0.
> [1.102858] 
> [1.102858] the shortest dependencies between 2nd lock and 1st lock:
> [1.103646]  -> (>leddev_list_lock){.+.?}-{2:2} ops: 46 {
> [1.104248] HARDIRQ-ON-R at:
> [1.104600]

Re: [tip: locking/core] lockdep: Fix usage_traceoverflow

2020-10-29 Thread Boqun Feng

Hi Peter,

On Wed, Oct 28, 2020 at 08:59:10PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 28, 2020 at 08:42:09PM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 28, 2020 at 05:40:48PM +, Chris Wilson wrote:
> > > Quoting Chris Wilson (2020-10-27 16:34:53)
> > > > Quoting Peter Zijlstra (2020-10-27 15:45:33)
> > > > > On Tue, Oct 27, 2020 at 01:29:10PM +, Chris Wilson wrote:
> > > > > 
> > > > > > <4> [304.908891] hm#2, depth: 6 [6], 3425cfea6ff31f7f != 
> > > > > > 547d92e9ec2ab9af
> > > > > > <4> [304.908897] WARNING: CPU: 0 PID: 5658 at 
> > > > > > kernel/locking/lockdep.c:3679 check_chain_key+0x1a4/0x1f0
> > > > > 
> > > > > Urgh, I don't think I've _ever_ seen that warning trigger.
> > > > > 
> > > > > The comments that go with it suggest memory corruption is the most
> > > > > likely trigger of it. Is it easy to trigger?
> > > > 
> > > > For the automated CI, yes, the few machines that run that particular HW
> > > > test seem to hit it regularly. I have not yet reproduced it for myself.
> > > > I thought it looked like something kasan would provide some insight for
> > > > and we should get a kasan run through CI over the w/e. I suspect we've
> > > > feed in some garbage and called it a lock.
> > > 
> > > I tracked it down to a second invocation of 
> > > lock_acquire_shared_recursive()
> > > intermingled with some other regular mutexes (in this case ww_mutex).
> > > 
> > > We hit this path in validate_chain():
> > >   /*
> > >* Mark recursive read, as we jump over it when
> > >* building dependencies (just like we jump over
> > >* trylock entries):
> > >*/
> > >   if (ret == 2)
> > >   hlock->read = 2;
> > > 
> > > and that is modifying hlock_id() and so the chain-key, after it has
> > > already been computed.
> > 
> > Ooh, interesting.. I'll have to go look at this in the morning, brain is
> > fried already. Thanks for digging into it.
> 

Sorry for the late response.

> So that's commit f611e8cf98ec ("lockdep: Take read/write status in
> consideration when generate chainkey") that did that.
> 

Yeah, I think that's related, howver ...

> So validate_chain() requires the new chain_key, but can change ->read
> which then invalidates the chain_key we just calculated.
> 
> This happens when check_deadlock() returns 2, which only happens when:
> 
>   - next->read == 2 && ... ; however @hext is our @hlock, so that's
> pointless
> 

I don't think we should return 2 (earlier) in this case anymore. Because
now we have recursive read deadlock detection, it's safe to add dep:
"prev -> next" in the dependency graph. I think we can just continue in
this case. Actually I think this is something I'm missing in my
recursive read detection patchset :-/

>   - when there's a nest_lock involved ; ww_mutex uses that !!!
> 

That leaves check_deadlock() return 2 only if hlock is a nest_lock, and
...

> I suppose something like the below _might_ just do it, but I haven't
> compiled it, and like said, my brain is fried.
> 
> Boqun, could you have a look, you're a few timezones ahead of us so your
> morning is earlier ;-)
> 
> ---
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..3caf63532bc2 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -3556,7 +3556,7 @@ static inline int lookup_chain_cache_add(struct 
> task_struct *curr,
>  
>  static int validate_chain(struct task_struct *curr,
> struct held_lock *hlock,
> -   int chain_head, u64 chain_key)
> +   int chain_head, u64 *chain_key)
>  {
>   /*
>* Trylock needs to maintain the stack of held locks, but it
> @@ -3568,6 +3568,7 @@ static int validate_chain(struct task_struct *curr,
>* (If lookup_chain_cache_add() return with 1 it acquires
>* graph_lock for us)
>*/
> +again:
>   if (!hlock->trylock && hlock->check &&
>   lookup_chain_cache_add(curr, hlock, chain_key)) {
>   /*
> @@ -3597,8 +3598,12 @@ static int validate_chain(struct task_struct *curr,
>* building dependencies (just like we jump over
>* trylock entries):
>*/
> - if (ret == 2)
> + if (ret == 2) {
>   hlock->read = 2;
> + *chain_key = iterate_chain_key(hlock->prev_chain_key, 
> hlock_id(hlock));

If "ret == 2" means hlock is a a nest_lock, than we don't need the
"->read = 2" trick here and we don't need to update chain_key either.
We used to have this "->read = 2" only because we want to skip the
dependency adding step afterwards. So how about the following:

It survived a lockdep selftest at boot time.

Regards,
Boqun

->8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..b23ca6196561 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2765,7 +2765,7 @@ print_deadlock_bug(struct task_struct

Re: [PATCH v3 2/6] docs: lockdep-design: fix some warning issues

2020-10-22 Thread Boqun Feng

On Wed, Oct 21, 2020 at 02:17:23PM +0200, Mauro Carvalho Chehab wrote:
> There are several warnings caused by a recent change
> 224ec489d3cd ("lockdep/Documention: Recursive read lock detection reasoning")
> 
> Those are reported by htmldocs build:
> 
> Documentation/locking/lockdep-design.rst:429: WARNING: Definition list 
> ends without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:452: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:453: WARNING: Unexpected 
> indentation.
> Documentation/locking/lockdep-design.rst:453: WARNING: Blank line 
> required after table.
> Documentation/locking/lockdep-design.rst:454: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:455: WARNING: Unexpected 
> indentation.
> Documentation/locking/lockdep-design.rst:455: WARNING: Blank line 
> required after table.
> Documentation/locking/lockdep-design.rst:456: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/locking/lockdep-design.rst:457: WARNING: Unexpected 
> indentation.
> Documentation/locking/lockdep-design.rst:457: WARNING: Blank line 
> required after table.
> 
> Besides the reported issues, there are some missing blank
> lines that ended producing wrong html output, and some
> literals are not properly identified.
> 
> Also, the symbols used at the irq enabled/disable table
> are not displayed as expected, as they're not literals.
> Also, on another table they're using a different notation.
> 
> Fixes: 224ec489d3cd ("lockdep/Documention: Recursive read lock detection 
> reasoning")
> Signed-off-by: Mauro Carvalho Chehab 

Acked-by: Boqun Feng 

Regards,
Boqun

> ---
>  Documentation/locking/lockdep-design.rst | 51 ++--
>  1 file changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/locking/lockdep-design.rst 
> b/Documentation/locking/lockdep-design.rst
> index cec03bd1294a..9f3cfca9f8a4 100644
> --- a/Documentation/locking/lockdep-design.rst
> +++ b/Documentation/locking/lockdep-design.rst
> @@ -42,6 +42,7 @@ The validator tracks lock-class usage history and divides 
> the usage into
>  (4 usages * n STATEs + 1) categories:
>  
>  where the 4 usages can be:
> +
>  - 'ever held in STATE context'
>  - 'ever held as readlock in STATE context'
>  - 'ever held with STATE enabled'
> @@ -49,10 +50,12 @@ where the 4 usages can be:
>  
>  where the n STATEs are coded in kernel/locking/lockdep_states.h and as of
>  now they include:
> +
>  - hardirq
>  - softirq
>  
>  where the last 1 category is:
> +
>  - 'ever used'   [ == !unused]
>  
>  When locking rules are violated, these usage bits are presented in the
> @@ -96,9 +99,9 @@ exact case is for the lock as of the reporting time.
>+--+-+--+
>|  | irq enabled | irq disabled |
>+--+-+--+
> -  | ever in irq  |  ?  |   -  |
> +  | ever in irq  | '?' |  '-' |
>+--+-+--+
> -  | never in irq |  +  |   .  |
> +  | never in irq | '+' |  '.' |
>+--+-+--+
>  
>  The character '-' suggests irq is disabled because if otherwise the
> @@ -216,7 +219,7 @@ looks like this::
> BD_MUTEX_PARTITION
>};
>  
> -mutex_lock_nested(>bd_contains->bd_mutex, BD_MUTEX_PARTITION);
> +  mutex_lock_nested(>bd_contains->bd_mutex, BD_MUTEX_PARTITION);
>  
>  In this case the locking is done on a bdev object that is known to be a
>  partition.
> @@ -334,7 +337,7 @@ Troubleshooting:
>  
>  
>  The validator tracks a maximum of MAX_LOCKDEP_KEYS number of lock classes.
> -Exceeding this number will trigger the following lockdep warning:
> +Exceeding this number will trigger the following lockdep warning::
>  
>   (DEBUG_LOCKS_WARN_ON(id >= MAX_LOCKDEP_KEYS))
>  
> @@ -420,7 +423,8 @@ the critical section of another reader of the same lock 
> instance.
>  
>  The difference between recursive readers and non-recursive readers is 
> because:
>  recursive readers get blocked only by a write lock *holder*, while 
> non-recursive
> -readers could get blocked by a write lock *waiter*. Considering the follow 
> example:
> +readers could get blocked by a write lock *waiter*. Considering the follow
> +example::
>  
>   TASK A: TASK B:
>  
> @@

Re: [PATCH 1/3] sched: fix exit_mm vs membarrier (v4)

2020-10-22 Thread Boqun Feng

Hi,

On Tue, Oct 20, 2020 at 10:59:58AM -0400, Mathieu Desnoyers wrote:
> - On Oct 20, 2020, at 10:36 AM, Peter Zijlstra pet...@infradead.org wrote:
> 
> > On Tue, Oct 20, 2020 at 09:47:13AM -0400, Mathieu Desnoyers wrote:
> >> +void membarrier_update_current_mm(struct mm_struct *next_mm)
> >> +{
> >> +  struct rq *rq = this_rq();
> >> +  int membarrier_state = 0;
> >> +
> >> +  if (next_mm)
> >> +  membarrier_state = atomic_read(_mm->membarrier_state);
> >> +  if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> >> +  return;
> >> +  WRITE_ONCE(rq->membarrier_state, membarrier_state);
> >> +}
> > 
> > This is suspisioucly similar to membarrier_switch_mm().
> > 
> > Would something like so make sense?
> 
> Very much yes. Do you want me to re-send the series, or you
> want to fold this in as you merge it ?
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > ---
> > --- a/kernel/sched/membarrier.c
> > +++ b/kernel/sched/membarrier.c
> > @@ -206,14 +206,7 @@ void membarrier_exec_mmap(struct mm_stru
> > 
> > void membarrier_update_current_mm(struct mm_struct *next_mm)
> > {
> > -   struct rq *rq = this_rq();
> > -   int membarrier_state = 0;
> > -
> > -   if (next_mm)
> > -   membarrier_state = atomic_read(_mm->membarrier_state);
> > -   if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> > -   return;
> > -   WRITE_ONCE(rq->membarrier_state, membarrier_state);
> > +   membarrier_switch_mm(this_rq(), NULL, next_mm);
> > }
> > 
> > static int membarrier_global_expedited(void)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index d2621155393c..3d589c2ffd28 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2645,12 +2645,14 @@ static inline void membarrier_switch_mm(struct rq 
> > *rq,
> > struct mm_struct *prev_mm,
> > struct mm_struct *next_mm)
> > {
> > -   int membarrier_state;
> > +   int membarrier_state = 0;
> > 
> > if (prev_mm == next_mm)

Unless I'm missing something subtle, in exit_mm(),
membarrier_update_current_mm() is called with @next_mm == NULL, and
inside membarrier_update_current_mm(), membarrier_switch_mm() is called
wiht @prev_mm == NULL. As a result, the branch above is taken, so
membarrier_update_current_mm() becomes a nop. I think we should use the
previous value of current->mm as the @prev_mm, something like below
maybe?

void update_current_mm(struct mm_struct *next_mm)
{
struct mm_struct *prev_mm;
unsigned long flags;

local_irq_save(flags);
prev_mm = current->mm;
current->mm = next_mm;
membarrier_switch_mm(this_rq(), prev_mm, next_mm);
local_irq_restore(flags);
}

, and replace all settings for "current->mm" in kernel with
update_current_mm().

Thoughts?

Regards,
Boqun

> > return;
> > 
> > -   membarrier_state = atomic_read(_mm->membarrier_state);
> > +   if (next_mm)
> > +   membarrier_state = atomic_read(_mm->membarrier_state);
> > +
> > if (READ_ONCE(rq->membarrier_state) == membarrier_state)
> > return;
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-13 Thread Boqun Feng

On Tue, Oct 13, 2020 at 12:27:15PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 12, 2020 at 11:11:10AM +0800, Boqun Feng wrote:
> 
> > I think this happened because in this commit debug_lockdep_rcu_enabled()
> > didn't adopt to the change that made lockdep_recursion a percpu
> > variable?
> > 
> > Qian, mind to try the following?
> > 
> > Although, arguably the problem still exists, i.e. we still have an RCU
> > read-side critical section inside lock_acquire(), which may be called on
> 
> There is actual RCU usage from the trace_lock_acquire().
> 
> > a yet-to-online CPU, which RCU doesn't watch. I think this used to be OK
> > because we don't "free" anything from lockdep, IOW, there is no
> > synchronize_rcu() or call_rcu() that _needs_ to wait for the RCU
> > read-side critical sections inside lockdep. But now we lock class
> > recycling, so it might be a problem.
> > 
> > That said, currently validate_chain() and lock class recycling are
> > mutually excluded via graph_lock, so we are safe for this one ;-)
> 
> We should have a comment on that somewhere, could you write one?
> 

Sure, I will write something tomorrow.

Regards,
Boqun

> > --->8
> > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > index 39334d2d2b37..35d9bab65b75 100644
> > --- a/kernel/rcu/update.c
> > +++ b/kernel/rcu/update.c
> > @@ -275,8 +275,8 @@ EXPORT_SYMBOL_GPL(rcu_callback_map);
> >  
> >  noinstr int notrace debug_lockdep_rcu_enabled(void)
> >  {
> > -   return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
> > -  current->lockdep_recursion == 0;
> > +   return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE &&
> > +  __lockdep_enabled;
> >  }
> >  EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
> 
> Urgh, I didn't expect (and forgot to grep) lockdep_recursion users
> outside of lockdep itself :/ It looks like this is indeed the only one.
> 
>

Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-11 Thread Boqun Feng

Hi,

On Fri, Oct 09, 2020 at 09:41:24AM -0400, Qian Cai wrote:
> On Fri, 2020-10-09 at 07:58 +, tip-bot2 for Peter Zijlstra wrote:
> > The following commit has been merged into the locking/core branch of tip:
> > 
> > Commit-ID: 4d004099a668c41522242aa146a38cc4eb59cb1e
> > Gitweb:
> > https://git.kernel.org/tip/4d004099a668c41522242aa146a38cc4eb59cb1e
> > Author:Peter Zijlstra 
> > AuthorDate:Fri, 02 Oct 2020 11:04:21 +02:00
> > Committer: Ingo Molnar 
> > CommitterDate: Fri, 09 Oct 2020 08:53:30 +02:00
> > 
> > lockdep: Fix lockdep recursion
> > 
> > Steve reported that lockdep_assert*irq*(), when nested inside lockdep
> > itself, will trigger a false-positive.
> > 
> > One example is the stack-trace code, as called from inside lockdep,
> > triggering tracing, which in turn calls RCU, which then uses
> > lockdep_assert_irqs_disabled().
> > 
> > Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu
> > variables")
> > Reported-by: Steven Rostedt 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Signed-off-by: Ingo Molnar 
> 
> Reverting this linux-next commit fixed booting RCU-list warnings everywhere.
> 

I think this happened because in this commit debug_lockdep_rcu_enabled()
didn't adopt to the change that made lockdep_recursion a percpu
variable?

Qian, mind to try the following?

Although, arguably the problem still exists, i.e. we still have an RCU
read-side critical section inside lock_acquire(), which may be called on
a yet-to-online CPU, which RCU doesn't watch. I think this used to be OK
because we don't "free" anything from lockdep, IOW, there is no
synchronize_rcu() or call_rcu() that _needs_ to wait for the RCU
read-side critical sections inside lockdep. But now we lock class
recycling, so it might be a problem.

That said, currently validate_chain() and lock class recycling are
mutually excluded via graph_lock, so we are safe for this one ;-)

--->8
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 39334d2d2b37..35d9bab65b75 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -275,8 +275,8 @@ EXPORT_SYMBOL_GPL(rcu_callback_map);
 
 noinstr int notrace debug_lockdep_rcu_enabled(void)
 {
-   return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
-  current->lockdep_recursion == 0;
+   return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE &&
+  __lockdep_enabled;
 }
 EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
 

> == x86 ==
> [8.101841][T1] rcu: Hierarchical SRCU implementation.
> [8.110615][T5] NMI watchdog: Enabled. Permanently consumes one hw-PMU 
> counter.
> [8.153506][T1] smp: Bringing up secondary CPUs ...
> [8.163075][T1] x86: Booting SMP configuration:
> [8.167843][T1]  node  #0, CPUs:#1
> [4.002695][T0] 
> [4.002695][T0] =
> [4.002695][T0] WARNING: suspicious RCU usage
> [4.002695][T0] 5.9.0-rc8-next-20201009 #2 Not tainted
> [4.002695][T0] -
> [4.002695][T0] kernel/locking/lockdep.c:3497 RCU-list traversed in 
> non-reader section!!
> [4.002695][T0] 
> [4.002695][T0] other info that might help us debug this:
> [4.002695][T0] 
> [4.002695][T0] 
> [4.002695][T0] RCU used illegally from offline CPU!
> [4.002695][T0] rcu_scheduler_active = 1, debug_locks = 1
> [4.002695][T0] no locks held by swapper/1/0.
> [4.002695][T0] 
> [4.002695][T0] stack backtrace:
> [4.002695][T0] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 5.9.0-rc8-next-20201009 #2
> [4.002695][T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
> Gen10, BIOS A40 07/10/2019
> [4.002695][T0] Call Trace:
> [4.002695][T0]  dump_stack+0x99/0xcb
> [4.002695][T0]  __lock_acquire.cold.76+0x2ad/0x3e0
> lookup_chain_cache at kernel/locking/lockdep.c:3497
> (inlined by) lookup_chain_cache_add at kernel/locking/lockdep.c:3517
> (inlined by) validate_chain at kernel/locking/lockdep.c:3572
> (inlined by) __lock_acquire at kernel/locking/lockdep.c:4837
> [4.002695][T0]  ? lockdep_hardirqs_on_prepare+0x3d0/0x3d0
> [4.002695][T0]  lock_acquire+0x1c8/0x820
> lockdep_recursion_finish at kernel/locking/lockdep.c:435
> (inlined by) lock_acquire at kernel/locking/lockdep.c:5444
> (inlined by) lock_acquire at kernel/locking/lockdep.c:5407
> [4.002695][T0]  ? __debug_object_init+0xb4/0xf50
> [4.002695][T0]  ? memset+0x1f/0x40
> [4.002695][T0]  ? rcu_read_unlock+0x40/0x40
> [4.002695][T0]  ? mce_gather_info+0x170/0x170
> [4.002695][T0]  ? arch_freq_get_on_cpu+0x270/0x270
> [4.002695][T0]  ? mce_cpu_restart+0x40/0x40
> [4.002695][T0]  _raw_spin_lock_irqsave+0x30/0x50
> [4.002695][T0]  ? __debug_object_init+0xb4/0xf50
> [4.002695][T0]

Re: lockdep null-ptr-deref

2020-10-02 Thread Boqun Feng

On Fri, Oct 02, 2020 at 03:09:29PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 02, 2020 at 08:36:02PM +0800, Boqun Feng wrote:
> 
> > But what if f2() is called with interrupt disabled? Or f2() disables
> > interrupt inside the function, like:
> > 
> > void f2(...)
> > {
> > local_irq_disable();
> > spin_lock();
> > g(...);
> > ...
> > local_irq_enable();
> > }
> > 
> > In this case, there wouldn't be any LOCK_ENABLED_*_READ usage for
> > rwlock_t A. As a result, we won't see it in the lockdep splat.
> 
> Hurm, fair enough. So just to make sure, you're arguing for:
> 
> -#define LOCK_TRACE_STATES  (XXX_LOCK_USAGE_STATES*4 + 1)
> +#define LOCK_TRACE_STATES  (XXX_LOCK_USAGE_STATES*4 + 2)
> 
> On top of my earlier patch, right?

Yep. Thanks ;-)

Regards,
Boqun

Re: lockdep null-ptr-deref

2020-10-02 Thread Boqun Feng

On Wed, Sep 30, 2020 at 09:02:28PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 30, 2020 at 08:18:18PM +0800, Boqun Feng wrote:
> 
> > For one thing, I do think that LOCK_READ_USED trace is helpful for
> > better reporting, because if there is a read lock in the dependency path
> > which causes the deadlock, it's better to have the LOCK_READ_USED trace
> > to know at least the initial READ usage. For example, if we have
> > 
> > void f1(...)
> > {
> > write_lock();
> > spin_lock();
> > // A -> C
> > ...
> > }
> > 
> > void g(...)
> > {
> > read_lock();
> > ...
> > }
> > void f2(...)
> > {
> > spin_lock();
> > g(...);
> > // B -> A
> > }
> > 
> > void f3(...) {
> > spin_lock();
> > spin_lock();
> > // C -> B, trigger lockdep splat
> > }
> > 
> > when lockdep reports the deadlock (at the time f3() is called), it will
> > be useful if we have a trace like:
> > 
> > INITIAL READ usage at:
> > g+0x.../0x...
> > f2+0x.../0x...
> > 
> > Thoughts?
> 
> Wouldn't that also be in LOCK_ENABLED_*_READ ?
> 

But what if f2() is called with interrupt disabled? Or f2() disables
interrupt inside the function, like:

void f2(...)
{
local_irq_disable();
spin_lock();
g(...);
...
local_irq_enable();
}

In this case, there wouldn't be any LOCK_ENABLED_*_READ usage for
rwlock_t A. As a result, we won't see it in the lockdep splat.

Regards,
Boqun

> That is, with PROVE_LOCKING on, the initial usage is bound to set more
> states, except for !check||trylock usage, and those aren't really all
> that interesting.

Re: lockdep null-ptr-deref

2020-09-30 Thread Boqun Feng

On Wed, Sep 30, 2020 at 11:49:37AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 30, 2020 at 11:16:11AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 30, 2020 at 07:08:23AM +0800, Boqun Feng wrote:
> > > I think there are two problems here:
> > > 
> > > 1) the "(null)" means we don't have the "usage_str" for a usage bit,
> > > which I think is the LOCK_USED_READ bit introduced by Peter at
> > > 23870f122768 ('locking/lockdep: Fix "USED" <- "IN-NMI" inversions').
> > > 
> > > 2) the next null-ptr-deref, and I think this is also caused by
> > > LOCK_USED_READ bit, because in the loop inside
> > > print_lock_class_header(), we iterate from 0 to LOCK_USAGE_STATES (which
> > > is 4*2 + 3), however the class->usage_traces[] only has
> > > XXX_LOCK_USAGE_STATES (which is 4*2 + 1) elements, so if we have
> > > LOCK_USED_READ bit set in ->usage_mask, we will try to access an element
> > > out of the ->usage_traces[] array.
> > > 
> > > Probably the following helps? And another possible fix is to enlarge the
> > > ->usage_trace[] array and record the call trace of LOCK_READ_USED.
> > 
> > Urgh.. yeah, I wanted to avoid saving that trace; it's pretty useless :/
> > The existing USED trace is already mostly pointless, the consistent
> > thing would be to remove both but that might be too radical.
> > 
> > But you're right in that I made a right mess of it. Not sure what's
> > best here.
> > 
> > Let me have a play.
> 
> How's something like this? It's bigger than I'd like, but I feel the
> result is more consistent/readable.
> 

Looks good to me.

For one thing, I do think that LOCK_READ_USED trace is helpful for
better reporting, because if there is a read lock in the dependency path
which causes the deadlock, it's better to have the LOCK_READ_USED trace
to know at least the initial READ usage. For example, if we have

void f1(...)
{
write_lock();
spin_lock();
// A -> C
...
}

void g(...)
{
read_lock();
...
}
void f2(...)
{
spin_lock();
g(...);
// B -> A
}

void f3(...) {
spin_lock();
spin_lock();
// C -> B, trigger lockdep splat
}

when lockdep reports the deadlock (at the time f3() is called), it will
be useful if we have a trace like:

INITIAL READ usage at:
g+0x.../0x...
f2+0x.../0x...

Thoughts?

Regards,
Boqun

> ---
> diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
> index bb35b449f533..a55b1d314ae8 100644
> --- a/include/linux/lockdep_types.h
> +++ b/include/linux/lockdep_types.h
> @@ -35,8 +35,12 @@ enum lockdep_wait_type {
>  /*
>   * We'd rather not expose kernel/lockdep_states.h this wide, but we do need
>   * the total number of states... :-(
> + *
> + * XXX_LOCK_USAGE_STATES is the number of lines in lockdep_states.h, for each
> + * of those we generates 4 states, Additionally we (for now) report on USED.
>   */
> -#define XXX_LOCK_USAGE_STATES(1+2*4)
> +#define XXX_LOCK_USAGE_STATES2
> +#define LOCK_TRACE_STATES(XXX_LOCK_USAGE_STATES*4 + 1)
>  
>  /*
>   * NR_LOCKDEP_CACHING_CLASSES ... Number of classes
> @@ -106,7 +110,7 @@ struct lock_class {
>* IRQ/softirq usage tracking bits:
>*/
>   unsigned long   usage_mask;
> - const struct lock_trace *usage_traces[XXX_LOCK_USAGE_STATES];
> + const struct lock_trace *usage_traces[LOCK_TRACE_STATES];
>  
>   /*
>* Generation counter, when doing certain classes of graph walking,
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 454355c033d2..4f98ac8b4575 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -600,6 +600,8 @@ static const char *usage_str[] =
>  #include "lockdep_states.h"
>  #undef LOCKDEP_STATE
>   [LOCK_USED] = "INITIAL USE",
> + [LOCK_USED_READ] = "INITIAL READ USE",
> + /* abused as string storage for verify_lock_unused() */
>   [LOCK_USAGE_STATES] = "IN-NMI",
>  };
>  #endif
> @@ -2231,7 +2233,7 @@ static void print_lock_class_header(struct lock_class 
> *class, int depth)
>  #endif
>   printk(KERN_CONT " {\n");
>  
> - for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
> + for (bit = 0; bit < LOCK_TRACE_STATES; bit++) {
>   i

[tip: locking/core] lockdep: Optimize the memory usage of circular queue

2020-09-30 Thread tip-bot2 for Boqun Feng

The following commit has been merged into the locking/core branch of tip:

Commit-ID: 6d1823ccc480866e571ab1206665d693aeb600cf
Gitweb:
https://git.kernel.org/tip/6d1823ccc480866e571ab1206665d693aeb600cf
Author:Boqun Feng 
AuthorDate:Thu, 17 Sep 2020 16:01:50 +08:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 29 Sep 2020 09:56:59 +02:00

lockdep: Optimize the memory usage of circular queue

Qian Cai reported a BFS_EQUEUEFULL warning [1] after read recursive
deadlock detection merged into tip tree recently. Unlike the previous
lockep graph searching, which iterate every lock class (every node in
the graph) exactly once, the graph searching for read recurisve deadlock
detection needs to iterate every lock dependency (every edge in the
graph) once, as a result, the maximum memory cost of the circular queue
changes from O(V), where V is the number of lock classes (nodes or
vertices) in the graph, to O(E), where E is the number of lock
dependencies (edges), because every lock class or dependency gets
enqueued once in the BFS. Therefore we hit the BFS_EQUEUEFULL case.

However, actually we don't need to enqueue all dependencies for the BFS,
because every time we enqueue a dependency, we almostly enqueue all
other dependencies in the same dependency list ("almostly" is because
we currently check before enqueue, so if a dependency doesn't pass the
check stage we won't enqueue it, however, we can always do in reverse
ordering), based on this, we can only enqueue the first dependency from
a dependency list and every time we want to fetch a new dependency to
work, we can either:

  1)fetch the dependency next to the current dependency in the
dependency list
or

  2)if the dependency in 1) doesn't exist, fetch the dependency from
the queue.

With this approach, the "max bfs queue depth" for a x86_64_defconfig +
lockdep and selftest config kernel can get descreased from:

max bfs queue depth:   201

to (after apply this patch)

max bfs queue depth:   61

While I'm at it, clean up the code logic a little (e.g. directly return
other than set a "ret" value and goto the "exit" label).

[1]: 
https://lore.kernel.org/lkml/17343f6f7f2438fc376125384133c5ba70c2a681.ca...@redhat.com/

Reported-by: Qian Cai 
Reported-by: syzbot+62ebe501c1ce9a91f...@syzkaller.appspotmail.com
Signed-off-by: Boqun Feng 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200917080210.108095-1-boqun.f...@gmail.com
---
 kernel/locking/lockdep.c |  99 +++---
 1 file changed, 60 insertions(+), 39 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index cccf4bc..9560a4e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1606,6 +1606,15 @@ static inline void bfs_init_rootb(struct lock_list *lock,
lock->only_xr = (hlock->read != 0);
 }
 
+static inline struct lock_list *__bfs_next(struct lock_list *lock, int offset)
+{
+   if (!lock || !lock->parent)
+   return NULL;
+
+   return list_next_or_null_rcu(get_dep_list(lock->parent, offset),
+>entry, struct lock_list, entry);
+}
+
 /*
  * Breadth-First Search to find a strong path in the dependency graph.
  *
@@ -1639,36 +1648,25 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
 struct lock_list **target_entry,
 int offset)
 {
+   struct circular_queue *cq = _cq;
+   struct lock_list *lock = NULL;
struct lock_list *entry;
-   struct lock_list *lock;
struct list_head *head;
-   struct circular_queue *cq = _cq;
-   enum bfs_result ret = BFS_RNOMATCH;
+   unsigned int cq_depth;
+   bool first;
 
lockdep_assert_locked();
 
-   if (match(source_entry, data)) {
-   *target_entry = source_entry;
-   ret = BFS_RMATCH;
-   goto exit;
-   }
-
-   head = get_dep_list(source_entry, offset);
-   if (list_empty(head))
-   goto exit;
-
__cq_init(cq);
__cq_enqueue(cq, source_entry);
 
-   while ((lock = __cq_dequeue(cq))) {
-   bool prev_only_xr;
-
-   if (!lock->class) {
-   ret = BFS_EINVALIDNODE;
-   goto exit;
-   }
+   while ((lock = __bfs_next(lock, offset)) || (lock = __cq_dequeue(cq))) {
+   if (!lock->class)
+   return BFS_EINVALIDNODE;
 
/*
+* Step 1: check whether we already finish on this one.
+*
 * If we have visited all the dependencies from this @lock to
 * others (iow, if we have visited all lock_list entries in
 * @lock->class->locks_{after,before}) we skip, other

Re: lockdep null-ptr-deref

2020-09-29 Thread Boqun Feng

On Tue, Sep 29, 2020 at 10:31:56AM -0400, Qian Cai wrote:
> I tried to add a few new Kconfig options like LEDS_TRIGGERS instantly trigger 
> a
> warning during the boot, and then there is null-ptr-deref in lockdep below. 
> Any
> idea?
> 
> [   16.487309] WARNING: possible irq lock inversion dependency detected
> [   16.488313] 5.9.0-rc7-next-20200928+ #9 Not tainted
> [   16.488936] 
> [   16.489767] swapper/6/0 just changed the state of lock:
> [   16.490449] 8889eea6f418 (>lock){-...}-{2:2}, at: 
> ata_bmdma_interrupt+0x1e/0x530 [libata]
> __ata_sff_interrupt at /home/linux-mm/linux-next/drivers/ata/libata-sff.c:1534
> (inlined by) ata_bmdma_interrupt at 
> /home/linux-mm/linux-next/drivers/ata/libata-sff.c:2832
> [   16.491639] but this lock took another, HARDIRQ-READ-unsafe lock in the 
> past:
> [   16.492561]  (>leddev_list_lock){.+.?}-{2:2}
> [   16.492565] 
> [   16.492565] 
> [   16.492565] and interrupts could create inverse lock ordering between them.
> [   16.492565] 
> [   16.494635] 
> [   16.494635] other info that might help us debug this:
> [   16.495479]  Possible interrupt unsafe locking scenario:
> [   16.495479] 
> [   16.496360]CPU0CPU1
> [   16.496941]
> [   16.497542]   lock(>leddev_list_lock);
> [   16.498095]local_irq_disable();
> [   16.498864]lock(>lock);
> [   16.499611]lock(>leddev_list_lock);
> [   16.500481]   
> [   16.500833] lock(>lock);
> [   16.501289] 
> [   16.501289]  *** DEADLOCK ***
> [   16.501289] 
> [   16.502044] no locks held by swapper/6/0.
> [   16.502566] 
> [   16.502566] the shortest dependencies between 2nd lock and 1st lock:
> [   16.503578]  -> (>leddev_list_lock){.+.?}-{2:2} {
> [   16.504259] HARDIRQ-ON-R at:
> [   16.504692]   lock_acquire+0x17f/0x7e0
> [   16.505411]   _raw_read_lock+0x38/0x70
> [   16.506120]   led_trigger_event+0x2b/0xb0
> led_trigger_event at drivers/leds/led-triggers.c:386
> (inlined by) led_trigger_event at drivers/leds/led-triggers.c:377
> [   16.506868]   kbd_propagate_led_state+0x5d/0x80
> [   16.507680]   kbd_bh+0x14d/0x1d0
> [   16.508335]   tasklet_action_common.isra.13+0x23a/0x2e0
> [   16.509235]   __do_softirq+0x1ce/0x828
> [   16.509940]   run_ksoftirqd+0x26/0x50
> [   16.510647]   smpboot_thread_fn+0x30f/0x740
> [   16.511413]   kthread+0x357/0x420
> [   16.512068]   ret_from_fork+0x22/0x30
> [   16.512762] IN-SOFTIRQ-R at:
> [   16.513187]   lock_acquire+0x17f/0x7e0
> [   16.513891]   _raw_read_lock+0x38/0x70
> [   16.514602]   led_trigger_event+0x2b/0xb0
> [   16.515356]   kbd_propagate_led_state+0x5d/0x80
> [   16.516165]   kbd_bh+0x14d/0x1d0
> [   16.516810]   tasklet_action_common.isra.13+0x23a/0x2e0
> [   16.517701]   __do_softirq+0x1ce/0x828
> [   16.518418]   run_ksoftirqd+0x26/0x50
> [   16.519119]   smpboot_thread_fn+0x30f/0x740
> [   16.519874]   kthread+0x357/0x420
> [   16.520460] scsi 0:0:0:0: Direct-Access ATA  QEMU HARDDISK2.5+ 
> PQ: 0 ANSI: 5
> [   16.520528]   ret_from_fork+0x22/0x30
> [   16.520531] SOFTIRQ-ON-R at:
> [   16.522704]   lock_acquire+0x17f/0x7e0
> [   16.523423]   _raw_read_lock+0x5d/0x70
> [   16.524124]   led_trigger_event+0x2b/0xb0
> [   16.524865]   kbd_propagate_led_state+0x5d/0x80
> [   16.525671]   kbd_start+0xd2/0xf0
> [   16.526332]   input_register_handle+0x282/0x4f0
> [   16.527142]   kbd_connect+0xc0/0x120
> [   16.527826]   input_attach_handler+0x10a/0x170
> [   16.528758]   input_register_device.cold.22+0xac/0x29d
> [   16.529651]   atkbd_connect+0x58f/0x810
> [   16.530374]   serio_connect_driver+0x4a/0x70
> [   16.531144]   really_probe+0x222/0xb20
> [   16.531861]   driver_probe_device+0x1f6/0x380
> [   16.532650]   device_driver_attach+0xea/0x120
> [   16.533441]   __driver_attach+0xf5/0x270
> [   16.534172]   bus_for_each_dev+0x11c/0x1b0
> [   16.534924]   serio_handle_event+0x1df/0x7f0
> [   16.535701]   process_one_work+0x842/0x1410
> [   16.536470]   worker_thread+0x87/0xb40
> [   16.537178]

Re: [PATCH] lockdep: Optimize the memory usage of circular queue

2020-09-28 Thread Boqun Feng

On Mon, Sep 28, 2020 at 10:51:04AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 17, 2020 at 04:01:50PM +0800, Boqun Feng wrote:
> 
> > __cq_init(cq);
> > __cq_enqueue(cq, source_entry);
> >  
> > +   while (lock || (lock = __cq_dequeue(cq))) {
> > +   if (!lock->class)
> > +   return BFS_EINVALIDNODE;
> >  
> > /*
> > +* Step 1: check whether we already finish on this one.
> > +*
> >  * If we have visited all the dependencies from this @lock to
> >  * others (iow, if we have visited all lock_list entries in
> >  * @lock->class->locks_{after,before}) we skip, otherwise go
> 
> > @@ -1698,29 +1685,68 @@ static enum bfs_result __bfs(struct lock_list 
> > *source_entry,
> >  
> > /* If nothing left, we skip */
> > if (!dep)
> > +   goto next;
> >  
> > /* If there are only -(*R)-> left, set that for the 
> > next step */
> > +   lock->only_xr = !(dep & (DEP_SN_MASK | DEP_EN_MASK));
> > +   }
> >  
> > +   /*
> > +* Step 3: we haven't visited this and there is a strong
> > +* dependency path to this, so check with @match.
> > +*/
> > +   if (match(lock, data)) {
> > +   *target_entry = lock;
> > +   return BFS_RMATCH;
> > +   }
> > +
> > +   /*
> > +* Step 4: if not match, expand the path by adding the
> > +* afterwards or backwards dependencis in the search
> > +*
> > +* Note we only enqueue the first of the list into the queue,
> > +* because we can always find a sibling dependency from one
> > +* (see label 'next'), as a result the space of queue is saved.
> > +*/
> > +   head = get_dep_list(lock, offset);
> > +   entry = list_first_or_null_rcu(head, struct lock_list, entry);
> > +   if (entry) {
> > +   unsigned int cq_depth;
> > +
> > +   if (__cq_enqueue(cq, entry))
> > +   return BFS_EQUEUEFULL;
> >  
> > cq_depth = __cq_get_elem_count(cq);
> > if (max_bfs_queue_depth < cq_depth)
> > max_bfs_queue_depth = cq_depth;
> > }
> > +
> > +   /*
> > +* Update the ->parent, so when @entry is iterated, we know the
> > +* previous dependency.
> > +*/
> > +   list_for_each_entry_rcu(entry, head, entry)
> > +   visit_lock_entry(entry, lock);
> 
> This confused me for a while. I think it might be a little clearer if we
> put this inside the previous block.
> 
> Alternatively, we could try and write it something like so:
> 
>   /*
>* Step 4: if not match, expand the path by adding the
>* afterwards or backwards dependencis in the search
>*/
>   first = true;
>   head = get_dep_list(lock, offset);
>   list_for_each_entry_rcu(entry, head, entry) {
>   visit_lock_entry(entry, lock);
> 
>   if (!first)
>   continue;
> 
>   /*
>* Only enqueue the first entry of the list,
>* we'll iterate it's siblings at the next
>* label.
>*/
>   first = false;
>   if (__cq_enqueue(cq, entry))
>   return BFS_EQUEUEFULL;
> 
>   cq_depth = __cq_get_elem_count(cq);
>   if (max_bfs_queue_depth < cq_depth)
>   max_bfs_queue_depth = cq_depth;
>   }
> 
> Hmm?
> 

Better than mine ;-)

> > +next:
> > +   /*
> > +* Step 5: fetch the next dependency to process.
> > +*
> > +* If there is a previous dependency, we fetch the sibling
> > +* dependency in the dep list of previous dependency.
> > +*
> > +* Otherwise set @lock to NULL to fetch the next entry from
> > +* queue.
> > +*/
> > +   if (lock->parent) {
> > +

Re: [PATCH] lockdep: Optimize the memory usage of circular queue

2020-09-28 Thread Boqun Feng

On Mon, Sep 28, 2020 at 10:03:19AM +0200, Dmitry Vyukov wrote:
> On Thu, Sep 24, 2020 at 5:13 PM Boqun Feng  wrote:
> >
> > Ping ;-)
> >
> > Regards,
> > Boqun
> 
> Hi Boqun,
> 
> Peter says this may also fix this issue:
> https://syzkaller.appspot.com/bug?extid=62ebe501c1ce9a91f68c
> please add the following tag to the patch so that the bug report will
> be closed on merge:
> Reported-by: syzbot+62ebe501c1ce9a91f...@syzkaller.appspotmail.com
> 

Sure, I will if another version of this patch is required, otherwise (if
this one looks good to Peter), I will rely on Peter to add the tag ;-)
Works for you?

Regards,
Boqun

> > On Thu, Sep 17, 2020 at 04:01:50PM +0800, Boqun Feng wrote:
> > > Qian Cai reported a BFS_EQUEUEFULL warning [1] after read recursive
> > > deadlock detection merged into tip tree recently. Unlike the previous
> > > lockep graph searching, which iterate every lock class (every node in
> > > the graph) exactly once, the graph searching for read recurisve deadlock
> > > detection needs to iterate every lock dependency (every edge in the
> > > graph) once, as a result, the maximum memory cost of the circular queue
> > > changes from O(V), where V is the number of lock classes (nodes or
> > > vertices) in the graph, to O(E), where E is the number of lock
> > > dependencies (edges), because every lock class or dependency gets
> > > enqueued once in the BFS. Therefore we hit the BFS_EQUEUEFULL case.
> > >
> > > However, actually we don't need to enqueue all dependencies for the BFS,
> > > because every time we enqueue a dependency, we almostly enqueue all
> > > other dependencies in the same dependency list ("almostly" is because
> > > we currently check before enqueue, so if a dependency doesn't pass the
> > > check stage we won't enqueue it, however, we can always do in reverse
> > > ordering), based on this, we can only enqueue the first dependency from
> > > a dependency list and every time we want to fetch a new dependency to
> > > work, we can either:
> > >
> > > 1)fetch the dependency next to the current dependency in the
> > >   dependency list
> > > or
> > > 2)if the dependency in 1) doesn't exist, fetch the dependency from
> > >   the queue.
> > >
> > > With this approach, the "max bfs queue depth" for a x86_64_defconfig +
> > > lockdep and selftest config kernel can get descreased from:
> > >
> > > max bfs queue depth:   201
> > >
> > > to (after apply this patch)
> > >
> > > max bfs queue depth:   61
> > >
> > > While I'm at it, clean up the code logic a little (e.g. directly return
> > > other than set a "ret" value and goto the "exit" label).
> > >
> > > [1]: 
> > > https://lore.kernel.org/lkml/17343f6f7f2438fc376125384133c5ba70c2a681.ca...@redhat.com/
> > >
> > > Reported-by: Qian Cai 
> > > Signed-off-by: Boqun Feng 
> > > ---
> > >  kernel/locking/lockdep.c | 108 ---
> > >  1 file changed, 67 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index cccf4bc759c6..761c2327e9cf 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -1640,35 +1640,22 @@ static enum bfs_result __bfs(struct lock_list 
> > > *source_entry,
> > >int offset)
> > >  {
> > >   struct lock_list *entry;
> > > - struct lock_list *lock;
> > > + struct lock_list *lock = NULL;
> > >   struct list_head *head;
> > >   struct circular_queue *cq = _cq;
> > > - enum bfs_result ret = BFS_RNOMATCH;
> > >
> > >   lockdep_assert_locked();
> > >
> > > - if (match(source_entry, data)) {
> > > - *target_entry = source_entry;
> > > - ret = BFS_RMATCH;
> > > - goto exit;
> > > - }
> > > -
> > > - head = get_dep_list(source_entry, offset);
> > > - if (list_empty(head))
> > > - goto exit;
> > > -
> > >   __cq_init(cq);
> > >   __cq_enqueue(cq, source_entry);
> > >
> > > - while ((lock = __cq_dequeue(cq))) {
> > > - bool prev_only_xr;
>

Re: [PATCH] lockdep: Optimize the memory usage of circular queue

2020-09-24 Thread Boqun Feng

Ping ;-)

Regards,
Boqun

On Thu, Sep 17, 2020 at 04:01:50PM +0800, Boqun Feng wrote:
> Qian Cai reported a BFS_EQUEUEFULL warning [1] after read recursive
> deadlock detection merged into tip tree recently. Unlike the previous
> lockep graph searching, which iterate every lock class (every node in
> the graph) exactly once, the graph searching for read recurisve deadlock
> detection needs to iterate every lock dependency (every edge in the
> graph) once, as a result, the maximum memory cost of the circular queue
> changes from O(V), where V is the number of lock classes (nodes or
> vertices) in the graph, to O(E), where E is the number of lock
> dependencies (edges), because every lock class or dependency gets
> enqueued once in the BFS. Therefore we hit the BFS_EQUEUEFULL case.
> 
> However, actually we don't need to enqueue all dependencies for the BFS,
> because every time we enqueue a dependency, we almostly enqueue all
> other dependencies in the same dependency list ("almostly" is because
> we currently check before enqueue, so if a dependency doesn't pass the
> check stage we won't enqueue it, however, we can always do in reverse
> ordering), based on this, we can only enqueue the first dependency from
> a dependency list and every time we want to fetch a new dependency to
> work, we can either:
> 
> 1)fetch the dependency next to the current dependency in the
>   dependency list
> or
> 2)if the dependency in 1) doesn't exist, fetch the dependency from
>   the queue.
> 
> With this approach, the "max bfs queue depth" for a x86_64_defconfig +
> lockdep and selftest config kernel can get descreased from:
> 
> max bfs queue depth:   201
> 
> to (after apply this patch)
> 
> max bfs queue depth:   61
> 
> While I'm at it, clean up the code logic a little (e.g. directly return
> other than set a "ret" value and goto the "exit" label).
> 
> [1]: 
> https://lore.kernel.org/lkml/17343f6f7f2438fc376125384133c5ba70c2a681.ca...@redhat.com/
> 
> Reported-by: Qian Cai 
> Signed-off-by: Boqun Feng 
> ---
>  kernel/locking/lockdep.c | 108 ---
>  1 file changed, 67 insertions(+), 41 deletions(-)
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index cccf4bc759c6..761c2327e9cf 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -1640,35 +1640,22 @@ static enum bfs_result __bfs(struct lock_list 
> *source_entry,
>int offset)
>  {
>   struct lock_list *entry;
> - struct lock_list *lock;
> + struct lock_list *lock = NULL;
>   struct list_head *head;
>   struct circular_queue *cq = _cq;
> - enum bfs_result ret = BFS_RNOMATCH;
>  
>   lockdep_assert_locked();
>  
> - if (match(source_entry, data)) {
> - *target_entry = source_entry;
> - ret = BFS_RMATCH;
> - goto exit;
> - }
> -
> - head = get_dep_list(source_entry, offset);
> - if (list_empty(head))
> - goto exit;
> -
>   __cq_init(cq);
>   __cq_enqueue(cq, source_entry);
>  
> - while ((lock = __cq_dequeue(cq))) {
> - bool prev_only_xr;
> -
> - if (!lock->class) {
> - ret = BFS_EINVALIDNODE;
> - goto exit;
> - }
> + while (lock || (lock = __cq_dequeue(cq))) {
> + if (!lock->class)
> + return BFS_EINVALIDNODE;
>  
>   /*
> +  * Step 1: check whether we already finish on this one.
> +  *
>* If we have visited all the dependencies from this @lock to
>* others (iow, if we have visited all lock_list entries in
>* @lock->class->locks_{after,before}) we skip, otherwise go
> @@ -1676,17 +1663,17 @@ static enum bfs_result __bfs(struct lock_list 
> *source_entry,
>* list accessed.
>*/
>   if (lock_accessed(lock))
> - continue;
> + goto next;
>   else
>   mark_lock_accessed(lock);
>  
> - head = get_dep_list(lock, offset);
> -
> - prev_only_xr = lock->only_xr;
> -
> - list_for_each_entry_rcu(entry, head, entry) {
> - unsigned int cq_depth;
> - u8 dep = entry->dep;
> + /*
> +  * Step 2: check whether prev dependency and this form a strong
> +  * dependency path.
> +

[PATCH] lockdep: Optimize the memory usage of circular queue

2020-09-17 Thread Boqun Feng

Qian Cai reported a BFS_EQUEUEFULL warning [1] after read recursive
deadlock detection merged into tip tree recently. Unlike the previous
lockep graph searching, which iterate every lock class (every node in
the graph) exactly once, the graph searching for read recurisve deadlock
detection needs to iterate every lock dependency (every edge in the
graph) once, as a result, the maximum memory cost of the circular queue
changes from O(V), where V is the number of lock classes (nodes or
vertices) in the graph, to O(E), where E is the number of lock
dependencies (edges), because every lock class or dependency gets
enqueued once in the BFS. Therefore we hit the BFS_EQUEUEFULL case.

However, actually we don't need to enqueue all dependencies for the BFS,
because every time we enqueue a dependency, we almostly enqueue all
other dependencies in the same dependency list ("almostly" is because
we currently check before enqueue, so if a dependency doesn't pass the
check stage we won't enqueue it, however, we can always do in reverse
ordering), based on this, we can only enqueue the first dependency from
a dependency list and every time we want to fetch a new dependency to
work, we can either:

1)  fetch the dependency next to the current dependency in the
dependency list
or
2)  if the dependency in 1) doesn't exist, fetch the dependency from
the queue.

With this approach, the "max bfs queue depth" for a x86_64_defconfig +
lockdep and selftest config kernel can get descreased from:

max bfs queue depth:   201

to (after apply this patch)

max bfs queue depth:   61

While I'm at it, clean up the code logic a little (e.g. directly return
other than set a "ret" value and goto the "exit" label).

[1]: 
https://lore.kernel.org/lkml/17343f6f7f2438fc376125384133c5ba70c2a681.ca...@redhat.com/

Reported-by: Qian Cai 
Signed-off-by: Boqun Feng 
---
 kernel/locking/lockdep.c | 108 ---
 1 file changed, 67 insertions(+), 41 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index cccf4bc759c6..761c2327e9cf 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1640,35 +1640,22 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
 int offset)
 {
struct lock_list *entry;
-   struct lock_list *lock;
+   struct lock_list *lock = NULL;
struct list_head *head;
struct circular_queue *cq = _cq;
-   enum bfs_result ret = BFS_RNOMATCH;
 
lockdep_assert_locked();
 
-   if (match(source_entry, data)) {
-   *target_entry = source_entry;
-   ret = BFS_RMATCH;
-   goto exit;
-   }
-
-   head = get_dep_list(source_entry, offset);
-   if (list_empty(head))
-   goto exit;
-
__cq_init(cq);
__cq_enqueue(cq, source_entry);
 
-   while ((lock = __cq_dequeue(cq))) {
-   bool prev_only_xr;
-
-   if (!lock->class) {
-   ret = BFS_EINVALIDNODE;
-   goto exit;
-   }
+   while (lock || (lock = __cq_dequeue(cq))) {
+   if (!lock->class)
+   return BFS_EINVALIDNODE;
 
/*
+* Step 1: check whether we already finish on this one.
+*
 * If we have visited all the dependencies from this @lock to
 * others (iow, if we have visited all lock_list entries in
 * @lock->class->locks_{after,before}) we skip, otherwise go
@@ -1676,17 +1663,17 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
 * list accessed.
 */
if (lock_accessed(lock))
-   continue;
+   goto next;
else
mark_lock_accessed(lock);
 
-   head = get_dep_list(lock, offset);
-
-   prev_only_xr = lock->only_xr;
-
-   list_for_each_entry_rcu(entry, head, entry) {
-   unsigned int cq_depth;
-   u8 dep = entry->dep;
+   /*
+* Step 2: check whether prev dependency and this form a strong
+* dependency path.
+*/
+   if (lock->parent) { /* Parent exists, check prev dependency */
+   u8 dep = lock->dep;
+   bool prev_only_xr = lock->parent->only_xr;
 
/*
 * Mask out all -(S*)-> if we only have *R in previous
@@ -1698,29 +1685,68 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
 
/* If nothing left, we skip */
if (!dep)
-   continue;
+

Re: [RFC v7 11/19] lockdep: Fix recursive read lock related safe->unsafe detection

2020-09-16 Thread Boqun Feng

On Wed, Sep 16, 2020 at 05:11:59PM -0400, Qian Cai wrote:
> On Thu, 2020-09-17 at 00:14 +0800, Boqun Feng wrote:
> > Found a way to resolve this while still keeping the BFS. Every time when
> > we want to enqueue a lock_list, we basically enqueue a whole dep list of
> > entries from the previous lock_list, so we can use a trick here: instead
> > enqueue all the entries, we only enqueue the first entry and we can
> > fetch other silbing entries with list_next_or_null_rcu(). Patch as
> > below, I also took the chance to clear the code up and add more
> > comments. I could see this number (in /proc/lockdep_stats):
> > 
> > max bfs queue depth:   201
> > 
> > down to (after apply this patch)
> > 
> > max bfs queue depth:   61
> > 
> > with x86_64_defconfig along with lockdep and selftest configs.
> > 
> > Qian, could you give it a try?
> 
> It works fine as the number went down from around 3000 to 500 on our 
> workloads.
> 

Thanks, let me send a proper patch. I will add a Reported-by tag from
you.

Regards,
Boqun

Re: [RFC v7 11/19] lockdep: Fix recursive read lock related safe->unsafe detection

2020-09-16 Thread Boqun Feng

On Wed, Sep 16, 2020 at 04:10:46PM +0800, Boqun Feng wrote:
> On Tue, Sep 15, 2020 at 02:32:51PM -0400, Qian Cai wrote:
> > On Fri, 2020-08-07 at 15:42 +0800, Boqun Feng wrote:
> > > Currently, in safe->unsafe detection, lockdep misses the fact that a
> > > LOCK_ENABLED_IRQ_*_READ usage and a LOCK_USED_IN_IRQ_*_READ usage may
> > > cause deadlock too, for example:
> > > 
> > >   P1  P2
> > >   
> > >   write_lock(l1); 
> > >   read_lock(l2);
> > >   write_lock(l2);
> > >   
> > >   read_lock(l1);
> > > 
> > > Actually, all of the following cases may cause deadlocks:
> > > 
> > >   LOCK_USED_IN_IRQ_* -> LOCK_ENABLED_IRQ_*
> > >   LOCK_USED_IN_IRQ_*_READ -> LOCK_ENABLED_IRQ_*
> > >   LOCK_USED_IN_IRQ_* -> LOCK_ENABLED_IRQ_*_READ
> > >   LOCK_USED_IN_IRQ_*_READ -> LOCK_ENABLED_IRQ_*_READ
> > > 
> > > To fix this, we need to 1) change the calculation of exclusive_mask() so
> > > that READ bits are not dropped and 2) always call usage() in
> > > mark_lock_irq() to check usage deadlocks, even when the new usage of the
> > > lock is READ.
> > > 
> > > Besides, adjust usage_match() and usage_acculumate() to recursive read
> > > lock changes.
> > > 
> > > Signed-off-by: Boqun Feng 
> > 
> > So our daily CI starts to trigger a warning (graph corruption?) below. From 
> > the
> > call traces, this recent patchset changed a few related things here and 
> > there.
> > Does it ring any bells?
> > 
> > [14828.805563][T145183] lockdep bfs error:-1
> 
> -1 is BFS_EQUEUEFULL, that means we hit the size limitation in lockdep
> searching, which is possible since recursive read deadlock detection
> tries to make the all edges (dependencies) searched. So maybe we should
> switch to DFS instead of BFS, I will look into this, in the meanwhile,
> could you try the following to see if it can help on the warnings you
> got?
> 

Found a way to resolve this while still keeping the BFS. Every time when
we want to enqueue a lock_list, we basically enqueue a whole dep list of
entries from the previous lock_list, so we can use a trick here: instead
enqueue all the entries, we only enqueue the first entry and we can
fetch other silbing entries with list_next_or_null_rcu(). Patch as
below, I also took the chance to clear the code up and add more
comments. I could see this number (in /proc/lockdep_stats):

max bfs queue depth:   201

down to (after apply this patch)

max bfs queue depth:   61

with x86_64_defconfig along with lockdep and selftest configs.

Qian, could you give it a try?

Regards,
Boqun

-->8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 454355c033d2..1cc1302bf319 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1640,35 +1640,22 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
 int offset)
 {
struct lock_list *entry;
-   struct lock_list *lock;
+   struct lock_list *lock = NULL;
struct list_head *head;
struct circular_queue *cq = _cq;
-   enum bfs_result ret = BFS_RNOMATCH;
 
lockdep_assert_locked();
 
-   if (match(source_entry, data)) {
-   *target_entry = source_entry;
-   ret = BFS_RMATCH;
-   goto exit;
-   }
-
-   head = get_dep_list(source_entry, offset);
-   if (list_empty(head))
-   goto exit;
-
__cq_init(cq);
__cq_enqueue(cq, source_entry);
 
-   while ((lock = __cq_dequeue(cq))) {
-   bool prev_only_xr;
-
-   if (!lock->class) {
-   ret = BFS_EINVALIDNODE;
-   goto exit;
-   }
+   while (lock || (lock = __cq_dequeue(cq))) {
+   if (!lock->class)
+   return BFS_EINVALIDNODE;
 
/*
+* Step 1: check whether we already finish on this one.
+*
 * If we have visited all the dependencies from this @lock to
 * others (iow, if we have visited all lock_list entries in
 * @lock->class->locks_{after,before}) we skip, otherwise go
@@ -1676,17 +1663,17 @@ static enum bfs_result __bfs(struct lock_list 
*source_entry,
 * list accessed.
 */
if (lock_accessed(lock))
-   continue;
+   goto next;
else
mark_lock_accessed

Re: [RFC v7 11/19] lockdep: Fix recursive read lock related safe->unsafe detection

2020-09-16 Thread Boqun Feng

On Tue, Sep 15, 2020 at 02:32:51PM -0400, Qian Cai wrote:
> On Fri, 2020-08-07 at 15:42 +0800, Boqun Feng wrote:
> > Currently, in safe->unsafe detection, lockdep misses the fact that a
> > LOCK_ENABLED_IRQ_*_READ usage and a LOCK_USED_IN_IRQ_*_READ usage may
> > cause deadlock too, for example:
> > 
> > P1  P2
> > 
> > write_lock(l1); 
> > read_lock(l2);
> > write_lock(l2);
> > 
> > read_lock(l1);
> > 
> > Actually, all of the following cases may cause deadlocks:
> > 
> > LOCK_USED_IN_IRQ_* -> LOCK_ENABLED_IRQ_*
> > LOCK_USED_IN_IRQ_*_READ -> LOCK_ENABLED_IRQ_*
> > LOCK_USED_IN_IRQ_* -> LOCK_ENABLED_IRQ_*_READ
> > LOCK_USED_IN_IRQ_*_READ -> LOCK_ENABLED_IRQ_*_READ
> > 
> > To fix this, we need to 1) change the calculation of exclusive_mask() so
> > that READ bits are not dropped and 2) always call usage() in
> > mark_lock_irq() to check usage deadlocks, even when the new usage of the
> > lock is READ.
> > 
> > Besides, adjust usage_match() and usage_acculumate() to recursive read
> > lock changes.
> > 
> > Signed-off-by: Boqun Feng 
> 
> So our daily CI starts to trigger a warning (graph corruption?) below. From 
> the
> call traces, this recent patchset changed a few related things here and there.
> Does it ring any bells?
> 
> [14828.805563][T145183] lockdep bfs error:-1

-1 is BFS_EQUEUEFULL, that means we hit the size limitation in lockdep
searching, which is possible since recursive read deadlock detection
tries to make the all edges (dependencies) searched. So maybe we should
switch to DFS instead of BFS, I will look into this, in the meanwhile,
could you try the following to see if it can help on the warnings you
got?

Regards,
Boqun

--->8
index 454355c033d2..8f07bf37ab62 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1365,7 +1365,7 @@ static int add_lock_to_list(struct lock_class *this,
 /*
  * For good efficiency of modular, we use power of 2
  */
-#define MAX_CIRCULAR_QUEUE_SIZE4096UL
+#define MAX_CIRCULAR_QUEUE_SIZE8192UL
 #define CQ_MASK(MAX_CIRCULAR_QUEUE_SIZE-1)
 
 /*


> [14828.826015][T145183] WARNING: CPU: 28 PID: 145183 at 
> kernel/locking/lockdep.c:1960 print_bfs_bug+0xfc/0x180
> [14828.871595][T145183] Modules linked in: vfio_pci vfio_virqfd 
> vfio_iommu_type1 vfio loop nls_ascii nls_cp437 vfat fat kvm_intel kvm 
> irqbypass efivars ip_tables x_tables sd_mod bnx2x hpsa mdio 
> scsi_transport_sas firmware_class dm_mirror dm_region_hash dm_log dm_mod 
> efivarfs [last unloaded: dummy_del_mod]
> [14828.994188][T145183] CPU: 28 PID: 145183 Comm: trinity-c28 Tainted: G  
>  O  5.9.0-rc5-next-20200915+ #2
> [14829.041983][T145183] Hardware name: HP ProLiant BL660c Gen9, BIOS I38 
> 10/17/2018
> [14829.075779][T145183] RIP: 0010:print_bfs_bug+0xfc/0x180
> [14829.099551][T145183] Code: 04 08 00 00 01 48 c7 05 4e 02 75 07 00 00 00 00 
> c6 05 87 02 75 07 00 45 85 e4 74 10 89 ee 48 c7 c7 e0 71 45 90 e8 78 15 0a 01 
> <0f> 0b 5b 5d 41 5c c3 e8 a8 74 0d 01 85 c0 74 dd 48 c7 c7 18 9f 59
> [14829.189430][T145183] RSP: 0018:c90023d7ed90 EFLAGS: 00010082
> [14829.217056][T145183] RAX:  RBX: 888ac6238040 RCX: 
> 0027
> [14829.253274][T145183] RDX: 0027 RSI: 0004 RDI: 
> 1e29fe08
> [14829.289767][T145183] RBP:  R08: ed1103c53fc2 R09: 
> ed1103c53fc2
> [14829.328689][T145183] R10: 1e29fe0b R11: ed1103c53fc1 R12: 
> 0001
> [14829.367921][T145183] R13:  R14: 888ac6238040 R15: 
> 888ac62388e8
> [14829.404156][T145183] FS:  7f850d4a0740() GS:1e28() 
> knlGS:
> [14829.78][T145183] CS:  0010 DS:  ES:  CR0: 80050033
> [14829.474221][T145183] CR2: 7f850c3b00fc CR3: 000a3634a001 CR4: 
> 001706e0
> [14829.511287][T145183] DR0: 7f850ae99000 DR1: 7f850b399000 DR2: 
> 
> [14829.548612][T145183] DR3:  DR6: 0ff0 DR7: 
> 0600
> [14829.586621][T145183] Call Trace:
> [14829.602266][T145183]  check_irq_usage+0x6a1/0xc30
> check_irq_usage at kernel/locking/lockdep.c:2586
> [14829.624092][T145183]  ? print_usage_bug+0x1e0/0x1e0
> [14829.646334][T145183]  ? mark_lock.part.47+0x109/0x1920
> [14829.670176][T145183]  ? print_irq_inversion_bug+0x210/0x210
> [14829.695950][T145183]  ? print_usage_bug+0x1e0/0x1e0
> [14829.718164][T145183]  ? h

[PATCH v4 01/11] Drivers: hv: vmbus: Always use HV_HYP_PAGE_SIZE for gpadl

2020-09-15 Thread Boqun Feng

Since the hypervisor always uses 4K as its page size, the size of PFNs
used for gpadl should be HV_HYP_PAGE_SIZE rather than PAGE_SIZE, so
adjust this accordingly as the preparation for supporting 16K/64K page
size guests. No functional changes on x86, since PAGE_SIZE is always 4k
(equals to HV_HYP_PAGE_SIZE).

Signed-off-by: Boqun Feng 
Reviewed-by: Michael Kelley 
---
 drivers/hv/channel.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 3ebda7707e46..4d0f8e5a88d6 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -22,9 +22,6 @@
 
 #include "hyperv_vmbus.h"
 
-#define NUM_PAGES_SPANNED(addr, len) \
-((PAGE_ALIGN(addr + len) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT))
-
 static unsigned long virt_to_hvpfn(void *addr)
 {
phys_addr_t paddr;
@@ -35,7 +32,7 @@ static unsigned long virt_to_hvpfn(void *addr)
else
paddr = __pa(addr);
 
-   return  paddr >> PAGE_SHIFT;
+   return  paddr >> HV_HYP_PAGE_SHIFT;
 }
 
 /*
@@ -330,7 +327,7 @@ static int create_gpadl_header(void *kbuffer, u32 size,
 
int pfnsum, pfncount, pfnleft, pfncurr, pfnsize;
 
-   pagecount = size >> PAGE_SHIFT;
+   pagecount = size >> HV_HYP_PAGE_SHIFT;
 
/* do we need a gpadl body msg */
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
@@ -360,7 +357,7 @@ static int create_gpadl_header(void *kbuffer, u32 size,
gpadl_header->range[0].byte_count = size;
for (i = 0; i < pfncount; i++)
gpadl_header->range[0].pfn_array[i] = virt_to_hvpfn(
-   kbuffer + PAGE_SIZE * i);
+   kbuffer + HV_HYP_PAGE_SIZE * i);
*msginfo = msgheader;
 
pfnsum = pfncount;
@@ -412,7 +409,7 @@ static int create_gpadl_header(void *kbuffer, u32 size,
 */
for (i = 0; i < pfncurr; i++)
gpadl_body->pfn[i] = virt_to_hvpfn(
-   kbuffer + PAGE_SIZE * (pfnsum + i));
+   kbuffer + HV_HYP_PAGE_SIZE * (pfnsum + 
i));
 
/* add to msg header */
list_add_tail(>msglistentry,
@@ -441,7 +438,7 @@ static int create_gpadl_header(void *kbuffer, u32 size,
gpadl_header->range[0].byte_count = size;
for (i = 0; i < pagecount; i++)
gpadl_header->range[0].pfn_array[i] = virt_to_hvpfn(
-   kbuffer + PAGE_SIZE * i);
+   kbuffer + HV_HYP_PAGE_SIZE * i);
 
*msginfo = msgheader;
}
-- 
2.28.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1674 matches

Mail list logo