Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-04-05 Thread Paul E. McKenney
On Mon, Apr 05, 2021 at 09:37:40AM +0200, Juergen Gross wrote:
> On 02.04.21 18:11, Paul E. McKenney wrote:
> > On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
> > > On 30.03.21 19:33, Paul E. McKenney wrote:
> > > > On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> > > > > On 02.03.21 07:28, Juergen Gross wrote:
> > > > > > In order to help identifying problems with IPI handling and remote
> > > > > > function execution add some more data to IPI debugging code.
> > > > > > 
> > > > > > There have been multiple reports of cpus looping long times (many
> > > > > > seconds) in smp_call_function_many() waiting for another cpu 
> > > > > > executing
> > > > > > a function like tlb flushing. Most of these reports have been for
> > > > > > cases where the kernel was running as a guest on top of KVM or Xen
> > > > > > (there are rumours of that happening under VMWare, too, and even on
> > > > > > bare metal).
> > > > > > 
> > > > > > Finding the root cause hasn't been successful yet, even after more 
> > > > > > than
> > > > > > 2 years of chasing this bug by different developers.
> > > > > > 
> > > > > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > > > > > diagnostics") tried to address this by adding some debug code and by
> > > > > > issuing another IPI when a hang was detected. This helped mitigating
> > > > > > the problem (the repeated IPI unlocks the hang), but the root cause 
> > > > > > is
> > > > > > still unknown.
> > > > > > 
> > > > > > Current available data suggests that either an IPI wasn't sent when 
> > > > > > it
> > > > > > should have been, or that the IPI didn't result in the target cpu
> > > > > > executing the queued function (due to the IPI not reaching the cpu,
> > > > > > the IPI handler not being called, or the handler not seeing the 
> > > > > > queued
> > > > > > request).
> > > > > > 
> > > > > > Try to add more diagnostic data by introducing a global atomic 
> > > > > > counter
> > > > > > which is being incremented when doing critical operations (before 
> > > > > > and
> > > > > > after queueing a new request, when sending an IPI, and when 
> > > > > > dequeueing
> > > > > > a request). The counter value is stored in percpu variables which 
> > > > > > can
> > > > > > be printed out when a hang is detected.
> > > > > > 
> > > > > > The data of the last event (consisting of sequence counter, source
> > > > > > cpu, target cpu, and event type) is stored in a global variable. 
> > > > > > When
> > > > > > a new event is to be traced, the data of the last event is stored in
> > > > > > the event related percpu location and the global data is updated 
> > > > > > with
> > > > > > the new event's data. This allows to track two events in one data
> > > > > > location: one by the value of the event data (the event before the
> > > > > > current one), and one by the location itself (the current event).
> > > > > > 
> > > > > > A typical printout with a detected hang will look like this:
> > > > > > 
> > > > > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 
> > > > > > 53 ns for CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
> > > > > > csd: CSD lock (#1) handling prior 
> > > > > > scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
> > > > > >csd: cnt(8cc): -> dequeue (src cpu 0 == 
> > > > > > empty)
> > > > > >csd: cnt(8cd): ->0006 idle
> > > > > >csd: cnt(0003668): 0001->0006 queue
> > > > > >csd: cnt(0003669): 0001->0006 ipi
> > > > > >csd: cnt(0003e0f): 0007->000a queue
> > > > > >csd: cnt(0003e10): 0001-> ping
> > > > > >csd: cnt(0003e71): 0003-> ping
> > > > > >csd: cnt(0003e72): ->0006 gotipi
> > > > > >csd: cnt(0003e73): ->0006 handle
> > > > > >csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == 
> > > > > > empty)
> > > > > >csd: cnt(0003e7f): 0004->0006 ping
> > > > > >csd: cnt(0003e80): 0001-> pinged
> > > > > >csd: cnt(0003eb2): 0005->0001 noipi
> > > > > >csd: cnt(0003eb3): 0001->0006 queue
> > > > > >csd: cnt(0003eb4): 0001->0006 noipi
> > > > > >csd: cnt now: 0003f00
> > > > > > 
> > > > > > This example (being an artificial one, produced with a previous 
> > > > > > version
> > > > > > of this patch without the "hdlend" event), shows that cpu#6 started 
> > > > > > to
> > > > > > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another 
> > > > > > IPI
> > > > > > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > > > > > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > > > > > from cpu#4 in the queue already).
> > > > > > 
> > > > > > The idea is to print only relevant entries. Those are all events 
> > > > > > which
> > > > > > are associated with the hang (so sender side events for the source 
> > > > > > 

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-04-05 Thread Juergen Gross

On 02.04.21 18:11, Paul E. McKenney wrote:

On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:

On 30.03.21 19:33, Paul E. McKenney wrote:

On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:

On 02.03.21 07:28, Juergen Gross wrote:

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
   csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
   csd: cnt(8cd): ->0006 idle
   csd: cnt(0003668): 0001->0006 queue
   csd: cnt(0003669): 0001->0006 ipi
   csd: cnt(0003e0f): 0007->000a queue
   csd: cnt(0003e10): 0001-> ping
   csd: cnt(0003e71): 0003-> ping
   csd: cnt(0003e72): ->0006 gotipi
   csd: cnt(0003e73): ->0006 handle
   csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
   csd: cnt(0003e7f): 0004->0006 ping
   csd: cnt(0003e80): 0001-> pinged
   csd: cnt(0003eb2): 0005->0001 noipi
   csd: cnt(0003eb3): 0001->0006 queue
   csd: cnt(0003eb4): 0001->0006 noipi
   csd: cnt now: 0003f00

This example (being an artificial one, produced with a previous version
of this patch without the "hdlend" event), shows that cpu#6 started to
handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
(sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
from cpu#4 in the queue already).

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 


Just an update regarding current status with debugging the underlying
issue:

On a customer's machine with a backport of this patch applied we've
seen another case of the hang. In the logs we've found:

smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
500046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(  (null)).
smp:csd: CSD lock (#1) unresponsive.
smp:csd: cnt(000): -> queue
smp:csd: cnt(001): ->0006 idle
smp:csd: cnt(0025dba): 0012->0006 queue
smp:csd: cnt(0025dbb): 0012->0006 noipi
smp:csd: cnt(01d1333): 001a->0006 pinged
smp:csd: cnt(01d1334): ->0006 gotipi
smp:csd: cnt(01d1335): ->0006 handle
smp:csd: cnt(01d1336): ->0006 dequeue (src cpu 0 == empty)
smp:csd: cnt(01d1337): ->0006 hdlend (src cpu 0 == 

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-04-02 Thread Paul E. McKenney
On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
> On 30.03.21 19:33, Paul E. McKenney wrote:
> > On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> > > On 02.03.21 07:28, Juergen Gross wrote:
> > > > In order to help identifying problems with IPI handling and remote
> > > > function execution add some more data to IPI debugging code.
> > > > 
> > > > There have been multiple reports of cpus looping long times (many
> > > > seconds) in smp_call_function_many() waiting for another cpu executing
> > > > a function like tlb flushing. Most of these reports have been for
> > > > cases where the kernel was running as a guest on top of KVM or Xen
> > > > (there are rumours of that happening under VMWare, too, and even on
> > > > bare metal).
> > > > 
> > > > Finding the root cause hasn't been successful yet, even after more than
> > > > 2 years of chasing this bug by different developers.
> > > > 
> > > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > > > diagnostics") tried to address this by adding some debug code and by
> > > > issuing another IPI when a hang was detected. This helped mitigating
> > > > the problem (the repeated IPI unlocks the hang), but the root cause is
> > > > still unknown.
> > > > 
> > > > Current available data suggests that either an IPI wasn't sent when it
> > > > should have been, or that the IPI didn't result in the target cpu
> > > > executing the queued function (due to the IPI not reaching the cpu,
> > > > the IPI handler not being called, or the handler not seeing the queued
> > > > request).
> > > > 
> > > > Try to add more diagnostic data by introducing a global atomic counter
> > > > which is being incremented when doing critical operations (before and
> > > > after queueing a new request, when sending an IPI, and when dequeueing
> > > > a request). The counter value is stored in percpu variables which can
> > > > be printed out when a hang is detected.
> > > > 
> > > > The data of the last event (consisting of sequence counter, source
> > > > cpu, target cpu, and event type) is stored in a global variable. When
> > > > a new event is to be traced, the data of the last event is stored in
> > > > the event related percpu location and the global data is updated with
> > > > the new event's data. This allows to track two events in one data
> > > > location: one by the value of the event data (the event before the
> > > > current one), and one by the location itself (the current event).
> > > > 
> > > > A typical printout with a detected hang will look like this:
> > > > 
> > > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 
> > > > ns for CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
> > > > csd: CSD lock (#1) handling prior 
> > > > scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
> > > >   csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
> > > >   csd: cnt(8cd): ->0006 idle
> > > >   csd: cnt(0003668): 0001->0006 queue
> > > >   csd: cnt(0003669): 0001->0006 ipi
> > > >   csd: cnt(0003e0f): 0007->000a queue
> > > >   csd: cnt(0003e10): 0001-> ping
> > > >   csd: cnt(0003e71): 0003-> ping
> > > >   csd: cnt(0003e72): ->0006 gotipi
> > > >   csd: cnt(0003e73): ->0006 handle
> > > >   csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
> > > >   csd: cnt(0003e7f): 0004->0006 ping
> > > >   csd: cnt(0003e80): 0001-> pinged
> > > >   csd: cnt(0003eb2): 0005->0001 noipi
> > > >   csd: cnt(0003eb3): 0001->0006 queue
> > > >   csd: cnt(0003eb4): 0001->0006 noipi
> > > >   csd: cnt now: 0003f00
> > > > 
> > > > This example (being an artificial one, produced with a previous version
> > > > of this patch without the "hdlend" event), shows that cpu#6 started to
> > > > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> > > > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > > > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > > > from cpu#4 in the queue already).
> > > > 
> > > > The idea is to print only relevant entries. Those are all events which
> > > > are associated with the hang (so sender side events for the source cpu
> > > > of the hanging request, and receiver side events for the target cpu),
> > > > and the related events just before those (for adding data needed to
> > > > identify a possible race). Printing all available data would be
> > > > possible, but this would add large amounts of data printed on larger
> > > > configurations.
> > > > 
> > > > Signed-off-by: Juergen Gross 
> > > > Tested-by: Paul E. McKenney 
> > > 
> > > Just an update regarding current status with debugging the underlying
> > > issue:
> > > 
> > > On a customer's machine with a backport of this patch applied we've
> > > seen another case of the hang. In the logs we've found:

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-04-02 Thread Juergen Gross

On 30.03.21 19:33, Paul E. McKenney wrote:

On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:

On 02.03.21 07:28, Juergen Gross wrote:

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
  csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
  csd: cnt(8cd): ->0006 idle
  csd: cnt(0003668): 0001->0006 queue
  csd: cnt(0003669): 0001->0006 ipi
  csd: cnt(0003e0f): 0007->000a queue
  csd: cnt(0003e10): 0001-> ping
  csd: cnt(0003e71): 0003-> ping
  csd: cnt(0003e72): ->0006 gotipi
  csd: cnt(0003e73): ->0006 handle
  csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
  csd: cnt(0003e7f): 0004->0006 ping
  csd: cnt(0003e80): 0001-> pinged
  csd: cnt(0003eb2): 0005->0001 noipi
  csd: cnt(0003eb3): 0001->0006 queue
  csd: cnt(0003eb4): 0001->0006 noipi
  csd: cnt now: 0003f00

This example (being an artificial one, produced with a previous version
of this patch without the "hdlend" event), shows that cpu#6 started to
handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
(sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
from cpu#4 in the queue already).

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 


Just an update regarding current status with debugging the underlying
issue:

On a customer's machine with a backport of this patch applied we've
seen another case of the hang. In the logs we've found:

smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
500046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(  (null)).
smp:csd: CSD lock (#1) unresponsive.
smp:csd: cnt(000): -> queue
smp:csd: cnt(001): ->0006 idle
smp:csd: cnt(0025dba): 0012->0006 queue
smp:csd: cnt(0025dbb): 0012->0006 noipi
smp:csd: cnt(01d1333): 001a->0006 pinged
smp:csd: cnt(01d1334): ->0006 gotipi
smp:csd: cnt(01d1335): ->0006 handle
smp:csd: cnt(01d1336): ->0006 dequeue (src cpu 0 == empty)
smp:csd: cnt(01d1337): ->0006 hdlend (src cpu 0 == early)
smp:csd: cnt(01d16cb): 0012->0005 ipi
smp:csd: cnt(01d16cc): 0012->0006 queue
smp:csd: cnt(01d16cd): 

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-03-30 Thread Paul E. McKenney
On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> On 02.03.21 07:28, Juergen Gross wrote:
> > In order to help identifying problems with IPI handling and remote
> > function execution add some more data to IPI debugging code.
> > 
> > There have been multiple reports of cpus looping long times (many
> > seconds) in smp_call_function_many() waiting for another cpu executing
> > a function like tlb flushing. Most of these reports have been for
> > cases where the kernel was running as a guest on top of KVM or Xen
> > (there are rumours of that happening under VMWare, too, and even on
> > bare metal).
> > 
> > Finding the root cause hasn't been successful yet, even after more than
> > 2 years of chasing this bug by different developers.
> > 
> > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > diagnostics") tried to address this by adding some debug code and by
> > issuing another IPI when a hang was detected. This helped mitigating
> > the problem (the repeated IPI unlocks the hang), but the root cause is
> > still unknown.
> > 
> > Current available data suggests that either an IPI wasn't sent when it
> > should have been, or that the IPI didn't result in the target cpu
> > executing the queued function (due to the IPI not reaching the cpu,
> > the IPI handler not being called, or the handler not seeing the queued
> > request).
> > 
> > Try to add more diagnostic data by introducing a global atomic counter
> > which is being incremented when doing critical operations (before and
> > after queueing a new request, when sending an IPI, and when dequeueing
> > a request). The counter value is stored in percpu variables which can
> > be printed out when a hang is detected.
> > 
> > The data of the last event (consisting of sequence counter, source
> > cpu, target cpu, and event type) is stored in a global variable. When
> > a new event is to be traced, the data of the last event is stored in
> > the event related percpu location and the global data is updated with
> > the new event's data. This allows to track two events in one data
> > location: one by the value of the event data (the event before the
> > current one), and one by the location itself (the current event).
> > 
> > A typical printout with a detected hang will look like this:
> > 
> > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns 
> > for CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
> > csd: CSD lock (#1) handling prior 
> > scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
> >  csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
> >  csd: cnt(8cd): ->0006 idle
> >  csd: cnt(0003668): 0001->0006 queue
> >  csd: cnt(0003669): 0001->0006 ipi
> >  csd: cnt(0003e0f): 0007->000a queue
> >  csd: cnt(0003e10): 0001-> ping
> >  csd: cnt(0003e71): 0003-> ping
> >  csd: cnt(0003e72): ->0006 gotipi
> >  csd: cnt(0003e73): ->0006 handle
> >  csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
> >  csd: cnt(0003e7f): 0004->0006 ping
> >  csd: cnt(0003e80): 0001-> pinged
> >  csd: cnt(0003eb2): 0005->0001 noipi
> >  csd: cnt(0003eb3): 0001->0006 queue
> >  csd: cnt(0003eb4): 0001->0006 noipi
> >  csd: cnt now: 0003f00
> > 
> > This example (being an artificial one, produced with a previous version
> > of this patch without the "hdlend" event), shows that cpu#6 started to
> > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > from cpu#4 in the queue already).
> > 
> > The idea is to print only relevant entries. Those are all events which
> > are associated with the hang (so sender side events for the source cpu
> > of the hanging request, and receiver side events for the target cpu),
> > and the related events just before those (for adding data needed to
> > identify a possible race). Printing all available data would be
> > possible, but this would add large amounts of data printed on larger
> > configurations.
> > 
> > Signed-off-by: Juergen Gross 
> > Tested-by: Paul E. McKenney 
> 
> Just an update regarding current status with debugging the underlying
> issue:
> 
> On a customer's machine with a backport of this patch applied we've
> seen another case of the hang. In the logs we've found:
> 
> smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
> 500046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(  (null)).
> smp:  csd: CSD lock (#1) unresponsive.
> smp:  csd: cnt(000): -> queue
> smp:  csd: cnt(001): ->0006 idle
> smp:  csd: cnt(0025dba): 0012->0006 queue
> smp:  csd: cnt(0025dbb): 0012->0006 noipi
> smp:  csd: cnt(01d1333): 001a->0006 pinged
> smp:  csd: cnt(01d1334): ->0006 gotipi
> smp:  csd: cnt(01d1335): 

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-03-24 Thread Jürgen Groß

On 02.03.21 07:28, Juergen Gross wrote:

In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
 csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
 csd: cnt(8cd): ->0006 idle
 csd: cnt(0003668): 0001->0006 queue
 csd: cnt(0003669): 0001->0006 ipi
 csd: cnt(0003e0f): 0007->000a queue
 csd: cnt(0003e10): 0001-> ping
 csd: cnt(0003e71): 0003-> ping
 csd: cnt(0003e72): ->0006 gotipi
 csd: cnt(0003e73): ->0006 handle
 csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
 csd: cnt(0003e7f): 0004->0006 ping
 csd: cnt(0003e80): 0001-> pinged
 csd: cnt(0003eb2): 0005->0001 noipi
 csd: cnt(0003eb3): 0001->0006 queue
 csd: cnt(0003eb4): 0001->0006 noipi
 csd: cnt now: 0003f00

This example (being an artificial one, produced with a previous version
of this patch without the "hdlend" event), shows that cpu#6 started to
handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
(sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
from cpu#4 in the queue already).

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 


Just an update regarding current status with debugging the underlying
issue:

On a customer's machine with a backport of this patch applied we've
seen another case of the hang. In the logs we've found:

smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting 
500046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(  (null)).

smp:csd: CSD lock (#1) unresponsive.
smp:csd: cnt(000): -> queue
smp:csd: cnt(001): ->0006 idle
smp:csd: cnt(0025dba): 0012->0006 queue
smp:csd: cnt(0025dbb): 0012->0006 noipi
smp:csd: cnt(01d1333): 001a->0006 pinged
smp:csd: cnt(01d1334): ->0006 gotipi
smp:csd: cnt(01d1335): ->0006 handle
smp:csd: cnt(01d1336): ->0006 dequeue (src cpu 0 == empty)
smp:csd: cnt(01d1337): ->0006 hdlend (src cpu 0 == early)
smp:csd: cnt(01d16cb): 0012->0005 ipi
smp:csd: cnt(01d16cc): 0012->0006 queue
smp:csd: cnt(01d16cd): 0012->0006 ipi
smp:csd: cnt(01d16f3): 0012->001a ipi
smp:csd: cnt(01d16f4): 0012-> ping
smp:csd: 

[PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

2021-03-02 Thread Juergen Gross
In order to help identifying problems with IPI handling and remote
function execution add some more data to IPI debugging code.

There have been multiple reports of cpus looping long times (many
seconds) in smp_call_function_many() waiting for another cpu executing
a function like tlb flushing. Most of these reports have been for
cases where the kernel was running as a guest on top of KVM or Xen
(there are rumours of that happening under VMWare, too, and even on
bare metal).

Finding the root cause hasn't been successful yet, even after more than
2 years of chasing this bug by different developers.

Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
diagnostics") tried to address this by adding some debug code and by
issuing another IPI when a hang was detected. This helped mitigating
the problem (the repeated IPI unlocks the hang), but the root cause is
still unknown.

Current available data suggests that either an IPI wasn't sent when it
should have been, or that the IPI didn't result in the target cpu
executing the queued function (due to the IPI not reaching the cpu,
the IPI handler not being called, or the handler not seeing the queued
request).

Try to add more diagnostic data by introducing a global atomic counter
which is being incremented when doing critical operations (before and
after queueing a new request, when sending an IPI, and when dequeueing
a request). The counter value is stored in percpu variables which can
be printed out when a hang is detected.

The data of the last event (consisting of sequence counter, source
cpu, target cpu, and event type) is stored in a global variable. When
a new event is to be traced, the data of the last event is stored in
the event related percpu location and the global data is updated with
the new event's data. This allows to track two events in one data
location: one by the value of the event data (the event before the
current one), and one by the location itself (the current event).

A typical printout with a detected hang will look like this:

csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 53 ns for 
CPU#06 scf_handler_1+0x0/0x50(0xa2a881bb1410).
csd: CSD lock (#1) handling prior 
scf_handler_1+0x0/0x50(0xa2a8813823c0) request.
csd: cnt(8cc): -> dequeue (src cpu 0 == empty)
csd: cnt(8cd): ->0006 idle
csd: cnt(0003668): 0001->0006 queue
csd: cnt(0003669): 0001->0006 ipi
csd: cnt(0003e0f): 0007->000a queue
csd: cnt(0003e10): 0001-> ping
csd: cnt(0003e71): 0003-> ping
csd: cnt(0003e72): ->0006 gotipi
csd: cnt(0003e73): ->0006 handle
csd: cnt(0003e74): ->0006 dequeue (src cpu 0 == empty)
csd: cnt(0003e7f): 0004->0006 ping
csd: cnt(0003e80): 0001-> pinged
csd: cnt(0003eb2): 0005->0001 noipi
csd: cnt(0003eb3): 0001->0006 queue
csd: cnt(0003eb4): 0001->0006 noipi
csd: cnt now: 0003f00

This example (being an artificial one, produced with a previous version
of this patch without the "hdlend" event), shows that cpu#6 started to
handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
(sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
from cpu#4 in the queue already).

The idea is to print only relevant entries. Those are all events which
are associated with the hang (so sender side events for the source cpu
of the hanging request, and receiver side events for the target cpu),
and the related events just before those (for adding data needed to
identify a possible race). Printing all available data would be
possible, but this would add large amounts of data printed on larger
configurations.

Signed-off-by: Juergen Gross 
Tested-by: Paul E. McKenney 
---
V2:
- add automatic data deciphering and sorting of entries
- add new trace point for leaving flush_smp_call_function_queue()
- add information when finding an empty call_single_queue
V3:
- move new code to generic_exec_single() (Peter Zijlstra)
---
 .../admin-guide/kernel-parameters.txt |   4 +
 kernel/smp.c  | 228 +-
 2 files changed, 228 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 31dbf7b2f0e8..80c72f8e780d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -789,6 +789,10 @@
printed to the console in case a hanging cpu is
detected and that cpu is pinged again in order to try
to resolve the hang situation.
+   0: disable csdlock debugging (default)
+   1: enable basic csdlock debugging (minor impact)
+   ext: enable extended csdlock debugging