On 24.10.23 09:43, David Woodhouse wrote:
On Tue, 2023-10-24 at 08:53 +0200, Juergen Gross wrote:I'm puzzled. This path doesn't contain any of the RCU usage I've added in commit 87797fad6cce. Are you sure that with just reverting commit 87797fad6cce the issue doesn't manifest anymore? I'd rather expect commit 721255b9826b having caused this behavior, just telling from the messages above.Retesting in the cold light of day, yes. Using v6.6-rc5 which is the parent commit of the offending 87797fad6cce. I now see this warning at boot time again, which I believe was an aspect of what you were trying to fix: [ 0.059014] xen:events: Using FIFO-based ABI [ 0.059029] xen:events: Xen HVM callback vector for event delivery is enabled [ 0.059227] rcu: srcu_init: Setting srcu_struct sizes based on contention. [ 0.059296] [ 0.059297] ============================= [ 0.059298] [ BUG: Invalid wait context ] [ 0.059299] 6.6.0-rc5 #1374 Not tainted [ 0.059300] ----------------------------- [ 0.059301] swapper/0/0 is trying to lock: [ 0.059303] ffffffff8ad595f8 (evtchn_rwlock){....}-{3:3}, at: xen_evtchn_do_upcall+0x59/0xd0
Indeed. What I still not get is why the rcu_dereference_check() splat isn't happening without my patch. IMHO it should be related to the fact that cpuhp_report_idle_dead() is trying to send an IPI via xen_send_IPI_one(), which is using notify_remote_via_irq(), which in turn needs to call irq_get_chip_data(). This is using the maple-tree since 721255b9826b, which is using rcu_read_lock(). I can probably change xen_send_IPI_one() to not need irq_get_chip_data(). But I'd like to understand why my patch causes the problem to surface only now, instead of having been prominent since commit 721255b9826b. Paul, do you have an explanation for the splat only coming out now? Juergen
OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key
OpenPGP_signature.asc
Description: OpenPGP digital signature