Re: coherency issue observed after hotplug on POWER8
On 24/09/2021 19:17, Naveen N. Rao wrote: > Hi Cascardo, > Thanks for reporting this. > > > Thadeu Lima de Souza Cascardo wrote: >> Hi, there. >> >> We have been investigating an issue we have observed on POWER8 POWERNV >> systems. >> When running the kernel selftests reuseport_bpf_cpu after a CPU hotplug, we >> see >> crashes, in different forms. [1] > > Just to re-confirm: you are only seeing this on P8 powernv, and not in a > P8 guest/LPAR? I haven't been able to reproduce this on a firestone -- > can you share more details about your power8 machine? > > Also, do you only see this with ubuntu kernels, or are you also able to > reproduce this with the upstream tree? Let me just covert this part of your email: Upstream trees (5.11, 5.13, 5.14). See also: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1927076/comments/28 I could not reproduce it on Power8 LPAR. Neither on Power9 QEMU guest. Reproduced on few machines: IBM, POWER8NVL, 8335-GTB POWER8, 8001-22C and 8335-GTA lspcpu for the last one: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1927076/comments/15 Best regards, Krzysztof
Re: coherency issue observed after hotplug on POWER8
Hi Cascardo, Thanks for reporting this. Thadeu Lima de Souza Cascardo wrote: Hi, there. We have been investigating an issue we have observed on POWER8 POWERNV systems. When running the kernel selftests reuseport_bpf_cpu after a CPU hotplug, we see crashes, in different forms. [1] Just to re-confirm: you are only seeing this on P8 powernv, and not in a P8 guest/LPAR? I haven't been able to reproduce this on a firestone -- can you share more details about your power8 machine? Also, do you only see this with ubuntu kernels, or are you also able to reproduce this with the upstream tree? I managed to get xmon on that trap, and did some debugging. [2] I tried to dump the BPF JIT code, and it looks different when dumped from CPU#0 and CPU#0x9f (the one that was hotplugged, offlined, then onlined). Next time you reproduce this, can you try dumping the SLBs for the cpus (command 'u' in xmon)? Here is my partial analysis [3]. Basically, the BPF JIT fills a page with invalid instructions (traps, in ppc64 case), and puts the BPF program in a random offset of the page. In the case of the hotplugged CPU, which was the one that compiled the program, the page had the expected contents (BPF program started at the offset used to run the program). On the other CPU (in many cases, CPU #0), the same memory address/page had different contents, with the program starting at a different offset. From [3], I think fp->aux->jit_data can be NULL if there are subprogs. But, I find it interesting that you don't always see the correct bpf_func, as reported in comment #25. Can you also try dumping the full bpf_prog structure (prog/fp) from xmon? Is this a case of a bug in the micro-architecture or the firmware when doing the hotplug? Can someone chime in? It's possible that something is going wrong when offlining the cpu. Can you try booting the kernel with 'powersave=off' and see if the problem goes away? Notice that we can't reproduce the same issue on a POWER9 system. Thanks. Cascardo. [1] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076 [2] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/29 [3] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/30 - Naveen
coherency issue observed after hotplug on POWER8
Hi, there. We have been investigating an issue we have observed on POWER8 POWERNV systems. When running the kernel selftests reuseport_bpf_cpu after a CPU hotplug, we see crashes, in different forms. [1] I managed to get xmon on that trap, and did some debugging. [2] I tried to dump the BPF JIT code, and it looks different when dumped from CPU#0 and CPU#0x9f (the one that was hotplugged, offlined, then onlined). Here is my partial analysis [3]. Basically, the BPF JIT fills a page with invalid instructions (traps, in ppc64 case), and puts the BPF program in a random offset of the page. In the case of the hotplugged CPU, which was the one that compiled the program, the page had the expected contents (BPF program started at the offset used to run the program). On the other CPU (in many cases, CPU #0), the same memory address/page had different contents, with the program starting at a different offset. Is this a case of a bug in the micro-architecture or the firmware when doing the hotplug? Can someone chime in? Notice that we can't reproduce the same issue on a POWER9 system. Thanks. Cascardo. [1] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076 [2] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/29 [3] https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/comments/30