Re: [RFC PATCH v2 00/12] Unify SMP stop generic logic to common code
On 13/09/2019 19:27, Russell King - ARM Linux admin wrote: > On Fri, Sep 13, 2019 at 07:19:41PM +0100, Cristian Marussi wrote: >> Tested as follows: >> >> - arm: >> 1. boot > > So this basically means the code paths you're touching are untested on > ARM... given that, and the variety of systems we have out there, why > should the patches touching ARM be taken? > Yes, but sincerely it's an RFC, so I was not expecting any change to be picked up by anyone at this stage: the expectation was to have some feedback on the general approach used in the common code side of the series (patches 01-02-03-04): is it worth ? is it over-engineered ? is it badly coded ? is it complete crap ? In fact in the cover letter I stated: > A couple more of still to be done potential enhancements have been noted > on specific commits, and a lot more of testing remains surely to be done > as of now, but, in the context of this RFC, any thoughts on this approach > in general ? I didn't want to port and test a lot of architectures before having some basic feedback: in fact I did port more than one arch just to verify if they could easily all fit into the new common code logic/layout I introduced, and, also, to show that it could be generally useful to more than on arch. (as asked in V1) As you noticed, though, I did certainly test as of now a lot more on some of them: - arm64: because is where the initial bug was observed, so I had to verify if all of the above at least also fixed something at the end - x86: because the original x86 SMP stop code differs more than other archs and so it was a good challenge to see if it could fit inside the new common SMP code logic (and in fact I had to extend the common framework to fit also x86 SMP stop needs) Moreover within this series structure it is not mandatory for all archs to switch to the new common logic: if not deemed important they can simply stick to their old code, while other archs can switch to it. So testing and porting to further archs is certainly work in progress at this time, but in this RFC stage, I could be wrong, but I considered the arch-patches in this series more as an example to showcase the usefulness (or not) of the series related to the common code changes: I did not extensively tested all archs to the their full extent, so more fixes could come in V3 (if ever) together with more testing and archs. > Given that you're an ARM Ltd employee, I'm sure you can find 32-bit > systems to test - or have ARM Ltd got rid of everything that isn't > 64-bit? ;) > well...worst case there's always Amazon anyway ... :D Cheers Cristian
Re: [RFC PATCH v2 00/12] Unify SMP stop generic logic to common code
On Fri, Sep 13, 2019 at 07:19:41PM +0100, Cristian Marussi wrote: > Tested as follows: > > - arm: > 1. boot So this basically means the code paths you're touching are untested on ARM... given that, and the variety of systems we have out there, why should the patches touching ARM be taken? Given that you're an ARM Ltd employee, I'm sure you can find 32-bit systems to test - or have ARM Ltd got rid of everything that isn't 64-bit? ;) -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up
[RFC PATCH v2 00/12] Unify SMP stop generic logic to common code
Hi all, the logic underlying SMP stop and kexec crash procedures, beside containing some arch-specific bits, is mostly generic and common across all archs: despite this fact, such logic is now scattered across all architectures and on some of them is flawed, in such a way that, under some specific conditions, you can end up with a CPU left still running after a panic and possibly lost across a subsequent kexec crash reboot. [1] Beside the flaws on some archs, there is anyway lots of code duplication, so this patch series attempts to move into common code all the generic SMP stop and crash logic, fixing observed issues, and leaving only the arch specific bits inside properly provided arch-specific helpers. An architecture willing to rely on this SMP common logic has to define its own helpers and set CONFIG_ARCH_USE_COMMON_SMP_STOP=y. This series wire this up for arm64, x86, arm, sparc64. Behaviour is not changed for architectures not adopting this new common logic. In v2 the SMP common stop/crash code is generalized a bit more to address the needs of the newly ported architectures (x86, arm, sparc64). Tested as follows: - arm64: 1. boot/reboot/emergency reboot 2. panic on a starting CPU within a 2 CPUs system (freezing properly) 3. kexec reboot after a panic like 2. (not losing any CPU on reboot) 4. kexec reboot after a panic like 2. and a simultaneous reboot (instrumenting code to delay the stop messages transmission to have time to inject a reboot -f) - x86: 1. boot/reboot/emergency reboot/emergency reboot with VMX enabled 2. panic on a starting CPU within a 2 CPUs system 2. panic sysrq 4-CPus 3. kexec reboot after a panic - arm: 1. boot - sparc64: 1. build A couple more of still to be done potential enhancements have been noted on specific commits, and a lot more of testing remains surely to be done as of now, but, in the context of this RFC, any thoughts on this approach in general ? Thanks, Cristian Changes: v1-->v2: - rebased on v5.3-rc8 - moved common Kconfig bits to arch/Kconfig - extended SMP common stop/crash generic code to address new architectures needs: custom wait/timeouts, max_retries, attempt numbers. - ported x86 to SMP common stop/crash code - ported arm to SMP common stop/crash code - ported sparc64 to SMP common stop code [1] [root@arch ~]# echo 1 > /sys/devices/system/cpu/cpu1/online [root@arch ~]# [ 152.583368] [ cut here ] [ 152.583872] kernel BUG at arch/arm64/kernel/cpufeature.c:852! [ 152.584693] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 152.585228] Modules linked in: [ 152.586040] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.3.0-rc5-1-gcabd12118c4a-dirty #2 [ 152.586218] Hardware name: Foundation-v8A (DT) [ 152.586478] pstate: 01c5 (nzcv dAIF -PAN -UAO) [ 152.587260] pc : has_cpuid_feature+0x35c/0x360 [ 152.587398] lr : verify_local_elf_hwcaps+0x6c/0xf0 [ 152.587520] sp : 118bbf60 [ 152.587605] x29: 118bbf60 x28: [ 152.587784] x27: x26: [ 152.587882] x25: 1167a010 x24: 112f59f8 [ 152.587992] x23: x22: [ 152.588085] x21: 112ea018 x20: 10fe5518 [ 152.588180] x19: 10ba3f30 x18: 0036 [ 152.588285] x17: x16: [ 152.588380] x15: x14: 80087a821210 [ 152.588481] x13: x12: [ 152.588599] x11: 0080 x10: 00400032b5503510 [ 152.588709] x9 : x8 : 10b93204 [ 152.588810] x7 : 81d8 x6 : 0005 [ 152.588910] x5 : x4 : [ 152.589021] x3 : x2 : 8000 [ 152.589121] x1 : 00180480 x0 : 00180480 [ 152.589379] Call trace: [ 152.589646] has_cpuid_feature+0x35c/0x360 [ 152.589763] verify_local_elf_hwcaps+0x6c/0xf0 [ 152.589858] check_local_cpu_capabilities+0x88/0x118 [ 152.589968] secondary_start_kernel+0xc4/0x168 [ 152.590530] Code: d53801e0 1758 d5380600 1756 (d421) [ 152.592215] ---[ end trace 80ea98416149c87e ]--- [ 152.592734] Kernel panic - not syncing: Attempted to kill the idle task! [ 152.593173] Kernel Offset: disabled [ 152.593501] CPU features: 0x0004,20c02008 [ 152.593678] Memory Limit: none [ 152.594208] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- [root@arch ~]# bash: echo: write error: Input/output error [root@arch ~]# [root@arch ~]# [root@arch ~]# echo HELO HELO Cristian Marussi (12): smp: add generic SMP-stop support to common code smp: unify crash_ and smp_send_stop() logic smp: coordinate concurrent crash/smp stop calls smp: address races of starting CPUs while stopping arm64: smp: use generic SMP stop common code arm64: smp: use SMP crash-stop common code arm64: smp: add arch specific cpu parking helper x86: smp: use generic SMP