Re: arm64: unhandled level 0 translation fault
On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote: > Hi Dave, > > On Fri, Dec 15, 2017 at 12:23 PM, Dave Martinwrote: > > On Thu, Dec 14, 2017 at 07:08:27PM +0100, Geert Uytterhoeven wrote: > >> On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martin wrote: [...] > >> > Good work on the bisect -- I'll need to have a think about this... > >> > > >> > That patch fixes a genuine problem so we can't just revert it. > >> > > >> > What if you revert _just this function_ back to what it was in v4.14? > >> > >> With fpsimd_update_current_state() reverted to v4.14, and > >> > >> - __this_cpu_write(fpsimd_last_state, st); > >> + __this_cpu_write(fpsimd_last_state.st, st); > >> > >> to make it build, the problem seems to be fixed, too. > > > Interesting if I apply that to v4.14 and then flatten the new code for > > CONFIG_ARM64_SVE=n, I get: > > > > Working: > > > > void fpsimd_update_current_state(struct fpsimd_state *state) > > { > > local_bh_disable(); > > > > fpsimd_load_state(state); > > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { > > struct fpsimd_state *st = >thread.fpsimd_state; > > > > __this_cpu_write(fpsimd_last_state.st, st); > > st->cpu = smp_processor_id(); > > } > > > > local_bh_enable(); > > } > > > > Broken: > > > > void fpsimd_update_current_state(struct fpsimd_state *state) > > { > > struct fpsimd_last_state_struct *last; > > struct fpsimd_state *st; > > > > local_bh_disable(); > > > > current->thread.fpsimd_state = *state; > > fpsimd_load_state(>thread.fpsimd_state); > > > > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { > > last = this_cpu_ptr(_last_state); > > st = >thread.fpsimd_state; > > > > last->st = st; > > last->sve_in_use = test_thread_flag(TIF_SVE); > > st->cpu = smp_processor_id(); > > } > > > > local_bh_enable(); > > } > > > > Can you try my flattened "broken" version by itself and see if that does > > reproduce the bug? If not, my flattening may be making bad assumptions... > > > > Assuming the "broken" version reproduces the bug, I can't yet see exactly > > where the breakage comes from. > > Correct, above "Working" is working, and "Broken" is broken. > > > The two important differences here seem to be > > > > 1) Staging the state via current->thread.fpsimd_state instead of loading > > directly: > > > > - fpsimd_load_state(state); > > + current->thread.fpsimd_state = *state; > > + fpsimd_load_state(>thread.fpsimd_state); > > The change above introduces the breakage. > > > 2) Using this_cpu_ptr() + assignment instead of __this_cpu_write() when > > reassociating the task's fpsimd context with the cpu: > > > > { > > + struct fpsimd_last_state_struct *last; > > + struct fpsimd_state *st; > > > > [...] > > > > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { > > - struct fpsimd_state *st = >thread.fpsimd_state; > > - > > - __this_cpu_write(fpsimd_last_state.st, st); > > - st->cpu = smp_processor_id(); > > + last = this_cpu_ptr(_last_state); > > + st = >thread.fpsimd_state; > > + > > + last->st = st; > > + last->sve_in_use = test_thread_flag(TIF_SVE); > > + st->cpu = smp_processor_id(); > > } > > The change above is fine. Thanks for this. Will came up with a convincing hypothesis for how the dodgy change broke things here -- see the diff in his separate reply. I'll cook up a more complete fix, but the diff Will provided should at least get things working. Cheers ---Dave
Re: arm64: unhandled level 0 translation fault
On Fri, Dec 15, 2017 at 04:59:28PM +0100, Geert Uytterhoeven wrote: > On Fri, Dec 15, 2017 at 3:27 PM, Will Deaconwrote: > > On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote: > >> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin wrote: > >> > The two important differences here seem to be > >> > > >> > 1) Staging the state via current->thread.fpsimd_state instead of loading > >> > directly: > >> > > >> > - fpsimd_load_state(state); > >> > + current->thread.fpsimd_state = *state; > >> > + fpsimd_load_state(>thread.fpsimd_state); > >> > >> The change above introduces the breakage. > > > > I finally managed to reproduce this, but only by using the exact same > > compiler as Geert: > > > > https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz > > > > I then reliably see the problem if I run: > > > > # /usr/bin/update-ca-certificates > > /usr/sbin/... ? > > > from Debian Jessie. > > Funny, I've just got both > > *** Error in `/bin/sh': free(): invalid pointer: 0xc17d4988 *** > > and > > mountall.sh[2172]: unhandled level 0 translation fault (11) at > 0x004d, esr 0x9204, in dash[ce7e5000+1a000] > > during boot up, but I can't get update-ca-certificates to fail... Can you try the diff below, please? Will --->8 diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c index 540a1e010eb5..fae81f7964b4 100644 --- a/arch/arm64/kernel/fpsimd.c +++ b/arch/arm64/kernel/fpsimd.c @@ -1043,7 +1043,7 @@ void fpsimd_update_current_state(struct fpsimd_state *state) local_bh_disable(); - current->thread.fpsimd_state = *state; + current->thread.fpsimd_state.user_fpsimd = state->user_fpsimd; if (system_supports_sve() && test_thread_flag(TIF_SVE)) fpsimd_to_sve(current);
Re: arm64: unhandled level 0 translation fault
Hi Will, On Fri, Dec 15, 2017 at 3:27 PM, Will Deaconwrote: > On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote: >> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin wrote: >> > The two important differences here seem to be >> > >> > 1) Staging the state via current->thread.fpsimd_state instead of loading >> > directly: >> > >> > - fpsimd_load_state(state); >> > + current->thread.fpsimd_state = *state; >> > + fpsimd_load_state(>thread.fpsimd_state); >> >> The change above introduces the breakage. > > I finally managed to reproduce this, but only by using the exact same > compiler as Geert: > > https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz > > I then reliably see the problem if I run: > > # /usr/bin/update-ca-certificates /usr/sbin/... ? > from Debian Jessie. Funny, I've just got both *** Error in `/bin/sh': free(): invalid pointer: 0xc17d4988 *** and mountall.sh[2172]: unhandled level 0 translation fault (11) at 0x004d, esr 0x9204, in dash[ce7e5000+1a000] during boot up, but I can't get update-ca-certificates to fail... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: arm64: unhandled level 0 translation fault
On Fri, Dec 15, 2017 at 3:27 PM, Will Deaconwrote: > On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote: >> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin wrote: >> > The two important differences here seem to be >> > >> > 1) Staging the state via current->thread.fpsimd_state instead of loading >> > directly: >> > >> > - fpsimd_load_state(state); >> > + current->thread.fpsimd_state = *state; >> > + fpsimd_load_state(>thread.fpsimd_state); >> >> The change above introduces the breakage. > > I finally managed to reproduce this, but only by using the exact same > compiler as Geert: > > https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz > > I then reliably see the problem if I run: > > # /usr/bin/update-ca-certificates > > from Debian Jessie. > > Note that my normal toolchain (Linaro 7.1.1 build) works fine and also > if I use the toolchain above but disable CONFIG_ARM64_CRYPTO then things > work too. > > So there's some toolchain-specific interaction between this change and the > crypto code... > > Will -- Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: arm64: unhandled level 0 translation fault
On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote: > On Fri, Dec 15, 2017 at 12:23 PM, Dave Martinwrote: > > The two important differences here seem to be > > > > 1) Staging the state via current->thread.fpsimd_state instead of loading > > directly: > > > > - fpsimd_load_state(state); > > + current->thread.fpsimd_state = *state; > > + fpsimd_load_state(>thread.fpsimd_state); > > The change above introduces the breakage. I finally managed to reproduce this, but only by using the exact same compiler as Geert: https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz I then reliably see the problem if I run: # /usr/bin/update-ca-certificates from Debian Jessie. Note that my normal toolchain (Linaro 7.1.1 build) works fine and also if I use the toolchain above but disable CONFIG_ARM64_CRYPTO then things work too. So there's some toolchain-specific interaction between this change and the crypto code... Will
Re: arm64: unhandled level 0 translation fault
Hi Dave, On Fri, Dec 15, 2017 at 12:23 PM, Dave Martinwrote: > On Thu, Dec 14, 2017 at 07:08:27PM +0100, Geert Uytterhoeven wrote: >> On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martin wrote: >> > On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote: >> >> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven >> >> wrote: >> >> > During userspace (Debian jessie NFS root) boot on arm64: >> >> > >> >> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, >> >> > esr 0x9204, in dash[adf77000+1a000] >> >> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted >> >> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 >> >> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 >> >> > ES2.0+ (DT) >> >> >> >> This is a quad Cortex A57. >> >> >> >> > pstate: 8000 (Nzcv daif -PAN -UAO) >> >> > pc : 0xadf8a51c >> >> > lr : 0xadf8ac08 >> >> > sp : cffeac00 >> >> > x29: cffeac00 x28: adfa1000 >> >> > x27: cffebf7c x26: cffead20 >> >> > x25: cea1c5f0 x24: >> >> > x23: adfa1000 x22: adfa1000 >> >> > x21: x20: 0008 >> >> > x19: x18: cffeb500 >> >> > x17: a22babfc x16: adfa1ae8 >> >> > x15: a2363588 x14: >> >> > x13: 0020 x12: 0010 >> >> > x11: 0101010101010101 x10: adfa1000 >> >> > x9 : ff81 x8 : adfa2000 >> >> > x7 : x6 : >> >> > x5 : adfa2338 x4 : adfa2000 >> >> > x3 : adfa2338 x2 : >> >> > x1 : adfa28b0 x0 : adfa4c30 >> >> > >> >> > Sometimes it happens with other processes, but the main address, esr, >> >> > and >> >> > pstate values are always the same. >> >> > >> >> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers >> >> > releases, so the last time was two weeks ago), but never saw the issue >> >> > before until today, so probably v4.15-rc1 is OK. >> >> > Unfortunately it doesn't happen during every boot, which makes it >> >> > cumbersome to bisect. >> >> > >> >> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, >> >> > and even without today's arm64/for-next/core merged in, I still managed >> >> > to >> >> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or >> >> > v4.15-rc3. >> >> > >> >> > Once, when the kernel message above wasn't shown, I got an error from >> >> > userspace, which may be related: >> >> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** >> >> >> >> With more boots (10 instead of 6) to declare a kernel good, I bisected >> >> this >> >> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD >> >> state after signals"). >> >> >> >> Reverting that commit on top of v4.15-rc3 fixed the issue for me. >> > >> > Good work on the bisect -- I'll need to have a think about this... >> > >> > That patch fixes a genuine problem so we can't just revert it. >> > >> > What if you revert _just this function_ back to what it was in v4.14? >> >> With fpsimd_update_current_state() reverted to v4.14, and >> >> - __this_cpu_write(fpsimd_last_state, st); >> + __this_cpu_write(fpsimd_last_state.st, st); >> >> to make it build, the problem seems to be fixed, too. > Interesting if I apply that to v4.14 and then flatten the new code for > CONFIG_ARM64_SVE=n, I get: > > Working: > > void fpsimd_update_current_state(struct fpsimd_state *state) > { > local_bh_disable(); > > fpsimd_load_state(state); > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { > struct fpsimd_state *st = >thread.fpsimd_state; > > __this_cpu_write(fpsimd_last_state.st, st); > st->cpu = smp_processor_id(); > } > > local_bh_enable(); > } > > Broken: > > void fpsimd_update_current_state(struct fpsimd_state *state) > { > struct fpsimd_last_state_struct *last; > struct fpsimd_state *st; > > local_bh_disable(); > > current->thread.fpsimd_state = *state; > fpsimd_load_state(>thread.fpsimd_state); > > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { > last = this_cpu_ptr(_last_state); > st = >thread.fpsimd_state; > > last->st = st; > last->sve_in_use = test_thread_flag(TIF_SVE); > st->cpu = smp_processor_id(); > } > > local_bh_enable(); > } > > Can you try my flattened "broken" version by itself and see if that does > reproduce the bug? If not, my flattening may be making bad assumptions... > > Assuming the "broken" version reproduces the bug, I can't yet see exactly > where the breakage comes from. Correct, above "Working" is working, and
Re: arm64: unhandled level 0 translation fault
On Thu, Dec 14, 2017 at 07:08:27PM +0100, Geert Uytterhoeven wrote: > Hi Dave, > > On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martinwrote: > > On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote: > >> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven > >> wrote: > >> > During userspace (Debian jessie NFS root) boot on arm64: > >> > > >> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, > >> > esr 0x9204, in dash[adf77000+1a000] > >> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted > >> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 > >> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 > >> > ES2.0+ (DT) > >> > >> This is a quad Cortex A57. > >> > >> > pstate: 8000 (Nzcv daif -PAN -UAO) > >> > pc : 0xadf8a51c > >> > lr : 0xadf8ac08 > >> > sp : cffeac00 > >> > x29: cffeac00 x28: adfa1000 > >> > x27: cffebf7c x26: cffead20 > >> > x25: cea1c5f0 x24: > >> > x23: adfa1000 x22: adfa1000 > >> > x21: x20: 0008 > >> > x19: x18: cffeb500 > >> > x17: a22babfc x16: adfa1ae8 > >> > x15: a2363588 x14: > >> > x13: 0020 x12: 0010 > >> > x11: 0101010101010101 x10: adfa1000 > >> > x9 : ff81 x8 : adfa2000 > >> > x7 : x6 : > >> > x5 : adfa2338 x4 : adfa2000 > >> > x3 : adfa2338 x2 : > >> > x1 : adfa28b0 x0 : adfa4c30 > >> > > >> > Sometimes it happens with other processes, but the main address, esr, and > >> > pstate values are always the same. > >> > > >> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers > >> > releases, so the last time was two weeks ago), but never saw the issue > >> > before until today, so probably v4.15-rc1 is OK. > >> > Unfortunately it doesn't happen during every boot, which makes it > >> > cumbersome to bisect. > >> > > >> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, > >> > and even without today's arm64/for-next/core merged in, I still managed > >> > to > >> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or > >> > v4.15-rc3. > >> > > >> > Once, when the kernel message above wasn't shown, I got an error from > >> > userspace, which may be related: > >> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** > >> > >> With more boots (10 instead of 6) to declare a kernel good, I bisected this > >> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD > >> state after signals"). > >> > >> Reverting that commit on top of v4.15-rc3 fixed the issue for me. > > > > Good work on the bisect -- I'll need to have a think about this... > > > > That patch fixes a genuine problem so we can't just revert it. > > > > What if you revert _just this function_ back to what it was in v4.14? > > With fpsimd_update_current_state() reverted to v4.14, and > > - __this_cpu_write(fpsimd_last_state, st); > + __this_cpu_write(fpsimd_last_state.st, st); > > to make it build, the problem seems to be fixed, too. > > Thanks! > > Gr{oetje,eeting}s, > > Geert Interesting if I apply that to v4.14 and then flatten the new code for CONFIG_ARM64_SVE=n, I get: Working: void fpsimd_update_current_state(struct fpsimd_state *state) { local_bh_disable(); fpsimd_load_state(state); if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { struct fpsimd_state *st = >thread.fpsimd_state; __this_cpu_write(fpsimd_last_state.st, st); st->cpu = smp_processor_id(); } local_bh_enable(); } Broken: void fpsimd_update_current_state(struct fpsimd_state *state) { struct fpsimd_last_state_struct *last; struct fpsimd_state *st; local_bh_disable(); current->thread.fpsimd_state = *state; fpsimd_load_state(>thread.fpsimd_state); if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) { last = this_cpu_ptr(_last_state); st = >thread.fpsimd_state; last->st = st; last->sve_in_use = test_thread_flag(TIF_SVE); st->cpu = smp_processor_id(); } local_bh_enable(); } Can you try my flattened "broken" version by itself and see if that does reproduce the bug? If not, my flattening may be making bad assumptions... Assuming the "broken" version reproduces the bug, I can't yet see exactly where the breakage comes from. The two important differences here seem to be 1) Staging the state via current->thread.fpsimd_state instead of loading directly: - fpsimd_load_state(state); + current->thread.fpsimd_state = *state; +
Re: arm64: unhandled level 0 translation fault
Hi Dave, On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martinwrote: > On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote: >> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven >> wrote: >> > During userspace (Debian jessie NFS root) boot on arm64: >> > >> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, >> > esr 0x9204, in dash[adf77000+1a000] >> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted >> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 >> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 >> > ES2.0+ (DT) >> >> This is a quad Cortex A57. >> >> > pstate: 8000 (Nzcv daif -PAN -UAO) >> > pc : 0xadf8a51c >> > lr : 0xadf8ac08 >> > sp : cffeac00 >> > x29: cffeac00 x28: adfa1000 >> > x27: cffebf7c x26: cffead20 >> > x25: cea1c5f0 x24: >> > x23: adfa1000 x22: adfa1000 >> > x21: x20: 0008 >> > x19: x18: cffeb500 >> > x17: a22babfc x16: adfa1ae8 >> > x15: a2363588 x14: >> > x13: 0020 x12: 0010 >> > x11: 0101010101010101 x10: adfa1000 >> > x9 : ff81 x8 : adfa2000 >> > x7 : x6 : >> > x5 : adfa2338 x4 : adfa2000 >> > x3 : adfa2338 x2 : >> > x1 : adfa28b0 x0 : adfa4c30 >> > >> > Sometimes it happens with other processes, but the main address, esr, and >> > pstate values are always the same. >> > >> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers >> > releases, so the last time was two weeks ago), but never saw the issue >> > before until today, so probably v4.15-rc1 is OK. >> > Unfortunately it doesn't happen during every boot, which makes it >> > cumbersome to bisect. >> > >> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, >> > and even without today's arm64/for-next/core merged in, I still managed to >> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or >> > v4.15-rc3. >> > >> > Once, when the kernel message above wasn't shown, I got an error from >> > userspace, which may be related: >> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** >> >> With more boots (10 instead of 6) to declare a kernel good, I bisected this >> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD >> state after signals"). >> >> Reverting that commit on top of v4.15-rc3 fixed the issue for me. > > Good work on the bisect -- I'll need to have a think about this... > > That patch fixes a genuine problem so we can't just revert it. > > What if you revert _just this function_ back to what it was in v4.14? With fpsimd_update_current_state() reverted to v4.14, and - __this_cpu_write(fpsimd_last_state, st); + __this_cpu_write(fpsimd_last_state.st, st); to make it build, the problem seems to be fixed, too. Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: arm64: unhandled level 0 translation fault
On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote: > Hi Catalin, Will, Dave, > > On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven >wrote: > > During userspace (Debian jessie NFS root) boot on arm64: > > > > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, > > esr 0x9204, in dash[adf77000+1a000] > > CPU: 0 PID: 1083 Comm: rpcbind Not tainted > > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 > > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ > > (DT) > > This is a quad Cortex A57. > > > pstate: 8000 (Nzcv daif -PAN -UAO) > > pc : 0xadf8a51c > > lr : 0xadf8ac08 > > sp : cffeac00 > > x29: cffeac00 x28: adfa1000 > > x27: cffebf7c x26: cffead20 > > x25: cea1c5f0 x24: > > x23: adfa1000 x22: adfa1000 > > x21: x20: 0008 > > x19: x18: cffeb500 > > x17: a22babfc x16: adfa1ae8 > > x15: a2363588 x14: > > x13: 0020 x12: 0010 > > x11: 0101010101010101 x10: adfa1000 > > x9 : ff81 x8 : adfa2000 > > x7 : x6 : > > x5 : adfa2338 x4 : adfa2000 > > x3 : adfa2338 x2 : > > x1 : adfa28b0 x0 : adfa4c30 > > > > Sometimes it happens with other processes, but the main address, esr, and > > pstate values are always the same. > > > > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers > > releases, so the last time was two weeks ago), but never saw the issue > > before until today, so probably v4.15-rc1 is OK. > > Unfortunately it doesn't happen during every boot, which makes it > > cumbersome to bisect. > > > > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, > > and even without today's arm64/for-next/core merged in, I still managed to > > reproduce the issue, so I believe it was introduced in v4.15-rc2 or > > v4.15-rc3. > > > > Once, when the kernel message above wasn't shown, I got an error from > > userspace, which may be related: > > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** > > With more boots (10 instead of 6) to declare a kernel good, I bisected this > to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD > state after signals"). > > Reverting that commit on top of v4.15-rc3 fixed the issue for me. Good work on the bisect -- I'll need to have a think about this... That patch fixes a genuine problem so we can't just revert it. What if you revert _just this function_ back to what it was in v4.14? Cheers ---Dave IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Re: arm64: unhandled level 0 translation fault
Hi Geert, On Thu, Dec 14, 2017 at 03:34:50PM +0100, Geert Uytterhoeven wrote: > On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven >wrote: > > During userspace (Debian jessie NFS root) boot on arm64: > > > > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, > > esr 0x9204, in dash[adf77000+1a000] > > CPU: 0 PID: 1083 Comm: rpcbind Not tainted > > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 > > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ > > (DT) > > This is a quad Cortex A57. It's so bizarre that nobody else is running into this! > > pstate: 8000 (Nzcv daif -PAN -UAO) > > pc : 0xadf8a51c > > lr : 0xadf8ac08 > > sp : cffeac00 > > x29: cffeac00 x28: adfa1000 > > x27: cffebf7c x26: cffead20 > > x25: cea1c5f0 x24: > > x23: adfa1000 x22: adfa1000 > > x21: x20: 0008 > > x19: x18: cffeb500 > > x17: a22babfc x16: adfa1ae8 > > x15: a2363588 x14: > > x13: 0020 x12: 0010 > > x11: 0101010101010101 x10: adfa1000 > > x9 : ff81 x8 : adfa2000 > > x7 : x6 : > > x5 : adfa2338 x4 : adfa2000 > > x3 : adfa2338 x2 : > > x1 : adfa28b0 x0 : adfa4c30 > > > > Sometimes it happens with other processes, but the main address, esr, and > > pstate values are always the same. > > > > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers > > releases, so the last time was two weeks ago), but never saw the issue > > before until today, so probably v4.15-rc1 is OK. > > Unfortunately it doesn't happen during every boot, which makes it > > cumbersome to bisect. > > > > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, > > and even without today's arm64/for-next/core merged in, I still managed to > > reproduce the issue, so I believe it was introduced in v4.15-rc2 or > > v4.15-rc3. > > > > Once, when the kernel message above wasn't shown, I got an error from > > userspace, which may be related: > > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** > > With more boots (10 instead of 6) to declare a kernel good, I bisected this > to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD > state after signals"). > > Reverting that commit on top of v4.15-rc3 fixed the issue for me. Thanks for persevering with the bisect. We'll get this fixed ASAP, but we'll be relying on you to test the patch we come up with. Cheers, Will
Re: arm64: unhandled level 0 translation fault
Hi Catalin, Will, Dave, On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoevenwrote: > During userspace (Debian jessie NFS root) boot on arm64: > > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, > esr 0x9204, in dash[adf77000+1a000] > CPU: 0 PID: 1083 Comm: rpcbind Not tainted > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ > (DT) This is a quad Cortex A57. > pstate: 8000 (Nzcv daif -PAN -UAO) > pc : 0xadf8a51c > lr : 0xadf8ac08 > sp : cffeac00 > x29: cffeac00 x28: adfa1000 > x27: cffebf7c x26: cffead20 > x25: cea1c5f0 x24: > x23: adfa1000 x22: adfa1000 > x21: x20: 0008 > x19: x18: cffeb500 > x17: a22babfc x16: adfa1ae8 > x15: a2363588 x14: > x13: 0020 x12: 0010 > x11: 0101010101010101 x10: adfa1000 > x9 : ff81 x8 : adfa2000 > x7 : x6 : > x5 : adfa2338 x4 : adfa2000 > x3 : adfa2338 x2 : > x1 : adfa28b0 x0 : adfa4c30 > > Sometimes it happens with other processes, but the main address, esr, and > pstate values are always the same. > > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers > releases, so the last time was two weeks ago), but never saw the issue > before until today, so probably v4.15-rc1 is OK. > Unfortunately it doesn't happen during every boot, which makes it > cumbersome to bisect. > > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, > and even without today's arm64/for-next/core merged in, I still managed to > reproduce the issue, so I believe it was introduced in v4.15-rc2 or > v4.15-rc3. > > Once, when the kernel message above wasn't shown, I got an error from > userspace, which may be related: > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** With more boots (10 instead of 6) to declare a kernel good, I bisected this to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD state after signals"). Reverting that commit on top of v4.15-rc3 fixed the issue for me. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: arm64: unhandled level 0 translation fault
Hi Geert, Thanks for trying to bisect this. On Tue, Dec 12, 2017 at 09:54:05PM +0100, Geert Uytterhoeven wrote: > On Tue, Dec 12, 2017 at 5:57 PM, Will Deaconwrote: > > Do you reckon you can bisect between -rc1 and -rc2? We've been unable to > > reproduce this on any of our systems, unfortunately. > > I've tried, but ended up on an unrelated XFS merge commit. Probably I > marked a few commits good due to not seeing this heisenbug. > > For reference, here's the bisect log. > > Bad commits showed one or both of "unhandled level 0 translation fault" and > "invalid pointer". Good commits didn't show any during 6 tries. > > git bisect start > # bad: [ae64f9bd1d3621b5e60d7363bc20afb46aede215] Linux 4.15-rc2 > git bisect bad ae64f9bd1d3621b5e60d7363bc20afb46aede215 > # good: [4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323] Linux 4.15-rc1 > git bisect good 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323 > # good: [9e0600f5cf6cecfcab5046d1453a9538c054d8a7] Merge tag > 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm > git bisect good 9e0600f5cf6cecfcab5046d1453a9538c054d8a7 > # good: [503505bfea19b7d69e2572297e6defa0f9c2404e] Merge branch > 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into > drm-fixes > git bisect good 503505bfea19b7d69e2572297e6defa0f9c2404e > # good: [ae753ee2771a1bacade56411bb98037b2545c929] Merge tag > 'afs-fixes-20171201' of > git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs > git bisect good ae753ee2771a1bacade56411bb98037b2545c929 > # good: [e1ba1c99dad92c5917b22b1047cf36e4426b124a] Merge tag > 'riscv-for-linus-4.15-rc2_cleanups' of > git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux > git bisect good e1ba1c99dad92c5917b22b1047cf36e4426b124a ^^ This one is the first "good" commit containing the arm64-fixes pull. Maybe try stressing it a bit more and see if it also fails? That said, I'm still suspicious that nobody else is seeing this -- I also checked the various build/boot farms and everything looks ok. Will
Re: arm64: unhandled level 0 translation fault
Hi Will, On Tue, Dec 12, 2017 at 5:57 PM, Will Deaconwrote: > On Tue, Dec 12, 2017 at 05:00:33PM +0100, Geert Uytterhoeven wrote: >> On Tue, Dec 12, 2017 at 4:11 PM, Geert Uytterhoeven >> wrote: >> > On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon wrote: >> >> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote: >> >>> During userspace (Debian jessie NFS root) boot on arm64: >> >>> >> >>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, >> >>> esr 0x9204, in dash[adf77000+1a000] >> >>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted >> >>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 >> >>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 >> >>> ES2.0+ (DT) >> >>> pstate: 8000 (Nzcv daif -PAN -UAO) >> >>> pc : 0xadf8a51c >> >>> lr : 0xadf8ac08 >> >>> sp : cffeac00 >> >>> x29: cffeac00 x28: adfa1000 >> >>> x27: cffebf7c x26: cffead20 >> >>> x25: cea1c5f0 x24: >> >>> x23: adfa1000 x22: adfa1000 >> >>> x21: x20: 0008 >> >>> x19: x18: cffeb500 >> >>> x17: a22babfc x16: adfa1ae8 >> >>> x15: a2363588 x14: >> >>> x13: 0020 x12: 0010 >> >>> x11: 0101010101010101 x10: adfa1000 >> >>> x9 : ff81 x8 : adfa2000 >> >>> x7 : x6 : >> >>> x5 : adfa2338 x4 : adfa2000 >> >>> x3 : adfa2338 x2 : >> >>> x1 : adfa28b0 x0 : adfa4c30 >> >>> >> >>> Sometimes it happens with other processes, but the main address, esr, and >> >>> pstate values are always the same. >> >>> >> >>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers >> >>> releases, so the last time was two weeks ago), but never saw the issue >> >>> before until today, so probably v4.15-rc1 is OK. >> >>> Unfortunately it doesn't happen during every boot, which makes it >> >>> cumbersome to bisect. >> >>> >> >>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, >> >>> and even without today's arm64/for-next/core merged in, I still managed >> >>> to >> >>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or >> >>> v4.15-rc3. >> >> >> >> Urgh, this looks nasty. Thanks for the report! A few questions: >> >> >> >> - Can you share your .config somewhere please? >> > >> > I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, >> > and >> > renesas_defconfig (from Simon's repo). >> >> v4.15-rc2 is affected, too. > > Do you reckon you can bisect between -rc1 and -rc2? We've been unable to > reproduce this on any of our systems, unfortunately. I've tried, but ended up on an unrelated XFS merge commit. Probably I marked a few commits good due to not seeing this heisenbug. For reference, here's the bisect log. Bad commits showed one or both of "unhandled level 0 translation fault" and "invalid pointer". Good commits didn't show any during 6 tries. git bisect start # bad: [ae64f9bd1d3621b5e60d7363bc20afb46aede215] Linux 4.15-rc2 git bisect bad ae64f9bd1d3621b5e60d7363bc20afb46aede215 # good: [4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323] Linux 4.15-rc1 git bisect good 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323 # good: [9e0600f5cf6cecfcab5046d1453a9538c054d8a7] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm git bisect good 9e0600f5cf6cecfcab5046d1453a9538c054d8a7 # good: [503505bfea19b7d69e2572297e6defa0f9c2404e] Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes git bisect good 503505bfea19b7d69e2572297e6defa0f9c2404e # good: [ae753ee2771a1bacade56411bb98037b2545c929] Merge tag 'afs-fixes-20171201' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs git bisect good ae753ee2771a1bacade56411bb98037b2545c929 # good: [e1ba1c99dad92c5917b22b1047cf36e4426b124a] Merge tag 'riscv-for-linus-4.15-rc2_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux git bisect good e1ba1c99dad92c5917b22b1047cf36e4426b124a # bad: [2db767d9889cef087149a5eaa35c1497671fa40f] Merge tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs git bisect bad 2db767d9889cef087149a5eaa35c1497671fa40f # good: [22a6c83777ac7c17d6c63891beeeac24cf5da450] xfs: ubsan fixes git bisect good 22a6c83777ac7c17d6c63891beeeac24cf5da450 # bad: [788c1da05b73aee68ed98f05b577c308351f5619] Merge tag 'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux git bisect bad 788c1da05b73aee68ed98f05b577c308351f5619 # good: [3b42d385753c22b29d259ccb9d4c3f419e583b30] xfs: scrub inode mode properly git bisect good 3b42d385753c22b29d259ccb9d4c3f419e583b30 # good: [373b0589dc8d58bc09c9a28d03611ae4fb216057] xfs: Properly retry failed dquot items in case of error during buffer writeback git bisect
Re: arm64: unhandled level 0 translation fault
On Tue, Dec 12, 2017 at 05:00:33PM +0100, Geert Uytterhoeven wrote: > On Tue, Dec 12, 2017 at 4:11 PM, Geert Uytterhoeven >wrote: > > On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon wrote: > >> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote: > >>> During userspace (Debian jessie NFS root) boot on arm64: > >>> > >>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, > >>> esr 0x9204, in dash[adf77000+1a000] > >>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted > >>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 > >>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 > >>> ES2.0+ (DT) > >>> pstate: 8000 (Nzcv daif -PAN -UAO) > >>> pc : 0xadf8a51c > >>> lr : 0xadf8ac08 > >>> sp : cffeac00 > >>> x29: cffeac00 x28: adfa1000 > >>> x27: cffebf7c x26: cffead20 > >>> x25: cea1c5f0 x24: > >>> x23: adfa1000 x22: adfa1000 > >>> x21: x20: 0008 > >>> x19: x18: cffeb500 > >>> x17: a22babfc x16: adfa1ae8 > >>> x15: a2363588 x14: > >>> x13: 0020 x12: 0010 > >>> x11: 0101010101010101 x10: adfa1000 > >>> x9 : ff81 x8 : adfa2000 > >>> x7 : x6 : > >>> x5 : adfa2338 x4 : adfa2000 > >>> x3 : adfa2338 x2 : > >>> x1 : adfa28b0 x0 : adfa4c30 > >>> > >>> Sometimes it happens with other processes, but the main address, esr, and > >>> pstate values are always the same. > >>> > >>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers > >>> releases, so the last time was two weeks ago), but never saw the issue > >>> before until today, so probably v4.15-rc1 is OK. > >>> Unfortunately it doesn't happen during every boot, which makes it > >>> cumbersome to bisect. > >>> > >>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, > >>> and even without today's arm64/for-next/core merged in, I still managed to > >>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or > >>> v4.15-rc3. > >> > >> Urgh, this looks nasty. Thanks for the report! A few questions: > >> > >> - Can you share your .config somewhere please? > > > > I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, and > > renesas_defconfig (from Simon's repo). > > v4.15-rc2 is affected, too. Do you reckon you can bisect between -rc1 and -rc2? We've been unable to reproduce this on any of our systems, unfortunately. Will
Re: arm64: unhandled level 0 translation fault
Hi Will, On Tue, Dec 12, 2017 at 4:11 PM, Geert Uytterhoevenwrote: > On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon wrote: >> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote: >>> During userspace (Debian jessie NFS root) boot on arm64: >>> >>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, >>> esr 0x9204, in dash[adf77000+1a000] >>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted >>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 >>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ >>> (DT) >>> pstate: 8000 (Nzcv daif -PAN -UAO) >>> pc : 0xadf8a51c >>> lr : 0xadf8ac08 >>> sp : cffeac00 >>> x29: cffeac00 x28: adfa1000 >>> x27: cffebf7c x26: cffead20 >>> x25: cea1c5f0 x24: >>> x23: adfa1000 x22: adfa1000 >>> x21: x20: 0008 >>> x19: x18: cffeb500 >>> x17: a22babfc x16: adfa1ae8 >>> x15: a2363588 x14: >>> x13: 0020 x12: 0010 >>> x11: 0101010101010101 x10: adfa1000 >>> x9 : ff81 x8 : adfa2000 >>> x7 : x6 : >>> x5 : adfa2338 x4 : adfa2000 >>> x3 : adfa2338 x2 : >>> x1 : adfa28b0 x0 : adfa4c30 >>> >>> Sometimes it happens with other processes, but the main address, esr, and >>> pstate values are always the same. >>> >>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers >>> releases, so the last time was two weeks ago), but never saw the issue >>> before until today, so probably v4.15-rc1 is OK. >>> Unfortunately it doesn't happen during every boot, which makes it >>> cumbersome to bisect. >>> >>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, >>> and even without today's arm64/for-next/core merged in, I still managed to >>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or >>> v4.15-rc3. >> >> Urgh, this looks nasty. Thanks for the report! A few questions: >> >> - Can you share your .config somewhere please? > > I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, and > renesas_defconfig (from Simon's repo). v4.15-rc2 is affected, too. >> - What was your last known-good kernel? > > v4.15-rc1. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: arm64: unhandled level 0 translation fault
Hi Will, On Tue, Dec 12, 2017 at 11:36 AM, Will Deaconwrote: > On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote: >> During userspace (Debian jessie NFS root) boot on arm64: >> >> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, >> esr 0x9204, in dash[adf77000+1a000] >> CPU: 0 PID: 1083 Comm: rpcbind Not tainted >> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 >> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ >> (DT) >> pstate: 8000 (Nzcv daif -PAN -UAO) >> pc : 0xadf8a51c >> lr : 0xadf8ac08 >> sp : cffeac00 >> x29: cffeac00 x28: adfa1000 >> x27: cffebf7c x26: cffead20 >> x25: cea1c5f0 x24: >> x23: adfa1000 x22: adfa1000 >> x21: x20: 0008 >> x19: x18: cffeb500 >> x17: a22babfc x16: adfa1ae8 >> x15: a2363588 x14: >> x13: 0020 x12: 0010 >> x11: 0101010101010101 x10: adfa1000 >> x9 : ff81 x8 : adfa2000 >> x7 : x6 : >> x5 : adfa2338 x4 : adfa2000 >> x3 : adfa2338 x2 : >> x1 : adfa28b0 x0 : adfa4c30 >> >> Sometimes it happens with other processes, but the main address, esr, and >> pstate values are always the same. >> >> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers >> releases, so the last time was two weeks ago), but never saw the issue >> before until today, so probably v4.15-rc1 is OK. >> Unfortunately it doesn't happen during every boot, which makes it >> cumbersome to bisect. >> >> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, >> and even without today's arm64/for-next/core merged in, I still managed to >> reproduce the issue, so I believe it was introduced in v4.15-rc2 or >> v4.15-rc3. > > Urgh, this looks nasty. Thanks for the report! A few questions: > > - Can you share your .config somewhere please? I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, and renesas_defconfig (from Simon's repo). > - What was your last known-good kernel? v4.15-rc1. > - Have you seen it on any other Soc? I haven't seen it on any Renesas arm32 SoC, only on arm64. > - What's the CPU in your SoC? Quad Cortex A57. > If I can reproduce the failure here, then I should be able to debug ASAP. Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: arm64: unhandled level 0 translation fault
Hi Geert, On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote: > During userspace (Debian jessie NFS root) boot on arm64: > > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, > esr 0x9204, in dash[adf77000+1a000] > CPU: 0 PID: 1083 Comm: rpcbind Not tainted > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ > (DT) > pstate: 8000 (Nzcv daif -PAN -UAO) > pc : 0xadf8a51c > lr : 0xadf8ac08 > sp : cffeac00 > x29: cffeac00 x28: adfa1000 > x27: cffebf7c x26: cffead20 > x25: cea1c5f0 x24: > x23: adfa1000 x22: adfa1000 > x21: x20: 0008 > x19: x18: cffeb500 > x17: a22babfc x16: adfa1ae8 > x15: a2363588 x14: > x13: 0020 x12: 0010 > x11: 0101010101010101 x10: adfa1000 > x9 : ff81 x8 : adfa2000 > x7 : x6 : > x5 : adfa2338 x4 : adfa2000 > x3 : adfa2338 x2 : > x1 : adfa28b0 x0 : adfa4c30 > > Sometimes it happens with other processes, but the main address, esr, and > pstate values are always the same. > > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers > releases, so the last time was two weeks ago), but never saw the issue > before until today, so probably v4.15-rc1 is OK. > Unfortunately it doesn't happen during every boot, which makes it > cumbersome to bisect. > > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, > and even without today's arm64/for-next/core merged in, I still managed to > reproduce the issue, so I believe it was introduced in v4.15-rc2 or > v4.15-rc3. Urgh, this looks nasty. Thanks for the report! A few questions: - Can you share your .config somewhere please? - What was your last known-good kernel? - Have you seen it on any other Soc? - What's the CPU in your SoC? If I can reproduce the failure here, then I should be able to debug ASAP. Cheers, Will
arm64: unhandled level 0 translation fault
Hi Catalin, Will, et al, During userspace (Debian jessie NFS root) boot on arm64: rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008, esr 0x9204, in dash[adf77000+1a000] CPU: 0 PID: 1083 Comm: rpcbind Not tainted 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51 Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ (DT) pstate: 8000 (Nzcv daif -PAN -UAO) pc : 0xadf8a51c lr : 0xadf8ac08 sp : cffeac00 x29: cffeac00 x28: adfa1000 x27: cffebf7c x26: cffead20 x25: cea1c5f0 x24: x23: adfa1000 x22: adfa1000 x21: x20: 0008 x19: x18: cffeb500 x17: a22babfc x16: adfa1ae8 x15: a2363588 x14: x13: 0020 x12: 0010 x11: 0101010101010101 x10: adfa1000 x9 : ff81 x8 : adfa2000 x7 : x6 : x5 : adfa2338 x4 : adfa2000 x3 : adfa2338 x2 : x1 : adfa28b0 x0 : adfa4c30 Sometimes it happens with other processes, but the main address, esr, and pstate values are always the same. I regularly run arm64/for-next/core (through bi-weekly renesas-drivers releases, so the last time was two weeks ago), but never saw the issue before until today, so probably v4.15-rc1 is OK. Unfortunately it doesn't happen during every boot, which makes it cumbersome to bisect. My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that, and even without today's arm64/for-next/core merged in, I still managed to reproduce the issue, so I believe it was introduced in v4.15-rc2 or v4.15-rc3. Once, when the kernel message above wasn't shown, I got an error from userspace, which may be related: *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 *** Do you have a clue? Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds