Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Dave Martin
On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote:
> Hi Dave,
> 
> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin  wrote:
> > On Thu, Dec 14, 2017 at 07:08:27PM +0100, Geert Uytterhoeven wrote:
> >> On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martin  wrote:

[...]

> >> > Good work on the bisect -- I'll need to have a think about this...
> >> >
> >> > That patch fixes a genuine problem so we can't just revert it.
> >> >
> >> > What if you revert _just this function_ back to what it was in v4.14?
> >>
> >> With fpsimd_update_current_state() reverted to v4.14, and
> >>
> >> -   __this_cpu_write(fpsimd_last_state, st);
> >> +   __this_cpu_write(fpsimd_last_state.st, st);
> >>
> >> to make it build, the problem seems to be fixed, too.
> 
> > Interesting if I apply that to v4.14 and then flatten the new code for 
> > CONFIG_ARM64_SVE=n, I get:
> >
> > Working:
> >
> > void fpsimd_update_current_state(struct fpsimd_state *state)
> > {
> > local_bh_disable();
> >
> > fpsimd_load_state(state);
> > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> > struct fpsimd_state *st = >thread.fpsimd_state;
> >
> > __this_cpu_write(fpsimd_last_state.st, st);
> > st->cpu = smp_processor_id();
> > }
> >
> > local_bh_enable();
> > }
> >
> > Broken:
> >
> > void fpsimd_update_current_state(struct fpsimd_state *state)
> > {
> > struct fpsimd_last_state_struct *last;
> > struct fpsimd_state *st;
> >
> > local_bh_disable();
> >
> > current->thread.fpsimd_state = *state;
> > fpsimd_load_state(>thread.fpsimd_state);
> >
> > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> > last = this_cpu_ptr(_last_state);
> > st = >thread.fpsimd_state;
> >
> > last->st = st;
> > last->sve_in_use = test_thread_flag(TIF_SVE);
> > st->cpu = smp_processor_id();
> > }
> >
> > local_bh_enable();
> > }
> >
> > Can you try my flattened "broken" version by itself and see if that does
> > reproduce the bug?  If not, my flattening may be making bad assumptions...
> >
> > Assuming the "broken" version reproduces the bug, I can't yet see exactly
> > where the breakage comes from.
> 
> Correct, above "Working" is working, and "Broken" is broken.
> 
> > The two important differences here seem to be
> >
> > 1) Staging the state via current->thread.fpsimd_state instead of loading
> > directly:
> >
> > -   fpsimd_load_state(state);
> > +   current->thread.fpsimd_state = *state;
> > +   fpsimd_load_state(>thread.fpsimd_state);
> 
> The change above introduces the breakage.
> 
> > 2) Using this_cpu_ptr() + assignment instead of __this_cpu_write() when
> > reassociating the task's fpsimd context with the cpu:
> >
> >  {
> > +   struct fpsimd_last_state_struct *last;
> > +   struct fpsimd_state *st;
> >
> > [...]
> >
> > if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> > -   struct fpsimd_state *st = >thread.fpsimd_state;
> > -
> > -   __this_cpu_write(fpsimd_last_state.st, st);
> > -   st->cpu = smp_processor_id();
> > +   last = this_cpu_ptr(_last_state);
> > +   st = >thread.fpsimd_state;
> > +
> > +   last->st = st;
> > +   last->sve_in_use = test_thread_flag(TIF_SVE);
> > +   st->cpu = smp_processor_id();
> > }
> 
> The change above is fine.

Thanks for this.

Will came up with a convincing hypothesis for how the dodgy change broke
things here -- see the diff in his separate reply.

I'll cook up a more complete fix, but the diff Will provided should at
least get things working.

Cheers
---Dave


Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Will Deacon
On Fri, Dec 15, 2017 at 04:59:28PM +0100, Geert Uytterhoeven wrote:
> On Fri, Dec 15, 2017 at 3:27 PM, Will Deacon  wrote:
> > On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote:
> >> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin  wrote:
> >> > The two important differences here seem to be
> >> >
> >> > 1) Staging the state via current->thread.fpsimd_state instead of loading
> >> > directly:
> >> >
> >> > -   fpsimd_load_state(state);
> >> > +   current->thread.fpsimd_state = *state;
> >> > +   fpsimd_load_state(>thread.fpsimd_state);
> >>
> >> The change above introduces the breakage.
> >
> > I finally managed to reproduce this, but only by using the exact same
> > compiler as Geert:
> >
> > https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz
> >
> > I then reliably see the problem if I run:
> >
> >   # /usr/bin/update-ca-certificates
> 
> /usr/sbin/... ?
> 
> > from Debian Jessie.
> 
> Funny, I've just got both
> 
> *** Error in `/bin/sh': free(): invalid pointer: 0xc17d4988 ***
> 
> and
> 
> mountall.sh[2172]: unhandled level 0 translation fault (11) at
> 0x004d, esr 0x9204, in dash[ce7e5000+1a000]
> 
> during boot up, but I can't get update-ca-certificates to fail...

Can you try the diff below, please?

Will

--->8

diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 540a1e010eb5..fae81f7964b4 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -1043,7 +1043,7 @@ void fpsimd_update_current_state(struct fpsimd_state 
*state)
 
local_bh_disable();
 
-   current->thread.fpsimd_state = *state;
+   current->thread.fpsimd_state.user_fpsimd = state->user_fpsimd;
if (system_supports_sve() && test_thread_flag(TIF_SVE))
fpsimd_to_sve(current);


Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Geert Uytterhoeven
Hi Will,

On Fri, Dec 15, 2017 at 3:27 PM, Will Deacon  wrote:
> On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote:
>> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin  wrote:
>> > The two important differences here seem to be
>> >
>> > 1) Staging the state via current->thread.fpsimd_state instead of loading
>> > directly:
>> >
>> > -   fpsimd_load_state(state);
>> > +   current->thread.fpsimd_state = *state;
>> > +   fpsimd_load_state(>thread.fpsimd_state);
>>
>> The change above introduces the breakage.
>
> I finally managed to reproduce this, but only by using the exact same
> compiler as Geert:
>
> https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz
>
> I then reliably see the problem if I run:
>
>   # /usr/bin/update-ca-certificates

/usr/sbin/... ?

> from Debian Jessie.

Funny, I've just got both

*** Error in `/bin/sh': free(): invalid pointer: 0xc17d4988 ***

and

mountall.sh[2172]: unhandled level 0 translation fault (11) at
0x004d, esr 0x9204, in dash[ce7e5000+1a000]

during boot up, but I can't get update-ca-certificates to fail...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Geert Uytterhoeven
On Fri, Dec 15, 2017 at 3:27 PM, Will Deacon  wrote:
> On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote:
>> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin  wrote:
>> > The two important differences here seem to be
>> >
>> > 1) Staging the state via current->thread.fpsimd_state instead of loading
>> > directly:
>> >
>> > -   fpsimd_load_state(state);
>> > +   current->thread.fpsimd_state = *state;
>> > +   fpsimd_load_state(>thread.fpsimd_state);
>>
>> The change above introduces the breakage.
>
> I finally managed to reproduce this, but only by using the exact same
> compiler as Geert:
>
> https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz
>
> I then reliably see the problem if I run:
>
>   # /usr/bin/update-ca-certificates
>
> from Debian Jessie.
>
> Note that my normal toolchain (Linaro 7.1.1 build) works fine and also
> if I use the toolchain above but disable CONFIG_ARM64_CRYPTO then things
> work too.
>
> So there's some toolchain-specific interaction between this change and the
> crypto code...
>
> Will



-- 
Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Will Deacon
On Fri, Dec 15, 2017 at 02:30:00PM +0100, Geert Uytterhoeven wrote:
> On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin  wrote:
> > The two important differences here seem to be
> >
> > 1) Staging the state via current->thread.fpsimd_state instead of loading
> > directly:
> >
> > -   fpsimd_load_state(state);
> > +   current->thread.fpsimd_state = *state;
> > +   fpsimd_load_state(>thread.fpsimd_state);
> 
> The change above introduces the breakage.

I finally managed to reproduce this, but only by using the exact same
compiler as Geert:

https://www.kernel.org/pub/tools/crosstool/files/bin/x86_64/4.9.0/x86_64-gcc-4.9.0-nolibc_aarch64-linux.tar.xz

I then reliably see the problem if I run:

  # /usr/bin/update-ca-certificates

from Debian Jessie.

Note that my normal toolchain (Linaro 7.1.1 build) works fine and also
if I use the toolchain above but disable CONFIG_ARM64_CRYPTO then things
work too.

So there's some toolchain-specific interaction between this change and the
crypto code...

Will


Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Geert Uytterhoeven
Hi Dave,

On Fri, Dec 15, 2017 at 12:23 PM, Dave Martin  wrote:
> On Thu, Dec 14, 2017 at 07:08:27PM +0100, Geert Uytterhoeven wrote:
>> On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martin  wrote:
>> > On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote:
>> >> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven
>> >>  wrote:
>> >> > During userspace (Debian jessie NFS root) boot on arm64:
>> >> >
>> >> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
>> >> > esr 0x9204, in dash[adf77000+1a000]
>> >> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted
>> >> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
>> >> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 
>> >> > ES2.0+ (DT)
>> >>
>> >> This is a quad Cortex A57.
>> >>
>> >> > pstate: 8000 (Nzcv daif -PAN -UAO)
>> >> > pc : 0xadf8a51c
>> >> > lr : 0xadf8ac08
>> >> > sp : cffeac00
>> >> > x29: cffeac00 x28: adfa1000
>> >> > x27: cffebf7c x26: cffead20
>> >> > x25: cea1c5f0 x24: 
>> >> > x23: adfa1000 x22: adfa1000
>> >> > x21:  x20: 0008
>> >> > x19:  x18: cffeb500
>> >> > x17: a22babfc x16: adfa1ae8
>> >> > x15: a2363588 x14: 
>> >> > x13: 0020 x12: 0010
>> >> > x11: 0101010101010101 x10: adfa1000
>> >> > x9 : ff81 x8 : adfa2000
>> >> > x7 :  x6 : 
>> >> > x5 : adfa2338 x4 : adfa2000
>> >> > x3 : adfa2338 x2 : 
>> >> > x1 : adfa28b0 x0 : adfa4c30
>> >> >
>> >> > Sometimes it happens with other processes, but the main address, esr, 
>> >> > and
>> >> > pstate values are always the same.
>> >> >
>> >> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
>> >> > releases, so the last time was two weeks ago), but never saw the issue
>> >> > before until today, so probably v4.15-rc1 is OK.
>> >> > Unfortunately it doesn't happen during every boot, which makes it
>> >> > cumbersome to bisect.
>> >> >
>> >> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
>> >> > and even without today's arm64/for-next/core merged in, I still managed 
>> >> > to
>> >> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or
>> >> > v4.15-rc3.
>> >> >
>> >> > Once, when the kernel message above wasn't shown, I got an error from
>> >> > userspace, which may be related:
>> >> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***
>> >>
>> >> With more boots (10 instead of 6) to declare a kernel good, I bisected 
>> >> this
>> >> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
>> >> state after signals").
>> >>
>> >> Reverting that commit on top of v4.15-rc3 fixed the issue for me.
>> >
>> > Good work on the bisect -- I'll need to have a think about this...
>> >
>> > That patch fixes a genuine problem so we can't just revert it.
>> >
>> > What if you revert _just this function_ back to what it was in v4.14?
>>
>> With fpsimd_update_current_state() reverted to v4.14, and
>>
>> -   __this_cpu_write(fpsimd_last_state, st);
>> +   __this_cpu_write(fpsimd_last_state.st, st);
>>
>> to make it build, the problem seems to be fixed, too.

> Interesting if I apply that to v4.14 and then flatten the new code for 
> CONFIG_ARM64_SVE=n, I get:
>
> Working:
>
> void fpsimd_update_current_state(struct fpsimd_state *state)
> {
> local_bh_disable();
>
> fpsimd_load_state(state);
> if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> struct fpsimd_state *st = >thread.fpsimd_state;
>
> __this_cpu_write(fpsimd_last_state.st, st);
> st->cpu = smp_processor_id();
> }
>
> local_bh_enable();
> }
>
> Broken:
>
> void fpsimd_update_current_state(struct fpsimd_state *state)
> {
> struct fpsimd_last_state_struct *last;
> struct fpsimd_state *st;
>
> local_bh_disable();
>
> current->thread.fpsimd_state = *state;
> fpsimd_load_state(>thread.fpsimd_state);
>
> if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
> last = this_cpu_ptr(_last_state);
> st = >thread.fpsimd_state;
>
> last->st = st;
> last->sve_in_use = test_thread_flag(TIF_SVE);
> st->cpu = smp_processor_id();
> }
>
> local_bh_enable();
> }
>
> Can you try my flattened "broken" version by itself and see if that does
> reproduce the bug?  If not, my flattening may be making bad assumptions...
>
> Assuming the "broken" version reproduces the bug, I can't yet see exactly
> where the breakage comes from.

Correct, above "Working" is working, and 

Re: arm64: unhandled level 0 translation fault

2017-12-15 Thread Dave Martin
On Thu, Dec 14, 2017 at 07:08:27PM +0100, Geert Uytterhoeven wrote:
> Hi Dave,
> 
> On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martin  wrote:
> > On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote:
> >> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven
> >>  wrote:
> >> > During userspace (Debian jessie NFS root) boot on arm64:
> >> >
> >> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
> >> > esr 0x9204, in dash[adf77000+1a000]
> >> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted
> >> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
> >> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 
> >> > ES2.0+ (DT)
> >>
> >> This is a quad Cortex A57.
> >>
> >> > pstate: 8000 (Nzcv daif -PAN -UAO)
> >> > pc : 0xadf8a51c
> >> > lr : 0xadf8ac08
> >> > sp : cffeac00
> >> > x29: cffeac00 x28: adfa1000
> >> > x27: cffebf7c x26: cffead20
> >> > x25: cea1c5f0 x24: 
> >> > x23: adfa1000 x22: adfa1000
> >> > x21:  x20: 0008
> >> > x19:  x18: cffeb500
> >> > x17: a22babfc x16: adfa1ae8
> >> > x15: a2363588 x14: 
> >> > x13: 0020 x12: 0010
> >> > x11: 0101010101010101 x10: adfa1000
> >> > x9 : ff81 x8 : adfa2000
> >> > x7 :  x6 : 
> >> > x5 : adfa2338 x4 : adfa2000
> >> > x3 : adfa2338 x2 : 
> >> > x1 : adfa28b0 x0 : adfa4c30
> >> >
> >> > Sometimes it happens with other processes, but the main address, esr, and
> >> > pstate values are always the same.
> >> >
> >> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
> >> > releases, so the last time was two weeks ago), but never saw the issue
> >> > before until today, so probably v4.15-rc1 is OK.
> >> > Unfortunately it doesn't happen during every boot, which makes it
> >> > cumbersome to bisect.
> >> >
> >> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
> >> > and even without today's arm64/for-next/core merged in, I still managed 
> >> > to
> >> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or
> >> > v4.15-rc3.
> >> >
> >> > Once, when the kernel message above wasn't shown, I got an error from
> >> > userspace, which may be related:
> >> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***
> >>
> >> With more boots (10 instead of 6) to declare a kernel good, I bisected this
> >> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
> >> state after signals").
> >>
> >> Reverting that commit on top of v4.15-rc3 fixed the issue for me.
> >
> > Good work on the bisect -- I'll need to have a think about this...
> >
> > That patch fixes a genuine problem so we can't just revert it.
> >
> > What if you revert _just this function_ back to what it was in v4.14?
> 
> With fpsimd_update_current_state() reverted to v4.14, and
> 
> -   __this_cpu_write(fpsimd_last_state, st);
> +   __this_cpu_write(fpsimd_last_state.st, st);
> 
> to make it build, the problem seems to be fixed, too.
> 
> Thanks!
> 
> Gr{oetje,eeting}s,
> 
> Geert

Interesting if I apply that to v4.14 and then flatten the new code for 
CONFIG_ARM64_SVE=n, I get:

Working:

void fpsimd_update_current_state(struct fpsimd_state *state)
{
local_bh_disable();

fpsimd_load_state(state);
if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
struct fpsimd_state *st = >thread.fpsimd_state;

__this_cpu_write(fpsimd_last_state.st, st);
st->cpu = smp_processor_id();
}

local_bh_enable();
}

Broken:

void fpsimd_update_current_state(struct fpsimd_state *state)
{
struct fpsimd_last_state_struct *last;
struct fpsimd_state *st;

local_bh_disable();

current->thread.fpsimd_state = *state;
fpsimd_load_state(>thread.fpsimd_state);

if (test_and_clear_thread_flag(TIF_FOREIGN_FPSTATE)) {
last = this_cpu_ptr(_last_state);
st = >thread.fpsimd_state;

last->st = st;
last->sve_in_use = test_thread_flag(TIF_SVE);
st->cpu = smp_processor_id();
}

local_bh_enable();
}

Can you try my flattened "broken" version by itself and see if that does
reproduce the bug?  If not, my flattening may be making bad assumptions...


Assuming the "broken" version reproduces the bug, I can't yet see exactly
where the breakage comes from.

The two important differences here seem to be

1) Staging the state via current->thread.fpsimd_state instead of loading
directly:

-   fpsimd_load_state(state);
+   current->thread.fpsimd_state = *state;
+  

Re: arm64: unhandled level 0 translation fault

2017-12-14 Thread Geert Uytterhoeven
Hi Dave,

On Thu, Dec 14, 2017 at 4:24 PM, Dave P Martin  wrote:
> On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote:
>> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven
>>  wrote:
>> > During userspace (Debian jessie NFS root) boot on arm64:
>> >
>> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
>> > esr 0x9204, in dash[adf77000+1a000]
>> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted
>> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
>> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 
>> > ES2.0+ (DT)
>>
>> This is a quad Cortex A57.
>>
>> > pstate: 8000 (Nzcv daif -PAN -UAO)
>> > pc : 0xadf8a51c
>> > lr : 0xadf8ac08
>> > sp : cffeac00
>> > x29: cffeac00 x28: adfa1000
>> > x27: cffebf7c x26: cffead20
>> > x25: cea1c5f0 x24: 
>> > x23: adfa1000 x22: adfa1000
>> > x21:  x20: 0008
>> > x19:  x18: cffeb500
>> > x17: a22babfc x16: adfa1ae8
>> > x15: a2363588 x14: 
>> > x13: 0020 x12: 0010
>> > x11: 0101010101010101 x10: adfa1000
>> > x9 : ff81 x8 : adfa2000
>> > x7 :  x6 : 
>> > x5 : adfa2338 x4 : adfa2000
>> > x3 : adfa2338 x2 : 
>> > x1 : adfa28b0 x0 : adfa4c30
>> >
>> > Sometimes it happens with other processes, but the main address, esr, and
>> > pstate values are always the same.
>> >
>> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
>> > releases, so the last time was two weeks ago), but never saw the issue
>> > before until today, so probably v4.15-rc1 is OK.
>> > Unfortunately it doesn't happen during every boot, which makes it
>> > cumbersome to bisect.
>> >
>> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
>> > and even without today's arm64/for-next/core merged in, I still managed to
>> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or
>> > v4.15-rc3.
>> >
>> > Once, when the kernel message above wasn't shown, I got an error from
>> > userspace, which may be related:
>> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***
>>
>> With more boots (10 instead of 6) to declare a kernel good, I bisected this
>> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
>> state after signals").
>>
>> Reverting that commit on top of v4.15-rc3 fixed the issue for me.
>
> Good work on the bisect -- I'll need to have a think about this...
>
> That patch fixes a genuine problem so we can't just revert it.
>
> What if you revert _just this function_ back to what it was in v4.14?

With fpsimd_update_current_state() reverted to v4.14, and

-   __this_cpu_write(fpsimd_last_state, st);
+   __this_cpu_write(fpsimd_last_state.st, st);

to make it build, the problem seems to be fixed, too.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: arm64: unhandled level 0 translation fault

2017-12-14 Thread Dave P Martin
On Thu, Dec 14, 2017 at 02:34:50PM +, Geert Uytterhoeven wrote:
> Hi Catalin, Will, Dave,
>
> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven
>  wrote:
> > During userspace (Debian jessie NFS root) boot on arm64:
> >
> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
> > esr 0x9204, in dash[adf77000+1a000]
> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted
> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ 
> > (DT)
>
> This is a quad Cortex A57.
>
> > pstate: 8000 (Nzcv daif -PAN -UAO)
> > pc : 0xadf8a51c
> > lr : 0xadf8ac08
> > sp : cffeac00
> > x29: cffeac00 x28: adfa1000
> > x27: cffebf7c x26: cffead20
> > x25: cea1c5f0 x24: 
> > x23: adfa1000 x22: adfa1000
> > x21:  x20: 0008
> > x19:  x18: cffeb500
> > x17: a22babfc x16: adfa1ae8
> > x15: a2363588 x14: 
> > x13: 0020 x12: 0010
> > x11: 0101010101010101 x10: adfa1000
> > x9 : ff81 x8 : adfa2000
> > x7 :  x6 : 
> > x5 : adfa2338 x4 : adfa2000
> > x3 : adfa2338 x2 : 
> > x1 : adfa28b0 x0 : adfa4c30
> >
> > Sometimes it happens with other processes, but the main address, esr, and
> > pstate values are always the same.
> >
> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
> > releases, so the last time was two weeks ago), but never saw the issue
> > before until today, so probably v4.15-rc1 is OK.
> > Unfortunately it doesn't happen during every boot, which makes it
> > cumbersome to bisect.
> >
> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
> > and even without today's arm64/for-next/core merged in, I still managed to
> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or
> > v4.15-rc3.
> >
> > Once, when the kernel message above wasn't shown, I got an error from
> > userspace, which may be related:
> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***
>
> With more boots (10 instead of 6) to declare a kernel good, I bisected this
> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
> state after signals").
>
> Reverting that commit on top of v4.15-rc3 fixed the issue for me.

Good work on the bisect -- I'll need to have a think about this...

That patch fixes a genuine problem so we can't just revert it.


What if you revert _just this function_ back to what it was in v4.14?

Cheers
---Dave
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.


Re: arm64: unhandled level 0 translation fault

2017-12-14 Thread Will Deacon
Hi Geert,

On Thu, Dec 14, 2017 at 03:34:50PM +0100, Geert Uytterhoeven wrote:
> On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven
>  wrote:
> > During userspace (Debian jessie NFS root) boot on arm64:
> >
> > rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
> > esr 0x9204, in dash[adf77000+1a000]
> > CPU: 0 PID: 1083 Comm: rpcbind Not tainted
> > 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
> > Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ 
> > (DT)
> 
> This is a quad Cortex A57.

It's so bizarre that nobody else is running into this!

> > pstate: 8000 (Nzcv daif -PAN -UAO)
> > pc : 0xadf8a51c
> > lr : 0xadf8ac08
> > sp : cffeac00
> > x29: cffeac00 x28: adfa1000
> > x27: cffebf7c x26: cffead20
> > x25: cea1c5f0 x24: 
> > x23: adfa1000 x22: adfa1000
> > x21:  x20: 0008
> > x19:  x18: cffeb500
> > x17: a22babfc x16: adfa1ae8
> > x15: a2363588 x14: 
> > x13: 0020 x12: 0010
> > x11: 0101010101010101 x10: adfa1000
> > x9 : ff81 x8 : adfa2000
> > x7 :  x6 : 
> > x5 : adfa2338 x4 : adfa2000
> > x3 : adfa2338 x2 : 
> > x1 : adfa28b0 x0 : adfa4c30
> >
> > Sometimes it happens with other processes, but the main address, esr, and
> > pstate values are always the same.
> >
> > I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
> > releases, so the last time was two weeks ago), but never saw the issue
> > before until today, so probably v4.15-rc1 is OK.
> > Unfortunately it doesn't happen during every boot, which makes it
> > cumbersome to bisect.
> >
> > My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
> > and even without today's arm64/for-next/core merged in, I still managed to
> > reproduce the issue, so I believe it was introduced in v4.15-rc2 or
> > v4.15-rc3.
> >
> > Once, when the kernel message above wasn't shown, I got an error from
> > userspace, which may be related:
> > *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***
> 
> With more boots (10 instead of 6) to declare a kernel good, I bisected this
> to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
> state after signals").
> 
> Reverting that commit on top of v4.15-rc3 fixed the issue for me.

Thanks for persevering with the bisect. We'll get this fixed ASAP, but we'll
be relying on you to test the patch we come up with.

Cheers,

Will


Re: arm64: unhandled level 0 translation fault

2017-12-14 Thread Geert Uytterhoeven
Hi Catalin, Will, Dave,

On Tue, Dec 12, 2017 at 11:20 AM, Geert Uytterhoeven
 wrote:
> During userspace (Debian jessie NFS root) boot on arm64:
>
> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
> esr 0x9204, in dash[adf77000+1a000]
> CPU: 0 PID: 1083 Comm: rpcbind Not tainted
> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ 
> (DT)

This is a quad Cortex A57.

> pstate: 8000 (Nzcv daif -PAN -UAO)
> pc : 0xadf8a51c
> lr : 0xadf8ac08
> sp : cffeac00
> x29: cffeac00 x28: adfa1000
> x27: cffebf7c x26: cffead20
> x25: cea1c5f0 x24: 
> x23: adfa1000 x22: adfa1000
> x21:  x20: 0008
> x19:  x18: cffeb500
> x17: a22babfc x16: adfa1ae8
> x15: a2363588 x14: 
> x13: 0020 x12: 0010
> x11: 0101010101010101 x10: adfa1000
> x9 : ff81 x8 : adfa2000
> x7 :  x6 : 
> x5 : adfa2338 x4 : adfa2000
> x3 : adfa2338 x2 : 
> x1 : adfa28b0 x0 : adfa4c30
>
> Sometimes it happens with other processes, but the main address, esr, and
> pstate values are always the same.
>
> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
> releases, so the last time was two weeks ago), but never saw the issue
> before until today, so probably v4.15-rc1 is OK.
> Unfortunately it doesn't happen during every boot, which makes it
> cumbersome to bisect.
>
> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
> and even without today's arm64/for-next/core merged in, I still managed to
> reproduce the issue, so I believe it was introduced in v4.15-rc2 or
> v4.15-rc3.
>
> Once, when the kernel message above wasn't shown, I got an error from
> userspace, which may be related:
> *** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***

With more boots (10 instead of 6) to declare a kernel good, I bisected this
to commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
state after signals").

Reverting that commit on top of v4.15-rc3 fixed the issue for me.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: arm64: unhandled level 0 translation fault

2017-12-13 Thread Will Deacon
Hi Geert,

Thanks for trying to bisect this.

On Tue, Dec 12, 2017 at 09:54:05PM +0100, Geert Uytterhoeven wrote:
> On Tue, Dec 12, 2017 at 5:57 PM, Will Deacon  wrote:
> > Do you reckon you can bisect between -rc1 and -rc2? We've been unable to
> > reproduce this on any of our systems, unfortunately.
> 
> I've tried, but ended up on an unrelated XFS merge commit. Probably I
> marked a few commits good due to not seeing this heisenbug.
> 
> For reference, here's the bisect log.
> 
> Bad commits showed one or both of "unhandled level 0 translation fault" and
> "invalid pointer". Good commits didn't show any during 6 tries.
> 
> git bisect start
> # bad: [ae64f9bd1d3621b5e60d7363bc20afb46aede215] Linux 4.15-rc2
> git bisect bad ae64f9bd1d3621b5e60d7363bc20afb46aede215
> # good: [4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323] Linux 4.15-rc1
> git bisect good 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323
> # good: [9e0600f5cf6cecfcab5046d1453a9538c054d8a7] Merge tag
> 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
> git bisect good 9e0600f5cf6cecfcab5046d1453a9538c054d8a7
> # good: [503505bfea19b7d69e2572297e6defa0f9c2404e] Merge branch
> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into
> drm-fixes
> git bisect good 503505bfea19b7d69e2572297e6defa0f9c2404e
> # good: [ae753ee2771a1bacade56411bb98037b2545c929] Merge tag
> 'afs-fixes-20171201' of
> git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
> git bisect good ae753ee2771a1bacade56411bb98037b2545c929
> # good: [e1ba1c99dad92c5917b22b1047cf36e4426b124a] Merge tag
> 'riscv-for-linus-4.15-rc2_cleanups' of
> git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux
> git bisect good e1ba1c99dad92c5917b22b1047cf36e4426b124a

^^ This one is the first "good" commit containing the arm64-fixes pull.
Maybe try stressing it a bit more and see if it also fails?

That said, I'm still suspicious that nobody else is seeing this -- I also
checked the various build/boot farms and everything looks ok.

Will


Re: arm64: unhandled level 0 translation fault

2017-12-12 Thread Geert Uytterhoeven
Hi Will,

On Tue, Dec 12, 2017 at 5:57 PM, Will Deacon  wrote:
> On Tue, Dec 12, 2017 at 05:00:33PM +0100, Geert Uytterhoeven wrote:
>> On Tue, Dec 12, 2017 at 4:11 PM, Geert Uytterhoeven
>>  wrote:
>> > On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon  wrote:
>> >> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote:
>> >>> During userspace (Debian jessie NFS root) boot on arm64:
>> >>>
>> >>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
>> >>> esr 0x9204, in dash[adf77000+1a000]
>> >>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted
>> >>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
>> >>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 
>> >>> ES2.0+ (DT)
>> >>> pstate: 8000 (Nzcv daif -PAN -UAO)
>> >>> pc : 0xadf8a51c
>> >>> lr : 0xadf8ac08
>> >>> sp : cffeac00
>> >>> x29: cffeac00 x28: adfa1000
>> >>> x27: cffebf7c x26: cffead20
>> >>> x25: cea1c5f0 x24: 
>> >>> x23: adfa1000 x22: adfa1000
>> >>> x21:  x20: 0008
>> >>> x19:  x18: cffeb500
>> >>> x17: a22babfc x16: adfa1ae8
>> >>> x15: a2363588 x14: 
>> >>> x13: 0020 x12: 0010
>> >>> x11: 0101010101010101 x10: adfa1000
>> >>> x9 : ff81 x8 : adfa2000
>> >>> x7 :  x6 : 
>> >>> x5 : adfa2338 x4 : adfa2000
>> >>> x3 : adfa2338 x2 : 
>> >>> x1 : adfa28b0 x0 : adfa4c30
>> >>>
>> >>> Sometimes it happens with other processes, but the main address, esr, and
>> >>> pstate values are always the same.
>> >>>
>> >>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
>> >>> releases, so the last time was two weeks ago), but never saw the issue
>> >>> before until today, so probably v4.15-rc1 is OK.
>> >>> Unfortunately it doesn't happen during every boot, which makes it
>> >>> cumbersome to bisect.
>> >>>
>> >>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
>> >>> and even without today's arm64/for-next/core merged in, I still managed 
>> >>> to
>> >>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or
>> >>> v4.15-rc3.
>> >>
>> >> Urgh, this looks nasty. Thanks for the report! A few questions:
>> >>
>> >>  - Can you share your .config somewhere please?
>> >
>> > I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, 
>> > and
>> > renesas_defconfig (from Simon's repo).
>>
>> v4.15-rc2 is affected, too.
>
> Do you reckon you can bisect between -rc1 and -rc2? We've been unable to
> reproduce this on any of our systems, unfortunately.

I've tried, but ended up on an unrelated XFS merge commit. Probably I
marked a few commits good due to not seeing this heisenbug.

For reference, here's the bisect log.

Bad commits showed one or both of "unhandled level 0 translation fault" and
"invalid pointer". Good commits didn't show any during 6 tries.

git bisect start
# bad: [ae64f9bd1d3621b5e60d7363bc20afb46aede215] Linux 4.15-rc2
git bisect bad ae64f9bd1d3621b5e60d7363bc20afb46aede215
# good: [4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323] Linux 4.15-rc1
git bisect good 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323
# good: [9e0600f5cf6cecfcab5046d1453a9538c054d8a7] Merge tag
'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect good 9e0600f5cf6cecfcab5046d1453a9538c054d8a7
# good: [503505bfea19b7d69e2572297e6defa0f9c2404e] Merge branch
'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into
drm-fixes
git bisect good 503505bfea19b7d69e2572297e6defa0f9c2404e
# good: [ae753ee2771a1bacade56411bb98037b2545c929] Merge tag
'afs-fixes-20171201' of
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
git bisect good ae753ee2771a1bacade56411bb98037b2545c929
# good: [e1ba1c99dad92c5917b22b1047cf36e4426b124a] Merge tag
'riscv-for-linus-4.15-rc2_cleanups' of
git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux
git bisect good e1ba1c99dad92c5917b22b1047cf36e4426b124a
# bad: [2db767d9889cef087149a5eaa35c1497671fa40f] Merge tag
'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
git bisect bad 2db767d9889cef087149a5eaa35c1497671fa40f
# good: [22a6c83777ac7c17d6c63891beeeac24cf5da450] xfs: ubsan fixes
git bisect good 22a6c83777ac7c17d6c63891beeeac24cf5da450
# bad: [788c1da05b73aee68ed98f05b577c308351f5619] Merge tag
'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad 788c1da05b73aee68ed98f05b577c308351f5619
# good: [3b42d385753c22b29d259ccb9d4c3f419e583b30] xfs: scrub inode
mode properly
git bisect good 3b42d385753c22b29d259ccb9d4c3f419e583b30
# good: [373b0589dc8d58bc09c9a28d03611ae4fb216057] xfs: Properly retry
failed dquot items in case of error during buffer writeback
git bisect 

Re: arm64: unhandled level 0 translation fault

2017-12-12 Thread Will Deacon
On Tue, Dec 12, 2017 at 05:00:33PM +0100, Geert Uytterhoeven wrote:
> On Tue, Dec 12, 2017 at 4:11 PM, Geert Uytterhoeven
>  wrote:
> > On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon  wrote:
> >> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote:
> >>> During userspace (Debian jessie NFS root) boot on arm64:
> >>>
> >>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
> >>> esr 0x9204, in dash[adf77000+1a000]
> >>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted
> >>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
> >>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 
> >>> ES2.0+ (DT)
> >>> pstate: 8000 (Nzcv daif -PAN -UAO)
> >>> pc : 0xadf8a51c
> >>> lr : 0xadf8ac08
> >>> sp : cffeac00
> >>> x29: cffeac00 x28: adfa1000
> >>> x27: cffebf7c x26: cffead20
> >>> x25: cea1c5f0 x24: 
> >>> x23: adfa1000 x22: adfa1000
> >>> x21:  x20: 0008
> >>> x19:  x18: cffeb500
> >>> x17: a22babfc x16: adfa1ae8
> >>> x15: a2363588 x14: 
> >>> x13: 0020 x12: 0010
> >>> x11: 0101010101010101 x10: adfa1000
> >>> x9 : ff81 x8 : adfa2000
> >>> x7 :  x6 : 
> >>> x5 : adfa2338 x4 : adfa2000
> >>> x3 : adfa2338 x2 : 
> >>> x1 : adfa28b0 x0 : adfa4c30
> >>>
> >>> Sometimes it happens with other processes, but the main address, esr, and
> >>> pstate values are always the same.
> >>>
> >>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
> >>> releases, so the last time was two weeks ago), but never saw the issue
> >>> before until today, so probably v4.15-rc1 is OK.
> >>> Unfortunately it doesn't happen during every boot, which makes it
> >>> cumbersome to bisect.
> >>>
> >>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
> >>> and even without today's arm64/for-next/core merged in, I still managed to
> >>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or
> >>> v4.15-rc3.
> >>
> >> Urgh, this looks nasty. Thanks for the report! A few questions:
> >>
> >>  - Can you share your .config somewhere please?
> >
> > I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, and
> > renesas_defconfig (from Simon's repo).
> 
> v4.15-rc2 is affected, too.

Do you reckon you can bisect between -rc1 and -rc2? We've been unable to
reproduce this on any of our systems, unfortunately.

Will


Re: arm64: unhandled level 0 translation fault

2017-12-12 Thread Geert Uytterhoeven
Hi Will,

On Tue, Dec 12, 2017 at 4:11 PM, Geert Uytterhoeven
 wrote:
> On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon  wrote:
>> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote:
>>> During userspace (Debian jessie NFS root) boot on arm64:
>>>
>>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
>>> esr 0x9204, in dash[adf77000+1a000]
>>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted
>>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
>>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ 
>>> (DT)
>>> pstate: 8000 (Nzcv daif -PAN -UAO)
>>> pc : 0xadf8a51c
>>> lr : 0xadf8ac08
>>> sp : cffeac00
>>> x29: cffeac00 x28: adfa1000
>>> x27: cffebf7c x26: cffead20
>>> x25: cea1c5f0 x24: 
>>> x23: adfa1000 x22: adfa1000
>>> x21:  x20: 0008
>>> x19:  x18: cffeb500
>>> x17: a22babfc x16: adfa1ae8
>>> x15: a2363588 x14: 
>>> x13: 0020 x12: 0010
>>> x11: 0101010101010101 x10: adfa1000
>>> x9 : ff81 x8 : adfa2000
>>> x7 :  x6 : 
>>> x5 : adfa2338 x4 : adfa2000
>>> x3 : adfa2338 x2 : 
>>> x1 : adfa28b0 x0 : adfa4c30
>>>
>>> Sometimes it happens with other processes, but the main address, esr, and
>>> pstate values are always the same.
>>>
>>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
>>> releases, so the last time was two weeks ago), but never saw the issue
>>> before until today, so probably v4.15-rc1 is OK.
>>> Unfortunately it doesn't happen during every boot, which makes it
>>> cumbersome to bisect.
>>>
>>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
>>> and even without today's arm64/for-next/core merged in, I still managed to
>>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or
>>> v4.15-rc3.
>>
>> Urgh, this looks nasty. Thanks for the report! A few questions:
>>
>>  - Can you share your .config somewhere please?
>
> I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, and
> renesas_defconfig (from Simon's repo).

v4.15-rc2 is affected, too.

>>  - What was your last known-good kernel?
>
> v4.15-rc1.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: arm64: unhandled level 0 translation fault

2017-12-12 Thread Geert Uytterhoeven
Hi Will,

On Tue, Dec 12, 2017 at 11:36 AM, Will Deacon  wrote:
> On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote:
>> During userspace (Debian jessie NFS root) boot on arm64:
>>
>> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
>> esr 0x9204, in dash[adf77000+1a000]
>> CPU: 0 PID: 1083 Comm: rpcbind Not tainted
>> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
>> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ 
>> (DT)
>> pstate: 8000 (Nzcv daif -PAN -UAO)
>> pc : 0xadf8a51c
>> lr : 0xadf8ac08
>> sp : cffeac00
>> x29: cffeac00 x28: adfa1000
>> x27: cffebf7c x26: cffead20
>> x25: cea1c5f0 x24: 
>> x23: adfa1000 x22: adfa1000
>> x21:  x20: 0008
>> x19:  x18: cffeb500
>> x17: a22babfc x16: adfa1ae8
>> x15: a2363588 x14: 
>> x13: 0020 x12: 0010
>> x11: 0101010101010101 x10: adfa1000
>> x9 : ff81 x8 : adfa2000
>> x7 :  x6 : 
>> x5 : adfa2338 x4 : adfa2000
>> x3 : adfa2338 x2 : 
>> x1 : adfa28b0 x0 : adfa4c30
>>
>> Sometimes it happens with other processes, but the main address, esr, and
>> pstate values are always the same.
>>
>> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
>> releases, so the last time was two weeks ago), but never saw the issue
>> before until today, so probably v4.15-rc1 is OK.
>> Unfortunately it doesn't happen during every boot, which makes it
>> cumbersome to bisect.
>>
>> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
>> and even without today's arm64/for-next/core merged in, I still managed to
>> reproduce the issue, so I believe it was introduced in v4.15-rc2 or
>> v4.15-rc3.
>
> Urgh, this looks nasty. Thanks for the report! A few questions:
>
>  - Can you share your .config somewhere please?

I managed to reproduce it on plain v4.15-rc3 using both arm64_defconfig, and
renesas_defconfig (from Simon's repo).

>  - What was your last known-good kernel?

v4.15-rc1.

>  - Have you seen it on any other Soc?

I haven't seen it on any Renesas arm32 SoC, only on arm64.

>  - What's the CPU in your SoC?

Quad Cortex A57.

> If I can reproduce the failure here, then I should be able to debug ASAP.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: arm64: unhandled level 0 translation fault

2017-12-12 Thread Will Deacon
Hi Geert,

On Tue, Dec 12, 2017 at 11:20:09AM +0100, Geert Uytterhoeven wrote:
> During userspace (Debian jessie NFS root) boot on arm64:
> 
> rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
> esr 0x9204, in dash[adf77000+1a000]
> CPU: 0 PID: 1083 Comm: rpcbind Not tainted
> 4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
> Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ 
> (DT)
> pstate: 8000 (Nzcv daif -PAN -UAO)
> pc : 0xadf8a51c
> lr : 0xadf8ac08
> sp : cffeac00
> x29: cffeac00 x28: adfa1000
> x27: cffebf7c x26: cffead20
> x25: cea1c5f0 x24: 
> x23: adfa1000 x22: adfa1000
> x21:  x20: 0008
> x19:  x18: cffeb500
> x17: a22babfc x16: adfa1ae8
> x15: a2363588 x14: 
> x13: 0020 x12: 0010
> x11: 0101010101010101 x10: adfa1000
> x9 : ff81 x8 : adfa2000
> x7 :  x6 : 
> x5 : adfa2338 x4 : adfa2000
> x3 : adfa2338 x2 : 
> x1 : adfa28b0 x0 : adfa4c30
> 
> Sometimes it happens with other processes, but the main address, esr, and
> pstate values are always the same.
> 
> I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
> releases, so the last time was two weeks ago), but never saw the issue
> before until today, so probably v4.15-rc1 is OK.
> Unfortunately it doesn't happen during every boot, which makes it
> cumbersome to bisect.
> 
> My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
> and even without today's arm64/for-next/core merged in, I still managed to
> reproduce the issue, so I believe it was introduced in v4.15-rc2 or
> v4.15-rc3.

Urgh, this looks nasty. Thanks for the report! A few questions:

 - Can you share your .config somewhere please?
 - What was your last known-good kernel?
 - Have you seen it on any other Soc?
 - What's the CPU in your SoC?

If I can reproduce the failure here, then I should be able to debug ASAP.

Cheers,

Will


arm64: unhandled level 0 translation fault

2017-12-12 Thread Geert Uytterhoeven
Hi Catalin, Will, et al,

During userspace (Debian jessie NFS root) boot on arm64:

rpcbind[1083]: unhandled level 0 translation fault (11) at 0x0008,
esr 0x9204, in dash[adf77000+1a000]
CPU: 0 PID: 1083 Comm: rpcbind Not tainted
4.15.0-rc3-arm64-renesas-02176-g14f9a1826e48e355 #51
Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ (DT)
pstate: 8000 (Nzcv daif -PAN -UAO)
pc : 0xadf8a51c
lr : 0xadf8ac08
sp : cffeac00
x29: cffeac00 x28: adfa1000
x27: cffebf7c x26: cffead20
x25: cea1c5f0 x24: 
x23: adfa1000 x22: adfa1000
x21:  x20: 0008
x19:  x18: cffeb500
x17: a22babfc x16: adfa1ae8
x15: a2363588 x14: 
x13: 0020 x12: 0010
x11: 0101010101010101 x10: adfa1000
x9 : ff81 x8 : adfa2000
x7 :  x6 : 
x5 : adfa2338 x4 : adfa2000
x3 : adfa2338 x2 : 
x1 : adfa28b0 x0 : adfa4c30

Sometimes it happens with other processes, but the main address, esr, and
pstate values are always the same.

I regularly run arm64/for-next/core (through bi-weekly renesas-drivers
releases, so the last time was two weeks ago), but never saw the issue
before until today, so probably v4.15-rc1 is OK.
Unfortunately it doesn't happen during every boot, which makes it
cumbersome to bisect.

My first guess was UNMAP_KERNEL_AT_EL0, but even after disabling that,
and even without today's arm64/for-next/core merged in, I still managed to
reproduce the issue, so I believe it was introduced in v4.15-rc2 or
v4.15-rc3.

Once, when the kernel message above wasn't shown, I got an error from
userspace, which may be related:
*** Error in `/bin/sh': free(): invalid pointer: 0xdd970988 ***

Do you have a clue?
Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds