Re: mysterious crashes on OMAP5 uevm

2015-09-18 Thread Tony Lindgren
Hi Grazvydas,

* Tony Lindgren  [150908 14:11]:
> * Grazvydas Ignotas  [150908 13:44]:
> > On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren  wrote:
> OK nice to hear you found it. Yeah looks like some runtime
> capability check is needed.
>  
> > > Do you have some easy way to reproduce this issue?
> > 
> > Just moving a browser window around with mouse usually triggers it
> > within a minute.
> 
> OK good to know.

Just FYI, I too was now able to produce it here too moving around
icewweasel for about a minute. And can confirm Russell's patch
fixes the problem.

I'm using i3 tiling window manager here, and don't usually
ever have any floating windows which probably explains why I
did not run into this issue earlier with my lapdock experiments :)

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-16 Thread Russell King - ARM Linux
On Tue, Sep 15, 2015 at 08:31:44PM +0300, Grazvydas Ignotas wrote:
> On Mon, Sep 14, 2015 at 10:35 PM, Dr. H. Nikolaus Schaller
>  wrote:
> >
> > Am 14.09.2015 um 21:02 schrieb Tony Lindgren :
> >
> >> * Russell King - ARM Linux  [150914 05:16]:
> >>> On Fri, Sep 11, 2015 at 03:03:07PM +0100, Russell King - ARM Linux wrote:
> 
>  Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem,
>  and I doubt there's any ARMv6 non-T2 systems out there that would be
>  affected by clearing the IT state bits.
> >>>
> >>> Please test the following patch:
> >>
> >> While we're waiting for Grazvydas to test.. Looks good to me:
> >>
> >> Acked-by: Tony Lindgren 
> >
> > I have tested on:
> > * GTA04 with DM3730 (OMAP3)
> > * Pyra prototype with OMAP5432
> > No X server crashes seen any more.
> >
> > Tested-by: H. Nikolaus Schaller 
> 
> Tested-by: Grazvydas Ignotas 
> on OMAP5 uevm running v4.2 built with omap2plus_defconfig.
> On v4.3-rc1 hsmmc controller probe is deferred for whatever reason and
> never reprobes, so my rootfs is never mounted and I could not test,
> but that looks unrelated.

Thanks.

> I guess it's worth marking this one for stable.

Indeed.

Having looked closer at the ARM ARM, these bits on older CPUs are marked
as UNK/SBZP (unknown, should be zero or preserved).  So it's safe to get
rid of that #if entirely.  Removing that #if won't affect the validity
of your testing as you've only tested on ARMv7 platforms with ARMv6
included in the kernel.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-15 Thread Grazvydas Ignotas
On Mon, Sep 14, 2015 at 10:35 PM, Dr. H. Nikolaus Schaller
 wrote:
>
> Am 14.09.2015 um 21:02 schrieb Tony Lindgren :
>
>> * Russell King - ARM Linux  [150914 05:16]:
>>> On Fri, Sep 11, 2015 at 03:03:07PM +0100, Russell King - ARM Linux wrote:

 Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem,
 and I doubt there's any ARMv6 non-T2 systems out there that would be
 affected by clearing the IT state bits.
>>>
>>> Please test the following patch:
>>
>> While we're waiting for Grazvydas to test.. Looks good to me:
>>
>> Acked-by: Tony Lindgren 
>
> I have tested on:
> * GTA04 with DM3730 (OMAP3)
> * Pyra prototype with OMAP5432
> No X server crashes seen any more.
>
> Tested-by: H. Nikolaus Schaller 

Tested-by: Grazvydas Ignotas 
on OMAP5 uevm running v4.2 built with omap2plus_defconfig.
On v4.3-rc1 hsmmc controller probe is deferred for whatever reason and
never reprobes, so my rootfs is never mounted and I could not test,
but that looks unrelated.

I guess it's worth marking this one for stable.

Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-14 Thread Dr. H. Nikolaus Schaller

Am 14.09.2015 um 21:02 schrieb Tony Lindgren :

> * Russell King - ARM Linux  [150914 05:16]:
>> On Fri, Sep 11, 2015 at 03:03:07PM +0100, Russell King - ARM Linux wrote:
>>> 
>>> Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem,
>>> and I doubt there's any ARMv6 non-T2 systems out there that would be
>>> affected by clearing the IT state bits.
>> 
>> Please test the following patch:
> 
> While we're waiting for Grazvydas to test.. Looks good to me:
> 
> Acked-by: Tony Lindgren 

I have tested on:
* GTA04 with DM3730 (OMAP3)
* Pyra prototype with OMAP5432
No X server crashes seen any more.

Tested-by: H. Nikolaus Schaller 

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-14 Thread Tony Lindgren
* Russell King - ARM Linux  [150914 05:16]:
> On Fri, Sep 11, 2015 at 03:03:07PM +0100, Russell King - ARM Linux wrote:
> > 
> > Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem,
> > and I doubt there's any ARMv6 non-T2 systems out there that would be
> > affected by clearing the IT state bits.
> 
> Please test the following patch:

While we're waiting for Grazvydas to test.. Looks good to me:

Acked-by: Tony Lindgren 
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-14 Thread Russell King - ARM Linux
On Fri, Sep 11, 2015 at 03:03:07PM +0100, Russell King - ARM Linux wrote:
> On Fri, Sep 11, 2015 at 03:27:13PM +0200, Grazvydas Ignotas wrote:
> > On Thu, Sep 10, 2015 at 10:30 AM, Russell King - ARM Linux
> >  wrote:
> > > On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote:
> > >> ...
> > >>
> > >> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and 
> > >> adding the
> > >> >> #if 0 //__LINUX_ARM_ARCH__ >= 7
> > >> makes it re-appear.
> > >>
> > >> A while ago I tried to debug running the x-server under strace and could 
> > >> find that it also has
> > >> something to do with SIGALRM.
> > >>
> > >> And that is very consistent with “enable/disable” by modifying 
> > >> arch/arm/kernel/signal.c
> > >
> > > It would be really nice if someone could diagnose what's going on here.
> > > What exception is causing the X server to be killed (someone said a
> > > segfault)?  What is the register state at the point that happens?  What
> > > does the code look like  Is it happening inside the SIGALRM handler, or
> > > when the SIGALRM handler has returned?
> > >
> > > I'd suggest attaching gdb to the X server, but remember to set gdb to
> > > ignore SIGPIPEs.
> > 
> > It's actually pretty random, see some debug sessions in [1].
> > The first one is the most useful one, but I haven't though of checking
> > what pixman_rasterize_edges() was doing when the signal arrived, and
> > most often the "less useful" segfaults occur. However from the
> > disassembly (see debug1_libpixman.gz) it can be seen that the signal
> > arrived right after IT.
> > 
> > [1] http://notaz.gp2x.de/tmp/thumb_segfault/
> 
> We're not going from ARM -> Thumb or Thumb -> ARM here, but Thumb code
> in libpixman is being interrupted calling a Thumb signal handler.
> 
> Working through the code:
> 
>0x7f717ec8 : ldr r2, [pc, #20]   ; = 0x0004112e
>0x7f717eca :   ldr r1, [pc, #24]   ; = 0x0c48
>0x7f717ecc :   ldr r3, [pc, #24]   ; = 0x0e6c
>0x7f717ece :   add r2, pc
>0x7f717ed0 :   ldr r1, [r2, r1]
>0x7f717ed2 :  ldr r3, [r2, r3]
> => 0x7f717ed4 :  ldr r2, [r1, #0]
> 
> The instruction at 0x7f717ed4 was trying to access 0xd1242963 which
> is in kernel space, and this is the faulting instruction.
> 
> At this point, r2 should contain 0x0004112e plus the PC value.  r2 in
> the register dump was 0x7f717fa0.  Let's calculate the value that PC
> should be here.  0x7f717fa0 - 0x0004112e = 0x7f6d6e72, which is
> clearly wrong.
> 
> So, I don't think the first instruction here was executed by the CPU.
> 
> gdb indicates that the parent context to the signal frame, pc was at
> 0xb6dd87f8, which works out at 0x297f8 into the libpixman-1 library:
> 
>297f0:   449cadd ip, r3
>297f2:   f1bc 0fff   cmp.w   ip, #255; 0xff
>297f6:   bfd4ite le
>297f8:   fa5f fc8c   uxtble.wip, ip
>297fc:   f04f 0cff   movgt.w ip, #255; 0xff
>29800:   f88a c000   strb.w  ip, [sl]
> 
> and as you say, is just after an IT instruction, which would have
> set the IT execution state to appropriately skip either the first or
> the second instruction.
> 
> Unfortunately, the IT instruction's condition is being carried forward
> to the signal handler, causing either the first or second instruction
> there to be skipped.
> 
> Looking back at the history, the original commit introducing the
> clearing of the PSR_IT_MASK bits is just wrong:
> 
> -   if (thumb)
> +   if (thumb) {
> cpsr |= PSR_T_BIT;
> -   else
> +#if __LINUX_ARM_ARCH__ >= 7
> +   /* clear the If-Then Thumb-2 execution state */
> +   cpsr &= ~PSR_IT_MASK;
> +#endif
> +   } else
> cpsr &= ~PSR_T_BIT;
> 
> This shouldn't be a compile-time decision at all, and it certainly should
> not be dependent on __LINUX_ARM_ARCH__, which marks the _lowest_ supported
> architecture.
> 
> However, even the idea that it's ARMv7 or later is wrong.  According to
> the ARM ARM, the IT instruction is present in ARMv6T2 as well, which
> means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6).
> 
> Looking at the ARM ARM, these bits are "reserved" in previous non-T2
> architectures, have an undefined value at reset, and are probably zero
> anyway.
> 
> Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem,
> and I doubt there's any ARMv6 non-T2 systems out there that would be
> affected by clearing the IT state bits.

Please test the following patch:

8<===
From: Russell King 
Subject: [PATCH] ARM: fix Thumb2 signal handling when ARMv6 is enabled

When a kernel is built covering ARMv6 to ARMv7, we omit to clear the
IT state when entering a signal handler.  This can cause the first
few instructions to be conditionally executed depending on the parent
context.

In any case, the 

RE: mysterious crashes on OMAP5 uevm

2015-09-11 Thread Woodruff, Richard
> From: Russell King - ARM Linux [mailto:li...@arm.linux.org.uk]
> Sent: Friday, September 11, 2015 12:49 PM

> Frankly, Richard, you're getting on my nerves in this thread - you seem to
> know all about this problem, yet you never reported the problem upstream,
> so people are effectively having to waste time re-doing the work that you've
> already done.
>
> Nothing annoys me more than having people say "oh yes, I found that
> problem and worked on it" and nothing coming of it (no report, no patch, no
> nothing.)

Yes, when I put out the hint (to help speed resolution) I expected there might 
be some negative interpretation.

When I originally hit the issue, I did pass along information to folks who work 
in the area with expectation they would follow through.  Probably it got lost.

When I noticed this thread, it appeared like the CPSR.IT information didn't 
make it out, so I directly posted what I recalled.

> As you have "old notes" you've already investigated this issue, and
> presumably you came up with a patch.  Where is it?

I didn't generate a comprehensive one. I did a couple of hack versions but was 
unsure in some of the areas your analysis has cleared... for that issue I ended 
up advising a reversion of MULTI_V6 for that older kernel.

Regards,
Richard W.

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-11 Thread Russell King - ARM Linux
On Fri, Sep 11, 2015 at 04:12:21PM +, Woodruff, Richard wrote:
> > From: linux-omap-ow...@vger.kernel.org [mailto:linux-omap-
> > ow...@vger.kernel.org] On Behalf Of Russell King - ARM Linux
> > Sent: Friday, September 11, 2015 9:03 AM
> > To: Grazvydas Ignotas
> 
> > However, even the idea that it's ARMv7 or later is wrong.  According to
> > the ARM ARM, the IT instruction is present in ARMv6T2 as well, which
> > means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6).
> 
> I recall seeing ARMv6T2 first implemented in the ARM1156 which is a
> v6 CPU with T2 option added.

Exactly, which is why we need to be dealing with the IT bits in signal
handling for >= ARMv6, not >= ARMv7.

> > Looking at the ARM ARM, these bits are "reserved" in previous non-T2
> > architectures, have an undefined value at reset, and are probably zero
> > anyway.
> > 
> > Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the
> > problem,
> > and I doubt there's any ARMv6 non-T2 systems out there that would be
> > affected by clearing the IT state bits.
> 
> Probably you already looked, but cpsr.it usage is not restricted to this
> one spot.

Other places:

arch/arm/mm/extable.c-#ifdef CONFIG_THUMB2_KERNEL
arch/arm/mm/extable.c-  /* Clear the IT state to avoid nasty surprises 
in the fixup */
arch/arm/mm/extable.c:  regs->ARM_cpsr &= ~PSR_IT_MASK;
arch/arm/mm/extable.c-#endif

which is irrelevant here.  This code only deals with kernel mode, and
the only time that this makes sense is when the kernel is built using
Thumb2 instructions.  CONFIG_THUMB2_KERNEL covers the case properly.

arch/arm/probes/kprobes/test-core.c-regs->ARM_lr = val ^ (14 << 8);
arch/arm/probes/kprobes/test-core.c:regs->ARM_cpsr &= ~(APSR_MASK | 
PSR_IT_MASK);
arch/arm/probes/kprobes/test-core.c-regs->ARM_cpsr |= 
test_context_cpsr(scenario);

>From what I can see, this happens unconditionally.

KVM and Xen code... that requires virtualisation support, which is ARMv7.

arch/arm/probes/kprobes/actions-thumb.c... emulating an IT instruction.
arch/arm/probes/decode.h::it_advance... emulating Thumb2.

So really there's no other places that need fixing.

> Looking back at old notes I think both debug and signal handler code
> keyed on bit usage.  I see from LXR kernel KVM code also uses in some
> capacity.

Frankly, Richard, you're getting on my nerves in this thread - you
seem to know all about this problem, yet you never reported the problem
upstream, so people are effectively having to waste time re-doing the
work that you've already done.

Nothing annoys me more than having people say "oh yes, I found that
problem and worked on it" and nothing coming of it (no report, no
patch, no nothing.)

As you have "old notes" you've already investigated this issue, and
presumably you came up with a patch.  Where is it?

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: mysterious crashes on OMAP5 uevm

2015-09-11 Thread Woodruff, Richard
> From: linux-omap-ow...@vger.kernel.org [mailto:linux-omap-
> ow...@vger.kernel.org] On Behalf Of Russell King - ARM Linux
> Sent: Friday, September 11, 2015 9:03 AM
> To: Grazvydas Ignotas

> However, even the idea that it's ARMv7 or later is wrong.  According to
> the ARM ARM, the IT instruction is present in ARMv6T2 as well, which
> means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6).

I recall seeing ARMv6T2 first implemented in the ARM1156 which is a v6 CPU with 
T2 option added.

Cortex-R class was the ARMv7 successor to the 1156 CPU which also use T2.

> Looking at the ARM ARM, these bits are "reserved" in previous non-T2
> architectures, have an undefined value at reset, and are probably zero
> anyway.
> 
> Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the
> problem,
> and I doubt there's any ARMv6 non-T2 systems out there that would be
> affected by clearing the IT state bits.

Probably you already looked, but cpsr.it usage is not restricted to this one 
spot.

Looking back at old notes I think both debug and signal handler code keyed on 
bit usage.  I see from LXR kernel KVM code also uses in some capacity.

The 1156/Cortex-R are typically MMU-less.   They may (or not) have something 
else to consider when fixing.

Regards,
Richard W.



Re: mysterious crashes on OMAP5 uevm

2015-09-11 Thread Russell King - ARM Linux
On Fri, Sep 11, 2015 at 03:27:13PM +0200, Grazvydas Ignotas wrote:
> On Thu, Sep 10, 2015 at 10:30 AM, Russell King - ARM Linux
>  wrote:
> > On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote:
> >> ...
> >>
> >> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding 
> >> the
> >> >> #if 0 //__LINUX_ARM_ARCH__ >= 7
> >> makes it re-appear.
> >>
> >> A while ago I tried to debug running the x-server under strace and could 
> >> find that it also has
> >> something to do with SIGALRM.
> >>
> >> And that is very consistent with “enable/disable” by modifying 
> >> arch/arm/kernel/signal.c
> >
> > It would be really nice if someone could diagnose what's going on here.
> > What exception is causing the X server to be killed (someone said a
> > segfault)?  What is the register state at the point that happens?  What
> > does the code look like  Is it happening inside the SIGALRM handler, or
> > when the SIGALRM handler has returned?
> >
> > I'd suggest attaching gdb to the X server, but remember to set gdb to
> > ignore SIGPIPEs.
> 
> It's actually pretty random, see some debug sessions in [1].
> The first one is the most useful one, but I haven't though of checking
> what pixman_rasterize_edges() was doing when the signal arrived, and
> most often the "less useful" segfaults occur. However from the
> disassembly (see debug1_libpixman.gz) it can be seen that the signal
> arrived right after IT.
> 
> [1] http://notaz.gp2x.de/tmp/thumb_segfault/

We're not going from ARM -> Thumb or Thumb -> ARM here, but Thumb code
in libpixman is being interrupted calling a Thumb signal handler.

Working through the code:

   0x7f717ec8 : ldr r2, [pc, #20]   ; = 0x0004112e
   0x7f717eca :   ldr r1, [pc, #24]   ; = 0x0c48
   0x7f717ecc :   ldr r3, [pc, #24]   ; = 0x0e6c
   0x7f717ece :   add r2, pc
   0x7f717ed0 :   ldr r1, [r2, r1]
   0x7f717ed2 :  ldr r3, [r2, r3]
=> 0x7f717ed4 :  ldr r2, [r1, #0]

The instruction at 0x7f717ed4 was trying to access 0xd1242963 which
is in kernel space, and this is the faulting instruction.

At this point, r2 should contain 0x0004112e plus the PC value.  r2 in
the register dump was 0x7f717fa0.  Let's calculate the value that PC
should be here.  0x7f717fa0 - 0x0004112e = 0x7f6d6e72, which is
clearly wrong.

So, I don't think the first instruction here was executed by the CPU.

gdb indicates that the parent context to the signal frame, pc was at
0xb6dd87f8, which works out at 0x297f8 into the libpixman-1 library:

   297f0:   449cadd ip, r3
   297f2:   f1bc 0fff   cmp.w   ip, #255; 0xff
   297f6:   bfd4ite le
   297f8:   fa5f fc8c   uxtble.wip, ip
   297fc:   f04f 0cff   movgt.w ip, #255; 0xff
   29800:   f88a c000   strb.w  ip, [sl]

and as you say, is just after an IT instruction, which would have
set the IT execution state to appropriately skip either the first or
the second instruction.

Unfortunately, the IT instruction's condition is being carried forward
to the signal handler, causing either the first or second instruction
there to be skipped.

Looking back at the history, the original commit introducing the
clearing of the PSR_IT_MASK bits is just wrong:

-   if (thumb)
+   if (thumb) {
cpsr |= PSR_T_BIT;
-   else
+#if __LINUX_ARM_ARCH__ >= 7
+   /* clear the If-Then Thumb-2 execution state */
+   cpsr &= ~PSR_IT_MASK;
+#endif
+   } else
cpsr &= ~PSR_T_BIT;

This shouldn't be a compile-time decision at all, and it certainly should
not be dependent on __LINUX_ARM_ARCH__, which marks the _lowest_ supported
architecture.

However, even the idea that it's ARMv7 or later is wrong.  According to
the ARM ARM, the IT instruction is present in ARMv6T2 as well, which
means it's ARMv6 too (which would have __LINUX_ARM_ARCH__ = 6).

Looking at the ARM ARM, these bits are "reserved" in previous non-T2
architectures, have an undefined value at reset, and are probably zero
anyway.

Merely changing __LINUX_ARM_ARCH__ >= 7 to >= 6 should fix the problem,
and I doubt there's any ARMv6 non-T2 systems out there that would be
affected by clearing the IT state bits.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-11 Thread Grazvydas Ignotas
On Thu, Sep 10, 2015 at 10:30 AM, Russell King - ARM Linux
 wrote:
> On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote:
>> ...
>>
>> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the
>> >> #if 0 //__LINUX_ARM_ARCH__ >= 7
>> makes it re-appear.
>>
>> A while ago I tried to debug running the x-server under strace and could 
>> find that it also has
>> something to do with SIGALRM.
>>
>> And that is very consistent with “enable/disable” by modifying 
>> arch/arm/kernel/signal.c
>
> It would be really nice if someone could diagnose what's going on here.
> What exception is causing the X server to be killed (someone said a
> segfault)?  What is the register state at the point that happens?  What
> does the code look like  Is it happening inside the SIGALRM handler, or
> when the SIGALRM handler has returned?
>
> I'd suggest attaching gdb to the X server, but remember to set gdb to
> ignore SIGPIPEs.

It's actually pretty random, see some debug sessions in [1].
The first one is the most useful one, but I haven't though of checking
what pixman_rasterize_edges() was doing when the signal arrived, and
most often the "less useful" segfaults occur. However from the
disassembly (see debug1_libpixman.gz) it can be seen that the signal
arrived right after IT.

[1] http://notaz.gp2x.de/tmp/thumb_segfault/

Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: mysterious crashes on OMAP5 uevm

2015-09-10 Thread Woodruff, Richard
> From: linux-arm-kernel [mailto:linux-arm-kernel-
> boun...@lists.infradead.org] On Behalf Of Russell King - ARM Linux
 
> >  There are 2 workarounds that I know which make the problem go
> >  away (one is enough):
> >  - recompile Xorg with -marm (I'm using Debian armhf so it's
> >  thumb2 by default)
> >  - disable ARCH_MULTI_V6 in the kernel config

This reminds me of a customer crash I saw quite a while ago relating to thumb2. 
 I thought it was fixed but maybe not.

In a couple spots the PSR_IT_MASK was not conditionally handled well in 
ARCH_MULTI_V6 flow.  Some stack sanity check failed and a BUG() was triggered.

Compiling the app for v6 or pulling MULTI from the kernel build solved the 
issue.

Additionally it was not handled correctly in GDB.   The old build of GDB didn't 
do MULTI and needed a hack to be useable on thumb2 code.

Regards,
Richard W.



Re: mysterious crashes on OMAP5 uevm

2015-09-10 Thread Dr. H. Nikolaus Schaller

Am 10.09.2015 um 10:30 schrieb Russell King - ARM Linux 
:

> On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote:
>> 
>> Am 08.09.2015 um 23:07 schrieb Tony Lindgren :
>> 
>>> * Grazvydas Ignotas  [150908 13:44]:
 On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren  wrote:
> * Grazvydas Ignotas  [150908 05:50]:
>> Hi,
>> 
>> this is a longstanding problem I'm seeing since the very beginning,
>> which was around 3.12 or so (when I've first got the hardware) and it
>> seems 4.2 is affected by it still. Basically what happens is Xorg
>> randomly segfaults at some "impossible" location. I don't have the
>> details at the moment (could get them is needed), but from what I
>> examined with gdb some time ago the situation did not make any sense.
>> 
>> There are 2 workarounds that I know which make the problem go away
>> (one is enough):
>> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by 
>> default)
>> - disable ARCH_MULTI_V6 in the kernel config
>> 
>> Because of the above workarounds I have forgotten about it several
>> times, but it regularly comes back and bites again. It would look like
>> some missing erratum workaround, but I have all of them enabled in the
>> kernel.
>> 
>> Does anyone know about this? Perhaps some missing erratum workaround
>> in the bootloader? u-boot isn't too old here (2015.07).
> 
> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in..
> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and
> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6
> places ignoring uncompress and davinci code.
 
 ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6
 disabled, it is enough to just do this:
 
 --- a/arch/arm/kernel/signal.c
 +++ b/arch/arm/kernel/signal.c
 @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal 
 *ksig,
   /*
* The LSB of the handler determines if we're going to
* be using THUMB or ARM mode for this signal handler.
*/
   thumb = handler & 1;
 
 -#if __LINUX_ARM_ARCH__ >= 7
 +#if 0 //__LINUX_ARM_ARCH__ >= 7
   /*
* Clear the If-Then Thumb-2 execution state
* ARM spec requires this to be all 000s in ARM mode
* Snapdragon S4/Krait misbehaves on a Thumb=>ARM
* signal transition without this.
*/
 
 ... and the problem appears, so I guess this needs some real
 multiplatform handling,.
>>> 
>>> OK nice to hear you found it. Yeah looks like some runtime
>>> capability check is needed.
>>> 
> Do you have some easy way to reproduce this issue?
 
 Just moving a browser window around with mouse usually triggers it
 within a minute.
>>> 
>>> OK good to know.
>> 
>> It looks as if this is the solution for the same symptom on our OMAP3 board 
>> (gta04).
>> There, it suffices to draw on the touch screen for ~10 seconds to make the 
>> xserver segfault.
>> 
>> [we are using the binary xserver from debian wheezy
>> ii  xserver-xorg-core2:1.12.4-6+deb7u5 
>> armhfXorg X server - core server]
>> 
>> We know about this bug for a while, but so far did think that some touch 
>> screen
>> event bit has changed and we have to fix our touch screen driver.
>> 
>> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the
 #if 0 //__LINUX_ARM_ARCH__ >= 7
>> makes it re-appear.
>> 
>> A while ago I tried to debug running the x-server under strace and could 
>> find that it also has
>> something to do with SIGALRM.
>> 
>> And that is very consistent with “enable/disable” by modifying 
>> arch/arm/kernel/signal.c
> 
> It would be really nice if someone could diagnose what's going on here.
> What exception is causing the X server to be killed (someone said a
> segfault)?  What is the register state at the point that happens?  What
> does the code look like  Is it happening inside the SIGALRM handler, or
> when the SIGALRM handler has returned?
> 
> I'd suggest attaching gdb to the X server, but remember to set gdb to
> ignore SIGPIPEs.

I don’t have a setup to run gdb (with source) on the device and really zero
experience with Xserver sources. But maybe Grazvydas can do that better
than me.

Attached is some strace I had recorded during my earlier experiments.
X-Server appears not only to heavily use SIGALRM but SIGIO.

And it looks as if it a SEGFAULT appears inside the SIGIO handler after
having done 3 syscalls (select, read, clock_gettime) but before the
sigreturn. At least in this example.

Xserver then does a graceful shutdown after SEGFAULT. I.e. it prints the
segfault message by itself.

Hope this is a useful piece to solve the puzzl

Re: mysterious crashes on OMAP5 uevm

2015-09-10 Thread Russell King - ARM Linux
On Thu, Sep 10, 2015 at 08:42:57AM +0200, Dr. H. Nikolaus Schaller wrote:
> 
> Am 08.09.2015 um 23:07 schrieb Tony Lindgren :
> 
> > * Grazvydas Ignotas  [150908 13:44]:
> >> On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren  wrote:
> >>> * Grazvydas Ignotas  [150908 05:50]:
>  Hi,
>  
>  this is a longstanding problem I'm seeing since the very beginning,
>  which was around 3.12 or so (when I've first got the hardware) and it
>  seems 4.2 is affected by it still. Basically what happens is Xorg
>  randomly segfaults at some "impossible" location. I don't have the
>  details at the moment (could get them is needed), but from what I
>  examined with gdb some time ago the situation did not make any sense.
>  
>  There are 2 workarounds that I know which make the problem go away
>  (one is enough):
>  - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by 
>  default)
>  - disable ARCH_MULTI_V6 in the kernel config
>  
>  Because of the above workarounds I have forgotten about it several
>  times, but it regularly comes back and bites again. It would look like
>  some missing erratum workaround, but I have all of them enabled in the
>  kernel.
>  
>  Does anyone know about this? Perhaps some missing erratum workaround
>  in the bootloader? u-boot isn't too old here (2015.07).
> >>> 
> >>> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in..
> >>> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and
> >>> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6
> >>> places ignoring uncompress and davinci code.
> >> 
> >> ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6
> >> disabled, it is enough to just do this:
> >> 
> >> --- a/arch/arm/kernel/signal.c
> >> +++ b/arch/arm/kernel/signal.c
> >> @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal 
> >> *ksig,
> >>/*
> >> * The LSB of the handler determines if we're going to
> >> * be using THUMB or ARM mode for this signal handler.
> >> */
> >>thumb = handler & 1;
> >> 
> >> -#if __LINUX_ARM_ARCH__ >= 7
> >> +#if 0 //__LINUX_ARM_ARCH__ >= 7
> >>/*
> >> * Clear the If-Then Thumb-2 execution state
> >> * ARM spec requires this to be all 000s in ARM mode
> >> * Snapdragon S4/Krait misbehaves on a Thumb=>ARM
> >> * signal transition without this.
> >> */
> >> 
> >> ... and the problem appears, so I guess this needs some real
> >> multiplatform handling,.
> > 
> > OK nice to hear you found it. Yeah looks like some runtime
> > capability check is needed.
> > 
> >>> Do you have some easy way to reproduce this issue?
> >> 
> >> Just moving a browser window around with mouse usually triggers it
> >> within a minute.
> > 
> > OK good to know.
> 
> It looks as if this is the solution for the same symptom on our OMAP3 board 
> (gta04).
> There, it suffices to draw on the touch screen for ~10 seconds to make the 
> xserver segfault.
> 
> [we are using the binary xserver from debian wheezy
> ii  xserver-xorg-core2:1.12.4-6+deb7u5 
> armhfXorg X server - core server]
> 
> We know about this bug for a while, but so far did think that some touch 
> screen
> event bit has changed and we have to fix our touch screen driver.
> 
> Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the
> >> #if 0 //__LINUX_ARM_ARCH__ >= 7
> makes it re-appear.
> 
> A while ago I tried to debug running the x-server under strace and could find 
> that it also has
> something to do with SIGALRM.
> 
> And that is very consistent with “enable/disable” by modifying 
> arch/arm/kernel/signal.c

It would be really nice if someone could diagnose what's going on here.
What exception is causing the X server to be killed (someone said a
segfault)?  What is the register state at the point that happens?  What
does the code look like  Is it happening inside the SIGALRM handler, or
when the SIGALRM handler has returned?

I'd suggest attaching gdb to the X server, but remember to set gdb to
ignore SIGPIPEs.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-09 Thread Dr. H. Nikolaus Schaller

Am 08.09.2015 um 23:07 schrieb Tony Lindgren :

> * Grazvydas Ignotas  [150908 13:44]:
>> On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren  wrote:
>>> * Grazvydas Ignotas  [150908 05:50]:
 Hi,
 
 this is a longstanding problem I'm seeing since the very beginning,
 which was around 3.12 or so (when I've first got the hardware) and it
 seems 4.2 is affected by it still. Basically what happens is Xorg
 randomly segfaults at some "impossible" location. I don't have the
 details at the moment (could get them is needed), but from what I
 examined with gdb some time ago the situation did not make any sense.
 
 There are 2 workarounds that I know which make the problem go away
 (one is enough):
 - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by 
 default)
 - disable ARCH_MULTI_V6 in the kernel config
 
 Because of the above workarounds I have forgotten about it several
 times, but it regularly comes back and bites again. It would look like
 some missing erratum workaround, but I have all of them enabled in the
 kernel.
 
 Does anyone know about this? Perhaps some missing erratum workaround
 in the bootloader? u-boot isn't too old here (2015.07).
>>> 
>>> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in..
>>> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and
>>> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6
>>> places ignoring uncompress and davinci code.
>> 
>> ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6
>> disabled, it is enough to just do this:
>> 
>> --- a/arch/arm/kernel/signal.c
>> +++ b/arch/arm/kernel/signal.c
>> @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal 
>> *ksig,
>>/*
>> * The LSB of the handler determines if we're going to
>> * be using THUMB or ARM mode for this signal handler.
>> */
>>thumb = handler & 1;
>> 
>> -#if __LINUX_ARM_ARCH__ >= 7
>> +#if 0 //__LINUX_ARM_ARCH__ >= 7
>>/*
>> * Clear the If-Then Thumb-2 execution state
>> * ARM spec requires this to be all 000s in ARM mode
>> * Snapdragon S4/Krait misbehaves on a Thumb=>ARM
>> * signal transition without this.
>> */
>> 
>> ... and the problem appears, so I guess this needs some real
>> multiplatform handling,.
> 
> OK nice to hear you found it. Yeah looks like some runtime
> capability check is needed.
> 
>>> Do you have some easy way to reproduce this issue?
>> 
>> Just moving a browser window around with mouse usually triggers it
>> within a minute.
> 
> OK good to know.

It looks as if this is the solution for the same symptom on our OMAP3 board 
(gta04).
There, it suffices to draw on the touch screen for ~10 seconds to make the 
xserver segfault.

[we are using the binary xserver from debian wheezy
ii  xserver-xorg-core2:1.12.4-6+deb7u5 
armhfXorg X server - core server]

We know about this bug for a while, but so far did think that some touch screen
event bit has changed and we have to fix our touch screen driver.

Now, disabling CONFIG_ARCH_MULTI_V6 also makes the bug go away and adding the
>> #if 0 //__LINUX_ARM_ARCH__ >= 7
makes it re-appear.

A while ago I tried to debug running the x-server under strace and could find 
that it also has
something to do with SIGALRM.

And that is very consistent with “enable/disable” by modifying 
arch/arm/kernel/signal.c

BR,
Nikolaus


--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-08 Thread Tony Lindgren
* Grazvydas Ignotas  [150908 13:44]:
> On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren  wrote:
> > * Grazvydas Ignotas  [150908 05:50]:
> >> Hi,
> >>
> >> this is a longstanding problem I'm seeing since the very beginning,
> >> which was around 3.12 or so (when I've first got the hardware) and it
> >> seems 4.2 is affected by it still. Basically what happens is Xorg
> >> randomly segfaults at some "impossible" location. I don't have the
> >> details at the moment (could get them is needed), but from what I
> >> examined with gdb some time ago the situation did not make any sense.
> >>
> >> There are 2 workarounds that I know which make the problem go away
> >> (one is enough):
> >> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by 
> >> default)
> >> - disable ARCH_MULTI_V6 in the kernel config
> >>
> >> Because of the above workarounds I have forgotten about it several
> >> times, but it regularly comes back and bites again. It would look like
> >> some missing erratum workaround, but I have all of them enabled in the
> >> kernel.
> >>
> >> Does anyone know about this? Perhaps some missing erratum workaround
> >> in the bootloader? u-boot isn't too old here (2015.07).
> >
> > Seems like some incorrect handling with CONFIG_CPU_V6 compiled in..
> > Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and
> > __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6
> > places ignoring uncompress and davinci code.
> 
> ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6
> disabled, it is enough to just do this:
> 
> --- a/arch/arm/kernel/signal.c
> +++ b/arch/arm/kernel/signal.c
> @@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig,
> /*
>  * The LSB of the handler determines if we're going to
>  * be using THUMB or ARM mode for this signal handler.
>  */
> thumb = handler & 1;
> 
> -#if __LINUX_ARM_ARCH__ >= 7
> +#if 0 //__LINUX_ARM_ARCH__ >= 7
> /*
>  * Clear the If-Then Thumb-2 execution state
>  * ARM spec requires this to be all 000s in ARM mode
>  * Snapdragon S4/Krait misbehaves on a Thumb=>ARM
>  * signal transition without this.
>  */
> 
> ... and the problem appears, so I guess this needs some real
> multiplatform handling,.

OK nice to hear you found it. Yeah looks like some runtime
capability check is needed.
 
> > Do you have some easy way to reproduce this issue?
> 
> Just moving a browser window around with mouse usually triggers it
> within a minute.

OK good to know.

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-08 Thread Grazvydas Ignotas
On Tue, Sep 8, 2015 at 4:38 PM, Tony Lindgren  wrote:
> * Grazvydas Ignotas  [150908 05:50]:
>> Hi,
>>
>> this is a longstanding problem I'm seeing since the very beginning,
>> which was around 3.12 or so (when I've first got the hardware) and it
>> seems 4.2 is affected by it still. Basically what happens is Xorg
>> randomly segfaults at some "impossible" location. I don't have the
>> details at the moment (could get them is needed), but from what I
>> examined with gdb some time ago the situation did not make any sense.
>>
>> There are 2 workarounds that I know which make the problem go away
>> (one is enough):
>> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by 
>> default)
>> - disable ARCH_MULTI_V6 in the kernel config
>>
>> Because of the above workarounds I have forgotten about it several
>> times, but it regularly comes back and bites again. It would look like
>> some missing erratum workaround, but I have all of them enabled in the
>> kernel.
>>
>> Does anyone know about this? Perhaps some missing erratum workaround
>> in the bootloader? u-boot isn't too old here (2015.07).
>
> Seems like some incorrect handling with CONFIG_CPU_V6 compiled in..
> Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and
> __LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6
> places ignoring uncompress and davinci code.

ok with that it was quite easy to find. On a kernel with ARCH_MULTI_V6
disabled, it is enough to just do this:

--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -340,13 +340,13 @@ setup_return(struct pt_regs *regs, struct ksignal *ksig,
/*
 * The LSB of the handler determines if we're going to
 * be using THUMB or ARM mode for this signal handler.
 */
thumb = handler & 1;

-#if __LINUX_ARM_ARCH__ >= 7
+#if 0 //__LINUX_ARM_ARCH__ >= 7
/*
 * Clear the If-Then Thumb-2 execution state
 * ARM spec requires this to be all 000s in ARM mode
 * Snapdragon S4/Krait misbehaves on a Thumb=>ARM
 * signal transition without this.
 */

... and the problem appears, so I guess this needs some real
multiplatform handling,.

> Do you have some easy way to reproduce this issue?

Just moving a browser window around with mouse usually triggers it
within a minute.

>
> Regards,
>
> Tony

Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mysterious crashes on OMAP5 uevm

2015-09-08 Thread Tony Lindgren
* Grazvydas Ignotas  [150908 05:50]:
> Hi,
> 
> this is a longstanding problem I'm seeing since the very beginning,
> which was around 3.12 or so (when I've first got the hardware) and it
> seems 4.2 is affected by it still. Basically what happens is Xorg
> randomly segfaults at some "impossible" location. I don't have the
> details at the moment (could get them is needed), but from what I
> examined with gdb some time ago the situation did not make any sense.
> 
> There are 2 workarounds that I know which make the problem go away
> (one is enough):
> - recompile Xorg with -marm (I'm using Debian armhf so it's thumb2 by default)
> - disable ARCH_MULTI_V6 in the kernel config
> 
> Because of the above workarounds I have forgotten about it several
> times, but it regularly comes back and bites again. It would look like
> some missing erratum workaround, but I have all of them enabled in the
> kernel.
> 
> Does anyone know about this? Perhaps some missing erratum workaround
> in the bootloader? u-boot isn't too old here (2015.07).

Seems like some incorrect handling with CONFIG_CPU_V6 compiled in.. 
Maybe try to narrow it down by commenting out some CONFIG_CPU_V6 and
__LINUX_ARM_ARCH__ = 6 ifdefs in the git grep CONFIG_CPU_V6
places ignoring uncompress and davinci code.

Do you have some easy way to reproduce this issue?

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html