Re: kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-11 Thread Michael Ellerman
Madhavan Srinivasan  writes:

> On Friday 07 April 2017 06:06 PM, Michael Ellerman wrote:
>> Sachin Sant  writes:
>>
>>> I have run into few instances where the lost_exception_test from
>>> powerpc kselftest fails with SIGABRT. Following o/p is against
>>> 4.11.0-rc5. The failure is intermittent.
>> What hardware are you on?
>>
>> How long does it take to run when it fails? I assume ~2 minutes?
>
> Started a run in power8 host (habanero) and it is more than 24hrs and
> havent failed yet. So this should be guest/VM scenario then?

Aha good point. I never tested this much (at all?) on VMs because it was
about verifying a workaround for a hardware bug.

So does it happen on both KVM and PowerVM or just one or the other?

cheers


Re: kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-10 Thread Sachin Sant

> On 07-Apr-2017, at 6:06 PM, Michael Ellerman  wrote:
> 
> Sachin Sant  writes:
> 
>> I have run into few instances where the lost_exception_test from
>> powerpc kselftest fails with SIGABRT. Following o/p is against
>> 4.11.0-rc5. The failure is intermittent. 
> 
> What hardware are you on?

I have seen this problem on a POWER8 LPAR.

> 
> How long does it take to run when it fails? I assume ~2 minutes?

Yes somewhere around 2 min.


>> MMCR2 0x
>> EBBHR 0x10003dcc
>> BESCR 0x8001 GE PMAE 
> 
> And that says we have global enable set and events enabled.
> 
> 
> So I think there is a bug here somewhere. I don't really have time to
> dig into it now, neither does Maddy I think. But we should try and get
> to it at some point.
> 

Let me know if I can help with debug.

Thanks
-Sachin


> cheers
> 



Re: kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-09 Thread Madhavan Srinivasan



On Friday 07 April 2017 06:06 PM, Michael Ellerman wrote:

Sachin Sant  writes:


I have run into few instances where the lost_exception_test from
powerpc kselftest fails with SIGABRT. Following o/p is against
4.11.0-rc5. The failure is intermittent.

What hardware are you on?

How long does it take to run when it fails? I assume ~2 minutes?


Started a run in power8 host (habanero) and it is more than 24hrs and
havent failed yet. So this should be guest/VM scenario then?




When the test fails it is killed due to SIGABRT.
# ./lost_exception_test
test: lost_exception
tags: git_version:unknown
Binding to cpu 8
main test running as pid 9208
EBB Handler is at 0x10003dcc
!! killing lost_exception

This is the parent (test harness saying) it's about to kill the child,
because it took too long.

It sends SIGTERM, but the child catches that, prints all this info, and
then aborts() - so that's why you're seeing SIGABRT.


ebb_state):
   ebb_count= 191529

The test usually runs until it's taken 1,000,000 EBBs, so it looks like
we got stuck.


   spurious = 0
   negative = 0
   no_overflow  = 0
   pmc[1] count = 0x0
   pmc[2] count = 0x0
   pmc[3] count = 0x0
   pmc[4] count = 0x4c1b707

We use a varying sample period of between 400 and 600, and from above
we've taken 191,529 EBBs.

0x4c1b707 / 191,529 ~= 416

So that looks reasonable.


   pmc[5] count = 0x0
   pmc[6] count = 0x0
HW state:
MMCR0 0x8080 FC PMAO

But this says we're stopped with counters frozen and an event pending.


MMCR2 0x
EBBHR 0x10003dcc
BESCR 0x8001 GE PMAE

And that says we have global enable set and events enabled.


So I think there is a bug here somewhere. I don't really have time to
dig into it now, neither does Maddy I think. But we should try and get
to it at some point.

cheers





Re: kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-07 Thread Michael Ellerman
Sachin Sant  writes:

> I have run into few instances where the lost_exception_test from
> powerpc kselftest fails with SIGABRT. Following o/p is against
> 4.11.0-rc5. The failure is intermittent. 

What hardware are you on?

How long does it take to run when it fails? I assume ~2 minutes?

> When the test fails it is killed due to SIGABRT.

> # ./lost_exception_test 
> test: lost_exception
> tags: git_version:unknown
> Binding to cpu 8
> main test running as pid 9208
> EBB Handler is at 0x10003dcc
> !! killing lost_exception

This is the parent (test harness saying) it's about to kill the child,
because it took too long.

It sends SIGTERM, but the child catches that, prints all this info, and
then aborts() - so that's why you're seeing SIGABRT.

> ebb_state):
>   ebb_count= 191529

The test usually runs until it's taken 1,000,000 EBBs, so it looks like
we got stuck.

>   spurious = 0
>   negative = 0
>   no_overflow  = 0
>   pmc[1] count = 0x0
>   pmc[2] count = 0x0
>   pmc[3] count = 0x0
>   pmc[4] count = 0x4c1b707

We use a varying sample period of between 400 and 600, and from above
we've taken 191,529 EBBs.

0x4c1b707 / 191,529 ~= 416

So that looks reasonable.

>   pmc[5] count = 0x0
>   pmc[6] count = 0x0
> HW state:
> MMCR0 0x8080 FC PMAO 

But this says we're stopped with counters frozen and an event pending.

> MMCR2 0x
> EBBHR 0x10003dcc
> BESCR 0x8001 GE PMAE 

And that says we have global enable set and events enabled.


So I think there is a bug here somewhere. I don't really have time to
dig into it now, neither does Maddy I think. But we should try and get
to it at some point.

cheers


kselftest:lost_exception_test failure with 4.11.0-rc5

2017-04-07 Thread Sachin Sant
I have run into few instances where the lost_exception_test from
powerpc kselftest fails with SIGABRT. Following o/p is against
4.11.0-rc5. The failure is intermittent. 

When the test fails it is killed due to SIGABRT.

# ./lost_exception_test 
test: lost_exception
tags: git_version:unknown
Binding to cpu 8
main test running as pid 9208
EBB Handler is at 0x10003dcc
!! killing lost_exception
ebb_state:
  ebb_count= 191529
  spurious = 0
  negative = 0
  no_overflow  = 0
  pmc[1] count = 0x0
  pmc[2] count = 0x0
  pmc[3] count = 0x0
  pmc[4] count = 0x4c1b707
  pmc[5] count = 0x0
  pmc[6] count = 0x0
HW state:
MMCR0 0x8080 FC PMAO 
MMCR2 0x
EBBHR 0x10003dcc
BESCR 0x8001 GE PMAE 
PMC1  0x
PMC2  0x
PMC3  0x
PMC4  0x8000
PMC5  0x88d4f0c8
PMC6  0x1e49da22
SIAR  0x3fffad60a608
!! child died by signal 6
failure: lost_exception
#

Thanks
-Sachin