Re: kselftest:lost_exception_test failure with 4.11.0-rc5
Madhavan Srinivasanwrites: > On Friday 07 April 2017 06:06 PM, Michael Ellerman wrote: >> Sachin Sant writes: >> >>> I have run into few instances where the lost_exception_test from >>> powerpc kselftest fails with SIGABRT. Following o/p is against >>> 4.11.0-rc5. The failure is intermittent. >> What hardware are you on? >> >> How long does it take to run when it fails? I assume ~2 minutes? > > Started a run in power8 host (habanero) and it is more than 24hrs and > havent failed yet. So this should be guest/VM scenario then? Aha good point. I never tested this much (at all?) on VMs because it was about verifying a workaround for a hardware bug. So does it happen on both KVM and PowerVM or just one or the other? cheers
Re: kselftest:lost_exception_test failure with 4.11.0-rc5
> On 07-Apr-2017, at 6:06 PM, Michael Ellermanwrote: > > Sachin Sant writes: > >> I have run into few instances where the lost_exception_test from >> powerpc kselftest fails with SIGABRT. Following o/p is against >> 4.11.0-rc5. The failure is intermittent. > > What hardware are you on? I have seen this problem on a POWER8 LPAR. > > How long does it take to run when it fails? I assume ~2 minutes? Yes somewhere around 2 min. >> MMCR2 0x >> EBBHR 0x10003dcc >> BESCR 0x8001 GE PMAE > > And that says we have global enable set and events enabled. > > > So I think there is a bug here somewhere. I don't really have time to > dig into it now, neither does Maddy I think. But we should try and get > to it at some point. > Let me know if I can help with debug. Thanks -Sachin > cheers >
Re: kselftest:lost_exception_test failure with 4.11.0-rc5
On Friday 07 April 2017 06:06 PM, Michael Ellerman wrote: Sachin Santwrites: I have run into few instances where the lost_exception_test from powerpc kselftest fails with SIGABRT. Following o/p is against 4.11.0-rc5. The failure is intermittent. What hardware are you on? How long does it take to run when it fails? I assume ~2 minutes? Started a run in power8 host (habanero) and it is more than 24hrs and havent failed yet. So this should be guest/VM scenario then? When the test fails it is killed due to SIGABRT. # ./lost_exception_test test: lost_exception tags: git_version:unknown Binding to cpu 8 main test running as pid 9208 EBB Handler is at 0x10003dcc !! killing lost_exception This is the parent (test harness saying) it's about to kill the child, because it took too long. It sends SIGTERM, but the child catches that, prints all this info, and then aborts() - so that's why you're seeing SIGABRT. ebb_state): ebb_count= 191529 The test usually runs until it's taken 1,000,000 EBBs, so it looks like we got stuck. spurious = 0 negative = 0 no_overflow = 0 pmc[1] count = 0x0 pmc[2] count = 0x0 pmc[3] count = 0x0 pmc[4] count = 0x4c1b707 We use a varying sample period of between 400 and 600, and from above we've taken 191,529 EBBs. 0x4c1b707 / 191,529 ~= 416 So that looks reasonable. pmc[5] count = 0x0 pmc[6] count = 0x0 HW state: MMCR0 0x8080 FC PMAO But this says we're stopped with counters frozen and an event pending. MMCR2 0x EBBHR 0x10003dcc BESCR 0x8001 GE PMAE And that says we have global enable set and events enabled. So I think there is a bug here somewhere. I don't really have time to dig into it now, neither does Maddy I think. But we should try and get to it at some point. cheers
Re: kselftest:lost_exception_test failure with 4.11.0-rc5
Sachin Santwrites: > I have run into few instances where the lost_exception_test from > powerpc kselftest fails with SIGABRT. Following o/p is against > 4.11.0-rc5. The failure is intermittent. What hardware are you on? How long does it take to run when it fails? I assume ~2 minutes? > When the test fails it is killed due to SIGABRT. > # ./lost_exception_test > test: lost_exception > tags: git_version:unknown > Binding to cpu 8 > main test running as pid 9208 > EBB Handler is at 0x10003dcc > !! killing lost_exception This is the parent (test harness saying) it's about to kill the child, because it took too long. It sends SIGTERM, but the child catches that, prints all this info, and then aborts() - so that's why you're seeing SIGABRT. > ebb_state): > ebb_count= 191529 The test usually runs until it's taken 1,000,000 EBBs, so it looks like we got stuck. > spurious = 0 > negative = 0 > no_overflow = 0 > pmc[1] count = 0x0 > pmc[2] count = 0x0 > pmc[3] count = 0x0 > pmc[4] count = 0x4c1b707 We use a varying sample period of between 400 and 600, and from above we've taken 191,529 EBBs. 0x4c1b707 / 191,529 ~= 416 So that looks reasonable. > pmc[5] count = 0x0 > pmc[6] count = 0x0 > HW state: > MMCR0 0x8080 FC PMAO But this says we're stopped with counters frozen and an event pending. > MMCR2 0x > EBBHR 0x10003dcc > BESCR 0x8001 GE PMAE And that says we have global enable set and events enabled. So I think there is a bug here somewhere. I don't really have time to dig into it now, neither does Maddy I think. But we should try and get to it at some point. cheers
kselftest:lost_exception_test failure with 4.11.0-rc5
I have run into few instances where the lost_exception_test from powerpc kselftest fails with SIGABRT. Following o/p is against 4.11.0-rc5. The failure is intermittent. When the test fails it is killed due to SIGABRT. # ./lost_exception_test test: lost_exception tags: git_version:unknown Binding to cpu 8 main test running as pid 9208 EBB Handler is at 0x10003dcc !! killing lost_exception ebb_state: ebb_count= 191529 spurious = 0 negative = 0 no_overflow = 0 pmc[1] count = 0x0 pmc[2] count = 0x0 pmc[3] count = 0x0 pmc[4] count = 0x4c1b707 pmc[5] count = 0x0 pmc[6] count = 0x0 HW state: MMCR0 0x8080 FC PMAO MMCR2 0x EBBHR 0x10003dcc BESCR 0x8001 GE PMAE PMC1 0x PMC2 0x PMC3 0x PMC4 0x8000 PMC5 0x88d4f0c8 PMC6 0x1e49da22 SIAR 0x3fffad60a608 !! child died by signal 6 failure: lost_exception # Thanks -Sachin