[Bug 1667239] Comment bridged from LTC Bugzilla

bugproxy Tue, 04 Jul 2017 06:16:23 -0700

------- Comment From [email protected] 2017-07-04 09:03 EDT-------
This CMVC defect is being cancelled by the CDE Bridge because the corresponding 
CQ Defect [SW354783] was transferred out of the bridge domain.
Here are the additional details:
New Subsystem = ppc_triage
New Release = unspecified
New Component = ubuntu_linux
New OwnerInfo = Chavez, Luciano ([email protected])
To continue tracking this issue, please follow CQ defect [SW354783].


Opened defect SW355478 on new fail to see if it is the same issue.  I
made sev 1 since system in XMON right now and is preventing further
testing.

Like I mentioned earlier, the fail could be related to this defect.

For this defect...

The "Oops: Kernel access of bad area, sig: 11 [#1]" in the logs happens
during HTX run.

On the reboot (that happened ~30 minutes after first error), I saw partition 
hang/crash.  I had to use ipmitool to power down system.
Current xmon crash in SW355478 / 142348 is different than
one being tracked in this bug. Will wait for recreate of original issue.

The FlashGT HST team still needs to recreate this issue.

SW357236 "HTX fail during superpipe 128 per LUN testing...during Guardband 
Testing" is now marked as a duplicate of this SW354783.
Per comment from JVP (SW357236 submitter), he is attempting a recreate again 
with the latest Firmware for his Tuleta-L.
We will monitor that attempt at recreate, and reopen this SW354783 if a new 
recreate is achieved.

This original recreate attempt on Firestone, fsbmc30, may be delayed, as
it is currently tied up with debugging a link training issue.

<Automated Update> The severity of defect SW354783 was increased from 2
to 1 because defect SW358210 was rejected as the duplicate of defect
SW354783 and the severity of defect SW358210 was higher than 2

Defect submitter, Dion is out on vacation until 7/11.  So we can make progress 
on this most recent recreate, SW358210 dup'd to this SW354783,
I request the defect Owner, Luciano/ScreenTeam, to please reopen this SW354783 
and continue live debug on the held system from SW358210:

#=#=# 2016-07-05 17:12:28 (CDT) #=#=#
Action = [reopen]

I'm not quite sure how to handle this (I'll ping Mark Smith) defect.

Dion's defect
SW358210 : FlashGT STC GA3: capiredp01: TMF timed out and Unable to handle 
kernel paging request before system drops into xmon debugger, was running HTX 
for superpipe with 1600 virtual luns across 4 FlashGT NVME cards

was just dup'd to this one.

That system is currently in XMON debugger now and can be debugged to 1) verify 
it is same issue and 2) maybe try to find root cause (his defect can be 
re-opened if not the same issue).
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#
Not able to look SW358210.
Looking into machine capiredp01 box.
Machine details:

FSP: capiredfsp.aus.stglabs.ibm.com (dev/FipSdev)
Partition: capiredp01.aus.stglabs.ibm.com
IPMI console: ipmitool -I lanplus -H capiredfsp.aus.stglabs.ibm.com -P abc123 
sol activate

Fail on "capiredfsp" seems same as reported in this bug.
hxesurelock process has segfaulted and kernel has crashed
while generating core dump.

cde00 ([email protected]) added native attachment
/tmp/AIXOS05866176/dmesg_backtrace_capiredfsp on 2016-07-07 06:19:39

Hi Dominic,
Can you please have some one from kernel team look at this ?
HTX (hxesurelock) process has segfaulted and kernel has crashed while
generating core. Attached kernel logs with bug . Machine is sitting in
xmon and available for debug.
(In reply to comment #25)
> Hi Dominic,
>            Can you please have some one from kernel team look at this ?
> HTX (hxesurelock) process has segfaulted and kernel has crashed while
> generating core. Attached kernel logs with bug . Machine is sitting in
> xmon and available for debug.

Vipin,

I cannot ssh to capiredfsp.aus.stglabs.ibm.com (dev/FipSdev). Is the
machine still in xmon?

(In reply to comment #26)
> Vipin,
> I cannot ssh to capiredfsp.aus.stglabs.ibm.com (dev/FipSdev). Is the machine
> still in xmon?

Yes its still sitting in xmon. You can open console via IPMI.
Please see comment 22 for machine access details.

Just wanted to point out the send_tmf timeout (at the end of the kernel
log) before the crash even though I am not sure it is the cause. The
system is in xmon. Please advise if additional debug data need to be
collected. Thanks.

Snippet at the end of the kernel log:

[ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 nip 
00003fff852c2ee8 lr 00003fff852c2938 code 30001
[ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 nip 
00003fff890b2ee8 lr 00003fff890b2938 code 30001
[ 8816.526807] Unable to handle kernel paging request for data at address 
0x0000000c
[ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
[ 8816.530233] Unable to handle kernel paging request for data at address 
0x0000000c
[ 8816.530596] Faulting instruction address: 0xc00000000035e2b0

Snippet of the send_tmf() code:
453                 cmd_checkin(cmd);
454                 spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
455                 cfg->tmf_active = false;
456                 spin_unlock_irqrestore(&cfg->tmf_slock, lock_flags);
457                 goto out;
458         }
459
460         spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
461         to = msecs_to_jiffies(5000);
462         to = wait_event_interruptible_lock_irq_timeout(cfg->tmf_waitq,
463                                                        !cfg->tmf_active,
464                                                        cfg->tmf_slock,
465                                                        to);
466         if (!to) {
467                 cfg->tmf_active = false;
468                 dev_err(dev, "%s: TMF timed out!\n", __func__);
469                 rc = -1;
470         }
471         spin_unlock_irqrestore(&cfg->tmf_slock, lock_flags);

Boqun,

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1667239

Title:
  FlashGT Integration and Setup: fsbmc30: After 17th reboot of soft
  bootme, HTX & Linux errors seen with 256 virtual LUNs

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1667239/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1667239] Comment bridged from LTC Bugzilla

Reply via email to