I attempted to bisect this, using the following process:
  - Run the kernel-build-reboot-loop test on 3 machines in parallel
    I used 2 CRB1S systems (anuchin, bestovius) and 1 R120-T33 (seidel)
  - If any machine crashes w/ the parity error message, consider it failed
  - If all machines survive over night, consider it "OK".

Unfortunately, the commit it landed on looks bogus:

# first bad commit: [852643165aea0999bb862b36511c5b9f6b11449f] 
fs//binfmt_elf.c: move variables initialization closer to their usage
(Reverse bisect - this would in theory be the commit that *fixed* it)

Just in case, I tried reverting that commit from 5.5-rc6. As noted in
comment #2, 5.5-rc6 seems immune to this problem. Reverting the commit
didn't change that - 5.5-rc6 still survived over night.

Note: Of the 3 systems, anuchin was usually the one that failed during
the bisect. It could be that this is a generic hw issue, and anuchin is
just more severely impacted than the others. It could also be that this
symptom can be caused by both a sw and a hw issue, and anuchin is
impacted by the hw part, making it a bad choice for a bisect. Either
way, bisection seems like a poor strategy for identifying the issue.


** Attachment added: "bisect.log"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1860013/+attachment/5323904/+files/bisect.log

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1860013

Title:
  [thunderx] Synchronous External Abort: synchronous parity or ECC error

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1860013/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to