[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-10-07 Thread Mauricio Faria de Oliveira
Disco is now fix released with linux 5.0.0-31.33 (bionic: linux-hwe
5.0.0-31.33~18.04.1).

** Changed in: linux (Ubuntu Disco)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  Fix Released
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-09-19 Thread Mauricio Faria de Oliveira
Disco currently has the mpt3sas fix in disco-proposed (version 5.0.0-30.32),
also available as linux-hwe kernel in bionic-proposed.

** Changed in: linux (Ubuntu Disco)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-09-04 Thread Mauricio Faria de Oliveira
Hi Drew,

Yes, the HWE kernel syncs automatically from the normal kernel as it
moves forward.

Once Disco gets the patch, the HWE from Disco in Bionic should get it as
well (same version number plus ~18.04.1 suffix).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-09-03 Thread Drew Woodard
Are there any plans to include this fix in the hwe kernel?
I tested today on the current 18.04 hwe kernel 5.0.0-27 and the bug appeared in 
18min.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-09-03 Thread Mauricio Faria de Oliveira
Marking Bionic as Fix Released as the kernel from bionic-proposed has
been promoted to bionic-updates.

** Changed in: linux (Ubuntu Bionic)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-28 Thread Mauricio Faria de Oliveira
** Changed in: linux (Ubuntu Eoan)
   Status: Incomplete => Fix Released

** Changed in: linux (Ubuntu Disco)
   Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-28 Thread Mauricio Faria de Oliveira
Drew,

Thanks for testing bionic-proposed!
So it will be resolved for bionic kernels shortly, when it hit bionic-updates.

Disco/19.04 will get this patch via stable updates in the near future
[1].

Eoan has it applied (LP: #1839588).

So this is all good.

Thanks again,
Mauricio

[1] https://lists.ubuntu.com/archives/kernel-
team/2019-August/103416.html

** Also affects: linux (Ubuntu Eoan)
   Importance: Undecided
   Status: Incomplete

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Disco)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-28 Thread Drew Woodard
Stress tested bionic-proposed kernel 4.15.0-60-generic #67-Ubuntu for
6.5h with no errors so it appears to be patched in that version as
expected.


ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 proposed (kernel 4.15.0) (mpt3sas driver 17.100.00.00) working
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc3) (mpt3sas driver 29.100.00.00) working
ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.


[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-27 Thread Mauricio Faria de Oliveira
Hi Drew,

That's very good news!  So it looks like that patch resolves the
problem.

Could you please test the kernel in bionic-proposed [1] (4.15.0-60-generic)
which has that patch to confirm it's also working correctly?

Thanks!
Mauricio

[1] https://wiki.ubuntu.com/Testing/EnableProposed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-27 Thread Drew Woodard
rc3 has been stress testing for 7h without error so I believe Mauricio
Faria de Oliveira correctly identified the patch that corrects this
issue.

ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc3) (mpt3sas driver 29.100.00.00) working
ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-26 Thread Drew Woodard
rc2 does contain the bug, annoyingly it took 54min to trigger which is longer 
than any previous version.
rc3 is stress testing at the moment.


ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-26 Thread Mauricio Faria de Oliveira
Mentioned upstream candidate fix is:

commit df9a606184bfdb5ae3ca9d226184e9489f5c24f7
Author: Suganath Prabu 
Date:   Tue Jul 30 03:43:57 2019 -0400

scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA

Although SAS3 & SAS3.5 IT HBA controllers support 64-bit DMA addressing, as
per hardware design, if DMA-able range contains all 64-bits
set (0x-) then it results in a firmware fault.

E.g. SGE's start address is 0x-000 and data length is 0x1000
bytes. when HBA tries to DMA the data at 0x- location then
HBA will fault the firmware.

Driver will set 63-bit DMA mask to ensure the above address will not be
used.

Cc:  # 5.1.20+
Signed-off-by: Suganath Prabu 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Martin K. Petersen 

git/linux $ git describe --contains df9a606184bfdb5ae3ca9d226184e9489f5c24f7
v5.3-rc3~21^2~1

git/ubuntu-bionic $ git log --oneline Ubuntu-4.15.0-60.67 -- 
drivers/scsi/mpt3sas/ 
395f1e3037b8 scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA
...

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-26 Thread Mauricio Faria de Oliveira
Hi Drew,

There's a mpt3sas fix in v5.3-rc3 for a problem that may cause an adapter 
firmware fault
(although not sure of the exact fault state code; but it should cause a reset 
anyway).

If you could please test either
1) v5.3-rc2 [1] to confirm the issue happens with v5.3-rc2 but not with 
v5.3-rc3;
or
2) or 4.15.0-60.67 (in bionic-proposed) which has the fix (so checking whether 
issue doesn't happen)
that would be great.

If that doesn't help, please continue with the great regression tip
provided by Kai-Heng Feng.

Thanks!
Mauricio

[1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.3-rc2/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-25 Thread Kai-Heng Feng
Would it be possible for you to do a reverse kernel bisection?

First, find the first -rc kernel works and the last -rc kernel doesn’t
work from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Then,
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect new $(the working version you found)
$ git bisect old $(the non-working version found)
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If it doesn’t work,
$ git bisect old
Otherwise,
$ git bisect new
Repeat to "make -j`nproc` deb-pkg" until you find the commit fixes the issue.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-25 Thread Drew Woodard
I let the stress test run on the mainline kernel for 22h, no errors.


so in summary:
ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG
ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG
ubuntu 18.04 mainline (kernel 5.3.0) (mpt3sas driver 29.100.00.00) working
rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-24 Thread Drew Woodard
Hi Kai-Heng, thanks for your response.

I followed your advice and installed these packages from the ppa:
linux-headers-5.3.0-050300rc5_5.3.0-050300rc5.201908182231_all.deb
linux-headers-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb
linux-image-unsigned-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb
linux-modules-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb

I then rebooted the system and checked versions with "uname -a" and "modinfo 
mpt3sas":
(kernel 5.3.0) (mpt3sas driver 29.100.00.00)

The stress test has now been running for 4h with no errors, which is 8x as long 
as the previous best on 18.04.
I will leave the stress test running overnight in the event that the bug still 
exists but occurs less frequently.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There 

[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-24 Thread Kai-Heng Feng
Please test latest mainline kernel:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.3-rc5/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io

2019-08-22 Thread Drew Woodard
attaching copy of syslog from moment of error

** Attachment added: "syslog.txt"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+attachment/5284119/+files/syslog.txt

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1841132

Title:
  mpt3sas - storage controller resets under heavy disk io

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [summary]
  when a server running ubuntu 18.04 with an lsi sas controller experiences 
high disk io there is a chance the storage controller will reset
  this can take weeks or months, but once the controller resets it will keep 
resetting every few seconds or few minutes, dramatically degrading disk io
  the server must be rebooted to restore the controller to a normal state

  [hardware configuration]
  server: dell poweredge r7415, purchased 2019-02
  cpu/chipset: amd epyc naples
  storage controller: "dell hba330 mini" with chipset "lsi sas3008"
  drives: 4x samsung 860 pro 2TB ssd

  [software configuration]
  ubuntu 18.04 server
  mdadm raid6
  all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q)

  [what happened]
  server was operating as a vm host for months without issue
  one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag 
reset !!" and "Power-on or device reset occurred", along with unusably-slow 
disk io
  the server was removed from production and I looked for a way to reproduce 
the issue

  [how to reproduce the issue]
  there are probably many ways to product this issue, the hackish way I found 
to reliably reproduce it was:
  have the four ssds in a mdadm raid6 with ext4 filesystem
  create three 500GB files containing random data
  open three terminals. one calculates md5sum of file1 in a loop, another does 
the same for file2, the third does a copy of file3 to file3-temp in a loop
  the number of files is arbitrary, the goal is just to generate a lot of disk 
io on files too large to be cached in memory
  then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause 
even more drive thrashing
  within 1-15min the controller will enter the broken state. the longest I ever 
saw it take was 30min. I reproduced this several times
  rebooting the server restores the controller to a normal state
  if the server is not rebooted and the controller is left in this broken state 
eventually drives will fall out of the array, and sometimes array/filesystem 
corruption will occur

  [why this is being reported here]
  It's unlikely I am exceeding limits of the hardware since this server chassis 
can hold 24 drives and I am only using 4. The controller specs indicate I 
should not hit pcie bandwidth limits until at least 16 drives.
  My first thought was that the lsi controller firmware was at fault since they 
have been historically buggy, however I reproduced this with the newest 
firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be 
dell-specific).
  I then tried the most recent motherboard bios "1.9.3", and downgraded to 
"1.9.2", no change.
  I then wanted to eliminate the possibility of a bad drive. swapped out all 4 
drives with different ones of the same model, no change.
  I then upgraded from the standard 18.04 kernel to the newer backported hwe 
kernel, which also came with a newer mpt3sas driver, no change.
  I then ran the same test on the same array but with rhel 8, to my surprise I 
could no longer reproduce the issue.
  -
  tl;dr version:
  ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller 
breaks in 1-10min
  ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage 
controller breaks in 1-15min, max 30min
  rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same 
array for 19h, no errors

  [caveats]
  Server os misconfiguration is possible, however this is a rather basic vm 
host running kvm and no 3rd-party packages.
  I can't conclusively prove this isn't a hardware fault since I don't have a 
second unused identical server to test on right now, however the fact that the 
problem can be easily reproduced under ubuntu but not under rhel seems 
noteworthy.
  There is another bug (LP: #1810781) similar to this, I didn't post there 
because it's already marked as fixed.
  There is also a debian bug (Debian #926202) that encountered this on kernel 
4.19.0, but I'm unable to tell if it's the same issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp