[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Disco is now fix released with linux 5.0.0-31.33 (bionic: linux-hwe 5.0.0-31.33~18.04.1). ** Changed in: linux (Ubuntu Disco) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help :
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Disco currently has the mpt3sas fix in disco-proposed (version 5.0.0-30.32), also available as linux-hwe kernel in bionic-proposed. ** Changed in: linux (Ubuntu Disco) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Released Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Hi Drew, Yes, the HWE kernel syncs automatically from the normal kernel as it moves forward. Once Disco gets the patch, the HWE from Disco in Bionic should get it as well (same version number plus ~18.04.1 suffix). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Are there any plans to include this fix in the hwe kernel? I tested today on the current 18.04 hwe kernel 5.0.0-27 and the bug appeared in 18min. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help :
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Marking Bionic as Fix Released as the kernel from bionic-proposed has been promoted to bionic-updates. ** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
** Changed in: linux (Ubuntu Eoan) Status: Incomplete => Fix Released ** Changed in: linux (Ubuntu Disco) Status: New => In Progress ** Changed in: linux (Ubuntu Bionic) Status: New => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Drew, Thanks for testing bionic-proposed! So it will be resolved for bionic kernels shortly, when it hit bionic-updates. Disco/19.04 will get this patch via stable updates in the near future [1]. Eoan has it applied (LP: #1839588). So this is all good. Thanks again, Mauricio [1] https://lists.ubuntu.com/archives/kernel- team/2019-August/103416.html ** Also affects: linux (Ubuntu Eoan) Importance: Undecided Status: Incomplete ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Disco) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Stress tested bionic-proposed kernel 4.15.0-60-generic #67-Ubuntu for 6.5h with no errors so it appears to be patched in that version as expected. ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG ubuntu 18.04 proposed (kernel 4.15.0) (mpt3sas driver 17.100.00.00) working ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0-rc3) (mpt3sas driver 29.100.00.00) working ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue.
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Hi Drew, That's very good news! So it looks like that patch resolves the problem. Could you please test the kernel in bionic-proposed [1] (4.15.0-60-generic) which has that patch to confirm it's also working correctly? Thanks! Mauricio [1] https://wiki.ubuntu.com/Testing/EnableProposed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
rc3 has been stress testing for 7h without error so I believe Mauricio Faria de Oliveira correctly identified the patch that corrects this issue. ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0-rc3) (mpt3sas driver 29.100.00.00) working ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to:
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
rc2 does contain the bug, annoyingly it took 54min to trigger which is longer than any previous version. rc3 is stress testing at the moment. ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0-rc2) (mpt3sas driver 29.100.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0-rc5) (mpt3sas driver 29.100.00.00) working rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list:
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Mentioned upstream candidate fix is: commit df9a606184bfdb5ae3ca9d226184e9489f5c24f7 Author: Suganath Prabu Date: Tue Jul 30 03:43:57 2019 -0400 scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA Although SAS3 & SAS3.5 IT HBA controllers support 64-bit DMA addressing, as per hardware design, if DMA-able range contains all 64-bits set (0x-) then it results in a firmware fault. E.g. SGE's start address is 0x-000 and data length is 0x1000 bytes. when HBA tries to DMA the data at 0x- location then HBA will fault the firmware. Driver will set 63-bit DMA mask to ensure the above address will not be used. Cc: # 5.1.20+ Signed-off-by: Suganath Prabu Reviewed-by: Christoph Hellwig Signed-off-by: Martin K. Petersen git/linux $ git describe --contains df9a606184bfdb5ae3ca9d226184e9489f5c24f7 v5.3-rc3~21^2~1 git/ubuntu-bionic $ git log --oneline Ubuntu-4.15.0-60.67 -- drivers/scsi/mpt3sas/ 395f1e3037b8 scsi: mpt3sas: Use 63-bit DMA addressing on SAS35 HBA ... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Hi Drew, There's a mpt3sas fix in v5.3-rc3 for a problem that may cause an adapter firmware fault (although not sure of the exact fault state code; but it should cause a reset anyway). If you could please test either 1) v5.3-rc2 [1] to confirm the issue happens with v5.3-rc2 but not with v5.3-rc3; or 2) or 4.15.0-60.67 (in bionic-proposed) which has the fix (so checking whether issue doesn't happen) that would be great. If that doesn't help, please continue with the great regression tip provided by Kai-Heng Feng. Thanks! Mauricio [1] https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.3-rc2/ -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Would it be possible for you to do a reverse kernel bisection? First, find the first -rc kernel works and the last -rc kernel doesn’t work from http://kernel.ubuntu.com/~kernel-ppa/mainline/ Then, $ sudo apt build-dep linux $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git $ cd linux $ git bisect start $ git bisect new $(the working version you found) $ git bisect old $(the non-working version found) $ make localmodconfig $ make -j`nproc` deb-pkg Install the newly built kernel, then reboot with it. If it doesn’t work, $ git bisect old Otherwise, $ git bisect new Repeat to "make -j`nproc` deb-pkg" until you find the commit fixes the issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
I let the stress test run on the mainline kernel for 22h, no errors. so in summary: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) BUG ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) BUG ubuntu 18.04 mainline (kernel 5.3.0) (mpt3sas driver 29.100.00.00) working rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) working -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help :
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Hi Kai-Heng, thanks for your response. I followed your advice and installed these packages from the ppa: linux-headers-5.3.0-050300rc5_5.3.0-050300rc5.201908182231_all.deb linux-headers-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb linux-image-unsigned-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb linux-modules-5.3.0-050300rc5-generic_5.3.0-050300rc5.201908182231_amd64.deb I then rebooted the system and checked versions with "uname -a" and "modinfo mpt3sas": (kernel 5.3.0) (mpt3sas driver 29.100.00.00) The stress test has now been running for 4h with no errors, which is 8x as long as the previous best on 18.04. I will leave the stress test running overnight in the event that the bug still exists but occurs less frequently. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
Please test latest mainline kernel: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.3-rc5/ -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1841132] Re: mpt3sas - storage controller resets under heavy disk io
attaching copy of syslog from moment of error ** Attachment added: "syslog.txt" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+attachment/5284119/+files/syslog.txt -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1841132 Title: mpt3sas - storage controller resets under heavy disk io Status in linux package in Ubuntu: Incomplete Bug description: [summary] when a server running ubuntu 18.04 with an lsi sas controller experiences high disk io there is a chance the storage controller will reset this can take weeks or months, but once the controller resets it will keep resetting every few seconds or few minutes, dramatically degrading disk io the server must be rebooted to restore the controller to a normal state [hardware configuration] server: dell poweredge r7415, purchased 2019-02 cpu/chipset: amd epyc naples storage controller: "dell hba330 mini" with chipset "lsi sas3008" drives: 4x samsung 860 pro 2TB ssd [software configuration] ubuntu 18.04 server mdadm raid6 all firmware is fully updated (bios 1.9.3) (hba330 16.17.00.03) (ssd rvm01b6q) [what happened] server was operating as a vm host for months without issue one day the syslog was flooded with messages like "mpt3sas_cm0: sending diag reset !!" and "Power-on or device reset occurred", along with unusably-slow disk io the server was removed from production and I looked for a way to reproduce the issue [how to reproduce the issue] there are probably many ways to product this issue, the hackish way I found to reliably reproduce it was: have the four ssds in a mdadm raid6 with ext4 filesystem create three 500GB files containing random data open three terminals. one calculates md5sum of file1 in a loop, another does the same for file2, the third does a copy of file3 to file3-temp in a loop the number of files is arbitrary, the goal is just to generate a lot of disk io on files too large to be cached in memory then initiate an array check with "/usr/share/mdadm/checkarray -a" to cause even more drive thrashing within 1-15min the controller will enter the broken state. the longest I ever saw it take was 30min. I reproduced this several times rebooting the server restores the controller to a normal state if the server is not rebooted and the controller is left in this broken state eventually drives will fall out of the array, and sometimes array/filesystem corruption will occur [why this is being reported here] It's unlikely I am exceeding limits of the hardware since this server chassis can hold 24 drives and I am only using 4. The controller specs indicate I should not hit pcie bandwidth limits until at least 16 drives. My first thought was that the lsi controller firmware was at fault since they have been historically buggy, however I reproduced this with the newest firmware "16.17.00.03" and the previous version "15.17.09.06" (versions may be dell-specific). I then tried the most recent motherboard bios "1.9.3", and downgraded to "1.9.2", no change. I then wanted to eliminate the possibility of a bad drive. swapped out all 4 drives with different ones of the same model, no change. I then upgraded from the standard 18.04 kernel to the newer backported hwe kernel, which also came with a newer mpt3sas driver, no change. I then ran the same test on the same array but with rhel 8, to my surprise I could no longer reproduce the issue. - tl;dr version: ubuntu 18.04 (kernel 4.15.0) (mpt3sas driver 17.100.00.00) storage controller breaks in 1-10min ubuntu 18.04 hwe (kernel 5.0.0) (mpt3sas driver 27.101.00.00) storage controller breaks in 1-15min, max 30min rhel 8 (kernel 4.18.0) (mpt3sas driver 27.101.00.00) same stress test on same array for 19h, no errors [caveats] Server os misconfiguration is possible, however this is a rather basic vm host running kvm and no 3rd-party packages. I can't conclusively prove this isn't a hardware fault since I don't have a second unused identical server to test on right now, however the fact that the problem can be easily reproduced under ubuntu but not under rhel seems noteworthy. There is another bug (LP: #1810781) similar to this, I didn't post there because it's already marked as fixed. There is also a debian bug (Debian #926202) that encountered this on kernel 4.19.0, but I'm unable to tell if it's the same issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1841132/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp