[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From cdead...@us.ibm.com 2017-08-10 16:33 EDT--- -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in The Ubuntu-power-systems project: Fix Released Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Released Status in linux-lts-utopic source package in Trusty: Fix Released Status in linux source package in Utopic: Won't Fix Status in linux source package in Vivid: Fix Released Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth echo 8 > /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From mainam...@in.ibm.com 2015-10-27 06:13 EDT--- *** Bug 128832 has been marked as a duplicate of this bug. *** -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Released Status in linux-lts-utopic source package in Trusty: Fix Released Status in linux source package in Utopic: Won't Fix Status in linux source package in Vivid: Fix Released Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth echo 8 > /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From jgriv...@us.ibm.com 2015-10-14 11:49 EDT--- installed latest ubuntu14043 and it seems like the default is "deadline" now # uname -a Linux amp 3.19.0-30-generic #34~14.04.1-Ubuntu SMP Fri Oct 2 22:21:52 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux # for i in `ls`; do echo -n "$i: "; cat $i/queue/scheduler; done|grep -v none dm-0: noop [deadline] cfq dm-7: noop [deadline] cfq dm-8: noop [deadline] cfq dm-9: noop [deadline] cfq sda: noop [deadline] cfq sdb: noop [deadline] cfq sdc: noop [deadline] cfq sdd: noop [deadline] cfq sde: noop [deadline] cfq sdf: noop [deadline] cfq sdg: noop [deadline] cfq sdh: noop [deadline] cfq sdi: noop [deadline] cfq sdj: noop [deadline] cfq sdk: noop [deadline] cfq sdl: noop [deadline] cfq sdm: noop [deadline] cfq sdn: noop [deadline] cfq sdo: noop [deadline] cfq sdp: noop [deadline] cfq sdq: noop [deadline] cfq sr0: noop [deadline] cfq sr1: noop [deadline] cfq sr2: noop [deadline] cfq sr3: noop [deadline] cfq root@amp:/sys/block# We'll restart tests -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Released Status in linux-lts-utopic source package in Trusty: Fix Released Status in linux source package in Utopic: Won't Fix Status in linux source package in Vivid: Fix Released Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From bjki...@us.ibm.com 2015-10-12 18:00 EDT--- *** Bug 129789 has been marked as a duplicate of this bug. *** -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Released Status in linux-lts-utopic source package in Trusty: Fix Released Status in linux source package in Utopic: Won't Fix Status in linux source package in Vivid: Fix Released Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth echo 8 > /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe :
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From cdead...@us.ibm.com 2015-09-25 06:22 EDT--- add last comment from CQ side to see if empty comment header sync to LTC. please ignore -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux-lts-utopic source package in Trusty: Fix Committed Status in linux source package in Utopic: Won't Fix Status in linux source package in Vivid: Fix Committed Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth echo 8 > /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to :
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From cdead...@us.ibm.com 2015-09-24 13:59 EDT--- -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux-lts-utopic source package in Trusty: Fix Committed Status in linux source package in Utopic: Won't Fix Status in linux source package in Vivid: Fix Committed Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth echo 8 > /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help :
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From cdead...@us.ibm.com 2015-09-03 20:26 EDT--- *** Bug 125891 has been marked as a duplicate of this bug. *** -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux-lts-utopic package in Ubuntu: Invalid Status in linux source package in Trusty: Fix Committed Status in linux-lts-utopic source package in Trusty: Fix Committed Status in linux source package in Utopic: Invalid Status in linux source package in Vivid: Fix Committed Bug description: [Impact] Using cfq instead of deadline as the default io scheduler starves certain workloads and causes performance issues. In addition every other arch we build uses deadline as the default scheduler. [Fix] Change the configuration to the following for ppc64el: CONFIG_DEFAULT_DEADLINE=y CONFIG_DEFAULT_IOSCHED="deadline" [Test Case] Boot and cat /sys/block/*/queue/scheduler to see if deadline is being used. -- -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage <<=== device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) <<=== miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline > /sys/block/sda/queue/scheduler echo deadline > /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq > /sys/block/sda/queue/scheduler echo cfq > /sys/block/sdb/queue/scheduler echo 8 > /sys/block/sda/device/queue_depth echo 8 > /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post
[Kernel-packages] [Bug 1469829] Comment bridged from LTC Bugzilla
--- Comment From bjki...@us.ibm.com 2015-08-25 21:03 EDT--- *** Bug 129364 has been marked as a duplicate of this bug. *** --- Comment From bjki...@us.ibm.com 2015-08-25 21:04 EDT--- Recently had another report for this issue with 14.04.3. Is this fix in the queue for 14.04.X as well? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1469829 Title: ppc64el should use 'deadline' as default io scheduler Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: In Progress Status in linux source package in Utopic: In Progress Status in linux source package in Vivid: In Progress Bug description: -- Problem Description -- Firestone system given to DASD group failed HTX overnight test with miscompare error. HTX mdt.hdbuster was running on secondary drive and failed about 12 hours into test HTX miscompare analysis: -== Device under test: /dev/sdb Stanza running: rule_3 miscompare offset: 0x40 Transfer size: Random Size LBA number: 0x70fc miscompare length: all the blocks in the transfer size *- STANZA 3: Creates number of threads twice the queue depth. Each thread -* *- doing 2 num_oper with RC operation with xfer size between 1 block -* *- to 256K.-* This miscompare shows read operation is unable to get the expected data from the disk. The re-read buffer also shows the same data as the first read operation. Since the first read and next re-read shows same data, there could be a write operation (of previous rule stanza to initialize disk with pattern 007 ) failure on the disk. The same miscompare behavior shows for all the blocks in the transfer size. /dev/sdb Jun 2 02:29:43 2015 err=03b6 sev=2 hxestorage === device name (/dev/sdb) rule_3_13 numopers= 2 loop= 767 blk=0x70fc len=89088 min_blkno=0 max_blkno=0x74706daf, RANDOM access Seed Values= 37303, 290, 23235 Data Pattern Seed Values = 37303, 291, 23235 BWRC LBA fencepost Detail: th_nummin_lba max_lba status 0 01c9be3ffR 1 1d1c1b6c3a3836d7F 2 3a3836d857545243F 3 5754524474706dafF Miscompare at buffer offset 64 (0x40) === miscompare offset (0x40) (Flags: badsig=0; cksum=0x6) Maximum LBA = 0x74706daf wbuf (baseaddr 0x3ffe1c0e6600) b0ff rbuf (baseaddr 0x3ffe1c0fc400) 850100fc700200fd700300fe700400ff7005 Write buffer saved in /tmp/htxsdb.wbuf1 Read buffer saved in /tmp/htxsdb.rbuf1 Re-read fails compare at offset64; buffer saved in /tmp/htxsdb.rerd1 errno: 950(Unknown error 950) Asghar reproduced that HTX hang he is seeing. Looking in the kernel logs I see some messages from the kernel that there are user threads blocked on getting reads serviced. So likely HTX is seeing the same thing. I've asked Asghar to try using the deadline I/O scheduler rather than CFQ to see if that makes any difference. If that does not make any difference, the next thing to try is reducing the queue depth of the device. Right now its 31, which I think is pretty high. Step 1: echo deadline /sys/block/sda/queue/scheduler echo deadline /sys/block/sdb/queue/scheduler If that reproduces the issue, go to step 2: echo cfq /sys/block/sda/queue/scheduler echo cfq /sys/block/sdb/queue/scheduler echo 8 /sys/block/sda/device/queue_depth echo 8 /sys/block/sdb/device/queue_depth Breno - it looks like the default I/O scheduler + default queue depth for the SATA disks in Firestone is not optimal, in that when running a heavy I/O workload, we see read starvation occurring, which is making the system nearly unusable. Once we changed the I/O scheduler from cfq to deadline, all the issues went away and the system is able to run the same workload yet still be responsive. Suggest we either encourage Canonical to change the default I/O scheduler to deadline or at the very least provide documentation to encourage our customers to make this change themselves. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469829/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp