Re: [zfs-discuss] Periodic flush
Robert Milkowski writes: Hello Roch, Saturday, June 28, 2008, 11:25:17 AM, you wrote: RB I suspect, a single dd is cpu bound. I don't think so. We're nearly so as you show. More below. Se below one with a stripe of 48x disks again. Single dd with 1024k block size and 64GB to write. bash-3.2# zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test 333K 21.7T 1 1 147K 147K test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 1.60K 0 204M test 333K 21.7T 0 20.5K 0 2.55G test4.00G 21.7T 0 9.19K 0 1.13G test4.00G 21.7T 0 0 0 0 test4.00G 21.7T 0 1.78K 0 228M test4.00G 21.7T 0 12.5K 0 1.55G test7.99G 21.7T 0 16.2K 0 2.01G test7.99G 21.7T 0 0 0 0 test7.99G 21.7T 0 13.4K 0 1.68G test12.0G 21.7T 0 4.31K 0 530M test12.0G 21.7T 0 0 0 0 test12.0G 21.7T 0 6.91K 0 882M test12.0G 21.7T 0 21.8K 0 2.72G test16.0G 21.7T 0839 0 88.4M test16.0G 21.7T 0 0 0 0 test16.0G 21.7T 0 4.42K 0 565M test16.0G 21.7T 0 18.5K 0 2.31G test20.0G 21.7T 0 8.87K 0 1.10G test20.0G 21.7T 0 0 0 0 test20.0G 21.7T 0 12.2K 0 1.52G test24.0G 21.7T 0 9.28K 0 1.14G test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 14.5K 0 1.81G test28.0G 21.7T 0 10.1K 63.6K 1.25G test28.0G 21.7T 0 0 0 0 test28.0G 21.7T 0 10.7K 0 1.34G test32.0G 21.7T 0 13.6K 63.2K 1.69G test32.0G 21.7T 0 0 0 0 test32.0G 21.7T 0 0 0 0 test32.0G 21.7T 0 11.1K 0 1.39G test36.0G 21.7T 0 19.9K 0 2.48G test36.0G 21.7T 0 0 0 0 test36.0G 21.7T 0 0 0 0 test36.0G 21.7T 0 17.7K 0 2.21G test40.0G 21.7T 0 5.42K 63.1K 680M test40.0G 21.7T 0 0 0 0 test40.0G 21.7T 0 6.62K 0 844M test44.0G 21.7T 1 19.8K 125K 2.46G test44.0G 21.7T 0 0 0 0 test44.0G 21.7T 0 0 0 0 test44.0G 21.7T 0 18.0K 0 2.24G test47.9G 21.7T 1 13.2K 127K 1.63G test47.9G 21.7T 0 0 0 0 test47.9G 21.7T 0 0 0 0 test47.9G 21.7T 0 15.6K 0 1.94G test47.9G 21.7T 1 16.1K 126K 1.99G test51.9G 21.7T 0 0 0 0 test51.9G 21.7T 0 0 0 0 test51.9G 21.7T 0 14.2K 0 1.77G test55.9G 21.7T 0 14.0K 63.2K 1.73G test55.9G 21.7T 0 0 0 0 test55.9G 21.7T 0 0 0 0 test55.9G 21.7T 0 16.3K 0 2.04G test59.9G 21.7T 0 14.5K 63.2K 1.80G test59.9G 21.7T 0 0 0 0 test59.9G 21.7T 0 0 0 0 test59.9G 21.7T 0 17.7K 0 2.21G test63.9G 21.7T 0 4.84K 62.6K 603M test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 ^C bash-3.2# bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536 65536+0 records in 65536+0 records out real 1:06.312 user0.074 sys54.060 bash-3.2# Doesn't look like it's CPU bound. So if sys we're at 81% of CPU saturation. If you make this 100% you will still have zeros in the zpool iostat. We
Re: [zfs-discuss] Periodic flush
Hello Roch, Saturday, June 28, 2008, 11:25:17 AM, you wrote: RB I suspect, a single dd is cpu bound. I don't think so. Se below one with a stripe of 48x disks again. Single dd with 1024k block size and 64GB to write. bash-3.2# zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test 333K 21.7T 1 1 147K 147K test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 1.60K 0 204M test 333K 21.7T 0 20.5K 0 2.55G test4.00G 21.7T 0 9.19K 0 1.13G test4.00G 21.7T 0 0 0 0 test4.00G 21.7T 0 1.78K 0 228M test4.00G 21.7T 0 12.5K 0 1.55G test7.99G 21.7T 0 16.2K 0 2.01G test7.99G 21.7T 0 0 0 0 test7.99G 21.7T 0 13.4K 0 1.68G test12.0G 21.7T 0 4.31K 0 530M test12.0G 21.7T 0 0 0 0 test12.0G 21.7T 0 6.91K 0 882M test12.0G 21.7T 0 21.8K 0 2.72G test16.0G 21.7T 0839 0 88.4M test16.0G 21.7T 0 0 0 0 test16.0G 21.7T 0 4.42K 0 565M test16.0G 21.7T 0 18.5K 0 2.31G test20.0G 21.7T 0 8.87K 0 1.10G test20.0G 21.7T 0 0 0 0 test20.0G 21.7T 0 12.2K 0 1.52G test24.0G 21.7T 0 9.28K 0 1.14G test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 14.5K 0 1.81G test28.0G 21.7T 0 10.1K 63.6K 1.25G test28.0G 21.7T 0 0 0 0 test28.0G 21.7T 0 10.7K 0 1.34G test32.0G 21.7T 0 13.6K 63.2K 1.69G test32.0G 21.7T 0 0 0 0 test32.0G 21.7T 0 0 0 0 test32.0G 21.7T 0 11.1K 0 1.39G test36.0G 21.7T 0 19.9K 0 2.48G test36.0G 21.7T 0 0 0 0 test36.0G 21.7T 0 0 0 0 test36.0G 21.7T 0 17.7K 0 2.21G test40.0G 21.7T 0 5.42K 63.1K 680M test40.0G 21.7T 0 0 0 0 test40.0G 21.7T 0 6.62K 0 844M test44.0G 21.7T 1 19.8K 125K 2.46G test44.0G 21.7T 0 0 0 0 test44.0G 21.7T 0 0 0 0 test44.0G 21.7T 0 18.0K 0 2.24G test47.9G 21.7T 1 13.2K 127K 1.63G test47.9G 21.7T 0 0 0 0 test47.9G 21.7T 0 0 0 0 test47.9G 21.7T 0 15.6K 0 1.94G test47.9G 21.7T 1 16.1K 126K 1.99G test51.9G 21.7T 0 0 0 0 test51.9G 21.7T 0 0 0 0 test51.9G 21.7T 0 14.2K 0 1.77G test55.9G 21.7T 0 14.0K 63.2K 1.73G test55.9G 21.7T 0 0 0 0 test55.9G 21.7T 0 0 0 0 test55.9G 21.7T 0 16.3K 0 2.04G test59.9G 21.7T 0 14.5K 63.2K 1.80G test59.9G 21.7T 0 0 0 0 test59.9G 21.7T 0 0 0 0 test59.9G 21.7T 0 17.7K 0 2.21G test63.9G 21.7T 0 4.84K 62.6K 603M test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 ^C bash-3.2# bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536 65536+0 records in 65536+0 records out real 1:06.312 user0.074 sys54.060 bash-3.2# Doesn't look like it's CPU bound. Let's try to read the file after zpool export test; zpool import test bash-3.2# zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test64.0G 21.7T 15 46 1.22M 1.76M test64.0G 21.7T 0 0 0 0 test64.0G 21.7T 0
Re: [zfs-discuss] Periodic flush
Hello Robert, Tuesday, July 1, 2008, 12:01:03 AM, you wrote: RM Nevertheless the main issu is jumpy writing... I was just wondering how much thruoughput I can get running multiple dd - one per disk drive and what kind of aggregated throughput I would get. So for each out of 48 disks I did: dd if=/dev/zero of=/dev/rdsk/c6t7d0s0 bs=128k The iostat looks like: bash-3.2# iostat -xnzC 1|egrep c[0-6]$|devic [skipped the first output] extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 5308.00.0 679418.9 0.1 7.20.01.4 0 718 c1 0.0 5264.20.0 673813.1 0.1 7.20.01.4 0 720 c2 0.0 4047.60.0 518095.1 0.1 7.30.01.8 0 725 c3 0.0 5340.10.0 683532.5 0.1 7.20.01.3 0 718 c4 0.0 5325.10.0 681608.0 0.1 7.10.01.3 0 714 c5 0.0 4089.30.0 523434.0 0.1 7.30.01.8 0 727 c6 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 5283.10.0 676231.2 0.1 7.20.01.4 0 723 c1 0.0 5215.20.0 667549.5 0.1 7.20.01.4 0 720 c2 0.0 4009.00.0 513152.8 0.1 7.30.01.8 0 725 c3 0.0 5281.90.0 676082.5 0.1 7.20.01.4 0 722 c4 0.0 5316.60.0 680520.9 0.1 7.20.01.4 0 720 c5 0.0 4159.50.0 532420.9 0.1 7.30.01.7 0 726 c6 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 5322.00.0 681213.6 0.1 7.20.01.4 0 720 c1 0.0 5292.90.0 677494.0 0.1 7.20.01.4 0 722 c2 0.0 4051.40.0 518573.3 0.1 7.30.01.8 0 727 c3 0.0 5315.00.0 680318.8 0.1 7.20.01.4 0 721 c4 0.0 5313.10.0 680074.3 0.1 7.20.01.4 0 723 c5 0.0 4184.80.0 535648.7 0.1 7.30.01.7 0 730 c6 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 5296.40.0 677940.2 0.1 7.10.01.3 0 714 c1 0.0 5236.40.0 670265.3 0.1 7.20.01.4 0 720 c2 0.0 4023.50.0 515011.5 0.1 7.30.01.8 0 728 c3 0.0 5291.40.0 677300.7 0.1 7.20.01.4 0 723 c4 0.0 5297.40.0 678072.8 0.1 7.20.01.4 0 720 c5 0.0 4095.60.0 524236.0 0.1 7.30.01.8 0 726 c6 ^C one full output: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 5302.00.0 678658.6 0.1 7.20.01.4 0 722 c1 0.0 664.00.0 84992.8 0.0 0.90.01.4 1 90 c1t0d0 0.0 657.00.0 84090.5 0.0 0.90.01.3 1 89 c1t1d0 0.0 666.00.0 85251.4 0.0 0.90.01.3 1 89 c1t2d0 0.0 662.00.0 84735.6 0.0 0.90.01.4 1 91 c1t3d0 0.0 669.10.0 85638.4 0.0 0.90.01.4 1 92 c1t4d0 0.0 665.00.0 85122.9 0.0 0.90.01.4 1 91 c1t5d0 0.0 652.90.0 83575.1 0.0 0.90.01.4 1 90 c1t6d0 0.0 666.00.0 85251.8 0.0 0.90.01.4 1 91 c1t7d0 0.0 5293.30.0 677537.5 0.1 7.30.01.4 0 725 c2 0.0 660.00.0 84481.2 0.0 0.90.01.4 1 91 c2t0d0 0.0 661.00.0 84610.3 0.0 0.90.01.4 1 90 c2t1d0 0.0 664.00.0 84997.4 0.0 0.90.01.4 1 90 c2t2d0 0.0 662.00.0 84739.4 0.0 0.90.01.4 1 92 c2t3d0 0.0 655.00.0 83836.6 0.0 0.90.01.4 1 89 c2t4d0 0.0 663.10.0 84871.3 0.0 0.90.01.4 1 90 c2t5d0 0.0 663.10.0 84871.5 0.0 0.90.01.4 1 92 c2t6d0 0.0 665.10.0 85129.7 0.0 0.90.01.4 1 92 c2t7d0 0.0 4072.10.0 521228.9 0.1 7.30.01.8 0 728 c3 0.0 506.90.0 64879.3 0.0 0.90.01.8 1 90 c3t0d0 0.0 513.90.0 65782.4 0.0 0.90.01.8 1 92 c3t1d0 0.0 511.90.0 65524.4 0.0 0.90.01.8 1 91 c3t2d0 0.0 505.90.0 64750.5 0.0 0.90.01.8 1 91 c3t3d0 0.0 502.80.0 64363.6 0.0 0.90.01.8 1 90 c3t4d0 0.0 506.90.0 64879.6 0.0 0.90.01.8 1 91 c3t5d0 0.0 513.90.0 65782.6 0.0 0.90.01.8 1 92 c3t6d0 0.0 509.90.0 65266.6 0.0 0.90.01.8 1 91 c3t7d0 0.0 5298.70.0 678232.6 0.1 7.30.01.4 0 725 c4 0.0 664.10.0 85001.4 0.0 0.90.01.4 1 92 c4t0d0 0.0 662.10.0 84743.4 0.0 0.90.01.4 1 90 c4t1d0 0.0 663.10.0 84872.4 0.0 0.90.01.4 1 92 c4t2d0 0.0 664.10.0 85001.4 0.0 0.90.01.3 1 88 c4t3d0 0.0 657.10.0 84105.4 0.0 0.90.01.4 1 91 c4t4d0 0.0 658.10.0 84234.5 0.0 0.90.01.4 1 91 c4t5d0 0.0 669.20.0 85653.4 0.0 0.9
Re: [zfs-discuss] Periodic flush
Le 28 juin 08 à 05:14, Robert Milkowski a écrit : Hello Mark, Tuesday, April 15, 2008, 8:32:32 PM, you wrote: MM The new write throttle code put back into build 87 attempts to MM smooth out the process. We now measure the amount of time it takes MM to sync each transaction group, and the amount of data in that group. MM We dynamically resize our write throttle to try to keep the sync MM time constant (at 5secs) under write load. We also introduce MM fairness delays on writers when we near pipeline capacity: each MM write is delayed 1/100sec when we are about to fill up. This MM prevents a single heavy writer from starving out occasional MM writers. So instead of coming to an abrupt halt when the pipeline MM fills, we slow down our write pace. The result should be a constant MM even IO load. snv_91, 48x 500GB sata drives in one large stripe: # zpool create -f test c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0 c1t6d0 c1t7d0 c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 c3t0d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t6d0 c3t7d0 c4t0d0 c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0 c4t6d0 c4t7d0 c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 c5t6d0 c5t7d0 c6t0d0 c6t1d0 c6t2d0 c6t3d0 c6t4d0 c6t5d0 c6t6d0 c6t7d0 # zfs set atime=off test # dd if=/dev/zero of=/test/q1 bs=1024k ^C34374+0 records in 34374+0 records out # zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - [...] test58.9M 21.7T 0 1.19K 0 80.8M test 862M 21.7T 0 6.67K 0 776M test1.52G 21.7T 0 5.50K 0 689M test1.52G 21.7T 0 9.28K 0 1.16G test2.88G 21.7T 0 1.14K 0 135M test2.88G 21.7T 0 1.61K 0 206M test2.88G 21.7T 0 18.0K 0 2.24G test5.60G 21.7T 0 79 0 264K test5.60G 21.7T 0 0 0 0 test5.60G 21.7T 0 10.9K 0 1.36G test9.59G 21.7T 0 7.09K 0 897M test9.59G 21.7T 0 0 0 0 test9.59G 21.7T 0 6.33K 0 807M test9.59G 21.7T 0 17.9K 0 2.24G test13.6G 21.7T 0 1.96K 0 239M test13.6G 21.7T 0 0 0 0 test13.6G 21.7T 0 11.9K 0 1.49G test17.6G 21.7T 0 9.91K 0 1.23G test17.6G 21.7T 0 0 0 0 test17.6G 21.7T 0 5.48K 0 700M test17.6G 21.7T 0 20.0K 0 2.50G test21.6G 21.7T 0 2.03K 0 244M test21.6G 21.7T 0 0 0 0 test21.6G 21.7T 0 0 0 0 test21.6G 21.7T 0 4.03K 0 513M test21.6G 21.7T 0 23.7K 0 2.97G test25.6G 21.7T 0 1.83K 0 225M test25.6G 21.7T 0 0 0 0 test25.6G 21.7T 0 13.9K 0 1.74G test29.6G 21.7T 1 1.40K 127K 167M test29.6G 21.7T 0 0 0 0 test29.6G 21.7T 0 7.14K 0 912M test29.6G 21.7T 0 19.2K 0 2.40G test33.6G 21.7T 1378 127K 34.8M test33.6G 21.7T 0 0 0 0 ^C Well, doesn't actually look good. Checking with iostat I don't see any problems like long service times, etc. I suspect, a single dd is cpu bound. Reducing zfs_txg_synctime to 1 helps a little bit but still it's not even stream of data. If I start 3 dd streams at the same time then it is slightly better (zfs_txg_synctime set back to 5) but still very jumpy. Try zfs_txg_synctime to 10; that reduces the txg overhead. Reading with one dd produces steady throghput but I'm disapointed with actual performance: Again, probably cpu bound. What's ptime dd... saying ? test 161G 21.6T 9.94K 0 1.24G 0 test 161G 21.6T 10.0K 0 1.25G 0 test 161G 21.6T 10.3K 0 1.29G 0 test 161G 21.6T 10.1K 0 1.27G 0 test 161G 21.6T 10.4K 0 1.31G 0 test 161G 21.6T 10.1K 0 1.27G 0 test 161G 21.6T 10.4K 0 1.30G 0 test 161G 21.6T 10.2K 0 1.27G 0 test 161G 21.6T 10.3K 0 1.29G 0 test 161G 21.6T 10.0K 0 1.25G 0 test 161G 21.6T 9.96K 0 1.24G 0 test 161G 21.6T 10.6K 0 1.33G 0 test 161G 21.6T 10.1K 0 1.26G 0 test 161G 21.6T 10.2K 0 1.27G 0 test 161G 21.6T 10.4K 0 1.30G 0 test 161G 21.6T 9.62K 0 1.20G 0 test 161G 21.6T 8.22K 0 1.03G 0 test 161G 21.6T 9.61K 0
Re: [zfs-discuss] Periodic flush
Hello Mark, Tuesday, April 15, 2008, 8:32:32 PM, you wrote: MM The new write throttle code put back into build 87 attempts to MM smooth out the process. We now measure the amount of time it takes MM to sync each transaction group, and the amount of data in that group. MM We dynamically resize our write throttle to try to keep the sync MM time constant (at 5secs) under write load. We also introduce MM fairness delays on writers when we near pipeline capacity: each MM write is delayed 1/100sec when we are about to fill up. This MM prevents a single heavy writer from starving out occasional MM writers. So instead of coming to an abrupt halt when the pipeline MM fills, we slow down our write pace. The result should be a constant MM even IO load. snv_91, 48x 500GB sata drives in one large stripe: # zpool create -f test c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0 c1t6d0 c1t7d0 c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 c2t7d0 c3t0d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t6d0 c3t7d0 c4t0d0 c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0 c4t6d0 c4t7d0 c5t0d0 c5t1d0 c5t2d0 c5t3d0 c5t4d0 c5t5d0 c5t6d0 c5t7d0 c6t0d0 c6t1d0 c6t2d0 c6t3d0 c6t4d0 c6t5d0 c6t6d0 c6t7d0 # zfs set atime=off test # dd if=/dev/zero of=/test/q1 bs=1024k ^C34374+0 records in 34374+0 records out # zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - [...] test58.9M 21.7T 0 1.19K 0 80.8M test 862M 21.7T 0 6.67K 0 776M test1.52G 21.7T 0 5.50K 0 689M test1.52G 21.7T 0 9.28K 0 1.16G test2.88G 21.7T 0 1.14K 0 135M test2.88G 21.7T 0 1.61K 0 206M test2.88G 21.7T 0 18.0K 0 2.24G test5.60G 21.7T 0 79 0 264K test5.60G 21.7T 0 0 0 0 test5.60G 21.7T 0 10.9K 0 1.36G test9.59G 21.7T 0 7.09K 0 897M test9.59G 21.7T 0 0 0 0 test9.59G 21.7T 0 6.33K 0 807M test9.59G 21.7T 0 17.9K 0 2.24G test13.6G 21.7T 0 1.96K 0 239M test13.6G 21.7T 0 0 0 0 test13.6G 21.7T 0 11.9K 0 1.49G test17.6G 21.7T 0 9.91K 0 1.23G test17.6G 21.7T 0 0 0 0 test17.6G 21.7T 0 5.48K 0 700M test17.6G 21.7T 0 20.0K 0 2.50G test21.6G 21.7T 0 2.03K 0 244M test21.6G 21.7T 0 0 0 0 test21.6G 21.7T 0 0 0 0 test21.6G 21.7T 0 4.03K 0 513M test21.6G 21.7T 0 23.7K 0 2.97G test25.6G 21.7T 0 1.83K 0 225M test25.6G 21.7T 0 0 0 0 test25.6G 21.7T 0 13.9K 0 1.74G test29.6G 21.7T 1 1.40K 127K 167M test29.6G 21.7T 0 0 0 0 test29.6G 21.7T 0 7.14K 0 912M test29.6G 21.7T 0 19.2K 0 2.40G test33.6G 21.7T 1378 127K 34.8M test33.6G 21.7T 0 0 0 0 ^C Well, doesn't actually look good. Checking with iostat I don't see any problems like long service times, etc. Reducing zfs_txg_synctime to 1 helps a little bit but still it's not even stream of data. If I start 3 dd streams at the same time then it is slightly better (zfs_txg_synctime set back to 5) but still very jumpy. Reading with one dd produces steady throghput but I'm disapointed with actual performance: test 161G 21.6T 9.94K 0 1.24G 0 test 161G 21.6T 10.0K 0 1.25G 0 test 161G 21.6T 10.3K 0 1.29G 0 test 161G 21.6T 10.1K 0 1.27G 0 test 161G 21.6T 10.4K 0 1.31G 0 test 161G 21.6T 10.1K 0 1.27G 0 test 161G 21.6T 10.4K 0 1.30G 0 test 161G 21.6T 10.2K 0 1.27G 0 test 161G 21.6T 10.3K 0 1.29G 0 test 161G 21.6T 10.0K 0 1.25G 0 test 161G 21.6T 9.96K 0 1.24G 0 test 161G 21.6T 10.6K 0 1.33G 0 test 161G 21.6T 10.1K 0 1.26G 0 test 161G 21.6T 10.2K 0 1.27G 0 test 161G 21.6T 10.4K 0 1.30G 0 test 161G 21.6T 9.62K 0 1.20G 0 test 161G 21.6T 8.22K 0 1.03G 0 test 161G 21.6T 9.61K 0 1.20G 0 test 161G 21.6T 10.2K 0 1.28G 0 test 161G 21.6T 9.12K 0 1.14G 0 test 161G 21.6T 9.96K 0 1.25G 0 test 161G 21.6T 9.72K 0 1.22G 0 test 161G 21.6T 10.6K 0 1.32G 0 test 161G 21.6T 9.93K
Re: [zfs-discuss] Periodic flush
Bob Friesenhahn writes: On Tue, 15 Apr 2008, Mark Maybee wrote: going to take 12sec to get this data onto the disk. This impedance mis-match is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is Yes. With an application which also needs to make best use of available CPU, these I/O gaps cut into available CPU time (by blocking the process) unless the application uses multithreading and an intermediate write queue (more memory) to separate the CPU-centric parts from the I/O-centric parts. While the single-threaded application is waiting for data to be written, it is not able to read and process more data. Since reads take time to complete, being blocked on write stops new reads from being started so the data is ready when it is needed. There is one down side to this new model: if a write load is very bursty, e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all This is also a common scenario. :-) Presumably the special slow I/O code would not kick in unless the burst was large enough to fill quite a bit of the ARC. Bursts of 1/8th of physical memory or 5 seconds of storage throughput whichever is smallest. -r Real time throttling is quite a challenge to do in software. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
Hello Mark, Tuesday, April 15, 2008, 8:32:32 PM, you wrote: MM ZFS has always done a certain amount of write throttling. In the past MM (or the present, for those of you running S10 or pre build 87 bits) this MM throttling was controlled by a timer and the size of the ARC: we would MM cut a transaction group every 5 seconds based off of our timer, and MM we would also cut a transaction group if we had more than 1/4 of the MM ARC size worth of dirty data in the transaction group. So, for example, MM if you have a machine with 16GB of physical memory it wouldn't be MM unusual to see an ARC size of around 12GB. This means we would allow MM up to 3GB of dirty data into a single transaction group (if the writes MM complete in less than 5 seconds). Now we can have up to three MM transaction groups in progress at any time: open context, quiesce MM context, and sync context. As a final wrinkle, we also don't allow more MM than 1/2 the ARC to be composed of dirty write data. All taken MM together, this means that there can be up to 6GB of writes in the pipe MM (using the 12GB ARC example from above). MM Problems with this design start to show up when the write-to-disk MM bandwidth can't keep up with the application: if the application is MM writing at a rate of, say, 1GB/sec, it will fill the pipe within MM 6 seconds. But if the IO bandwidth to disk is only 512MB/sec, its MM going to take 12sec to get this data onto the disk. This impedance MM mis-match is going to manifest as pauses: the application fills MM the pipe, then waits for the pipe to empty, then starts writing again. MM Note that this won't be smooth, since we need to complete an entire MM sync phase before allowing things to progress. So you can end up MM with IO gaps. This is probably what the original submitter is MM experiencing. Note there are a few other subtleties here that I MM have glossed over, but the general picture is accurate. MM The new write throttle code put back into build 87 attempts to MM smooth out the process. We now measure the amount of time it takes MM to sync each transaction group, and the amount of data in that group. MM We dynamically resize our write throttle to try to keep the sync MM time constant (at 5secs) under write load. We also introduce MM fairness delays on writers when we near pipeline capacity: each MM write is delayed 1/100sec when we are about to fill up. This MM prevents a single heavy writer from starving out occasional MM writers. So instead of coming to an abrupt halt when the pipeline MM fills, we slow down our write pace. The result should be a constant MM even IO load. MM There is one down side to this new model: if a write load is very MM bursty, e.g., a large 5GB write followed by 30secs of idle, the MM new code may be less efficient than the old. In the old code, all MM of this IO would be let in at memory speed and then more slowly make MM its way out to disk. In the new code, the writes may be slowed down. MM The data makes its way to the disk in the same amount of time, but MM the application takes longer. Conceptually: we are sizing the write MM buffer to the pool bandwidth, rather than to the memory size. First - thank you for your explanation - it is very helpful. I'm worried about the last part - but it's hard to be optimal for all workloads. Nevertheless sometimes the problem is if you change the behavior from application perspective. With other file systems I guess you are able to fill in most of memory and still keep disks busy 100% of the time without IO gaps. My biggest concern were these gaps in IO as zfs should keep disks 100% busy if needed. -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
ZFS has always done a certain amount of write throttling. In the past (or the present, for those of you running S10 or pre build 87 bits) this throttling was controlled by a timer and the size of the ARC: we would cut a transaction group every 5 seconds based off of our timer, and we would also cut a transaction group if we had more than 1/4 of the ARC size worth of dirty data in the transaction group. So, for example, if you have a machine with 16GB of physical memory it wouldn't be unusual to see an ARC size of around 12GB. This means we would allow up to 3GB of dirty data into a single transaction group (if the writes complete in less than 5 seconds). Now we can have up to three transaction groups in progress at any time: open context, quiesce context, and sync context. As a final wrinkle, we also don't allow more than 1/2 the ARC to be composed of dirty write data. All taken together, this means that there can be up to 6GB of writes in the pipe (using the 12GB ARC example from above). Problems with this design start to show up when the write-to-disk bandwidth can't keep up with the application: if the application is writing at a rate of, say, 1GB/sec, it will fill the pipe within 6 seconds. But if the IO bandwidth to disk is only 512MB/sec, its going to take 12sec to get this data onto the disk. This impedance mis-match is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is experiencing. Note there are a few other subtleties here that I have glossed over, but the general picture is accurate. The new write throttle code put back into build 87 attempts to smooth out the process. We now measure the amount of time it takes to sync each transaction group, and the amount of data in that group. We dynamically resize our write throttle to try to keep the sync time constant (at 5secs) under write load. We also introduce fairness delays on writers when we near pipeline capacity: each write is delayed 1/100sec when we are about to fill up. This prevents a single heavy writer from starving out occasional writers. So instead of coming to an abrupt halt when the pipeline fills, we slow down our write pace. The result should be a constant even IO load. There is one down side to this new model: if a write load is very bursty, e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all of this IO would be let in at memory speed and then more slowly make its way out to disk. In the new code, the writes may be slowed down. The data makes its way to the disk in the same amount of time, but the application takes longer. Conceptually: we are sizing the write buffer to the pool bandwidth, rather than to the memory size. Robert Milkowski wrote: Hello eric, Thursday, March 27, 2008, 9:36:42 PM, you wrote: ek On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote: On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) ek Actually, this has gone back into snv_87 (and no we don't know which ek s10uX it will go into yet). Could you share more details how it works right now after change? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Tue, 15 Apr 2008, Mark Maybee wrote: going to take 12sec to get this data onto the disk. This impedance mis-match is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is Yes. With an application which also needs to make best use of available CPU, these I/O gaps cut into available CPU time (by blocking the process) unless the application uses multithreading and an intermediate write queue (more memory) to separate the CPU-centric parts from the I/O-centric parts. While the single-threaded application is waiting for data to be written, it is not able to read and process more data. Since reads take time to complete, being blocked on write stops new reads from being started so the data is ready when it is needed. There is one down side to this new model: if a write load is very bursty, e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all This is also a common scenario. :-) Presumably the special slow I/O code would not kick in unless the burst was large enough to fill quite a bit of the ARC. Real time throttling is quite a challenge to do in software. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
Hello eric, Thursday, March 27, 2008, 9:36:42 PM, you wrote: ek On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote: On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) ek Actually, this has gone back into snv_87 (and no we don't know which ek s10uX it will go into yet). Could you share more details how it works right now after change? -- Best regards, Robert Milkowskimailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
the question is: does the IO pausing behaviour you noticed penalize your application? what are the consequences at the application level? for instance we have seen application doing some kind of data capture from external device (video for example) requiring a constant throughput to disk (data feed), risking otherwise loss of data. in this case qfs might be a better option (no free though) if your application is not suffering, then you should be able to live with this apparent io hangs s- On Thu, Mar 27, 2008 at 3:35 AM, Bob Friesenhahn [EMAIL PROTECTED] wrote: My application processes thousands of files sequentially, reading input files, and outputting new files. I am using Solaris 10U4. While running the application in a verbose mode, I see that it runs very fast but pauses about every 7 seconds for a second or two. This is while reading 50MB/second and writing 73MB/second (ARC cache miss rate of 87%). The pause does not occur if the application spends more time doing real work. However, it would be nice if the pause went away. I have tried turning down the ARC size (from 14GB to 10GB) but the behavior did not noticeably improve. The storage device is trained to ignore cache flush requests. According to the Evil Tuning Guide, the pause I am seeing is due to a cache flush after the uberblock updates. It does not seem like a wise choice to disable ZFS cache flushing entirely. Is there a better way other than adding a small delay into my application? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Blog: http://fakoli.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Wed, 26 Mar 2008, Neelakanth Nadgir wrote: When you experience the pause at the application level, do you see an increase in writes to disk? This might the regular syncing of the transaction group to disk. If I use 'zpool iostat' with a one second interval what I see is two or three samples with no write I/O at all followed by a huge write of 100 to 312MB/second. Writes claimed to be a lower rate are split across two sample intervale. It seems that writes are being cached and then issued all at once. This behavior assumes that the file may be written multiple times so a delayed write is more efficient. If I run a script like while true do sync done then the write data rate is much more consistent (at about 66MB/second) and the program does not stall. Of course this is not very efficient. Are the 'zpool iostat' statistics accurate? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
Selim Daoud wrote: the question is: does the IO pausing behaviour you noticed penalize your application? what are the consequences at the application level? for instance we have seen application doing some kind of data capture from external device (video for example) requiring a constant throughput to disk (data feed), risking otherwise loss of data. in this case qfs might be a better option (no free though) if your application is not suffering, then you should be able to live with this apparent io hangs I would look at txg_time first... for lots of streaming writes on a machine with limited memory writes you can smooth out the sawtooth. QFS is open sourced. http://blogs.sun.com/samqfs -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
Bob Friesenhahn wrote: On Wed, 26 Mar 2008, Neelakanth Nadgir wrote: When you experience the pause at the application level, do you see an increase in writes to disk? This might the regular syncing of the transaction group to disk. If I use 'zpool iostat' with a one second interval what I see is two or three samples with no write I/O at all followed by a huge write of 100 to 312MB/second. Writes claimed to be a lower rate are split across two sample intervale. It seems that writes are being cached and then issued all at once. This behavior assumes that the file may be written multiple times so a delayed write is more efficient. This does sound like the regular syncing. If I run a script like while true do sync done then the write data rate is much more consistent (at about 66MB/second) and the program does not stall. Of course this is not very efficient. This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. Are the 'zpool iostat' statistics accurate? Yes. You could also look at regular iostat and correlate it. -neel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) I tested with my application outputting to a UFS filesystem on a single 15K RPM SAS disk and saw that it writes about 50MB/second and without the bursty behavior of ZFS. When writing to ZFS filesystem on a RAID array, zpool I/O stat reports an average (over 10 seconds) write rate of 54MB/second. Given that the throughput is not much higher on the RAID array, I assume that the bottleneck is in my application. Are the 'zpool iostat' statistics accurate? Yes. You could also look at regular iostat and correlate it. Iostat shows that my RAID array disks are loafing with only 9MB/second writes to each but with 82 writes/second. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
On Mar 27, 2008, at 9:24 AM, Bob Friesenhahn wrote: On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) Actually, this has gone back into snv_87 (and no we don't know which s10uX it will go into yet). eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
you may want to try disabling the disk write cache on the single disk. also for the RAID disable 'host cache flush' if such an option exists. that solved the problem for me. let me know. Bob Friesenhahn [EMAIL PROTECTED] wrote: On Thu, 27 Mar 2008, Neelakanth Nadgir wrote: This causes the sync to happen much faster, but as you say, suboptimal. Haven't had the time to go through the bug report, but probably CR 6429205 each zpool needs to monitor its throughput and throttle heavy writers will help. I hope that this feature is implemented soon, and works well. :-) I tested with my application outputting to a UFS filesystem on a single 15K RPM SAS disk and saw that it writes about 50MB/second and without the bursty behavior of ZFS. When writing to ZFS filesystem on a RAID array, zpool I/O stat reports an average (over 10 seconds) write rate of 54MB/second. Given that the throughput is not much higher on the RAID array, I assume that the bottleneck is in my application. Are the 'zpool iostat' statistics accurate? Yes. You could also look at regular iostat and correlate it. Iostat shows that my RAID array disks are loafing with only 9MB/second writes to each but with 82 writes/second. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - Looking for last minute shopping deals? Find them fast with Yahoo! Search.___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Periodic flush
My application processes thousands of files sequentially, reading input files, and outputting new files. I am using Solaris 10U4. While running the application in a verbose mode, I see that it runs very fast but pauses about every 7 seconds for a second or two. This is while reading 50MB/second and writing 73MB/second (ARC cache miss rate of 87%). The pause does not occur if the application spends more time doing real work. However, it would be nice if the pause went away. I have tried turning down the ARC size (from 14GB to 10GB) but the behavior did not noticeably improve. The storage device is trained to ignore cache flush requests. According to the Evil Tuning Guide, the pause I am seeing is due to a cache flush after the uberblock updates. It does not seem like a wise choice to disable ZFS cache flushing entirely. Is there a better way other than adding a small delay into my application? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
Bob Friesenhahn wrote: My application processes thousands of files sequentially, reading input files, and outputting new files. I am using Solaris 10U4. While running the application in a verbose mode, I see that it runs very fast but pauses about every 7 seconds for a second or two. When you experience the pause at the application level, do you see an increase in writes to disk? This might the regular syncing of the transaction group to disk. This is normal behavior. The amount of pause is determined by how much data needs to be synced. You could of course decrease it by reducing the time between syncs (either by reducing the ARC and/or decreasing txg_time), however, I am not sure it will translate to better performance for you. hth, -neel This is while reading 50MB/second and writing 73MB/second (ARC cache miss rate of 87%). The pause does not occur if the application spends more time doing real work. However, it would be nice if the pause went away. I have tried turning down the ARC size (from 14GB to 10GB) but the behavior did not noticeably improve. The storage device is trained to ignore cache flush requests. According to the Evil Tuning Guide, the pause I am seeing is due to a cache flush after the uberblock updates. It does not seem like a wise choice to disable ZFS cache flushing entirely. Is there a better way other than adding a small delay into my application? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss