Re: [zfs-discuss] Wired write performance problem
>> One day, the write performance of zfs degrade. >> The write performance decrease from 60MB/s to about 6MB/s in sequence >> write. >> >> Command: >> date;dd if=/dev/zero of=block bs=1024*128 count=1;date See this thread: http://www.opensolaris.org/jive/thread.jspa?threadID=139317&tstart=45 And search in the page for: "metaslab_min_alloc_size" Try adjusting the metaslab size and see if it fixes your performance problem. -Don ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
And one comment: When we do write operation(by command dd), heavy read operation increased from zero to 3M for each disk, and the write bandwidth is poor. The disk io %b increase from 0 to about 60. I don't understand why this happened. capacity operations bandwidth pool used avail read write read write -- - - - - - - datapool19.8T 5.48T543 47 1.74M 5.89M raidz15.64T 687G146 13 480K 1.66M c3t600221900085486703B2490FB009d0 - - 49 13 3.26M 293K c3t600221900085486703B4490FB063d0 - - 48 13 3.19M 296K c3t6002219000852889055F4CB79C10d0 - - 48 13 3.19M 293K c3t600221900085486703B8490FB0FFd0 - - 50 13 3.28M 284K c3t600221900085486703BA490FB14Fd0 - - 50 13 3.31M 287K c3t6002219000852889041C490FAFA0d0 - - 49 14 3.27M 297K c3t600221900085486703C0490FB27Dd0 - - 48 14 3.24M 300K raidz15.73T 594G102 7 337K 996K c3t600221900085486703C2490FB2BFd0 - - 52 5 3.59M 166K c3t6002219000852889041F490FAFD0d0 - - 54 5 3.72M 166K c3t60022190008528890428490FB0D8d0 - - 55 5 3.79M 166K c3t60022190008528890422490FB02Cd0 - - 52 5 3.57M 166K c3t60022190008528890425490FB07Cd0 - - 53 5 3.64M 166K c3t60022190008528890434490FB24Ed0 - - 55 5 3.76M 166K c3t6002219000852889043949100968d0 - - 55 5 3.83M 166K raidz15.81T 519G117 10 388K 1.26M c3t6002219000852889056B4CB79D66d0 - - 46 9 3.09M 215K c3t600221900085486704B94CB79F91d0 - - 44 9 2.91M 215K c3t600221900085486704BB4CB79FE1d0 - - 44 9 2.97M 224K c3t600221900085486704BD4CB7A035d0 - - 44 9 2.96M 215K c3t600221900085486704BF4CB7A0ABd0 - - 44 9 2.97M 216K c3t6002219000852889055C4CB79BB8d0 - - 45 9 3.04M 215K c3t600221900085486704C14CB7A0FDd0 - - 46 9 3.02M 215K raidz12.59T 3.72T176 16 581K 2.00M c3t6002219000852889042B490FB124d0 - - 48 5 3.21M 342K c3t600221900085486704C54CB7A199d0 - - 46 5 2.99M 342K c3t600221900085486704C74CB7A1D5d0 - - 49 5 3.27M 342K c3t600221900085288905594CB79B64d0 - - 46 6 3.00M 342K c3t600221900085288905624CB79C86d0 - - 47 6 3.11M 342K c3t600221900085288905654CB79CCCd0 - - 50 6 3.29M 342K c3t600221900085288905684CB79D1Ed0 - - 45 5 2.98M 342K c3t6B8AC6FF837605864DC9E9F1d0 4K 928G 0 0 0 0 -- - - - - - - ^C root@nas-hz-01:~# On 06/08/2011 11:07 AM, Ding Honghui wrote: Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self->ts = timestamp; } fbt:zfs:zfs_write:return /self->ts/ { @time = quantize(timestamp-self->ts); self->ts = 0; } It shows value - Distribution - count 8192 | 0 16384 | 16 32768 | 3270 65536 |@@@ 898 131072 |@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@180 16777216 | 33 33554432 | 0
[zfs-discuss] Wired write performance problem
Hi, I got a wired write performance and need your help. One day, the write performance of zfs degrade. The write performance decrease from 60MB/s to about 6MB/s in sequence write. Command: date;dd if=/dev/zero of=block bs=1024*128 count=1;date The hardware configuration is 1 Dell MD3000 and 1 MD1000 with 30 disks. The OS is Solaris 10U8, zpool version 15 and zfs version 4. I run Dtrace to trace the write performance: fbt:zfs:zfs_write:entry { self->ts = timestamp; } fbt:zfs:zfs_write:return /self->ts/ { @time = quantize(timestamp-self->ts); self->ts = 0; } It shows value - Distribution - count 8192 | 0 16384 | 16 32768 | 3270 65536 |@@@ 898 131072 |@@@ 985 262144 | 33 524288 | 1 1048576 | 1 2097152 | 3 4194304 | 0 8388608 |@180 16777216 | 33 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 1 1073741824 | 2 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 2 34359738368 | 3 68719476736 | 0 Compare to a working well storage(1 MD3000), the max write time of zfs_write is 4294967296, it is about 10 times faster. Any suggestions? Thanks Ding ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
On 07/06/2011 22:57, LaoTsao wrote: You have un balance setup Fc 4gbps vs 10gbps nic It's actually 2x 4Gbps (using MPXIO) vs 1x 10Gbps. After 10b/8b encoding it is even worse, but this not yet impact your benchmark yet Sent from my iPad Hung-Sheng Tsao ( LaoTsao) Ph.D On Jun 7, 2011, at 5:46 PM, Phil Harman wrote: On 07/06/2011 20:34, Marty Scholes wrote: I'll throw out some (possibly bad) ideas. Thanks for taking the time. Is ARC satisfying the caching needs? 32 GB for ARC should almost cover the 40GB of total reads, suggesting that the L2ARC doesn't add any value for this test. Are the SSD devices saturated from an I/O standpoint? Put another way, can ZFS put data to them fast enough? If they aren't taking writes fast enough, then maybe they can't effectively load for caching. Certainly if they are saturated for writes they can't do much for reads. The SSDs are barely ticking over, and can deliver almost as much throughput as the current SAN storage. Are some of the reads sequential? Sequential reads don't go to L2ARC. That'll be it. I assume the L2ARC is just taking metadata. In situations such as mine, I would quite like the option of routing sequential read data to the L2ARC also. I do notice a benefit with a sequential update (i.e. COW for each block), and I think this is because the L2ARC satisfies most of the metadata reads instead of having to read them from the SAN. What does iostat say for the SSD units? What does arc_summary.pl (maybe spelled differently) say about the ARC / L2ARC usage? How much of the SSD units are in use as reported in zpool iostat -v? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
You have un balance setup Fc 4gbps vs 10gbps nic After 10b/8b encoding it is even worse, but this not yet impact your benchmark yet Sent from my iPad Hung-Sheng Tsao ( LaoTsao) Ph.D On Jun 7, 2011, at 5:46 PM, Phil Harman wrote: > On 07/06/2011 20:34, Marty Scholes wrote: >> I'll throw out some (possibly bad) ideas. > > Thanks for taking the time. > >> Is ARC satisfying the caching needs? 32 GB for ARC should almost cover the >> 40GB of total reads, suggesting that the L2ARC doesn't add any value for >> this test. >> >> Are the SSD devices saturated from an I/O standpoint? Put another way, can >> ZFS put data to them fast enough? If they aren't taking writes fast enough, >> then maybe they can't effectively load for caching. Certainly if they are >> saturated for writes they can't do much for reads. > > The SSDs are barely ticking over, and can deliver almost as much throughput > as the current SAN storage. > >> Are some of the reads sequential? Sequential reads don't go to L2ARC. > > That'll be it. I assume the L2ARC is just taking metadata. In situations such > as mine, I would quite like the option of routing sequential read data to the > L2ARC also. > > I do notice a benefit with a sequential update (i.e. COW for each block), and > I think this is because the L2ARC satisfies most of the metadata reads > instead of having to read them from the SAN. > >> What does iostat say for the SSD units? What does arc_summary.pl (maybe >> spelled differently) say about the ARC / L2ARC usage? How much of the SSD >> units are in use as reported in zpool iostat -v? > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
On 07/06/2011 20:34, Marty Scholes wrote: I'll throw out some (possibly bad) ideas. Thanks for taking the time. Is ARC satisfying the caching needs? 32 GB for ARC should almost cover the 40GB of total reads, suggesting that the L2ARC doesn't add any value for this test. Are the SSD devices saturated from an I/O standpoint? Put another way, can ZFS put data to them fast enough? If they aren't taking writes fast enough, then maybe they can't effectively load for caching. Certainly if they are saturated for writes they can't do much for reads. The SSDs are barely ticking over, and can deliver almost as much throughput as the current SAN storage. Are some of the reads sequential? Sequential reads don't go to L2ARC. That'll be it. I assume the L2ARC is just taking metadata. In situations such as mine, I would quite like the option of routing sequential read data to the L2ARC also. I do notice a benefit with a sequential update (i.e. COW for each block), and I think this is because the L2ARC satisfies most of the metadata reads instead of having to read them from the SAN. What does iostat say for the SSD units? What does arc_summary.pl (maybe spelled differently) say about the ARC / L2ARC usage? How much of the SSD units are in use as reported in zpool iostat -v? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and poor read performance
I'll throw out some (possibly bad) ideas. Is ARC satisfying the caching needs? 32 GB for ARC should almost cover the 40GB of total reads, suggesting that the L2ARC doesn't add any value for this test. Are the SSD devices saturated from an I/O standpoint? Put another way, can ZFS put data to them fast enough? If they aren't taking writes fast enough, then maybe they can't effectively load for caching. Certainly if they are saturated for writes they can't do much for reads. Are some of the reads sequential? Sequential reads don't go to L2ARC. What does iostat say for the SSD units? What does arc_summary.pl (maybe spelled differently) say about the ARC / L2ARC usage? How much of the SSD units are in use as reported in zpool iostat -v? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate Log Devices
> The guide suggests that the zil be sized to 1/2 the amount of ram in the > server which would be 1GB. The ZFS Best Practices Guide does detail the absolute maximum size the ZIL can grow in theory, which as you stated is 1/2 the size of the host's physical memory. But in practice, the very next bullet point details the log device sizing equation which we have found to be a more relevant indicator. Excerpt below: "For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GB of log device should be sufficient." > What happens if I oversize the zil? "Oversizing" the log device capacity has no negative repercussions other than the under utilization of your SSD. > If I create a 1GB slice for the zil, can I add another slice for another > zil in the future when more ram is added? If the question is if multiple disk slices can be striped to aggregate capacity, then the answer is yes. Be aware with most SSDs, including the Intel X25-E, using a disk slice instead of the entire device will automatically disable the on-board write cache. Christopher George Founder / CTO http://www.ddrdrive.com/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] L2ARC and poor read performance
Ok here's the thing ... A customer has some big tier 1 storage, and has presented 24 LUNs (from four RAID6 groups) to an OI148 box which is acting as a kind of iSCSI/FC bridge (using some of the cool features of ZFS along the way). The OI box currently has 32GB configured for the ARC, and 4x 223GB SSDs for L2ARC. It has a dual port QLogic HBA, and is currently configured to do round-robin MPXIO over two 4Gbps links. The iSCSI traffic is over a dual 10Gbps card (rather like the one Sun used to sell). I've just built a fresh pool, and have created 20x 100GB zvols which are mapped to iSCSI clients. I have initialised the first 20GB of each zvol with random data. I've had a lot of success with write performance (e.g. in earlier tests I had 20 parallel streams writing 100GB each at over 600MB/sec aggregate), but read performance is very poor. Right now I'm just playing with 20 parallel streams of reads from the first 2GB of each zvol (i.e. 40GB in all). During each run, I see lots of writes to the L2ARC, but less than a quarter the volume of reads. Yet my FC LUNS are hot with 1000s of reads per second. This doesn't change from run to run. Why? Surely 20x 2GB of data (and it's associated metadata) will sit nicely in 4x 223GB SSDs? Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Separate Log Devices
The server I have currently have only has 2GB of ram. At some point, I will be adding more ram to the server but I'm not sure when. I want to add a mirrored zil. I have 2 Intel 32GB SSDSA2SH032G1GN drives As such, I have been reading the ZFS Best Practices Guide http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices The guide suggests that the zil be sized to 1/2 the amount of ram in the server which would be 1GB. I have a couple of questions What happens if I oversize the zil? If I create a 1GB slice for the zil, can I add another slice for another zil in the future when more ram is added? Thanks Karl CONFIDENTIALITY NOTICE: This communication (including all attachments) is confidential and is intended for the use of the named addressee(s) only and may contain information that is private, confidential, privileged, and exempt from disclosure under law. All rights to privilege are expressly claimed and reserved and are not waived. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. If you have received this communication in error, please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Resilver / Scrub Status
I am running zpool 22 (Solaris 10U9) and I am looking for a way to determine how much more work has to be done to complete a resilver operation (it is already at 100%, but I know that is not a really accurate number). From my understanding of how the resilver operation works, it walks the metadata structure and then the transaction groups. So if there is no write (or snapshot or clone or ...) activity, once it completes the walk of the metadata it is done (I assume the % complete number is based on this). If there is write activity, it then replays the TXG that came in after the resilver started. I have two zpools that are resilvering and are having write activity. I know data is still being committed to the devices being resilvered, but I am looking for a way to determine how close they are to being done. So is there a kernel structure I can look at (with kstat or mdb) that will tell me how many TXG remain to be written to complete the resilver ? I know this will be a dynamic number, but it will be a help to determining if we should idle the replication job (in one of our two cases) and catch up later (the replication happens over a WAN link, so it is not very fast, 3 MB/sec. maybe) or just wait it out. I'll be honest, I am nervous with a raidz2 vdev not at full strength, and I am looking for some comfort :-) -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss