Re: [zfs-discuss] zfs-discuss Digest, Vol 59, Issue 13
Am 09.09.2010 um 07:00 schrieb zfs-discuss-requ...@opensolaris.org: What's the write workload like? You could try disabling the ZIL to see if that makes a difference. If it does, the addition of an SSD-based ZIL / slog device would most certainly help. Maybe you could describe the makeup of your zpool as well? Ray The zpool is a mirrored root-pool (2 SATA 250GB devices). The box is a Dell PE T710. When I copy via NFS, zpool iostat reports 4MB/sec along the copy process. When I copy via scp I get a network performance of about 50 MB/sec and zpool iostat reports 105 MB/sec for a short interval about 5 seconds after scp completed. As far as I figured out, the problem is the nfs commit, that forces the filesystem to write data directly on disk instead of caching the data stream, like it is done in the scp example. NFS was there long before SSD-based drives where. I can not imagine, that NFS performance used to be not more than 1/3 of the speed of a 10BaseT connection ever before... Martin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On 9/8/2010 10:08 PM, Freddie Cash wrote: On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harveysh...@nedharvey.com wrote: Both of the above situations resilver in equal time, unless there is a bus bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 disks in a single raidz3 provides better redundancy than 3 vdev's each containing a 7 disk raidz1. No, it (21-disk raidz3 vdev) most certainly will not resilver in the same amount of time. In fact, I highly doubt it would resilver at all. My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE multilane controllers. Nice 10 TB storage pool. Worked beatifully as we filled it with data. Had less than 50% usage when a disk died. No problem, it's ZFS, it's meant to be easy to replace a drive, just offline, swap, replace, wait for it to resilver. Well, 3 days later, it was still under 10%, and every disk light was still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously, and the box was basically unusable (5 minutes to get the password line to appear on the console). Tried rebooting a few times, stopped all disk I/O to the machine (it was our backups box, running rysnc every night for - at the time - 50+ remote servers), let it do its thing. After 3 weeks of trying to get the resilver to complete (or even reach 50%), we pulled the plug and destroyed the pool, rebuilding it using 3x 8-drive raidz2 vdevs. Things have been a lot smoother ever since. Have replaced 8 of the drives (1 vdev) with 1.5 TB drives. Have replaced multiple dead drives. Resilvers, while running outgoing rsync all day and incoming rsync all night, take 3 days for a 1.5 TB drive (with SNMP showing 300 MB/s disk I/O). You most definitely do not want to use a single super-wide raidz vdev. It just won't work. Instead of the Best Practices Guide saying Don't put more than ___ disks into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck by constructing your vdev's using physical disks which are distributed across multiple buses, as necessary per the speed of your disks and buses. Yeah, I still don't buy it. Even spreading disks out such that you have 4 SATA drives per PCI-X/PCIe bus, I don't think you'd be able to get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a raidz1) in a 50% full pool. Especially if you are using the pool for anything at the same time. the thing that folks tend to forget is that RaidZ is IOPS limited. For the most part, if I want to reconstruct a single slab (stripe) of data, I have to issue a read to EACH disk in the vdev, and wait for that disk to return the value, before I can write the computed parity value out to the disk under reconstruction. This is *regardless* of the amount of data being reconstructed. So, the bottleneck tends to be the IOPS value of the single disk being reconstructed. Thus, having fewer disks in a vdev leads to less data being required to be resilvered, which leads to fewer IOPS being required to finish the resilver. Example (for ease of calculation, let's do the disk-drive mfg's cheat of 1k = 1000 bytes): Scenario 1:I have 5 1TB disks in a raidz1, and I assume I have 128k slab sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive. Scenario 2:I have 10 1TB drives in a raidz1, with the same 128k slab sizes. In this case, there's only about 14k of data on each drive for a slab. This means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k = 71e6 IOPS to complete. From this, it can be pretty easy to see that the number of required IOPS to the resilvered disk goes up linearly with the number of data drives in a vdev. Since you're always going to be IOPS bound by the single disk resilvering, you have a fixed limit. In addition, remember that having more disks means you have to wait longer for each IOPS to complete. That is, it takes longer (fractionally, but in the aggregate, a measuable amount) for 9 drives to each return 14k of info than it does for 4 drives to return 32k of data. This is due to rotational and seek access delays. So, not only are you having to do more total IOPS in Scenario 2, but each IOPS takes longer to complete (the read cycle taking longer, the write/reconstruct cycle taking the same amount of time). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On 9/9/2010 2:15 AM, taemun wrote: Erik: does that mean that keeping the number of data drives in a raidz(n) to a power of two is better? In the example you gave, you mentioned 14kb being written to each drive. That doesn't sound very efficient to me. (when I say the above, I mean a five disk raidz or a ten disk raidz2, etc) Cheers, Well, since the size of a slab can vary (from 512 bytes to 128k), it's hard to say. Length (size) of the slab is likely the better determination. Remember each block on a hard drive is 512 bytes (for now). So, it's really not any more efficient to write 16k than 14k (or vice versa). Both are integer multiples of 512 bytes. IIRC, there was something about using a power-of-two number of data drives in a RAIDZ, but I can't remember what that was. It may just be a phantom memory. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On 08 September, 2010 - Fei Xu sent me these 5,9K bytes: I dig deeper into it and might find some useful information. I attached an X25 SSD for ZIL to see if it helps. but no luck. I run IOstate -xnz for more details and got interesting result as below.(maybe too long) some explaination: 1. c2d0 is SSD for ZIL 2. c0t3d0, c0t20d0, c0t21d0, c0t22d0 is source pool. ... extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.30.01.20.0 0.0 0.00.00.1 0 0 c2d0 0.1 17.70.1 51.7 0.0 0.10.24.1 0 7 c3d0 0.12.10.0 79.8 0.0 0.00.14.0 0 0 c0t2d0 0.20.07.10.0 0.1 2.3 278.5 11365.1 1 46 c0t3d0 Service time here is crap. 11 seconds to reply. 0.12.20.0 79.9 0.0 0.00.13.7 0 0 c0t5d0 0.12.30.0 80.0 0.0 0.00.19.2 0 0 c0t6d0 0.12.50.0 80.1 0.0 0.00.13.8 0 0 c0t10d0 0.12.40.0 80.0 0.0 0.00.19.5 0 0 c0t11d0 1.90.0 133.00.0 0.1 2.8 60.2 1520.6 2 51 c0t20d0 1.5 seconds to reply. crap. extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device ... 0.70.0 39.10.0 0.0 0.6 64.0 884.1 1 10 c0t3d0 ... 2.10.0 135.80.0 0.1 5.2 67.8 2498.1 3 88 c0t21d0 ... extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device ... 3.50.0 246.80.0 0.0 0.86.3 229.8 1 20 c0t3d0 ... 0.70.0 29.20.0 0.0 0.60.0 911.0 0 12 c0t21d0 1.90.0 138.70.0 0.1 4.7 73.0 2428.6 2 66 c0t22d0 ... Service times here are crap. Disks are malfunctioning in some way. If your source disks can take seconds (or 10+ seconds) to reply, then of course your copy will be slow. Disk is probably having a hard time reading the data or something. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
Service times here are crap. Disks are malfunctioning in some way. If your source disks can take seconds (or 10+ seconds) to reply, then of course your copy will be slow. Disk is probably having a hard time reading the data or something. Yeah, that should not go over 15ms. I just cannot understand why it starts ok with hundred GB files transfered and then suddenly fall to sleep. by the way, WDIDLE time is already disabled which might cause some issue. I've changed to another system to test ZFS send between 8*1TB pool and 4*1TB pool. hope everythings OK on this case. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Very interesting... Well, lets see if we can do the numbers for my setup. From a previous post of mine: [i]This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) for each vdev. It'll give the best possible speed, but still wont max out the HDD's. I've never actually sat and done the math before. Hope its decently accurate :)[/i] My scenario, as from Erik's post: Scenario: I have 10 1TB disks in a raidz2, and I have 128k slab sizes. Thus, I have 16k of data for each slab written to each disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 16k of data on the failed drive. It thus takes about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. Best Case: I'll read at 12Gbps, write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Freddie Cash No, it (21-disk raidz3 vdev) most certainly will not resilver in the same amount of time. In fact, I highly doubt it would resilver at all. My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE multilane controllers. Nice 10 TB storage pool. Worked beatifully as we filled it with data. Had less than 50% usage when a disk died. No problem, it's ZFS, it's meant to be easy to replace a drive, just offline, swap, replace, wait for it to resilver. Well, 3 days later, it was still under 10%, and every disk light was still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously, I don't believe your situation is typical. I think you either encountered a bug, or you had something happening that you weren't aware of (scrub, autosnapshots, etc) ... because the only time I've ever seen anything remotely similar to the behavior you described was the bug I've mentioned in other emails, which occurs when disk is 100% full and a scrub is taking place. I know it's not the same bug for you, because you said your pool was only 50% full. But I don't believe that what you saw was normal or typical. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On 9/9/2010 5:49 AM, hatish wrote: Very interesting... Well, lets see if we can do the numbers for my setup. From a previous post of mine: [i]This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) for each vdev. It'll give the best possible speed, but still wont max out the HDD's. I've never actually sat and done the math before. Hope its decently accurate :)[/i] My scenario, as from Erik's post: Scenario: I have 10 1TB disks in a raidz2, and I have 128k slab sizes. Thus, I have 16k of data for each slab written to each disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 16k of data on the failed drive. It thus takes about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. Best Case: I'll read at 12Gbps, write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB. Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That's it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On 9/9/2010 5:49 AM, hatish wrote: Very interesting... Well, lets see if we can do the numbers for my setup. From a previous post of mine: [i]This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) for each vdev. It'll give the best possible speed, but still wont max out the HDD's. I've never actually sat and done the math before. Hope its decently accurate :)[/i] My scenario, as from Erik's post: Scenario: I have 10 1TB disks in a raidz2, and I have 128k slab sizes. Thus, I have 16k of data for each slab written to each disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 16k of data on the failed drive. It thus takes about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. Best Case: I'll read at 12Gbps, write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB. Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That's it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
On Thu, Sep 9, 2010 at 09:03, Erik Trimble erik.trim...@oracle.com wrote: Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That's it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. No argument on IOPS, but 173 hours is 7 days, or a little over one week. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Erik Trimble the thing that folks tend to forget is that RaidZ is IOPS limited. For the most part, if I want to reconstruct a single slab (stripe) of data, I have to issue a read to EACH disk in the vdev, and wait for that disk to return the value, before I can write the computed parity value out to the disk under reconstruction. If I'm trying to interpret your whole message, Erik, and condense it, I think I get the following. Please tell me if and where I'm wrong. In any given zpool, some number of slabs are used in the whole pool. In raidzN, a portion of each slab is written on each disk. Therefore, during resilver, if there are a total of 1million slabs used in the zpool, it means each good disk will need to read 1million partial slabs, and the replaced disk will need to write 1 million partial slabs. Each good disk receives a read request in parallel, and all of them must complete before a write is given to the new disk. Each read/write cycle is completed before the next cycle begins. (It seems this could be accelerated by allowing all the good disks to continue reading in parallel instead of waiting, right?) The conclusion I would reach is: Given no bus bottleneck: It is true that resilvering a raidz will be slower with many disks in the vdev, because the average latency for the worst of N disks will increase as N increases. But that effect is only marginal, and bounded between the average latency of a single disk, and the worst case latency of a single disk. The characteristic that *really* makes a big difference is the number of slabs in the pool. i.e. if your filesystem is composed of mostly small files or fragments, versus mostly large unfragmented files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: Hatish Narotam [mailto:hat...@gmail.com] PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). Assuming your disks can all sustain 500Mbit/sec, which I find to be typical for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit upstream bottleneck, it means each of your groups of 5 should be fine in a raidz1 configuration. You think that your sata card can do 32Gbit because it's on a PCIe x8 bus. I highly doubt it unless you paid a grand or two for your sata controller, but please prove me wrong. ;-) I think the backplane of the sata controller is more likely either 3G or 6G. If it's 3G, then you should use 4 groups of raidz1. If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of 500Mbit can only sustain 5Gbit) If it's 12G or higher, then you can make all of your drives one big vdev of raidz3. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. I guarantee you this is not a sustainable speed for 7.2krpm sata disks. You can get a decent measure of sustainable speed by doing something like: (write 1G byte) time dd if=/dev/zero of=/some/file bs=1024k count=1024 (beware: you might get an inaccurate speed measurement here due to ram buffering. See below.) (reboot to ensure nothing is in cache) (read 1G byte) time dd if=/some/file of=/dev/null bs=1024k (Now you're certain you have a good measurement. If it matches the measurement you had before, that means your original measurement was also accurate. ;-) ) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey The characteristic that *really* makes a big difference is the number of slabs in the pool. i.e. if your filesystem is composed of mostly small files or fragments, versus mostly large unfragmented files. Oh, if at least some of my reasoning was correct, there is one valuable take-away point for hatish: Given some number X total slabs used in the whole pool. If you use a single vdev for the whole pool, you will have X partial slabs written on each disk. If you have 2 vdev's, you'll have approx X/2 partial slabs written on each disk. 3 vdevs ~ X/3 partial slabs on each disk. Therefore, the resilver time approximately divides by the number of separate vdev's you are using in your pool. So the largest factor affecting resilver time of a single large vdev versus many smaller vdev's is NOT the quantity of data written on each disk, but just the fact that fewer slabs are used on each disk when using smaller vdev's. If you want to choose between (a) 21disk raidz3 versus (b) 3 vdevs of each 7disk raidz1, then: The raidz3 provides better redundancy, but has the disadvantage that every slab must be partially written on every disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Erik wrote: Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That's it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. My OCD is coming out and I will split that hair with you. 173 hours is just over a week. This is a fascinating and timely discussion. My personal (biased and unhindered by facts) preference is wide stripes RAIDZ3. Ned is right that I kept reading that RAIDZx should not exceed _ devices and couldn't find real numbers behind those conclusions. Discussions in this thread have opened my eyes a little and I am in the middle of deploying a second 22 disk fibre array on home server, so I have been struggling with the best way to allocate pools. Up until reading this thread, the biggest downside to wide stripes, that I was aware of, has been low iops. And let's be clear: while on paper the iops of a wide stripe is the same as a single disk, it actually is worse. In truth, the service time for any request on wide stripe is the service time of the SLOWEST disk for that request. The slowest disk may vary from request to request, but will always delay the entire stripe operation. Since all of the 44 spindles are 15K disks, I am about to convince myself to go with two pools of wide stripes and keep several spindles for L2ARC and SLOG. The thinking is that other background operations (scrub and resilver) can take place with little impact to application performance, since those will be using L2ARC and SLOG. Of course, I could be wrong on any of the above. Cheers, Marty -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote: Service times here are crap. Disks are malfunctioning in some way. If your source disks can take seconds (or 10+ seconds) to reply, then of course your copy will be slow. Disk is probably having a hard time reading the data or something. Yeah, that should not go over 15ms. I just cannot understand why it starts ok with hundred GB files transfered and then suddenly fall to sleep. by the way, WDIDLE time is already disabled which might cause some issue. I've changed to another system to test ZFS send between 8*1TB pool and 4*1TB pool. hope everythings OK on this case. This might be the dreaded WD TLER issue. Basically the drive keeps retrying a read operation over and over after a bit error trying to recover from a read error themselves. With ZFS one really needs to disable this and have the drives fail immediately. Check your drives to see if they have this feature, if so think about replacing the drives in the source pool that have long service times and make sure this feature is disabled on the destination pool drives. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote: This might be the dreaded WD TLER issue. Basically the drive keeps retrying a read operation over and over after a bit error trying to recover from a read error themselves. With ZFS one really needs to disable this and have the drives fail immediately. Check your drives to see if they have this feature, if so think about replacing the drives in the source pool that have long service times and make sure this feature is disabled on the destination pool drives. -Ross It might be due tler-issues, but I'd try to pin greens down to SATA1-mode (use jumper, or force via controller). It might help a bit with these disks, although these are not really suitable disks for any use in any raid configurations due tler issue, which cannot be disabled in later firmware versions. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool create using whole disk - do I add p0? E.g. c4t2d0 or c42d0p0
Hi-- It might help to review the disk component terminology description: c#t#d#p# = represents the the fdisk partition on x86 systems, where you can have up to 4 fdisk partitions, such as one for the Solaris OS or a Windows OS. An fdisk partition is the larger container of the disk or disk slices. c#t#d# = represents the whole disk. c#t#d#s# = represents the disk slice, used for the root pool because the current boot limitation that says we must boot from a slice. The issue is that if you don't understand that the c#t#d#p# device contains the c#t#d# or c#t#d#s# devices, you might create a pool that contains p#, d#, and s# components, in an overlapping kind of way (we've seen it). A bug exists to prevent pool creation with p# devices. You are probably okay if you use c0t0d0p0 and c0t1d0p0 and never overlap the fdisk components but we don't test this configuration and its not supported. Thanks, Cindy On 09/08/10 23:07, R.G. Keen wrote: Hi Craig, Don't use the p* devices for your storage pools. They represent the larger fdisk partition. Use the d* devices instead, like this example below: Good advice, something I wondered about too. However, aside from my having guessed right once (I think...) I have no clue why this should be. Can you expound a bit on the reasoning behind this advice? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NFS performance near zero on a very full pool
Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
On Thu, 9 Sep 2010 14:05:51 +, Markus Kovero markus.kov...@nebula.fi wrote: On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote: This might be the dreaded WD TLER issue. Basically the drive keeps retrying a read operation over and over after a bit error trying to recover from a read error themselves. With ZFS one really needs to disable this and have the drives fail immediately. Check your drives to see if they have this feature, if so think about replacing the drives in the source pool that have long service times and make sure this feature is disabled on the destination pool drives. -Ross It might be due tler-issues, but I'd try to pin greens down to SATA1-mode (use jumper, or force via controller). It might help a bit with these disks, although these are not really suitable disks for any use in any raid configurations due tler issue, which cannot be disabled in later firmware versions. Yours Markus Kovero Just to clarify - do you mean TLER should be off or on? TLER = Time Limited Error Recovery so the drive only takes a max time (eg: 7 seconds) to retrieve data or returns an error. So you say 'cannot be disabled' but I think you mean 'cannot be ENABLED' ? I've been doing a lot of research for a new storage box at work, and from reading a lot of the info available in the Storage forum on hardforum.com, the experts there seem to recommend NOT having TLER enabled when using ZFS as ZFS can be configured for its timeouts, etc, and the main reason to use TLER is when using those drives with hardware RAID cards which will kick a drive out of the array if it takes longer than 10 seconds. Can anyone else here comment if they have had experience with the WD drives and ZFS and if they have TLER enabled or disabled? Cheers, Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
Arne, NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. Neil. On 09/09/10 09:00, Arne Jansen wrote: Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NetApp/Oracle-Sun lawsuit done
Seems that things have been cleared up: NetApp (NASDAQ: NTAP) today announced that both parties have agreed to dismiss their pending patent litigation, which began in 2007 between Sun Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits dismissed without prejudice. The terms of the agreement are confidential. http://tinyurl.com/39qkzgz http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html A recap of the history at: http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
I should also have mentioned that if the pool has a separate log device then this shouldn't happen.Assuming the slog is big enough then it it should have enough blocks to not be forced into using main pool device blocks. Neil. On 09/09/10 10:36, Neil Perrin wrote: Arne, NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. Neil. On 09/09/10 09:00, Arne Jansen wrote: Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
Hi Neil, Neil Perrin wrote: NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. I think this is not what we saw, for two reason: a) we have a mirrored slog device. According to zpool iostat -v only 16MB out of 4GB were in use. b) it didn't seem like the txg would have been closed early. Rather it kept approximately the 30 second intervals. Internally we came up with a different explanation, without any backing that it might be correct: When the pool reaches 96%, zfs goes into a 'self defense' mode. Instead of allocating block from ZIL, every write turns synchronous and has to wait for the txg to finish naturally. The reasoning behind this might be that even if ZIL is available, there might not be enough space left to commit the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above 96%. While this might be proper for small pools, on large pools 4% are still some TB of free space, so there should be an upper limit of maybe 10GB on this hidden reserve. Also this sudden switch of behavior is completely unexpected and at least under- documented. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. In this situation, not only writes suffered, but as a side effect reads also came to a nearly complete halt. -- Arne Neil. On 09/09/10 09:00, Arne Jansen wrote: Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
This is welcome news. -- richard On Sep 9, 2010, at 9:38 AM, David Magda wrote: Seems that things have been cleared up: NetApp (NASDAQ: NTAP) today announced that both parties have agreed to dismiss their pending patent litigation, which began in 2007 between Sun Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits dismissed without prejudice. The terms of the agreement are confidential. http://tinyurl.com/39qkzgz http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html A recap of the history at: http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/ -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
On 9/9/2010 10:25 AM, Richard Elling wrote: This is welcome news. -- richard On Sep 9, 2010, at 9:38 AM, David Magda wrote: Seems that things have been cleared up: NetApp (NASDAQ: NTAP) today announced that both parties have agreed to dismiss their pending patent litigation, which began in 2007 between Sun Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits dismissed without prejudice. The terms of the agreement are confidential. http://tinyurl.com/39qkzgz http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html A recap of the history at: http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss Yes, it's welcome to get it over with. I do get to bitch about one aspect here of the US civil legal system, though. If you've gone so far as to burn our (the public's) time and money to file a lawsuit, you shouldn't be able to seal up the court transcript, or have a non-public settlement. Call it the price you pay for wasting our time (i.e. the court system's time). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote: Hi Neil, Neil Perrin wrote: NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. I think this is not what we saw, for two reason: a) we have a mirrored slog device. According to zpool iostat -v only 16MB out of 4GB were in use. b) it didn't seem like the txg would have been closed early. Rather it kept approximately the 30 second intervals. Internally we came up with a different explanation, without any backing that it might be correct: When the pool reaches 96%, zfs goes into a 'self defense' mode. Instead of allocating block from ZIL, every write turns synchronous and has to wait for the txg to finish naturally. The reasoning behind this might be that even if ZIL is available, there might not be enough space left to commit the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above 96%. While this might be proper for small pools, on large pools 4% are still some TB of free space, so there should be an upper limit of maybe 10GB on this hidden reserve. I do not believe this is correct. At 96% the first-fit algorithm changes to best-fit and ganging can be expected. This has nothing to do with the ZIL. There is already a reserve set aside for metadata and the ZIL so that you can remove files when the file system is 100% full. This reserve is 32 MB or 1/64 of the pool size. Also this sudden switch of behavior is completely unexpected and at least under- documented. Methinks you are just seeing the change in performance from the allocation algorithm change. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. In this situation, not only writes suffered, but as a side effect reads also came to a nearly complete halt. If you have atime=on, then reads create writes. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
Richard Elling wrote: On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote: Hi Neil, Neil Perrin wrote: NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. I think this is not what we saw, for two reason: a) we have a mirrored slog device. According to zpool iostat -v only 16MB out of 4GB were in use. b) it didn't seem like the txg would have been closed early. Rather it kept approximately the 30 second intervals. Internally we came up with a different explanation, without any backing that it might be correct: When the pool reaches 96%, zfs goes into a 'self defense' mode. Instead of allocating block from ZIL, every write turns synchronous and has to wait for the txg to finish naturally. The reasoning behind this might be that even if ZIL is available, there might not be enough space left to commit the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above 96%. While this might be proper for small pools, on large pools 4% are still some TB of free space, so there should be an upper limit of maybe 10GB on this hidden reserve. I do not believe this is correct. At 96% the first-fit algorithm changes to best-fit and ganging can be expected. This has nothing to do with the ZIL. There is already a reserve set aside for metadata and the ZIL so that you can remove files when the file system is 100% full. This reserve is 32 MB or 1/64 of the pool size. Maybe it is some side-effect of this change of allocation scheme. But I'm very sure about what I saw. The change was drastic and abrupt. I had a dtrace script running that measured the time for rfs3_write to complete. With the pool 96% I saw a burst of writes every 30 seconds, with completion times of up to 30s. With the pool 96%, I saw a continuous stream of writes with completion times of mostly a few microseconds. In this situation, not only writes suffered, but as a side effect reads also came to a nearly complete halt. If you have atime=on, then reads create writes. atime is off. The impact on reads/lookups/getattr came imho because all server threads have been occupied by blocking writes for a prolonged time. I'll try to reproduce this on a test machine. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
On Thu, 9 Sep 2010, Erik Trimble wrote: Yes, it's welcome to get it over with. I do get to bitch about one aspect here of the US civil legal system, though. If you've gone so far as to burn our (the public's) time and money to file a lawsuit, you shouldn't be able to seal up the court transcript, or have a non-public settlement. Call it the price you pay for wasting our time (i.e. the court system's time). Unfortunately, this may just be a case of Oracle's patents vs NetApp's patents. Oracle obviously holds a lot of patents and could counter-sue using one of its own patents. Oracle's handshake agreement with NetApp does not in any way shield other zfs commercial users from a patent lawsuit from NetApp. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] performance leakage when copy huge data
ml == Mark Little marklit...@koallo.com writes: ml Just to clarify - do you mean TLER should be off or on? It should be set to ``do not have asvc_t 11 seconds and 1 io/s''. ...which is not one of the settings of the TLER knob. This isn't a problem with the TLER *setting*. TLER does not even apply unless the drive has a latent sector error. TLER does not even apply unless the drive has a latent sector error. TLER does not even apply unless the drive has a latent sector error. GOT IT? so if the drive is not defective, but is erratically having huge latency when not busy, this isn't a TLER problem. It's a drive-is-unpredictable-piece-of-junk problem. Will the problem go away if you change the TLER setting to the opposite of whatever it is? Who knows?! It shouldn't based on the claimed purpose of TLER, but in reality, maybe, maybe not, because the drive shouldn't (``shouldn't'', haha) act like that to begin with. It will be more likely to go away if you replace the drive with a different model, though. ml Storage forum on hardforum.com, the experts there seem to ml recommend NOT having TLER enabled when using ZFS as ZFS can be ml configured for its timeouts, etc, I don't believe there are any configurable timeouts in ZFS. The ZFS developers take the position that timeouts are not our problem and push all that work down the stack to the controller driver and the disk driver, which cooperate (this is two drivers, now. plus a third ``SCSI mid-layer'' perhaps, for some controllers but not others.) to implement a variety of inconsistent, silly, undocumented cargo-cult flailing timeout regimes that we all have to put up with. However they are always quite long. The ATA max timeout is 30sec, and AIUI they are all much longer than that. My new favorite thing, though, is the reference counting. OS: ``This disk/iSCSIdisk is `busy' so you can't detach it''. me: ``bullshit. YOINK, detached, now deal with it.'' IMO this area is in need of some serious bar-raising. ml and the main reason to use TLER is when using those drives ml with hardware RAID cards which will kick a drive out of the ml array if it takes longer than 10 seconds. yup. which is something the drive will not do unless it encounters an ERROR. that is the E in TLER. In other words, the feature as described prevents you from noticing and invoking warranty replacement on your about-to-fail drive. For this you pay double. Have I got that right? In any case the obvious proper place to fix this is in the RAID-on-a-card firmware, not the disk firmware, if it does even need fixing which is unclear to me. unless the disk manufacturers are going to offer a feature ``do not spend more than 1 second out of every 2 seconds `trying harder' to read marginal data, just return errors'' which woudl actually have real value, the only reason TLER is proper is that it can convince all you gamers to pay twice as much for a drive because they've flipped a single bit in the firmware and then shovelled a big pile of bullshit into your heads. ml Can anyone else here comment if they have had experience with ml the WD drives and ZFS and if they have TLER enabled or ml disabled? I do not have any problems with drives dropping out of ZFS using the normal TLER setting. I do have problems with slowly-failing drives fucking up the whole system. ZFS doesn't deal with them gracefully, and I have to find the bad drive and remove it by hand. All this stuff about cold spares automatically replacing and USARS never notice, is largely a fantasy. Neither observation leads me to want TLER. however observations like this ``why did my disks suddenly slow down?'' lead me to avoid WD drives period, for ZFS or not ZFS or anything at all. Whipping up all this marketing sillyness around TLER also leads me to avoid them because I know they will shovel bullshit and FUD to justify jacked prices. pgpMng48rq0w8.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
dm == David Magda dma...@ee.ryerson.ca writes: dm http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/ http://www.groklaw.net/articlebasic.php?story=20050121014650517 says when the MPL was modified to become the CDDL, clauses were removed which would have required Oracle to disclose any patent licenses it might have negotiated with NetApp covering CDDL code. The disclosure would have to be added to hg, freeze or no: ``If Contributor obtains such knowledge after the Modification is made available as described in Section 3.2, Contributor shall promptly modify the LEGAL file in all copies Contributor makes available thereafter and shall take other steps (such as notifying appropriate mailing lists or newsgroups) reasonably calculated to inform those who received the Covered Code that new knowledge has been obtained.'' This is in MPL but removed from CDDL. The groklaw poster's concern is that this is a mechanism through which Oracle could manoever to make the CDDL worthless as a guarantee of zfs users' software freedom. CDDL does implicitly grant rights to Oracle's patents, but not to negotiations for shield from NetApp's. AIUI GPLv3 is different and does not have this problem, though I don't understand it well so I could be wrong. With MPL at least we would know about the negotiations: the settlement was ``secret'' which is exactly the disaster scenario the groklaw poster warned of. I'm sorry you cannot be uninterested in licenses and ``just want to get work done.'' To me it looks like the patent situation is mostly an obstacle to getting ZFS development funded. If you used ZFS secretly in some kind of cloud service, and never told anyone about it, you could be pretty certain of getting away with it without any patent claims throughout the entire decade or so that ZFS remains relevant, but if you want to participate in a horizontally-divided market like Coraid, or otherwise share source changes, you might get sued. This regime has to be a huge drag on the industry, and it makes things really unpredictable which has to discourage investment, and it strongly favours large companies. pgpLRI59okaob.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
On 9/9/2010 11:11 AM, Garrett D'Amore wrote: On Thu, 2010-09-09 at 12:58 -0500, Bob Friesenhahn wrote: On Thu, 9 Sep 2010, Erik Trimble wrote: Yes, it's welcome to get it over with. I do get to bitch about one aspect here of the US civil legal system, though. If you've gone so far as to burn our (the public's) time and money to file a lawsuit, you shouldn't be able to seal up the court transcript, or have a non-public settlement. Call it the price you pay for wasting our time (i.e. the court system's time). Unfortunately, this may just be a case of Oracle's patents vs NetApp's patents. Oracle obviously holds a lot of patents and could counter-sue using one of its own patents. Oracle's handshake agreement with NetApp does not in any way shield other zfs commercial users from a patent lawsuit from NetApp. True. But, I wonder if the settlement sets a precedent? Certainly the lack of a successful lawsuit has *failed* to set any precedent conclusively indicating that NetApp has enforceable patents where ZFS is concerned. IANAL, but it seems like if Oracle and NetApp were to reach some kind of licensing arrangement, then it might be construed to be anticompetitive if NetApp were to fail to offer similar licensing arrangements to downstream consumers? Does anyone know if there is any basis for such a theory, or are these just my idle imaginings? As far as I know, Nexenta has not been approached by NetApp. I'd like to see what happens with Coraid ... but ultimately those conversations are between Coraid and NetApp. - Garrett This is *exactly* the reason I advocate forced public settlement agreements. If you've availed yourself of the court system, you should be obligated to put into the public record any agreement reached, just as if you had gotten a verdict. It would help prevent a lot of the cross-licensing discrimination due to secrecy. Oh well. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
On Thu, 9 Sep 2010, Garrett D'Amore wrote: True. But, I wonder if the settlement sets a precedent? No precedent has been set. Certainly the lack of a successful lawsuit has *failed* to set any precedent conclusively indicating that NetApp has enforceable patents where ZFS is concerned. Right. IANAL, but it seems like if Oracle and NetApp were to reach some kind of licensing arrangement, then it might be construed to be anticompetitive if NetApp were to fail to offer similar licensing arrangements to downstream consumers? Does anyone know if there is any basis for such a theory, or are these just my idle imaginings? Idle imaginings. A patent holder is not compelled to license use of the patent to anyone else, and can be selective regarding who gets a license. As far as I know, Nexenta has not been approached by NetApp. I'd like to see what happens with Coraid ... but ultimately those conversations are between Coraid and NetApp. There should be little doubt that NetApp's goal was to make money by suing Sun. Nexenta does not have enough income/assets to make a risky lawsuit worthwhile. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] resilver = defrag?
A) Resilver = Defrag. True/false? B) If I buy larger drives and resilver, does defrag happen? C) Does zfs send zfs receive mean it will defrag? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How to migrate to 4KB sector drives?
ZFS does not handle 4K sector drives well, you need to create a new zpool with 4K property (ashift) set. http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html Are there plans to allow resilver to handle 4K sector drives? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote: A) Resilver = Defrag. True/false? False. Resilver just rebuilds a drive in a vdev based on the redundant data stored on the other drives in the vdev. Similar to how replacing a dead drive works in a hardware RAID array. B) If I buy larger drives and resilver, does defrag happen? No. C) Does zfs send zfs receive mean it will defrag? No. ZFS doesn't currently have a defragmenter. That will come when the legendary block pointer rewrite feature is committed. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn There should be little doubt that NetApp's goal was to make money by suing Sun. Nexenta does not have enough income/assets to make a risky lawsuit worthwhile. But in all likelihood, Apple still won't touch ZFS. Apple would be worth suing. A big fat juicy... On interesting take-away point, however: Oracle is now in a solid position to negotiate with Apple. If Apple wants to pay for ZFS and indemnification against netapp lawsuit, Oracle can grant it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
I am speaking from my own observations and nothing scientific such as reading the code or designing the process. A) Resilver = Defrag. True/false? False B) If I buy larger drives and resilver, does defrag happen? No. The first X sectors of the bigger drive are identical to the smaller drive, fragments and all. C) Does zfs send zfs receive mean it will defrag? Yes. The data is laid out on the receiving side in a sane manner, until it later becomes fragmented. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Thu, Sep 9, 2010 at 1:26 PM, Freddie Cash fjwc...@gmail.com wrote: On Thu, Sep 9, 2010 at 1:04 PM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote: A) Resilver = Defrag. True/false? False. Resilver just rebuilds a drive in a vdev based on the redundant data stored on the other drives in the vdev. Similar to how replacing a dead drive works in a hardware RAID array. B) If I buy larger drives and resilver, does defrag happen? No. Actually, thinking about it ... since the resilver is writing new data to an empty drive, in essence, the drive is defragmented. C) Does zfs send zfs receive mean it will defrag? No. Same here, but only if the receiving pool has never had any snapshots deleted or files deleted, so that there are no holes in the pool. Then the newly written data will be contiguous (not fragmented). ZFS doesn't currently have a defragmenter. That will come when the legendary block pointer rewrite feature is committed. -- Freddie Cash fjwc...@gmail.com -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
On Thu, Sep 9, 2010 at 2:49 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Thu, 9 Sep 2010, Garrett D'Amore wrote: True. But, I wonder if the settlement sets a precedent? No precedent has been set. Certainly the lack of a successful lawsuit has *failed* to set any precedent conclusively indicating that NetApp has enforceable patents where ZFS is concerned. Right. IANAL, but it seems like if Oracle and NetApp were to reach some kind of licensing arrangement, then it might be construed to be anticompetitive if NetApp were to fail to offer similar licensing arrangements to downstream consumers? Does anyone know if there is any basis for such a theory, or are these just my idle imaginings? Idle imaginings. A patent holder is not compelled to license use of the patent to anyone else, and can be selective regarding who gets a license. As far as I know, Nexenta has not been approached by NetApp. I'd like to see what happens with Coraid ... but ultimately those conversations are between Coraid and NetApp. There should be little doubt that NetApp's goal was to make money by suing Sun. Nexenta does not have enough income/assets to make a risky lawsuit worthwhile. There should be little doubt it's a complete waste of money for NetApp go to court with a second party when the outcome of their primary lawsuit will determine the outcome of the second. They had absolutely nothing to gain by suing Nexenta if they still had a pending lawsuit with Sun. Furthermore, unless you work as legal counsel for Nexenta, I'd say you have absolutely no clue whether or not they received a cease and desist from NetApp. I *STRONGLY* doubt the goal was money for NetApp. They've got that coming out of their ears. It was either cross licensing issues (almost assuredly this), or a hope to stop/slow down ZFS. If they had, I strongly doubt it's something they would want to publicize. It wouldn't exactly give potential customers the warm and fuzzies. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Comment at end... Mattias Pantzare wrote: On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey sh...@nedharvey.com wrote: From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias Pantzare It is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 vdev you have to read half the data compared to 1 vdev to resilver a disk. Let's suppose you have 1T of data. You have 12-disk raidz2. So you have approx 100G on each disk, and you replace one disk. Then 11 disks will each read 100G, and the new disk will write 100G. Let's suppose you have 1T of data. You have 2 vdev's that are each 6-disk raidz1. Then we'll estimate 500G is on each vdev, so each disk has approx 100G. You replace a disk. Then 5 disks will each read 100G, and 1 disk will write 100G. Both of the above situations resilver in equal time, unless there is a bus bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 disks in a single raidz3 provides better redundancy than 3 vdev's each containing a 7 disk raidz1. In my personal experience, approx 5 disks can max out approx 1 bus. (It actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks on a good bus, or good disks on a crap bus, but generally speaking people don't do that. Generally people get a good bus for good disks, and cheap disks for crap bus, so approx 5 disks max out approx 1 bus.) In my personal experience, servers are generally built with a separate bus for approx every 5-7 disk slots. So what it really comes down to is ... Instead of the Best Practices Guide saying Don't put more than ___ disks into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck by constructing your vdev's using physical disks which are distributed across multiple buses, as necessary per the speed of your disks and buses. This is assuming that you have no other IO besides the scrub. You should of course keep the number of disks in a vdev low for general performance reasons unless you only have linear reads (as your IOPS will be close to what only one disk can give for the whole vdev). There is another optimization in the Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. I.e. Optimal sizes RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Erik Trimble wrote: On 9/9/2010 2:15 AM, taemun wrote: Erik: does that mean that keeping the number of data drives in a raidz(n) to a power of two is better? In the example you gave, you mentioned 14kb being written to each drive. That doesn't sound very efficient to me. (when I say the above, I mean a five disk raidz or a ten disk raidz2, etc) Cheers, Well, since the size of a slab can vary (from 512 bytes to 128k), it's hard to say. Length (size) of the slab is likely the better determination. Remember each block on a hard drive is 512 bytes (for now). So, it's really not any more efficient to write 16k than 14k (or vice versa). Both are integer multiples of 512 bytes. IIRC, there was something about using a power-of-two number of data drives in a RAIDZ, but I can't remember what that was. It may just be a phantom memory. Not a phantom memory... From Matt Ahrens in a thread titled 'Metaslab alignment on RAID-Z': http://www.opensolaris.org/jive/thread.jspa?messageID=60241 'To eliminate the blank round up sectors for power-of-two blocksizes of 8k or larger, you should use a power-of-two plus 1 number of disks in your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 6) and for 2k, use 3 disks (for double parity, use 4).' These round up sectors are skipped and used as padding to simplify space accounting and improve performance. I may have referred to them as zero padding sectors in other posts, however they're not necessarily zeroed. See the thread titled 'raidz stripe size (not stripe width)' http://opensolaris.org/jive/thread.jspa?messageID=495351 This looks to be the reasoning behind the optimization in the ZFS Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. I.e. Optimal sizes RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev The best practices guide recommendation of 3-9 devices per vdev appears based on RAIDZ1's optimal size with 3-9 devices when N=1 to 3 in 2^N + P. Victor Latushkin in a thread titled 'odd versus even' said the same thing. Adam Leventhal said this had a 'very slight space-efficiency benefit' in the same thread. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg05460.html --- That said, the recommendations in the Best Practices Guide for RAIDZ2 to start with 5 disks and RAIDZ3 to start with 8 disks, do not match with the last recommendation. What is the reasoning behind 5 and 8? Reliability vs space? Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1) Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2) Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3) (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8 Perhaps the Best Practices Guide should also recommend: -the use of striped vdevs in order to bring up the IOPS number, particularly when using enough hard drives to meet the capacity and reliability requirements. -avoiding slow consumer class drives (fast ones may be okay for some users) -more sample array configurations for common drive chassis capacities -consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather than higher level RAIDZ or mirroring (touch on the value of backup vs. stronger RAIDZ) -watch out for BIOS or firmware upgrades that change host protected area (HPA) settings on drives making them appear smaller than before The BPG should also resolve this discrepancy: Storage Pools section: For production systems, use whole disks rather than slices for storage pools for the following reasons Additional Cautions for Storage Pools: Consider planning ahead and reserving some space by creating a slice which is smaller than the whole disk instead of the whole disk. --- Other (somewhat) related threads: From Darren Dunham in a thread titled 'ZFS raidz2 number of disks': http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265 ' 1 Why is the recommendation for a raidz2 3-9 disk, what are the cons for having 16 in a pool compared to 8? Reads potentially have to pull data from all data columns to reconstruct a filesystem block for verification. For random read workloads, increasing the number of columns in the raidz does not increase the read iops. So limiting the column count usually makes sense (with a cost tradeoff). 16 is valid, but not recommended.' From Richard Relling in a thread titled 'rethinking RaidZ and Record size': http://opensolaris.org/jive/thread.jspa?threadID=121016 'The raidz pathological worst case is a random read
Re: [zfs-discuss] performance leakage when copy huge data
Just to update the status and findings. I've checked TLER settings and they are off by default. I moved the source pool to another chassis and do the 3.8TB send again. this time, not any problems! the difference is 1. New chassis 2. BIGGER memory. 32GB v.s 12GB 3. although wdidle time is disabled by default, I've change the HD mode from silent to performance in HDtune. this is what I once heard from some website that might also fix the disk head park/unpark issue (aka, C1). seems TLER is not the root cause or at least, set to off is ok. my next step will be 1. move back HD to see if it's the performance mode fix the issue 2. if not, add more memory and try again. by the way, in HDtune, I saw C7: Ultra DMA CRC error count is a little high which indicates a potential connection issue. Maybe all are caused by the enclosure? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Orvar Korvar A) Resilver = Defrag. True/false? I think everyone will agree false on this question. However, more detail may be appropriate. See below. B) If I buy larger drives and resilver, does defrag happen? Scores so far: 2 No 1 Yes C) Does zfs send zfs receive mean it will defrag? Scores so far: 1 No 2 Yes ... Does anybody here know what they're talking about? I'd feel good if perhaps Erik ... or Neil ... perhaps ... answered the question with actual knowledge. Thanks... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
From: Haudy Kazemi [mailto:kaze0...@umn.edu] There is another optimization in the Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. I.e. Optimal sizes RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev This sounds logical, although I don't know how real it is. The logic seems to be ... Assuming slab sizes of 128K, the amount of data written to each disk within the vdev gets divided into something which is a multiple of 512b or 4K (newer drives supposedly starting to use 4K block sizes instead of 512b). But I have doubts about the real-ness here, because ... An awful lot of times, your actual slabs are smaller than 128K just because you're not performing sustained sequential writes very often. But it seems to make sense, whenever you *do* have some sequential writes, you would want the data written to each disk to be a multiple of 512b or 4K. If you had a 128K slab, divided into 5, then each disk would write 25.6K and even for sustained sequential writes, some degree of fragmentation would be impossible to avoid. Actually, I don't think fragmentation is techinically the correct term for that behavior. It might be more appropriate to simply say it forces a less-than-100% duty cycle. And another thing ... Doesn't the checksum take up some space anyway? Even if you obeyed the BPG and used ... let's say ... 4 disks for N ... then each disk has 32K of data to write, which is a multiple of 4K and 512b ... but each disk also needs to write the checksum. So each disk writes 32K + a few bytes. Which defeats the whole purpose anyway, doesn't it? The effect, if real at all, might be negligible. I don't know how small it is, but I'm quite certain it's not huge. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On 09/09/10 20:08, Edward Ned Harvey wrote: Scores so far: 2 No 1 Yes No. resilver does not re-layout your data or change whats in the block pointers on disk. if it was fragmented before, it will be fragmented after. C) Does zfs send zfs receive mean it will defrag? Scores so far: 1 No 2 Yes maybe. If there is sufficient contiguous freespace in the destination pool, files may be less fragmented. But if you do incremental sends of multiple snapshots, you may well replicate some or all the fragmentation on the origin (because snapshots only copy the blocks that change, and receiving an incremental send does the same). And if the destination pool is short on space you may end up more fragmented than the source. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss