Re: [zfs-discuss] ashift and vdevs
dm == David Magda dma...@ee.ryerson.ca writes: dm The other thing is that with the growth of SSDs, if more OS dm vendors support dynamic sectors, SSD makers can have dm different values for the sector size okay, but if the size of whatever you're talking about is a multiple of 512, we don't actually need (or, probably, want!) any SCSI sector size monkeying around. Just establish a minimum write size in the filesystem, and always write multiple aligned 512-sectors at once instead. the 520-byte sectors you mentioned can't be accomodated this way, but for 4kByte it seems fine. dm to allow for performance dm changes as the technology evolves. Currently everything is dm hard-coded, XFS is hardcoded. NTFS has settable block size. ZFS has ashift (almost). ZFS slog is apparently hardcoded though. so, two of those four are not hardcoded, and the two hardcoded ones are hardcoded to 4kByte. dm Until you're in a virtualized environment. I believe that in dm the combination of NetApp and VMware, a 64K alignment is best dm practice last I head. Similarly with the various stripe widths dm available on traditional RAID arrays, it could be advantageous dm for the OS/FS to know it. There is another setting in XFS for RAID stripe size, but I don't know what it does. It's separate from the (unsettable) XFS block size setting. so...this 64kByte thing might not be the same thing as what we're talking about so far...though in terms of aligning partitions it's the same, I guess. pgpKhRGPwJZ8d.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
Brandon High writes: On Tue, Nov 23, 2010 at 9:55 AM, Krunal Desai mov...@gmail.com wrote: What is the upgrade path like from this? For example, currently I The ashift is set in the pool when it's created and will persist through the life of that pool. If you set it at pool creation, it will stay regardless of OS upgrades. It is indeed persistent but each top level vdev (mirror or raid-z group or drive in a stripe) will have it's own value based on the sector size when the vdev was integrated in the pool. The sector size of a vdev which is part of a pool is better not to increase (or vdev will be faulted). -r -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
kd == Krunal Desai mov...@gmail.com writes: kd http://support.microsoft.com/kb/whatever dude.seriously? This is worse than a waste of time. Don't read a URL that starts this way. kd Windows 7 (even with SP1) has no support for 4K-sector kd drives. NTFS has 4KByte allocation units, so all you have to do is make sure the NTFS partition starts at an LBA that's a multiple of 8, and you have full performance. Probably NTFS is the reason WD has chosen 4kByte. Linux XFS is also locked at 4kByte sector size, because that's the VM page size and XFS cannot use any other block size than the page size. so, 4kByte is good (except for ZFS). kd can you explicate further about these drives and their kd emulation (or lack thereof), I'd appreciate it! further explication: all drives will have the emulation, or else you wouldn't be able to boot from them. The world of peecees isn't as clean as you imagine. kd which 4K sector drives offer a jumper or other method to kd completely disable any form of emulation and appear to the kd host OS as a 4K-sector drive? None that I know of. It's probably simpler and less silly to leave the emulation in place forever than start adding jumpers and modes and more secret commands. It doesn't matter what sector size the drive presents to the host OS because you can get the same performance character by always writing an aligned set of 8 sectors at once, which is what people are trying to force ZFS to do by adding 3 to ashift. Whether the number is reported by some messy new invented SCSI command, input by the operator, or derived by a mini-benchmark added to format/fmthard/zpool/whatever-applies-the-label, this is done once for the life of the disk, and after that happens whenever the OS needs this number it's gotten by issuing READ on the label. Day-to-day, the drive doesn't need to report it. Therefore, it is ``ability to accomodate a minimum-aligned-write-size'' which badly people want added to their operating systems, and no one sane really cares about automatic electronic reporting of true sector size. Unfortunately (but predictably) it sounds like if you 'zfs replace' a 512-byte drive with a 4096-byte drive you are screwed. therefore even people with 512-byte drives might want to set their ashift for 4096-byte drives right now. This is another reason it's a waste of time to worry about reporting/querying a drive's ``true'' sector size: for a pool of redundant disks, the needed planning's more complicated than query-report-obey. Also did anyone ever clarify whether the slog has an ashift? or is it forced-512? or derived from whatever vdev will eventually contain the separately-logged data? I would expect generalized immediate Caring about that since no slogs except ACARD and DDRDrive will have 512-byte sectors. pgpdnTloWn49S.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On 12/01/10 22:14, Miles Nordin wrote: Also did anyone ever clarify whether the slog has an ashift? or is it forced-512? or derived from whatever vdev will eventually contain the separately-logged data? I would expect generalized immediate Caring about that since no slogs except ACARD and DDRDrive will have 512-byte sectors. The minimum slog write is #define ZIL_MIN_BLKSZ 4096 and all writes are also rounded to multiples of ZIL_MIN_BLKSZ. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
I'd also note that in the future at some point, we won't be able to purchase 512B drives any more. In particular, I think that 3TB drives will all be 4KB formatted. So it isn't inadvisable for a pool that you plan on expanding to have ashift=12 (imo). One new thought occurred to me; I know some of the 4K drives emulate 512 byte sectors, so to the host OS, they appear to be no different than other 512b drives. With this additional layer of emulation, I would assume that ashift wouldn't be needed, though I have read reports of this affecting performance. I think I'll need to confirm what drives do what exactly and then decide on an ashift if needed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On 27 November 2010 08:05, Krunal Desai mov...@gmail.com wrote: One new thought occurred to me; I know some of the 4K drives emulate 512 byte sectors, so to the host OS, they appear to be no different than other 512b drives. With this additional layer of emulation, I would assume that ashift wouldn't be needed, though I have read reports of this affecting performance. I think I'll need to confirm what drives do what exactly and then decide on an ashift if needed. If you consider that for a 4KB internal drive, with a 512B external interface, a request for a 512B write will result in the drive reading 4KB, modifying it (putting the new 512B in) and then writing the 4KB out again. This is terrible from a latency perspective. I recall seeing 20 IOPS on a WD EARS 2TB drive (ie, 50ms latency for random 512B writes). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On Nov 26, 2010, at 20:09 , taemun wrote: If you consider that for a 4KB internal drive, with a 512B external interface, a request for a 512B write will result in the drive reading 4KB, modifying it (putting the new 512B in) and then writing the 4KB out again. This is terrible from a latency perspective. I recall seeing 20 IOPS on a WD EARS 2TB drive (ie, 50ms latency for random 512B writes). Agreed. However, if you look at this MS KB article: http://support.microsoft.com/kb/982018/en-us , Windows 7 (even with SP1) has no support for 4K-sector drives. Obviously, we're dealing with ZFS and Solaris/BSD here, but what I'm getting at is, which 4K sector drives offer a jumper or other method to completely disable any form of emulation and appear to the host OS as a 4K-sector drive? I believe the Barracuda LPs (the 5900rpm) disks can do this, but I'm not sure about the others like the F4s. I believe you earlier said you were using F4s (the HD204UIs) and the 5900rpm Seagates; can you explicate further about these drives and their emulation (or lack thereof), I'd appreciate it! --khd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On Tue, Nov 23, 2010 at 9:55 AM, Krunal Desai mov...@gmail.com wrote: What is the upgrade path like from this? For example, currently I The ashift is set in the pool when it's created and will persist through the life of that pool. If you set it at pool creation, it will stay regardless of OS upgrades. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ashift and vdevs
zdb -C shows an shift value on each vdev in my pool, I was just wondering if it is vdev specific, or pool wide. Google didn't seem to know. I'm considering a mixed pool with some advanced format (4KB sector) drives, and some normal 512B sector drives, and was wondering if the ashift can be set per vdev, or only per pool. Theoretically, this would save me some size on metadata on the 512B sector drives. Cheers, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On Tue, November 23, 2010 08:53, taemun wrote: zdb -C shows an shift value on each vdev in my pool, I was just wondering if it is vdev specific, or pool wide. Google didn't seem to know. I'm considering a mixed pool with some advanced format (4KB sector) drives, and some normal 512B sector drives, and was wondering if the ashift can be set per vdev, or only per pool. Theoretically, this would save me some size on metadata on the 512B sector drives. It's a per-pool property, and currently hard coded to a value of nine (i.e., 2^9 = 512). Sun/Oracle are aware of the new, upcoming sector size/s and some changes have been made in the code: a. PSARC/2008/769: Multiple disk sector size support http://arc.opensolaris.org/caselog/PSARC/2008/769/ b. PSARC/2010/296: Add tunable to control RMW for Flash Devices http://arc.opensolaris.org/caselog/PSARC/2010/296/ (a) appears to have been fixed in snv_118 or so: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6710930 However, at this time, there is no publicly available code that dynamically determines physical sector size and then adjusts ZFS pools automatically. Even if there was, most disks don't support the necessary ATA/SCSI command extensions to report on physical and logical sizes differences. AFAIK, they all simply report 512 when asked. If all of your disks will be 4K, you can hack together a solution to take advantage of that fact: http://tinyurl.com/25gmy7o http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html Hopefully it'll make it into at least Solaris 11, as during the lifetime of that product there will be even more disks with that property. There's also the fact that many LUNs from SANs also have alignment issues, though they tend to be at 64K. (At least that's what VMware and NetApp best practices state.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
Interesting, I didn't realize that Soracle was working on/had a solution somewhat in place for 4K-drives. I wonder what will happen first for me, Hitachi 7K2000s hitting a reasonable price, or 4K/variable-size sector support hiting so I can use Samsung F4s or Barracuda LPs. On Tue, Nov 23, 2010 at 9:40 AM, David Magda dma...@ee.ryerson.ca wrote: On Tue, November 23, 2010 08:53, taemun wrote: zdb -C shows an shift value on each vdev in my pool, I was just wondering if it is vdev specific, or pool wide. Google didn't seem to know. I'm considering a mixed pool with some advanced format (4KB sector) drives, and some normal 512B sector drives, and was wondering if the ashift can be set per vdev, or only per pool. Theoretically, this would save me some size on metadata on the 512B sector drives. It's a per-pool property, and currently hard coded to a value of nine (i.e., 2^9 = 512). Sun/Oracle are aware of the new, upcoming sector size/s and some changes have been made in the code: a. PSARC/2008/769: Multiple disk sector size support http://arc.opensolaris.org/caselog/PSARC/2008/769/ b. PSARC/2010/296: Add tunable to control RMW for Flash Devices http://arc.opensolaris.org/caselog/PSARC/2010/296/ (a) appears to have been fixed in snv_118 or so: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6710930 However, at this time, there is no publicly available code that dynamically determines physical sector size and then adjusts ZFS pools automatically. Even if there was, most disks don't support the necessary ATA/SCSI command extensions to report on physical and logical sizes differences. AFAIK, they all simply report 512 when asked. If all of your disks will be 4K, you can hack together a solution to take advantage of that fact: http://tinyurl.com/25gmy7o http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html Hopefully it'll make it into at least Solaris 11, as during the lifetime of that product there will be even more disks with that property. There's also the fact that many LUNs from SANs also have alignment issues, though they tend to be at 64K. (At least that's what VMware and NetApp best practices state.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- --khd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
Cheers for the links David, but you'll note that I've commented on the blog you linked (ie, was aware of it). The zpool-12 binary linked from http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/ worked perfectly on my SX11 installation. (It threw some error on b134, so it relies on some external code, to some extent.) I'd note for those who are going to try, that that binary produces a pool of as high a version as the system supports. I was surprised that it was higher than the code for which it was compiled (ie, b147 = zpool v28). I'm currently populating a pool with a 9-wide raidz vdev of Samsung HD204UI 2TB (5400rpm, 4KB sector) and a 9-wide raidz vdev of Seagate LP ST32000542AS 2TB (5900 rpm, 4KB sector) which was created with that binary, and haven't seen any of the performance issues I've had in the past with WD EARS drives. It would be lovely if Oracle could see fit to implementing correct detection of these drives! Or, at the very least, an -o ashift=12 parameter in the zpool create function. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ashift and vdevs
On Tue, Nov 23, 2010 at 9:59 AM, taemun tae...@gmail.com wrote: I'm currently populating a pool with a 9-wide raidz vdev of Samsung HD204UI 2TB (5400rpm, 4KB sector) and a 9-wide raidz vdev of Seagate LP ST32000542AS 2TB (5900 rpm, 4KB sector) which was created with that binary, and haven't seen any of the performance issues I've had in the past with WD EARS drives. It would be lovely if Oracle could see fit to implementing correct detection of these drives! Or, at the very least, an -o ashift=12 parameter in the zpool create function. What is the upgrade path like from this? For example, currently I have b134 OpenSolaris with 8x1.5TB drives in a -Z2 storage pool. I would like to go to OpenIndiana and move that data to a new pool built of 3 6-drive -Z2s (utilizing 2TB drives). I am going to stagger my drive purchases to give my wallet a breather, so I would likely start with 2 6-drive -Z2s at the beginning. If I was to use that binary/hack to force the ashift for 4k drives, I should be able to reconcile and upgrade to a zpool version down the road that is happy and aware of 4k drives? I know the safest route would be to just go with 512-byte sector 7K2000s, but their prices do not drop nearly as often as the LPs or F4s do. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss