Re: [zfs-discuss] ZFS best practice for FreeBSD?
On Thu, 11 Oct 2012, Freddie Cash wrote: On Thu, Oct 11, 2012 at 2:47 PM, andy thomas a...@time-domain.co.uk wrote: According to a Sun document called something like 'ZFS best practice' I read some time ago, best practice was to use the entire disk for ZFS and not to partition or slice it in any way. Does this advice hold good for FreeBSD as well? Solaris disabled the disk cache if the disk was partitioned, thus the recommendation to always use the entire disk with ZFS. FreeBSD's GEOM architecture allows the disk cache to be enabled whether you use the full disk or partition it. Personally, I find it nicer to use GPT partitions on the disk. That way, you can start the partition at 1 MB (gpart add -b 2048 on 512B disks, or gpart add -b 512 on 4K disks), leave a little wiggle-room at the end of the disk, and use GPT labels to identify the disk (using gpt/label-name for the device when adding to the pool). This is apparently what had been done in this case: gpart add -b 34 -s 600 -t freebsd-swap da0 gpart add -b 634 -s 1947525101 -t freebsd-zfs da1 gpart show (stuff relating to a compact flash/SATA boot disk deleted) =34 1953525101 da0 GPT (932G) 34 6001 freebsd-swap (2.9G) 634 19475251012 freebsd-zfs (929G) =34 1953525101 da2 GPT (932G) 34 6001 freebsd-swap (2.9G) 634 19475251012 freebsd-zfs (929G) =34 1953525101 da1 GPT (932G) 34 6001 freebsd-swap (2.9G) 634 19475251012 freebsd-zfs (929G) Is this a good scheme? The server has 12 G of memory (upped from 4 GB last year after it kept crashing with out of memory reports on the console screen) so I doubt the swap would actually be used very often. Running Bonnie++ on this pool comes up with some very good results for sequential disk writes but the latency of over 43 seconds for block reads is terrible and is obviously impacting performance as a mail server, as shown here: Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP hsl-main.hsl.of 24G63 67 80584 20 70568 17 314 98 554226 60 410.1 13 Latency 77140us 43145ms 28872ms 171ms 212ms 232ms Version 1.96 --Sequential Create-- Random Create hsl-main.hsl.office -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 19261 93 + +++ 18491 97 21542 92 + +++ 20691 94 Latency 15399us 488us 226us 27733us 103us 138us The other issue with this server is it needs to be rebooted every 8-10 weeks as disk I/O slows to a crawl over time and the server becomes unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 has a lot of problems so I was planning to rebuild the server with FreeBSD 9.0 and ZFS 28 but I didn't want to make any basic design mistakes in doing this. Another point about the Sun ZFS paper - it mentioned optimum performance would be obtained with RAIDz pools if the number of disks was between 3 and 9. So I've always limited my pools to a maximum of 9 active disks plus spares but the other day someone here was talking of seeing hundreds of disks in a single pool! So what is the current advice for ZFS in Solaris and FreeBSD? You can have multiple disks in a vdev. And you can multiple vdevs in a pool. Thus, you can have hundred of disks in a pool. :) Just split the disks up into multiple vdevs, where each vdev is under 9 disks each. :) For example, we have 25 disks in the following pool, but only 6 disks in each vdev (plus log/cache): [root@alphadrive ~]# zpool list -v NAMESIZE ALLOC FREECAP DEDUP HEALTH ALTROOT storage24.5T 20.7T 3.76T84% 3.88x DEGRADED - raidz2 8.12T 6.78T 1.34T - gpt/disk-a1- - - - gpt/disk-a2- - - - gpt/disk-a3- - - - gpt/disk-a4- - - - gpt/disk-a5- - - - gpt/disk-a6- - - - raidz2 5.44T 4.57T 888G - gpt/disk-b1- - - - gpt/disk-b2- - - - gpt/disk-b3- - - - gpt/disk-b4- - - - gpt/disk-b5- - - - gpt/disk-b6- - - - raidz2 5.44T 4.60T 863G - gpt/disk-c1
Re: [zfs-discuss] ZFS best practice for FreeBSD?
On Thu, 11 Oct 2012, Richard Elling wrote: On Oct 11, 2012, at 2:58 PM, Phillip Wagstrom phillip.wagst...@gmail.com wrote: On Oct 11, 2012, at 4:47 PM, andy thomas wrote: According to a Sun document called something like 'ZFS best practice' I read some time ago, best practice was to use the entire disk for ZFS and not to partition or slice it in any way. Does this advice hold good for FreeBSD as well? My understanding of the best practice was that with Solaris prior to ZFS, it disabled the volatile disk cache. This is not quite correct. If you use the whole disk ZFS will attempt to enable the write cache. To understand why, remember that UFS (and ext, by default) can die a horrible death (+fsck) if there is a power outage and cached data is not flushed to disk. So by default, Sun shipped some disks with write cache disabled by default. For non-Sun disks, they are most often shipped with write cache enabled and the most popular file systems (NTFS) properly issue cache flush requests as needed (for the same reason ZFS issues cache flush requests). Out of interest, how do you enable the write cache on a disk? I recently replaced a failing Dell-branded disk on a Dell server with an HP-branded disk (both disks were the identical Seagate model) and on running the EFI diagnostics just to check all was well, it reported the write cache was disabled on the new HP disk but enabled on the remaining Dell disks in the server. I couldn't see any way of enabling the cache from the EFI diags so I left it as it was - probably not ideal. With ZFS, the disk cache is used, but after every transaction a cache-flush command is issued to ensure that the data made it the platters. Write cache is flushed after uberblock updates and for ZIL writes. This is important for uberblock updates, so the uberblock doesn't point to a garbaged MOS. It is important for ZIL writes, because they must be guaranteed written to media before ack. Thanks for the explanation, that all makes sense now. Andy If you slice the disk, enabling the disk cache for the whole disk is dangerous because other file systems (meaning UFS) wouldn't do the cache-flush and there was a risk for data-loss should the cache fail due to, say a power outage. Can't speak to how BSD deals with the disk cache. I looked at a server earlier this week that was running FreeBSD 8.0 and had 2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a spare. Large file I/O throughput was OK but the mail jail it hosted had periods when it was very slow with accessing lots of small files. All three disks (the two in the ZFS mirror plus the spare) had been partitioned with gpart so that partition 1 was a 6 GB swap and partition 2 filled the rest of the disk and had a 'freebsd-zfs' partition on it. It was these second partitions that were part of the mirror. This doesn't sound like a very good idea to me as surelt disk seeks for swap and for ZFS file I/O are bound to clash. aren't they? It surely would make a slow, memory starved swapping system even slower. :) Another point about the Sun ZFS paper - it mentioned optimum performance would be obtained with RAIDz pools if the number of disks was between 3 and 9. So I've always limited my pools to a maximum of 9 active disks plus spares but the other day someone here was talking of seeing hundreds of disks in a single pool! So what is the current advice for ZFS in Solaris and FreeBSD? That number was drives per vdev, not per pool. -Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 - Andy Thomas, Time Domain Systems Tel: +44 (0)7866 556626 Fax: +44 (0)20 8372 2582 http://www.time-domain.co.uk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
2012-10-12 11:11, andy thomas wrote: Great, thanks for the explanation! I didn't realise you could have a sort of 'stacked pyramid' vdev/pool structure. Well, you can - the layers are pool - top-level VDEVs - leaf VDEVs, though on trivial pools like single-disk ones, the layers kinda merge into one or two :) This should be described in the manpage in greater detail. So the pool stripes over Top-Level VDEVs (TLVDEVs), roughly by round-robining whole logical blocks upon write, and then each tlvdev depending on its redundancy configuration forms the sectors to be written onto its component leaf vdevs (low-level disks, partitions or slices, luns, files, etc.) Since full-stripe writes are not required by ZFS, smaller blocks can consume less sectors than there are leafs (disks) in a tlvdev, but this does not result in lost space holes nor in RMW cycles like on full-stripe RAID systems. If there's a free hole of contiguous logical addressing (roughly, striped across leaf vdevs within the tlvdev), where the userdata sectors (after optional compression) plus the redundancy sectors fit - it will be used. I guess it is because of this contiguous addressing that a tlvdev with raidzN can not (currently) change the number of component disks, and a pool can not decrease the number of tlvdevs. If you add new tlvdevs to an existing pool, the ZFS algorithms will try to put some more load on emptier tlvdevs and balance the writes, although according to discussions, this can still lead to disbalance and performance problems on particular installations. In fact, you can (although not recommended due to balancing reasons) have tlvdevs of mixed size (like in Freddie's example) and even of different structure (i.e. mixing raidz and mirrors or even single LUNs) by forcing the disk attachment. Note however that a loss of a tlvdev kills your whole pool, so don't stripe important data over single disks/luns ;) And you don't have control of what gets written where, so you'd also get an averaged performance mix of raidz and mirrors with unpredictable performance for particular userdata block's storage. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
On 2012-Oct-12 08:11:13 +0100, andy thomas a...@time-domain.co.uk wrote: This is apparently what had been done in this case: gpart add -b 34 -s 600 -t freebsd-swap da0 gpart add -b 634 -s 1947525101 -t freebsd-zfs da1 gpart show Assuming that you can be sure that you'll keep 512B sector disks, that's OK but I'd recommend that you align both the swap and ZFS partitions on at least 4KiB boundaries for future-proofing (ie you can safely stick the same partition table onto a 4KiB disk in future). Is this a good scheme? The server has 12 G of memory (upped from 4 GB last year after it kept crashing with out of memory reports on the console screen) so I doubt the swap would actually be used very often. Having enough swap to hold a crashdump is useful. You might consider using gmirror for swap redundancy (though 3-way is overkill). (And I'd strongly recommend against swapping to a zvol or ZFS - FreeBSD has issues with that combination). The other issue with this server is it needs to be rebooted every 8-10 weeks as disk I/O slows to a crawl over time and the server becomes unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 has a lot of problems Yes, it does - and your symptoms match one of the problems. Does top(1) report lots of inactive and cache memory and very little free memory and a high kstat.zfs.misc.arcstats.memory_throttle_count once I/O starts slowing down? so I was planning to rebuild the server with FreeBSD 9.0 and ZFS 28 but I didn't want to make any basic design mistakes in doing this. I'd suggest you test 9.1-RC2 (just released) with a view to using 9.1, rather than installing 9.0. Since your questions are FreeBSD specific, you might prefer to ask on the freebsd-fs list. -- Peter Jeremy pgpoDwzmWvFUU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
From: Richard Elling [mailto:richard.ell...@gmail.com] Pedantically, a pool can be made in a file, so it works the same... Pool can only be made in a file, by a system that is able to create a pool. Point is, his receiving system runs linux and doesn't have any zfs; his receiving system is remote from his sending system, and it has been suggested that he might consider making an iscsi target available, so the sending system could zpool create and zfs receive directly into a file or device on the receiving system, but it doesn't seem as if that's going to be possible for him - he's expecting to transport the data over ssh. So he's looking for a way to do a zfs receive on a linux system, transported over ssh. Suggested answers so far include building a VM on the receiving side, to run openindiana (or whatever) or using zfs-fuse-linux. He is currently writing his zfs send datastream into a series of files on the receiving system, but this has a few disadvantages as compared to doing zfs receive on the receiving side. Namely, increased risk of data loss and less granularity for restores. For these reasons, it's been suggested to find a way of receiving via zfs receive and he's exploring the possibilities of how to improve upon this situation. Namely, how to zfs receive on a remote linux system via ssh, instead of cat'ing or redirecting into a series of files. There, I think I've recapped the whole thread now. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of andy thomas According to a Sun document called something like 'ZFS best practice' I read some time ago, best practice was to use the entire disk for ZFS and not to partition or slice it in any way. Does this advice hold good for FreeBSD as well? I'm not going to address the FreeBSD question. I know others have made some comments on the best practice on solaris, but here goes: There are two reasons for the best practice of not partitioning. And I disagree with them both. First, by default, the on-disk write cache is disabled. But if you use the whole disk in a zpool, then zfs enables the cache. If you partition a disk and use it for only zpool's, then you might want to manually enable the cache yourself. This is a fairly straightforward scripting exercise. You may use this if you want: (No warranty, etc, it will probably destroy your system if you don't read and understand and rewrite it yourself before attempting to use it.) https://dl.dropbox.com/u/543241/dedup%20tests/cachecontrol/cachecontrol.zip If you do that, you'll need to re-enable the cache once on each boot (or zfs mount). The second reason is because when you zpool import it doesn't automatically check all the partitions of all the devices - it only scans devices. So if you are forced to move your disks to a new system, you try to import, you get an error message, you panic and destroy your disks. To overcome this problem, you just need to be good at remembering the disks were partitioned - Perhaps you should make a habit of partitioning *all* of your disks, so you'll *always* remember. On zpool import, you need to specify the partitions to scan for zpools. I believe this is the zpool import -d option. And finally - There are at least a couple of solid reasons *in favor* of partitioning. #1 It seems common, at least to me, that I'll build a server with let's say, 12 disk slots, and we'll be using 2T disks or something like that. The OS itself only takes like 30G which means if I don't partition, I'm wasting 1.99T on each of the first two disks. As a result, when installing the OS, I always partition rpool down to ~80G or 100G, and I will always add the second partitions of the first disks to the main data pool. #2 A long time ago, there was a bug, where you couldn't attach a mirror unless the two devices had precisely the same geometry. That was addressed in a bugfix a couple of years ago. (I had a failed SSD mirror, and Sun shipped me a new SSD with a different firmware rev, and the size of the replacement device was off by 1 block, so I couldn't replace the failed SSD). After the bugfix, a mirror can be attached if there's a little bit of variation in the sizes of the two devices. But it's not quite enough - As recently as 2 weeks ago, I tried to attach two devices that were precisely the same, but couldn't because of the different size. One of them was a local device, and the other was an iscsi target. So I guess iscsi must require a little bit of space, and that was enough to make the devices un-mirror-able without partitioning. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
2012-10-12 16:50, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) пишет: So he's looking for a way to do a zfs receive on a linux system, transported over ssh. Suggested answers so far include building a VM on the receiving side, to run openindiana (or whatever) or using zfs-fuse-linux. For completeness, if iSCSI target on the receiving host or another similar solution is implemented, the secure networking part of zfs send over ssh (local sending into a pool on an iSCSI target) can be done by a VPN, i.e. OpenVPN which uses the same OpenSSL encryption. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
Jim, I'm trying to contact you off-list, but it doesn't seem to be working. Can you please contact me off-list? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
On Fri, Oct 12, 2012 at 3:28 AM, Jim Klimov jimkli...@cos.ru wrote: In fact, you can (although not recommended due to balancing reasons) have tlvdevs of mixed size (like in Freddie's example) and even of different structure (i.e. mixing raidz and mirrors or even single LUNs) by forcing the disk attachment. My example shows 4 raidz2 vdevs, with each vdev having 6 disks, along with a log vdev, and a cache vdev. Not sure where you're seeing an imbalance. Maybe it's because the pool is currently resilvering a drive, thus making it look like one of the vdevs has 7 drives? My home file server ran with mixed vdevs for awhile (a 2 IDE-disk mirror vdev with a 3 SATA-disk raidz1 vdev) as it was built using scrounged parts. But all my work file servers have matched vdevs. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
On Oct 12, 2012, at 5:50 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Pedantically, a pool can be made in a file, so it works the same... Pool can only be made in a file, by a system that is able to create a pool. You can't send a pool, you can only send a dataset. Whether you receive the dataset into a pool or file is a minor nit, the send stream itself is consistent. Point is, his receiving system runs linux and doesn't have any zfs; his receiving system is remote from his sending system, and it has been suggested that he might consider making an iscsi target available, so the sending system could zpool create and zfs receive directly into a file or device on the receiving system, but it doesn't seem as if that's going to be possible for him - he's expecting to transport the data over ssh. So he's looking for a way to do a zfs receive on a linux system, transported over ssh. Suggested answers so far include building a VM on the receiving side, to run openindiana (or whatever) or using zfs-fuse-linux. He is currently writing his zfs send datastream into a series of files on the receiving system, but this has a few disadvantages as compared to doing zfs receive on the receiving side. Namely, increased risk of data loss and less granularity for restores. For these reasons, it's been suggested to find a way of receiving via zfs receive and he's exploring the possibilities of how to improve upon this situation. Namely, how to zfs receive on a remote linux system via ssh, instead of cat'ing or redirecting into a series of files. There, I think I've recapped the whole thread now. ;-) Yep, and cat works fine. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
On 10/13/12 02:12, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: There are at least a couple of solid reasons *in favor* of partitioning. #1 It seems common, at least to me, that I'll build a server with let's say, 12 disk slots, and we'll be using 2T disks or something like that. The OS itself only takes like 30G which means if I don't partition, I'm wasting 1.99T on each of the first two disks. As a result, when installing the OS, I always partition rpool down to ~80G or 100G, and I will always add the second partitions of the first disks to the main data pool. How do you provision a spare in that situation? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss