Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
On 11/27/12 1:52 AM, Grégory Giannoni s...@narguile.org wrote: Le 27 nov. 2012 à 01:17, Erik Trimble a écrit : On 11/26/2012 12:54 PM, Grégory Giannoni wrote: [snip] I switched few month ago from Sun X45x0 to HP things : My fast NAS are now DL 180 G6. I got better perfs using LSI 9240-8I rather than HP SmartArray (tried P410 P812). I'm using only 600Gb SSD drives. That LSI controllers supports SATA III, or 6Gbps SATA. The Px1x controllers do 6GB SAS, but only 3GB SATA, so that's your likely perf difference. The SmartArray Px2x series should do both SATA and SAS at 6Gbps. The SSD drives I'm using (Intel 320 600GB) are limited to 270MB/sec ; So I don't think that SATA II is limiting. That said, I do think you're right that the LSI controller is probably a better fit for connections requiring a SATA SSD. The only exception is having to give up the 1GB of NVRAM on the HP controller. :-( I don't think that this is a real issue when using a bunch of SSDs. I even wonder if the NVRAM is not slowing down writings. My tests were done with ZIL enabled, so a power loss shouldn't damage the datas. HP recommends to disable the write accelerator on SSD-only volumes. http://h2.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=encc=us taskId=120prodSeriesId=3802118prodTypeId=329290objectID=c02963968 [...] Is the bottleneck the LSI controller, or the SAS/SATA bus, or the PCI-E bus itself? That is, have you tested with LSI 9240-4i (one per 8-drive cage, which I *believe* can use the HP multi-lane cable), and with a LSI 9260-16i or LSI 9280-24i? My instinct would be to say it's the PCI-E bus, and you could probably get away with the 4-channel cards. i.e. 4-channels @ 6Gbit/s = 3 GBytes/s 4x PCI-E 2.0 at 2GB/s The first bottleneck we reached (DL 180 / standard 25 drives bay) was the HP controller (both P410 AND P812 reached the same perfs : 800MB/sec writing, 1.3GB/sec reading). With LSI 9240-8I, we reached 1.2GB/s writing, 1.3Gb/s reading. The LSI 9240-4I was not able to connect to the 25-drives bay ; Not tested LSI 9260-16I or LSI 9280-24i. The results were the same with 10 or 25 drives, so I suspected either the PCI bus, either the expander in the 25-drives bay (HP 530946-001). Plugging the disks directly to the LSI card allowed to gain few MB/s : the expander was limiting a bit, but moreover, it disallowed to use more than 1 disk controller ! By replacing the 25-drives bay by three 8-drives bays (507803-B21), the system was able to use 3 LSI 9240-8I, with this 4.4GB/sec reading rate. That's correct that you've run into the limitation of the expander on the 25-disk drive backplane. However, I'm curious about the 8-drive cage you mention. I use that cage in the ML/DL370 G6 servers. I didn't think it would fit into a DL180 G6. How is this arranged in your unit? What does the resulting setup look like? Sine the DL180 drive cages are part of the bezel, do you just have three loose cages connected to the controllers? Also, with three controllers, didn't you max the number of available PCIe slots? Anyway, the new HP SL4540 server is the next product worth testing in this realmŠ 60 x LFF disks. http://h18004.www1.hp.com/products/quickspecs/14406_na/14406_na.html -- Edmund White ewwh...@mac.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Appliance as a general-purpose server question
[...] The results were the same with 10 or 25 drives, so I suspected either the PCI bus, either the expander in the 25-drives bay (HP 530946-001). Plugging the disks directly to the LSI card allowed to gain few MB/s : the expander was limiting a bit, but moreover, it disallowed to use more than 1 disk controller ! [...] That's correct that you've run into the limitation of the expander on the 25-disk drive backplane. However, I'm curious about the 8-drive cage you mention. I use that cage in the ML/DL370 G6 servers. I didn't think it would fit into a DL180 G6. How is this arranged in your unit? What does the resulting setup look like? Sine the DL180 drive cages are part of the bezel, do you just have three loose cages connected to the controllers? It was not as easy that just unplug the 25-drives bay and plus 3 8 drives-bays.. Few rivets to drill, backplane alimentation cable to trick (the pins and wires colors are not the same !), minimolex - molex cable the drives alimentation, and some screw to fix the cages. The result is really clean. Here are few pictures : http://www.flickr.com/photos/webzinemaker/6964036523/in/photostream/ Also, with three controllers, didn't you max the number of available PCIe slots? 4 slots are available on the DL180 : 3 were used for the LSI controllers, and one for a nic. Anyway, the new HP SL4540 server is the next product worth testing in this realmŠ 60 x LFF disks. http://h18004.www1.hp.com/products/quickspecs/14406_na/14406_na.html I might be a very good alternative for the X4540... But I wonder how many controllers are connected, and what are their perfs. -- Grégory Giannoni http://www.wmaker.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Eugen Leitl can I make e.g. LSI SAS3442E directly do SSD caching (it says something about CacheCade, but I'm not sure it's an OS-side driver thing), as it is supposed to boost IOPS? Unlikely shot, but probably somebody here would know. Depending on the type of work you will be doing, the best performance thing you could do is to disable zil (zfs set sync=disabled) and use SSD's for cache. But don't go *crazy* adding SSD's for cache, because they still have some in-memory footprint. If you have 8G of ram and 80G SSD's, maybe just use one of them for cache, and let the other 3 do absolutely nothing. Better yet, make your OS on a pair of SSD mirror, then use pair of HDD mirror for storagepool, and one SSD for cache. Then you have one SSD unused, which you could optionally add as dedicated log device to your storagepool. There are specific situations where it's ok or not ok to disable zil - look around and ask here if you have any confusion about it. Don't do redundancy in hardware. Let ZFS handle it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
On Tue, Nov 27, 2012 at 12:12:43PM +, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Eugen Leitl can I make e.g. LSI SAS3442E directly do SSD caching (it says something about CacheCade, but I'm not sure it's an OS-side driver thing), as it is supposed to boost IOPS? Unlikely shot, but probably somebody here would know. Depending on the type of work you will be doing, the best performance thing you could do is to disable zil (zfs set sync=disabled) and use SSD's for cache. But don't go *crazy* adding SSD's for cache, because they still have some in-memory footprint. If you have 8G of ram and 80G SSD's, maybe just use one of them for cache, and let the other 3 do absolutely nothing. Better yet, make your OS on a pair of SSD mirror, then use pair of HDD mirror for storagepool, and one SSD for cache. Then you have one SSD unused, which you could optionally add as dedicated log device to your storagepool. There are specific situations where it's ok or not ok to disable zil - look around and ask here if you have any confusion about it. Don't do redundancy in hardware. Let ZFS handle it. Thanks. I'll try doing that, and see how it works out. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
Performance-wise, I think you should go for mirrors/raid10, and separate the pools (i.e. rpool mirror on SSD and data mirror on HDDs). If you have 4 SSDs, you might mirror the other couple for zoneroots or some databases in datasets delegated into zones, for example. Don't use dedup. Carve out some space for L2ARC. As Ed noted, you might not want to dedicate much disk space due to remaining RAM pressure when using the cache; however, spreading the IO load between smaller cache partitions/slices on each SSD may help your IOPS on average. Maybe go for compression. I really hope someone better versed in compression - like Saso - would chime in to say whether gzip-9 vs. lzjb (or lz4) sucks in terms of read-speeds from the pools. My HDD-based assumption is in general that the less data you read (or write) on platters - the better, and the spare CPU cycles can usually take the hit. I'd spread out the different data types (i.e. WORM programs, WORM-append logs and random-io other application data) into various datasets with different settings, backed by different storage - since you have the luxury. Many best practice documents (and original Sol10/SXCE/LiveUpgrade requirements) place the zoneroots on the same rpool so they can be upgraded seamlessly as part of the OS image. However you can also delegate ZFS datasets into zones and/or have lofs mounts from GZ to LZ (maybe needed for shared datasets like distros and homes - and faster/more robust than NFS from GZ to LZ). For OS images (zoneroots) I'd use gzip-9 or better (likely lz4 when it gets integrated), same for logfile datasets, and lzjb, zle or none for the random-io datasets. For structured things like databases I also research the block IO size and use that (at dataset creation time) to reduce extra work with ZFS COW during writes - at expense of more metadata. You'll likely benefit from having OS images on SSDs, logs on HDDs (including logs from the GZ and LZ OSes, to reduce needless writes on the SSDs), and databases on SSDs. Things depend for other data types, and in general would be helped by L2ARC on the SSDs. Also note that much of the default OS image is not really used (i.e. X11 on headless boxes), so you might want to do weird things with GZ or LZ rootfs data layouts - note that these might puzzle your beadm/liveupgrade software, so you'll have to do any upgrades with lots of manual labor :) On a somewhat orthogonal route, I'd start with setting up a generic dummy zone, perhaps with much unneeded software, and zfs-cloning that to spawn application zones. This way you only pay the footprint price once, at least until you have to upgrade the LZ OSes - in that case it might be cheaper (in terms of storage at least) to upgrade the dummy, clone it again, and port the LZ's customizations (installed software) by finding the differences between the old dummy and current zone state (zfs diff, rsync -cn, etc.) In such upgrades you're really well served by storing volatile data in separate datasets from the zone OS root - you just reattach these datasets to the upgraded OS image and go on serving. As a particular example of the thing often upgraded and taking considerable disk space per copy - I'd have the current JDK installed in GZ: either simply lofs-mounted from GZ to LZs, or in a separate dataset, cloned and delegated into LZs (if JDK customizations are further needed by some - but not all - local zones, i.e. timezone updates, trusted CA certs, etc.). HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
Now that I thought of it some more, a follow-up is due on my advices: 1) While the best practices do(did) dictate to set up zoneroots in rpool, this is certainly not required - and I maintain lots of systems which store zones in separate data pools. This minimizes write-impact on rpools and gives the fuzzy feeling of keeping the systems safer from unmountable or overfilled roots. 2) Whether LZs and GZs are in the same rpool for you, or you stack tens of your LZ roots in a separate pool, they do in fact offer a nice target for dedup - with expected large dedup ratio which would outweigh both the overheads and IO lags (especially if it is on SSD pool) and the inconveniences of my approach with cloned dummy zones - especially upgrades thereof. Just remember to use the same compression settings (or lack of compression) on all zoneroots, so that the zfs blocks for OS image files would be the same and dedupable. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on SunFire X2100M2 with hybrid pools
On Tue, Nov 27, 2012 at 5:13 AM, Eugen Leitl eu...@leitl.org wrote: Now there are multiple configurations for this. Some using Linux (roof fs on a RAID10, /home on RAID 1) or zfs. Now zfs on Linux probably wouldn't do hybrid zfs pools (would it?) Sure it does. You can even use the whole disk as zfs, with no additional partition required (not even for /boot). and it wouldn't be probably stable enough for production. Right? Depends on how you define stable, and what kind of in-house expertise you have. Some companies are selling (or plan to sell, as their product is in open beta stage) storage appliances powered by zfs on linux (search the ZoL list for details). So it's definitely stable-enough for them. -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel DC S3700
Going a bit on a tangent, does anyone know if those drives are available for sale anywhere? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question about degraded drive
Hello, I have a degraded mirror set and this is has happened a few times (not always the same drive) over the last two years. In the past I replaced the drive and and ran zpool replace and all was well. I am wondering, however, if it is safe to run zpool replace without replacing the drive to see if it is in fact failed. On traditional RAID systems I have had drives drop out of an array, but be perfectly fine. Adding them back to the array returned the drive to service and all was well. Does that approach work with ZFS? If not, is there another way to test the drive before making the decision to yank and replace? Thank you! Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about degraded drive
You don't use replace on mirror vdevs. 'zpool detach' the failed drive. Then 'zpool attach' the new drive. On Nov 27, 2012 6:00 PM, Chris Dunbar - Earthside, LLC cdun...@earthside.net wrote: Hello, ** ** I have a degraded mirror set and this is has happened a few times (not always the same drive) over the last two years. In the past I replaced the drive and and ran zpool replace and all was well. I am wondering, however, if it is safe to run zpool replace without replacing the drive to see if it is in fact failed. On traditional RAID systems I have had drives drop out of an array, but be perfectly fine. Adding them back to the array returned the drive to service and all was well. Does that approach work with ZFS? If not, is there another way to test the drive before making the decision to yank and replace? ** ** Thank you! Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about degraded drive
And you can try 'zpool online' on the failed drive to see if it comes back online. On Nov 27, 2012 6:08 PM, Freddie Cash fjwc...@gmail.com wrote: You don't use replace on mirror vdevs. 'zpool detach' the failed drive. Then 'zpool attach' the new drive. On Nov 27, 2012 6:00 PM, Chris Dunbar - Earthside, LLC cdun...@earthside.net wrote: Hello, ** ** I have a degraded mirror set and this is has happened a few times (not always the same drive) over the last two years. In the past I replaced the drive and and ran zpool replace and all was well. I am wondering, however, if it is safe to run zpool replace without replacing the drive to see if it is in fact failed. On traditional RAID systems I have had drives drop out of an array, but be perfectly fine. Adding them back to the array returned the drive to service and all was well. Does that approach work with ZFS? If not, is there another way to test the drive before making the decision to yank and replace? ** ** Thank you! Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about degraded drive
Hi Chris, On Tue, Nov 27, 2012 at 6:56 PM, Chris Dunbar - Earthside, LLC cdun...@earthside.net wrote: Hello, ** ** I have a degraded mirror set and this is has happened a few times (not always the same drive) over the last two years. In the past I replaced the drive and and ran zpool replace and all was well. I am wondering, however, if it is safe to run zpool replace without replacing the drive to see if it is in fact failed. On traditional RAID systems I have had drives drop out of an array, but be perfectly fine. Adding them back to the array returned the drive to service and all was well. Does that approach work with ZFS? If not, is there another way to test the drive before making the decision to yank and replace? ** ** I have two tidbits of useful information. 1) zpool scrub mypoolname will attempt to read all data on all disks in the pool and verify against the checksum. If you suspect the disk is fine, you can clear the errors, run a scrub, and check the zpool status to see if there are read/checksum errors on the disk. If there are, I'd replace the drive. 2) if you have an additional hard drive bay/cable/controller, you can do a zpool replace on the offending drive without doing a detach first - this may save you from the other drive failing during resilvering. Jan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about degraded drive
Sorry, I was skipping bits to get to the main point. I did use replace (as previously instructed on the list). I think that worked because my spare had taken over for the failed drive. That's the same situation now - spare in service for the failed drive. Sent from my iPhone On Nov 27, 2012, at 9:08 PM, Freddie Cash fjwc...@gmail.com wrote: You don't use replace on mirror vdevs. 'zpool detach' the failed drive. Then 'zpool attach' the new drive. On Nov 27, 2012 6:00 PM, Chris Dunbar - Earthside, LLC cdun...@earthside.net wrote: Hello, I have a degraded mirror set and this is has happened a few times (not always the same drive) over the last two years. In the past I replaced the drive and and ran zpool replace and all was well. I am wondering, however, if it is safe to run zpool replace without replacing the drive to see if it is in fact failed. On traditional RAID systems I have had drives drop out of an array, but be perfectly fine. Adding them back to the array returned the drive to service and all was well. Does that approach work with ZFS? If not, is there another way to test the drive before making the decision to yank and replace? Thank you! Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss