[zfs-discuss] Backing up ZFS metadata
Hi all, I know the easiest answer to this question is don't do it in the first place, and if you do, you should have a backup, however I'll ask it regardless. Is there a way to backup the ZFS metadata on each member device of a pool to another device (possibly non-ZFS)? I have recently read a discusson on this list regarding storing the working metadata on off-data devices (mirrored I assume). Is there a way today to walk, and save, the metadata of an entire pool and save it somewhere? The main motivation for the question is that I recently ruined a large raidz pool buy overwriting the start and end of two member disks (and possibly some data). I assume that if I could have restored the lost metadata I could have recovered most of the real data. Thanks Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering lost labels on raidz member
Hi Saso, thanks for your reply. If all disks are the same, is the root pointer the same? Also, is there a signature or something unique to the root block that I can search for on the disk? I'm going through the On-disk specification at the moment. Scott On Mon, Aug 13, 2012 at 10:02:58AM +0200, Sa?o Kiselkov wrote: On 08/13/2012 10:00 AM, Sa?o Kiselkov wrote: On 08/13/2012 03:02 AM, Scott wrote: Hi all, I have a 5 disk raidz array in a state of disrepair. Suffice to say three disks are ok, while two are missing all their labels. (Both ends of the disks were overwritten). The data is still intact. There are 4 labels on a zfs-labeled disk, two at the start and two at the end. Have all been overwritten? Just re-read your post again, and I realized my question here is redundant. Without the labels your data is toast. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering lost labels on raidz member
Thanks again Saso, at least I have closure :) Scott On Mon, Aug 13, 2012 at 11:24:55AM +0200, Sa?o Kiselkov wrote: On 08/13/2012 10:45 AM, Scott wrote: Hi Saso, thanks for your reply. If all disks are the same, is the root pointer the same? No. Also, is there a signature or something unique to the root block that I can search for on the disk? I'm going through the On-disk specification at the moment. Nope. The checksums are part of the blockpointer, and the root blockpointer is in the uberblock, which itself resides in the label. By overwriting the label you've essentially erased all hope of practically finding the root of the filesystem tree - not even checksumming all possible block combinations (of which there are quite a few) will help you here, because you have no checksums to compare them against. I'd love to be wrong, and I might be (I don't have as intimate a knowledge of ZFS' on-disk structure as I'd like), but from where I'm standing, your raidz vdev is essentially lost. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering lost labels on raidz member
On Mon, Aug 13, 2012 at 10:40:45AM -0700, Richard Elling wrote: On Aug 13, 2012, at 2:24 AM, Sa?o Kiselkov wrote: On 08/13/2012 10:45 AM, Scott wrote: Hi Saso, thanks for your reply. If all disks are the same, is the root pointer the same? No. Also, is there a signature or something unique to the root block that I can search for on the disk? I'm going through the On-disk specification at the moment. Nope. The checksums are part of the blockpointer, and the root blockpointer is in the uberblock, which itself resides in the label. By overwriting the label you've essentially erased all hope of practically finding the root of the filesystem tree - not even checksumming all possible block combinations (of which there are quite a few) will help you here, because you have no checksums to compare them against. I'd love to be wrong, and I might be (I don't have as intimate a knowledge of ZFS' on-disk structure as I'd like), but from where I'm standing, your raidz vdev is essentially lost. The labels are not identical, because each contains the guid for the device. It is possible, though nontrivial, to recreate. That said, I've never seen a failure that just takes out only the ZFS labels. You'd have to go out of your way to take out the labels. Which is just what I did (imagine: moving drives over to USB external enclosures, then putting them onto a HP Raid controller (which overwrites the end of the disk) - which also assumed that two disks should be automatically mirrored (if you miss the 5 second prompt where you can tell it not to). Then try and recover the labels without really knowing what you're doing (my bad). Suffice to say I have no confidence in the labels of two drives. On OI I can forcefully import the pool but with any file that lives on multiple disks (ie, over a certain size), all I get is an I/O error. Some of datasets also fail to mount. Thanks everyone for your input. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recovering lost labels on raidz member
Hi all, I have a 5 disk raidz array in a state of disrepair. Suffice to say three disks are ok, while two are missing all their labels. (Both ends of the disks were overwritten). The data is still intact. Unfortunately I don't have a zpool.cache either. Is there a way to reconstruct the labels using the infomration from the 3 valid disks? Thanks Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Corrupted pool: I/O error and Bad exchange descriptor
Hi all, this is a follow up some help I was soliciting with my corrupted pool. The short story is I can have no confidence in the quality in the labels on 2 of my 5 drive RAIDZ array. For various reasons. There is a possibility even that one drive has label of another (a mirroring accident). Anyhoo, for some odd reason, the drives finally mounted (they are actually drive images on another ZFS pool which I have snapshotted). When I imported the pool, ZFS complained that two of the datasets would not mount, but the remainder did. It seems that small files read ok. (Perhaps small enough to fit on a single block - hence probably mirrored and not striped. Assuming my understanding of what happens to small files is correct). But on larger files I get: root@openindiana-01:/ZP-8T-RZ1-01/incoming# cp httpd-error.log.zip /mnt2/ cp: reading `httpd-error.log.zip': I/O error and on some directories: root@openindiana-01:/ZP-8T-RZ1-01/usr# ls -al cd ..ls: cannot access obj: Bad exchange descriptor total 54 drwxr-xr-x 5 root root 5 2011-11-03 16:28 . drwxr-xr-x 11 root root 11 2011-11-04 13:14 .. ?? ? ?? ?? obj drwxr-xr-x 68 root root 83 2011-10-30 01:00 ports drwxr-xr-x 22 root root 31 2011-09-25 02:00 src Here is the zpool status output: root@openindiana-01:/ZP-8T-RZ1-01# zpool status pool: ZP-8T-RZ1-01 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scan: scrub in progress since Sat Nov 5 23:57:46 2011 112G scanned out of 6.93T at 6.24M/s, 318h17m to go 305M repaired, 1.57% done config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 DEGRADED 0 0 356K raidz1-0DEGRADED 0 0 722K 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 DEGRADED 0 0 0 too many errors (repairing) /dev/lofi/4 DEGRADED 0 0 0 too many errors (repairing) /dev/lofi/3 DEGRADED 0 0 74.4K too many errors (repairing) /dev/lofi/1 DEGRADED 0 0 0 too many errors (repairing) All those errors may be caused by one disk actually owning the wrong label. I'm not entirely sure. Also, while it's complaining that /dev/lofi/2 is UNAVAIL, it certainly is. Although it's probably not labelled with '12339070507640025002'. I'd love to get some of my data back. Any recovery is a bonus. If anyone is keen, I have enabled SSH into the Open Indiana box which I'm using to try and recovery the pool, so if you'd like to take a shot please let me know. Thanks in advance, Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Sat, Jun 16, 2012 at 08:54:05AM +0200, Stefan Ring wrote: when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). Yes, that's what I meant. root@openindiana-01:/mnt# zpool import -d /dev/lofi ??pool: ZP-8T-RZ1-01 ?? ??id: 9952605666247778346 ??state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing ?? ?? ?? ??devices and try again. ?? see: http://www.sun.com/msg/ZFS-8000-3C config: ?? ?? ?? ??ZP-8T-RZ1-01 ?? ?? ?? ?? ?? ?? ??FAULTED ??corrupted data ?? ?? ?? ?? ??raidz1-0 ?? ?? ?? ?? ?? ?? ?? ??DEGRADED ?? ?? ?? ?? ?? ??12339070507640025002 ??UNAVAIL ??cannot open ?? ?? ?? ?? ?? ??/dev/lofi/5 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/4 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/3 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/1 ?? ?? ?? ?? ?? ONLINE It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. I agree that it's interesting. Now someone really knowledgable will need to have a look at this. I can only imagine that somehow the devices contain data from different points in time, and that it's too far apart for the aggressive txg rollback that was added in PSARC 2009/479. Btw, did you try that? Try: zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01. Hi again, that got slightly further, but still no dice: root@openindiana-01:/mnt# zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01 root@openindiana-01:/mnt# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT ZP-8T-RZ1-01 - - - - - FAULTED - rpool 15.9G 2.17G 13.7G13% 1.00x ONLINE - root@openindiana-01:/mnt# zpool status pool: ZP-8T-RZ1-01 state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scan: none requested config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 FAULTED 0 0 1 corrupted data raidz1-0ONLINE 0 0 6 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 ONLINE 0 0 0 /dev/lofi/4 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 root@openindiana-01:/mnt# zpool scrub ZP-8T-RZ1-01 cannot scrub 'ZP-8T-RZ1-01': pool is currently unavailable Thanks for your tenacity Stefan. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Sat, Jun 16, 2012 at 09:09:53AM -0500, Gregg Wonderly wrote: Use 'dd' to replicate as much of lofi/2 as you can onto another device, and then cable that into place? It looks like you just need to put a functioning, working, but not correct device, in that slot so that it will import and then you can 'zpool replace' the new disk into the pool perhaps? Gregg Wonderly On 6/16/2012 2:02 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 08:54:05AM +0200, Stefan Ring wrote: when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). Yes, that's what I meant. root@openindiana-01:/mnt# zpool import -d /dev/lofi ??pool: ZP-8T-RZ1-01 ?? ??id: 9952605666247778346 ??state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing ?? ?? ?? ??devices and try again. ?? see: http://www.sun.com/msg/ZFS-8000-3C config: ?? ?? ?? ??ZP-8T-RZ1-01 ?? ?? ?? ?? ?? ?? ??FAULTED ??corrupted data ?? ?? ?? ?? ??raidz1-0 ?? ?? ?? ?? ?? ?? ?? ??DEGRADED ?? ?? ?? ?? ?? ??12339070507640025002 ??UNAVAIL ??cannot open ?? ?? ?? ?? ?? ??/dev/lofi/5 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/4 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/3 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/1 ?? ?? ?? ?? ?? ONLINE It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. I agree that it's interesting. Now someone really knowledgable will need to have a look at this. I can only imagine that somehow the devices contain data from different points in time, and that it's too far apart for the aggressive txg rollback that was added in PSARC 2009/479. Btw, did you try that? Try: zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01. Hi again, that got slightly further, but still no dice: root@openindiana-01:/mnt# zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01 root@openindiana-01:/mnt# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT ZP-8T-RZ1-01 - - - - - FAULTED - rpool 15.9G 2.17G 13.7G13% 1.00x ONLINE - root@openindiana-01:/mnt# zpool status pool: ZP-8T-RZ1-01 state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scan: none requested config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 FAULTED 0 0 1 corrupted data raidz1-0ONLINE 0 0 6 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 ONLINE 0 0 0 /dev/lofi/4 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 root@openindiana-01:/mnt# zpool scrub ZP-8T-RZ1-01 cannot scrub 'ZP-8T-RZ1-01': pool is currently unavailable Thanks for your tenacity Stefan. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi Greg, lofi/2 is a dd of a real disk. I am using disk images because I can roll back, clone etc without using the original drives (which are long gone anyway). I have tried making /2 unavailable for import, and zfs just moans that it can't be opened. It fails to import even though I have only one disk missing of a RAIDZ array. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Sat, Jun 16, 2012 at 09:58:40AM -0500, Gregg Wonderly wrote: On Jun 16, 2012, at 9:49 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 09:09:53AM -0500, Gregg Wonderly wrote: Use 'dd' to replicate as much of lofi/2 as you can onto another device, and then cable that into place? It looks like you just need to put a functioning, working, but not correct device, in that slot so that it will import and then you can 'zpool replace' the new disk into the pool perhaps? Gregg Wonderly On 6/16/2012 2:02 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 08:54:05AM +0200, Stefan Ring wrote: when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). Yes, that's what I meant. root@openindiana-01:/mnt# zpool import -d /dev/lofi ??pool: ZP-8T-RZ1-01 ?? ??id: 9952605666247778346 ??state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing ?? ?? ?? ??devices and try again. ?? see: http://www.sun.com/msg/ZFS-8000-3C config: ?? ?? ?? ??ZP-8T-RZ1-01 ?? ?? ?? ?? ?? ?? ??FAULTED ??corrupted data ?? ?? ?? ?? ??raidz1-0 ?? ?? ?? ?? ?? ?? ?? ??DEGRADED ?? ?? ?? ?? ?? ??12339070507640025002 ??UNAVAIL ??cannot open ?? ?? ?? ?? ?? ??/dev/lofi/5 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/4 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/3 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/1 ?? ?? ?? ?? ?? ONLINE It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. I agree that it's interesting. Now someone really knowledgable will need to have a look at this. I can only imagine that somehow the devices contain data from different points in time, and that it's too far apart for the aggressive txg rollback that was added in PSARC 2009/479. Btw, did you try that? Try: zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01. Hi again, that got slightly further, but still no dice: root@openindiana-01:/mnt# zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01 root@openindiana-01:/mnt# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT ZP-8T-RZ1-01 - - - - - FAULTED - rpool 15.9G 2.17G 13.7G13% 1.00x ONLINE - root@openindiana-01:/mnt# zpool status pool: ZP-8T-RZ1-01 state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scan: none requested config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 FAULTED 0 0 1 corrupted data raidz1-0ONLINE 0 0 6 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 ONLINE 0 0 0 /dev/lofi/4 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 root@openindiana-01:/mnt# zpool scrub ZP-8T-RZ1-01 cannot scrub 'ZP-8T-RZ1-01': pool is currently unavailable Thanks for your tenacity Stefan. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi Greg, lofi/2 is a dd of a real disk. I am using disk images because I can roll back, clone etc without using the original drives (which are long gone anyway). I have tried making /2 unavailable for import, and zfs just moans that it can't be opened. It fails to import even though I have only one disk missing of a RAIDZ array. My experience is that ZFS will not import a pool with a missing disk. There has to be something in that slot before the import will occur. Even if the disk is corrupt, it needs to be there. I think this is a failsafe mechanism that tries to keep a pool from going live when you have mistakenly not connected all the drives. That keeps the disks from becoming chronologically/txn misaligned which can result in data loss, in the right combinations I believe. Gregg Wonderly Hi again Gregg, not sure if I should be top posting this... Given I am working with images, it's hard to put just anything in place of lofi/2. ZFS scans all of the files in the directory for ZFS labels, so just replacing lofi/2 with an empty file (for example) just means ZFS skips it, which is the same result as deleting lofi/2 altogether. I did this, but to no avail. ZFS complains about having insufficient replicas. There is something more going
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Fri, Jun 15, 2012 at 10:54:34AM +0200, Stefan Ring wrote: Have you also mounted the broken image as /dev/lofi/2? Yep. Wouldn't it be better to just remove the corrupted device? This worked just fine in my case. Hi Stefan, when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). This is what happens (lofi/2 is the drive which ZFS thinks has corrupted data): oot@openindiana-01:/mnt# zpool import -d /dev/lofi pool: ZP-8T-RZ1-01 id: 9952605666247778346 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: ZP-8T-RZ1-01 FAULTED corrupted data raidz1-0ONLINE 12339070507640025002 UNAVAIL corrupted data /dev/lofi/5 ONLINE /dev/lofi/4 ONLINE /dev/lofi/3 ONLINE /dev/lofi/1 ONLINE root@openindiana-01:/mnt# lofiadm -d /dev/lofi/2 root@openindiana-01:/mnt# zpool import -d /dev/lofi pool: ZP-8T-RZ1-01 id: 9952605666247778346 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-3C config: ZP-8T-RZ1-01 FAULTED corrupted data raidz1-0DEGRADED 12339070507640025002 UNAVAIL cannot open /dev/lofi/5 ONLINE /dev/lofi/4 ONLINE /dev/lofi/3 ONLINE /dev/lofi/1 ONLINE So in the second import, it complains that it can't open the device, rather than saying it has corrupted data. It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. Thanks again. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Thu, Jun 14, 2012 at 09:56:43AM +1000, Daniel Carosone wrote: On Tue, Jun 12, 2012 at 03:46:00PM +1000, Scott Aitken wrote: Hi all, Hi Scott. :-) I have a 5 drive RAIDZ volume with data that I'd like to recover. Yeah, still.. I tried using Jeff Bonwick's labelfix binary to create new labels but it carps because the txg is not zero. Can you provide details of invocation and error response? # /root/labelfix /dev/lofi/1 assertion failed for thread 0xfecb2a40, thread-id 1: txg == 0, file label.c, line 53 Abort (core dumped) The reporting line of code is: VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0); Here is the entire labelfix code: #include devid.h #include dirent.h #include errno.h #include libintl.h #include stdlib.h #include string.h #include sys/stat.h #include unistd.h #include fcntl.h #include stddef.h #include sys/vdev_impl.h /* * Write a label block with a ZBT checksum. */ static void label_write(int fd, uint64_t offset, uint64_t size, void *buf) { zio_block_tail_t *zbt, zbt_orig; zio_cksum_t zc; zbt = (zio_block_tail_t *)((char *)buf + size) - 1; zbt_orig = *zbt; ZIO_SET_CHECKSUM(zbt-zbt_cksum, offset, 0, 0, 0); zio_checksum(ZIO_CHECKSUM_LABEL, zc, buf, size); VERIFY(pwrite64(fd, buf, size, offset) == size); *zbt = zbt_orig; } int main(int argc, char **argv) { int fd; vdev_label_t vl; nvlist_t *config; uberblock_t *ub = (uberblock_t *)vl.vl_uberblock; uint64_t txg; char *buf; size_t buflen; VERIFY(argc == 2); VERIFY((fd = open(argv[1], O_RDWR)) != -1); VERIFY(pread64(fd, vl, sizeof (vdev_label_t), 0) == sizeof (vdev_label_t)); VERIFY(nvlist_unpack(vl.vl_vdev_phys.vp_nvlist, sizeof (vl.vl_vdev_phys.vp_nvlist), config, 0) == 0); VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0); VERIFY(txg == 0); VERIFY(ub-ub_txg == 0); VERIFY(ub-ub_rootbp.blk_birth != 0); txg = ub-ub_rootbp.blk_birth; ub-ub_txg = txg; VERIFY(nvlist_remove_all(config, ZPOOL_CONFIG_POOL_TXG) == 0); VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0); buf = vl.vl_vdev_phys.vp_nvlist; buflen = sizeof (vl.vl_vdev_phys.vp_nvlist); VERIFY(nvlist_pack(config, buf, buflen, NV_ENCODE_XDR, 0) == 0); label_write(fd, offsetof(vdev_label_t, vl_uberblock), 1ULL UBERBLOCK_SHIFT, ub); label_write(fd, offsetof(vdev_label_t, vl_vdev_phys), VDEV_PHYS_SIZE, vl.vl_vdev_phys); fsync(fd); return (0); } For the benefit of others, this was at my suggestion; I've been discussing this problem with Scott for.. some time. I can also make the solaris machine available via SSH if some wonderful person wants to poke around. Will take a poke, as discussed. May well raise more discussion here as a result. -- Dan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recovery of RAIDZ with broken label(s)
Hi all, I have a 5 drive RAIDZ volume with data that I'd like to recover. The long story runs roughly: 1) The volume was running fine under FreeBSD on motherboard SATA controllers. 2) Two drives were moved to a HP P411 SAS/SATA controller 3) I *think* the HP controllers wrote some volume information to the end of each disk (hence no more ZFS labels 2,3) 4) In its auto configuration wisdom, the HP controller built a mirrored volume using the two drives (and I think started the actual mirroring process). (Hence on at least on of the drives - a copied labels 0,1). 5) From there everything went downhill. This happened a while back, and so the exact order of things (including my botched attemtps at recovery) are hazy. I tried using Jeff Bonwick's labelfix binary to create new labels but it carps because the txg is not zero. The situation now is I have dd'd the drives onto a NAS. These images are shared via NFS to a VM running Oracle Solaris 11 11/11 X86. When I attempt to import the pool I get: root@solaris-01:/mnt# zpool import -d /dev/lofi pool: ZP-8T-RZ1-01 id: 9952605666247778346 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: ZP-8T-RZ1-01 FAULTED corrupted data raidz1-0ONLINE 12339070507640025002 UNAVAIL corrupted data /dev/lofi/5 ONLINE /dev/lofi/4 ONLINE /dev/lofi/3 ONLINE /dev/lofi/1 ONLINE I'm not sure why I can't import although 4 of the 5 drives are ONLINE. Can anyone please point me to a next step? I can also make the solaris machine available via SSH if some wonderful person wants to poke around. If I lose the data that's ok, but it'd be nice to know all avenues were tried before I delete the 9TB of images (I need the space...) Many thanks, Scott zfs-list at thismonkey dot com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sudden drop in disk performance - WD20EURS 4k sectors to blame?
Did you 4k align your partition table and is ashift=12? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Hard link space savings
Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On 13/06/11 10:28 AM, Nico Williams wrote: On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson scott.law...@manukau.ac.nz wrote: I have an interesting question that may or may not be answerable from some internal ZFS semantics. This is really standard Unix filesystem semantics. I Understand this, just wanting to see if here is any easy way before I trawl through 10 million little files.. ;) [...] So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? [...] But... you just did! :) It's: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. Yes this number varies based on number of recipients, so could be as many a You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you're done. Looks like I will have to, just looking for a tried and tested method before I have to create my own one if possible. Just was looking for an easy option before I have to sit down and develop and test a script. I have resigned from my current job of 9 years and finish in 15 days and have a heck of a lot of documentation and knowledge transfer I need to do around other UNIX systems and am running very short on time... Nico PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... As a side not Exchange 2002 + Exchange 2007 do do this. But apparently M$ decided in Exchange 2010 that they no longer wished to do this and dropped the capability. Bizarre to say the least, but it may come down to what they have done in the underlying store technology changes.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On 13/06/11 11:36 AM, Jim Klimov wrote: Some time ago I wrote a script to find any duplicate files and replace them with hardlinks to one inode. Apparently this is only good for same files which don't change separately in future, such as distro archives. I can send it to you offlist, but it would be slow in your case because it is not quite the tool for the job (it will start by calculating checksums of all of your files ;) ) What you might want to do and script up yourself is a recursive listing find /var/opt/SUNWmsqsr/store/partition... -ls. This would print you the inode numbers and file sizes and link counts. Pipe it through something like this: find ... -ls | awk '{print $1 $4 $7}' | sort | uniq And you'd get 3 columns - inode, count, size My AWK math is a bit rusty today, so I present a monster-script like this to multiply and sum up the values: ( find ... -ls | awk '{print $1 $4 $7}' | sort | uniq | awk '{ print $2*$3+\\ }'; echo 0 ) | bc This looks something like what I thought would have to be done, I was just looking to see if there was something tried and tested before I had to invent something. I was really hoping in zdb there might have been some magic information I could have tapped into.. ;) Can be done cleaner, i.e. in a PERL one-liner, and if you have many values - that would probably complete faster too. But as a prototype this would do. HTH, //Jim PS: Why are you replacing the cool Sun Mail? Is it about Oracle licensing and the now-required purchase and support cost? Yes it is about cost mostly. We had Sun Mail for our Staff and students. We had 20,000 + students on it up until Christmas time as well. We have now migrated them to M$ Live@EDU. This leaves us with 1500 Staff left who all like to use LookOut. The Sun connector for LookOut is a bit flaky at best. But the Oracle licensing cost for Messaging and Calendar starts at 10,000 users plus and so is now rather expensive for what mailboxes we have left. M$ also heavily discounts Exchange CALS to Edu and Oracle is not very friendly the way Sun was with their JES licensing. So it is bye bye Sun Messaging Server for us. 2011-06-13 1:14, Scott Lawson пишет: Hi All, I have an interesting question that may or may not be answerable from some internal ZFS semantics. I have a Sun Messaging Server which has 5 ZFS based email stores. The Sun Messaging server uses hard links to link identical messages together. Messages are stored in standard SMTP MIME format so the binary attachments are included in the message ASCII. Each individual message is stored in a separate file. So as an example if a user sends a email with a 2MB attachment to the staff mailing list and there is 3 staff stores with 500 users on each, it will generate a space usage like : /store1 = 1 x 2MB + 499 x 1KB /store2 = 1 x 2MB + 499 x 1KB /store3 = 1 x 2MB + 499 x 1KB So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? The reason I ask is that our educational institute wishes to migrate these stores to M$ Exchange 2010 which doesn't do message single instancing. I need to try and project what the storage requirement will be on the new target environment. If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. I may post this to Sun Managers as well to see if anyone there might have any ideas on this as well. Regards, Scott. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best choice - file system for system
I don't disagree that zfs is the better choice, but... Seriously though. UFS is dead. It has no advantage over ZFS that I'm aware of. When it comes to dumping and restoring filesystems, there is still no official replacement for the ufsdump and ufsrestore. The discussion has been had before, but to my knowledge, there is no consensus on the best method for backing up zfs filesystems. Personally, I like to use variations on zfs send and zfs receive, but others will tell a different story. Still, don't let this put you off using zfs as the root filesystem. Just be aware that you need to do some work and decide what method of backup is best for you. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OT: anyone aware how to obtain 1.8.0 for X2100M2?
Hi, Took me a couple of minutes to find the download for this in my Oracle support. Search for the patch like this . Patches and Updates Panel - Patch Search - Patch Name or Number is : 10275731 Pretty easy really. Scott. PS. I found that patch by using product or family equals x2100 and it found it for me easily. On 20/12/2010 1:04 p.m., Jerry Kemp wrote: Eugen, I would 2nd your observation. I *do* have several support contracts, and as I review my Oracle profile, it does show that I am authorized to download patches, among other items. I really haven't downloaded a lot since SunSolve was killed off. Do others on the list have access to download stuff like this? Or is there some other place with in Oracle's site that makes Eugen's link obsolete? Jerry On 12/19/10 12:28, Eugen Leitl wrote: I realize this is off-topic, but Oracle has completely screwed up the support site from Sun. I figured someone here would know how to obtain Sun Fire X2100 M2 Server Software 1.8.0 Image contents: * BIOS is version 3A21 * SP is updated to version 3.24 (ELOM) * Chipset driver is updated to 9.27 from http://www.sun.com/servers/entry/x2100/downloads.jsp I've been trying for an hour, and I'm at the end of my rope. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to replace failed vdev on non redundant pool?
If the pool is non-redundant and your vdev has failed, you have lost your data. Just rebuild the pool, but consider a redundant configuration. On Oct 15, 2010, at 3:26 PM, Cassandra Pugh wrote: Hello, I would like to know how to replace a failed vdev in a non redundant pool? I am using fiber attached disks, and cannot simply place the disk back into the machine, since it is virtual. I have the latest kernel from sept 2010 that includes all of the new ZFS upgrades. Please, can you help me? - Cassandra (609) 243-2413 Unix Administrator From a little spark may burst a mighty flame. -Dante Alighieri ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Optimal raidz3 configuration
Hello Peter, Read the ZFS Best Practices Guide to start. If you still have questions, post back to the list. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Storage_Pool_Performance_Considerations -Scott On Oct 13, 2010, at 3:21 PM, Peter Taps wrote: Folks, If I have 20 disks to build a raidz3 pool, do I create one big raidz vdev or do I create multiple raidz3 vdevs? Is there any advantage of having multiple raidz3 vdevs in a single pool? Thank you in advance for your help. Regards, Peter -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bursty writes - why?
On Oct 12, 2010, at 3:31 PM, Bob Friesenhahn wrote: For obvious reasons, the SLOG is designed to write sequentially. Otherwise it would offer much less benefit. Maybe this random-write issue with Sandforce would not be a problem? Isn't writing from cache to disk designed to be sequential, while writes to the ZIL/SLOG will be more random (in order to commit quickly)? Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
On Oct 8, 2010, at 8:25 AM, Bob Friesenhahn wrote: It also does not include the human factor which is still the most significant contributor to data loss. This is the most difficult factor to diminish. If the humans have difficulty understanding the system or the hardware, then they are more likely to do something wrong which damages the data. This is often overlooked during a system design. It is very easy to lose your head during a high stress moment, and pull the wrong drive (I of course, have never done that... ahem). Having z2(3) / triple mirrors, graphical pictures of which disk has failed, working LED failures lights, and letting a hot spare finish reslivering before replacing a disk are all good counter measures. It also does not account for an OS kernel which caches quite a lot of data in memory (relying on ECC for reliability), and which may have bugs. At some point you have to rely on your backups for the unexpected and unforeseen. Make sure they are good! Michael, nice reliability write up! -- Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 disk raidz2 vdev that took about 16 hours to resliver. There was very little IO on the array, and it had maybe 3.5T of data to resliver. On Oct 7, 2010, at 3:17 PM, Ian Collins wrote: I would seriously consider raidz3, given I typically see 80-100 hour resilver times for 500G drives in raidz2 vdevs. If you haven't already, read Adam Leventhal's paper: http://queue.acm.org/detail.cfm?id=1670144 -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding corrupted files
Scrub? On Oct 6, 2010, at 6:48 AM, Stephan Budach wrote: No - not a trick question., but maybe I didn't make myself clear. Is there a way to discover such bad files other than trying to actually read from them one by one, say using cp or by sending a snapshot elsewhere? I am well aware that the file shown in zpool status -v is damaged and I have already restored it, but I wanted to know, if there're more of them. Regards, budy -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When is it okay to turn off the verify option.
Why do you want to turn verify off? If performance is the reason, is it significant, on and off? On Oct 4, 2010, at 2:28 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Peter Taps As I understand, the hash generated by sha256 is almost guaranteed not to collide. I am thinking it is okay to turn off verify property on the zpool. However, if there is indeed a collision, we lose data. Scrub cannot recover such lost data. I am wondering in real life when is it okay to turn off verify option? I guess for storing business critical data (HR, finance, etc.), you cannot afford to turn this option off. Right on all points. It's a calculated risk. If you have a hash collision, you will lose data undetected, and backups won't save you unless *you* are the backup. That is, if the good data, before it got corrupted by your system, happens to be saved somewhere else before it reached your system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there any way to stop a resilver?
Has it been running long? Initially the numbers are way off. After a while it settles down into something reasonable. How many disks, and what size, are in your raidz2? -Scott On 9/29/10 8:36 AM, LIC mesh licm...@gmail.com wrote: Is there any way to stop a resilver? We gotta stop this thing - at minimum, completion time is 300,000 hours, and maximum is in the millions. Raidz2 array, so it has the redundancy, we just need to get data off. We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there any way to stop a resilver?
What version of OS? Are snapshots running (turn them off). So are there eight disks? On 9/29/10 8:46 AM, LIC mesh licm...@gmail.com wrote: It's always running less than an hour. It usually starts at around 300,000h estimate(at 1m in), goes up to an estimate in the millions(about 30mins in) and restarts. Never gets past 0.00% completion, and K resilvered on any LUN. 64 LUNs, 32x5.44T, 32x10.88T in 8 vdevs. On Wed, Sep 29, 2010 at 11:40 AM, Scott Meilicke scott.meili...@craneaerospace.com wrote: Has it been running long? Initially the numbers are way off. After a while it settles down into something reasonable. How many disks, and what size, are in your raidz2? -Scott On 9/29/10 8:36 AM, LIC mesh licm...@gmail.com http://licm...@gmail.com wrote: Is there any way to stop a resilver? We gotta stop this thing - at minimum, completion time is 300,000 hours, and maximum is in the millions. Raidz2 array, so it has the redundancy, we just need to get data off. We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fwd: Is there any way to stop a resilver?
(I left the list off last time sorry) No, the resliver should only be happening if there was a spare available. Is the whole thing scrubbing? It looks like it. Can you stop it with a zpool scrub s pool So... Word of warning, I am no expert at this stuff. Think about what I am suggesting before you do it :). Although stopping a scrub is pretty innocuous. -Scott On 9/29/10 9:22 AM, LIC mesh licm...@gmail.com wrote: You almost have it - each iSCSI target is made up of 4 of the raidz vdevs - 4 * 6 = 24 disks. 16 targets total. We have one LUN with status of UNAVAIL but didn't know if removing it outright would help - it's actually available and well as far as the target is concerned, so we thought it went UNAVAIL as a result of iSCSI timeouts - we've since fixed the switches buffers, etc. See: http://pastebin.com/pan9DBBS On Wed, Sep 29, 2010 at 12:17 PM, Scott Meilicke scott.meili...@craneaerospace.com wrote: OK, let me see if I have this right: 8 shelves, 1T disks, 24 disks per shelf = 192 disks 8 shelves, 2T disks, 24 disks per shelf = 192 disks Each raidz is six disks. 64 raidz vdevs Each iSCSI target is made up of 8 of these raidz vdevs (8 x 6 disks = 48 disks) Then the head takes these eight targets, and makes a raidz2. So the raidz2 depends upon all 384 disks. So when a failure occurs, the resliver is accessing all 384 disks. If I have this right, which I am in serious doubt :), then that will either take an enormous amount of time to complete, or never. It looks like never. Recovery: From the head, can you see which vdev has failed? If so, can you remove it to stop the resliver? On 9/29/10 8:57 AM, LIC mesh licm...@gmail.com http://licm...@gmail.com wrote: This is an iSCSI/COMSTAR array. The head was running 2009.06 stable with version 14 ZFS, but we updated that to build 134 (kept the old OS drives) - did not, however, update the zpool - it's still version 14. The targets are all running 2009.06 stable, exporting 4 raidz1 LUNs each of 6 drives - 8 shelves have 1TB drives, the other 8 have 2TB drives. The head sees the filesystem as comprised of 8 vdevs of 8 iSCSI LUNs each, with SSD ZIL and SSD L2ARC. On Wed, Sep 29, 2010 at 11:49 AM, Scott Meilicke scott.meili...@craneaerospace.com http://scott.meili...@craneaerospace.com wrote: What version of OS? Are snapshots running (turn them off). So are there eight disks? On 9/29/10 8:46 AM, LIC mesh licm...@gmail.com http://licm...@gmail.com http://licm...@gmail.com wrote: It's always running less than an hour. It usually starts at around 300,000h estimate(at 1m in), goes up to an estimate in the millions(about 30mins in) and restarts. Never gets past 0.00% completion, and K resilvered on any LUN. 64 LUNs, 32x5.44T, 32x10.88T in 8 vdevs. On Wed, Sep 29, 2010 at 11:40 AM, Scott Meilicke scott.meili...@craneaerospace.com http://scott.meili...@craneaerospace.com http://scott.meili...@craneaerospace.com wrote: Has it been running long? Initially the numbers are way off. After a while it settles down into something reasonable. How many disks, and what size, are in your raidz2? -Scott On 9/29/10 8:36 AM, LIC mesh licm...@gmail.com http://licm...@gmail.com http://licm...@gmail.com http://licm...@gmail.com wrote: Is there any way to stop a resilver? We gotta stop this thing - at minimum, completion time is 300,000 hours, and maximum is in the millions. Raidz2 array, so it has the redundancy, we just need to get data off. We value your opinion! http://www.craneae.com/surveys/satisfaction.htm How may we serve you better?Please click the survey link to tell us how we are doing: http://www.craneae.com/surveys/satisfaction.htm http://www.craneae.com/surveys/satisfaction.htm http://www.craneae.com/surveys/satisfaction.htm Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. -- Scott Meilicke | Enterprise Systems Administrator | Crane Aerospace Electronics | +1 425-743-8153 | M: +1 206-406-2670 We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time
[zfs-discuss] Resliver making the system unresponsive
This must be resliver day :) I just had a drive failure. The hot spare kicked in, and access to the pool over NFS was effectively zero for about 45 minutes. Currently the pool is still reslivering, but for some reason I can access the file system now. Resliver speed has been beaten to death I know, but is there a way to avoid this? For example, is more enterprisy hardware less susceptible to reslivers? This box is used for development VMs, but there is no way I would consider this for production with this kind of performance hit during a resliver. My hardware: Dell 2950 16G ram 16 disk SAS chassis LSI 3801 (I think) SAS card (1068e chip) Intel x25-e SLOG off of the internal PERC 5/i RAID controller Seagate 750G disks (7200.11) I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc i386 i86pc Solaris) pool: data01 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Sep 29 14:03:52 2010 1.12T scanned out of 5.00T at 311M/s, 3h37m to go 82.0G resilvered, 22.42% done config: NAME STATE READ WRITE CKSUM data01 DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0ONLINE 0 0 0 c1t11d0ONLINE 0 0 0 c1t12d0ONLINE 0 0 0 c1t13d0ONLINE 0 0 0 c1t14d0ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c1t22d0ONLINE 0 0 0 c1t15d0ONLINE 0 0 0 c1t16d0ONLINE 0 0 0 c1t17d0ONLINE 0 0 0 c1t23d0ONLINE 0 0 0 spare-5REMOVED 0 0 0 c1t20d0 REMOVED 0 0 0 c8t18d0 ONLINE 0 0 0 (resilvering) c1t21d0ONLINE 0 0 0 logs c0t1d0 ONLINE 0 0 0 spares c8t18d0 INUSE currently in use errors: No known data errors Thanks for any insights. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resliver making the system unresponsive
I should add I have 477 snapshots across all files systems. Most of them are hourly snaps (225 of them anyway). On Sep 29, 2010, at 3:16 PM, Scott Meilicke wrote: This must be resliver day :) I just had a drive failure. The hot spare kicked in, and access to the pool over NFS was effectively zero for about 45 minutes. Currently the pool is still reslivering, but for some reason I can access the file system now. Resliver speed has been beaten to death I know, but is there a way to avoid this? For example, is more enterprisy hardware less susceptible to reslivers? This box is used for development VMs, but there is no way I would consider this for production with this kind of performance hit during a resliver. My hardware: Dell 2950 16G ram 16 disk SAS chassis LSI 3801 (I think) SAS card (1068e chip) Intel x25-e SLOG off of the internal PERC 5/i RAID controller Seagate 750G disks (7200.11) I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc i386 i86pc Solaris) pool: data01 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Sep 29 14:03:52 2010 1.12T scanned out of 5.00T at 311M/s, 3h37m to go 82.0G resilvered, 22.42% done config: NAME STATE READ WRITE CKSUM data01 DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0ONLINE 0 0 0 c1t11d0ONLINE 0 0 0 c1t12d0ONLINE 0 0 0 c1t13d0ONLINE 0 0 0 c1t14d0ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c1t22d0ONLINE 0 0 0 c1t15d0ONLINE 0 0 0 c1t16d0ONLINE 0 0 0 c1t17d0ONLINE 0 0 0 c1t23d0ONLINE 0 0 0 spare-5REMOVED 0 0 0 c1t20d0 REMOVED 0 0 0 c8t18d0 ONLINE 0 0 0 (resilvering) c1t21d0ONLINE 0 0 0 logs c0t1d0 ONLINE 0 0 0 spares c8t18d0 INUSE currently in use errors: No known data errors Thanks for any insights. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Scott Meilicke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When Zpool has no space left and no snapshots
Preemptively use quotas? On 9/22/10 7:25 PM, Aleksandr Levchuk alevc...@gmail.com wrote: Dear ZFS Discussion, I ran out of space, consequently could not rm or truncate files. (It make sense because it's a copy-on-write and any transaction needs to be written to disk. It worked out really well - all I had to do is destroy some snapshots.) If there are no snapshots to destroy, how to prepare for a situation when a ZFS pool looses it's last free byte? Alex ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Scott Meilicke | Enterprise Systems Administrator | Crane Aerospace Electronics | +1 425-743-8153 | M: +1 206-406-2670 We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Kernel panic on ZFS import - how do I recover?
Brilliant. I set those parameters via /etc/system, rebooted, and the pool imported with just the f switch. I had seen this as an option earlier, although not that thread, but was not sure it applied to my case. Scrub is running now. Thank you very much! -Scott On 9/23/10 7:07 PM, David Blasingame Oracle david.blasing...@oracle.com wrote: Have you tried setting zfs_recover aok in /etc/system or setting it with the mdb? Read how to set via /etc/system http://opensolaris.org/jive/thread.jspa?threadID=114906 mdb debugger http://www.listware.net/201009/opensolaris-zfs/46706-re-zfs-discuss-how-to-set -zfszfsrecover1-and-aok1-in-grub-at-startup.html After you get the variables set and system booted, try importing, then running a scrub. Dave On 09/23/10 19:48, Scott Meilicke wrote: I posted this on the www.nexentastor.org http://www.nexentastor.org forums, but no answer so far, so I apologize if you are seeing this twice. I am also engaged with nexenta support, but was hoping to get some additional insights here. I am running nexenta 3.0.3 community edition, based on 134. The box crashed yesterday, and goes into a reboot loop (kernel panic) when trying to import my data pool, screenshot attached. What I have tried thus far: Boot off of DVD, both 3.0.3 and 3.0.4 beta 8. 'zpool import -f data01' causes the panic in both cases. Boot off of 3.0.4 beta 8, ran zpool import -fF data01 That gives me a message like Pool data01 returned to its stat as of ..., and then panics. The import -fF does seem to import the pool, but then immediately panic. So after booting off of DVD, I can boot from my hard disks, and the system will not import the pool because it was last imported from another system. I have moved /etc/zfs/zfs.cache out of the way, but no luck after a reboot and import. zpool import shows all of my disks are OK, and the pool itself is online. Is it time to start working with zdb? Any suggestions? This box is hosting development VMs, so I have some people idling their thumbs at the moment. Thanks everyone, -Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Kernel panic on ZFS import - how do I recover?
I just realized that the email I sent to David and the list did not make the list (at least as jive can see it), so here is what I sent on the 23rd: Brilliant. I set those parameters via /etc/system, rebooted, and the pool imported with just the –f switch. I had seen this as an option earlier, although not that thread, but was not sure it applied to my case. Scrub is running now. Thank you very much! -Scott Update: The scrub finished with zero errors. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] My filesystem turned from a directory into a special character device
I am running nexenta CE 3.0.3. I have a file system that at some point in the last week went from a directory per 'ls -l' to a special character device. This results in not being able to get into the file system. Here is my file system, scott2, along with a new file system I just created, as seen by ls -l: drwxr-xr-x 4 root root4 Sep 27 09:14 scott crwxr-xr-x 9 root root 0, 0 Sep 20 11:51 scott2 Notice the 'c' vs. 'd' at the beginning of the permissions list. I had been fiddling with permissions last week, then had problems with a kernel panic. Perhaps this is related? Any ideas how to get access to my file system? Thanks, -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] My filesystem turned from a directory into a special character device
On 9/27/10 9:56 AM, Victor Latushkin victor.latush...@oracle.com wrote: On Sep 27, 2010, at 8:30 PM, Scott Meilicke wrote: I am running nexenta CE 3.0.3. I have a file system that at some point in the last week went from a directory per 'ls -l' to a special character device. This results in not being able to get into the file system. Here is my file system, scott2, along with a new file system I just created, as seen by ls -l: drwxr-xr-x 4 root root4 Sep 27 09:14 scott crwxr-xr-x 9 root root 0, 0 Sep 20 11:51 scott2 Notice the 'c' vs. 'd' at the beginning of the permissions list. I had been fiddling with permissions last week, then had problems with a kernel panic. Are you still running with aok/zfs_recover being set? Have you seen this issue before panic? Yes. Well, I have removed those entries in /etc/system, but have not yet rebooted the box. Perhaps this is related? May be. Any ideas how to get access to my file system? This can be fixed, but it is a bit more complicated and error prone that setting couple of variables. OK. Sounds like restoring from my backup would be best? What causes this? I saw this exact same behavior on my home box, and had to restore about two weeks ago. Not very encouraging. :( Is there anything I can provide to help people who know more than me solve this problem? Regards Victor Thanks Victor. -Scott We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup relationship between pool and filesystem
When I do the calculations, assuming 300bytes per block to be conservative, with 128K blocks, I get 2.34G of cache (RAM, L2ARC) per Terabyte of deduped data. But block size is dynamic, so you will need more than this. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data transfer taking a longer time than expected (Possibly dedup related)
Can I disable dedup on the dataset while the transfer is going on? Yes. Only the blocks copied after disabling dedupe will not be deduped. The stuff you have already copied will be deduped. Can I simply Ctrl-C the procress to stop it? Yes, you can do that to a mv process. Maybe stop the process, delete the deduped file system (your copy target), and create a new file system without dedupe to see if that is any better? Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup relationship between pool and filesystem
Hi Peter, dedupe is pool wide. File systems can opt in or out of dedupe. So if multiple file systems are set to dedupe, then they all benefit from using the same pool of deduped blocks. In this way, if two files share some of the same blocks, even if they are in different file systems, they will dedupe. I am not sure why reporting is not done at the file system level. It may be an accounting issue, i.e. which file system owns the dedupe blocks. But it seems some fair estimate could be made. Maybe the overhead to keep a file system updated with these stats is too high? -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Configuration questions for Home File Server (CPU cores, dedup, checksum)?
Craig, 3. I do not think you will get much dedupe on video, music and photos. I would not bother. If you really wanted to know at some later stage, you could create a new file system, enable dedupe, and copy your data (or a subset) into it just to see. In my experience there is a significant CPU penalty as well. My four core (1.86GHz xeons, 4 yrs old) box nearly maxes out when putting a lot of data into a deduped file system. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS development moving behind closed doors
I had already begun the process of migrating my 134 boxes over to Nexenta before Oracle's cunning plans became known. This just reaffirms my decision. Us too. :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snapshot space - miscalculation?
Are there other file systems underneath daten/backups that have snapshots? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog/L2ARC on a hard drive and not SSD?
Another data point - I used three 15K disks striped using my RAID controller as a slog for the zil, and performance went down. I had three raidz sata vdevs holding the data, and my load was VMs, i.e. a fair amount of small, random IO (60% random, 50% write, ~16k in size). Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deleting large amounts of files
If these files are deduped, and there is not a lot of RAM on the machine, it can take a long, long time to work through the dedupe portion. I don't know enough to know if that is what you are experiencing, but it could be the problem. How much RAM do you have? Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announce: zfsdump
At this point, I will repeat my recommendation about using zpool-in-files as a backup (staging) target. Depending where you ost, and how you combine the files, you can achieve these scenarios without clunkery, and with all the benefits a zpool provides. This is another good scheme. I see a number of points to consider when choosing amongst the various suggestions for backing up zfs file systems. In no particular order, I have these: 1. Does it work in place, or need an intermediate copy on disk? 2. Does it respect ACLs? 3. Does it respect zfs snapshots? 4. Does it allow random access to files, or only full file system restore? 5. Can it (mostly) survive partial data corruption? 6. Can it handle file systems larger than a single tape? 7. Can it stream to multiple tapes in parallel? 8. Does it understand the concept of incremental backups? I still see this as a serious gap in the offering of zfs. Clearly so do many other people, as there are a lot of methods offered to handle at least some of the above. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announce: zfsdump
would be nice if i could pipe the zfs send stream to a split and then send of those splitted stream over the network to a remote system. it would help sending it over to remote system quicker. can your tool do that? something like this s | - | j - | o zfs recv (local) l | - | i(remote) t | - | n copy from the fifos to tape(s). Asif Iqbal I did look at doing this, with the intention of allowing simultaneous streams to multiple tape drives, but put the idea to one side. I thought of providing interleaved streams, but wasn't happy with the idea that the whole process would block when one of the pipes stalled. I also contemplated dividing the stream into several large chunks, but for them to run simultaneously that seemed to require several reads of the original dump stream. Besides the expense of this approach, I am not certain that repeated zfs send streams have exactly the same byte content. I think that probably the best approach would be the interleaved streams. That said, I am not sure how this would necessarily help with the situation you describe. Isn't the limiting factor going to be the network bandwidth between remote machines? Won't you end up with four streams running at quarter speed? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announce: zfsdump
if, for example, the network pipe is bigger then one unsplitted stream of zfs send | zfs recv then splitting it to multiple streams should optimize the network bandwidth, shouldn't it ? Well, I guess so. But I wonder, what is the bottle neck here. If it is the rate at which zfs send can stream data, there is a good chance that is limited by disk read. If we split it into four pipes, I still think you are going to see four quarter rate reads. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announce: zfsdump
evik wrote: Reading this list for a while made it clear that zfs send is not a backup solution, it can be used for cloning the filesystem to a backup array if you are consuming the stream with zfs receive so you get notified immediately about errors. Even one bitflip will render the stream unusable and you will loose all data, not just part of your backup cause zfs receive will restore the whole filesystem or nothing at all depending on the correctness of the stream. You can use par2 or something similar to try to protect the stream against bit flips but that would require a lot of free storage space to recover from errors. e The all or nothing aspect does make me nervous, but there are things which can be done about it. The first step, I think, is to calculate a checksum of the data stream(s). -k chkfile. Calculates MD5 checksums for each tape and for the stream as a whole. These are written to chkfile, or if specified as -, then to stdout. Run the dump stream back through digest -a md5 and verify that it is intact. Certainly, using an error correcting code could help us out, but at additional expense, both computational and storage. Personally, for disaster recovery purposes, I think that verifying the data after writing to tape is good enough. What I am looking to guard against is the unlikely event that I have a hardware (or software) failure, or serious human error. This is okay with the zfs send stream, unless, of course, we get a data corruption on the tape. I think the correlation between hardware failure today and tape corruption since yesterday / last week when I last backed up must be pretty small. In the event that I reach for the tape and find it corrupted, I go back a week to the previous full dump stream. Clearly the strength of the backup solution needs to match the value of the data, and especially the cost of not having the data. For our large database applications we mirror to a remote location, and use tape backup. But still, I find the ability to restore the zfs filesystem with all its snapshots very useful, which is why I choose to work with zfs send. Tristram ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Announce: zfsdump
For quite some time I have been using zfs send -R fsn...@snapname | dd of=/dev/rmt/1ln to make a tape backup of my zfs file system. A few weeks back the size of the file system grew to larger than would fit on a single DAT72 tape, and I once again searched for a simple solution to allow dumping of a zfs file system to multiple tapes. Once again I was disappointed... I expect there are plenty of other ways this could have been handled, but none leapt out at me. I didn't want to pay large sums of cash for a commercial backup product, and I didn't see that Amanda would be an easy thing to fit into my existing scripts. In particular, (and I could well be reading this incorrectly) it seems that the commercial products, Amanda, star, all are dumping the zfs file system file by file (with or without ACLs). I found none which would allow me to dump the file system and its snapshots, unless I used zfs send to a scratch disk, and dumped to tape from there. But, of course, that assumes I have a scratch disk large enough. So, I have implemented zfsdump as a ksh script. The method is as follows: 1. Make a bunch of fifos. 2. Pipe the stream from zfs send to split, with split writing to the fifos (in sequence). 3. Use dd to copy from the fifos to tape(s). When the first tape is complete, zfsdump returns. One then calls it again, specifying that the second tape is to be used, and so on. From the man page: Example 1. Dump the @Tues snapshot of the tank filesystem to the non-rewinding, non-compressing tape, with a 36GB capacity: zfsdump -z t...@tues -a -R -f /dev/rmt/1ln -s 36864 -t 0 For the second tape: zfsdump -z t...@tues -a -R -f /dev/rmt/1ln -s 36864 -t 1 If you would like to try it out, download the package from: http://www.quantmodels.co.uk/zfsdump/ I have packaged it up, so do the usual pkgadd stuff to install. Please, though, [b]try this out with caution[/b]. Build a few test file systems, and see that it works for you. [b]It comes without warranty of any kind.[/b] Tristram -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Announce: zfsdump
I use Bacula which works very well (much better than Amanda did). You may be able to customize it to do direct zfs send/receive, however I find that although they are great for copying file systems to other machines, they are inadequate for backups unless you always intend to restore the whole file system. Most people want to restore a file or directory tree of files, not a whole file system. In the past 25 years of backups and restores, I've never had to restore a whole file system. I get requests for a few files, or somebody's mailbox or somebody's http document root. You can directly install it from CSW (or blastwave). Thanks for your comments, Brian. I should look at Bacula in more detail. As for full restore versus ad hoc requests for files I just deleted, my experience is mostly similar to yours, although I have had need for full system restore more than once. For the restore of a few files here and there, I believe this is now well handled with zfs snapshots. I have always found these requests to be down to human actions. The need for full system restore has (almost) always been hardware failure. If the file was there an hour ago, or yesterday, or last week, or last month, then we have it in a snapshot. If the disk died horribly during a power outage (grrr!) then it would be very nice to be able to restore not only the full file system, but also the snapshots too. The only way I know of achieving that is by using zfs send etc. On 6/28/2010 11:26 AM, Tristram Scott wrote: [snip] Tristram ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] COMSTAR iSCSI and two Windows computers
Look again at how XenServer does storage. I think you will find it already has a solution, both for iSCSI and NFS. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
Reaching into the dusty regions of my brain, I seem to recall that since RAIDz does not work like a traditional RAID 5, particularly because of variably sized stripes, that the data may not hit all of the disks, but it will always be redundant. I apologize for not having a reference for this assertion, so I may be completely wrong. I assume your hardware is recent, the controllers are on PCIe x4 buses, etc. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool export / import discrepancy
Hello All, I've migrated a JBOD of 16 drives from one server to another. I did a zpool export from the old system and a zpool import to the new system. One thing I did notice is since the drives are on a different controller card, the naming is different (as expected) but the order is also different. I setup the drives as passthrough on the controller card and went through each drive incrementally. I assumed the zpool import would have listed the drives in the order of c10t2d0, d1, d2, ... c10t3d7. As shown below the order the drives were imported is c10t2d0, d2, d3, d1, c10t3d0 through d7. __ |Original zpool setup on old server: | |zpool status backup | pool: backup | state: ONLINE |config: |NAME STATE READ WRITE CKSUM |backup ONLINE 0 0 0 | raidz2 ONLINE 0 0 0 |c7t1d0 ONLINE 0 0 0 |c7t2d0 ONLINE 0 0 0 |c7t3d0 ONLINE 0 0 0 |c7t4d0 ONLINE 0 0 0 |c7t5d0 ONLINE 0 0 0 |c7t6d0 ONLINE 0 0 0 |c7t7d0 ONLINE 0 0 0 |c7t8d0 ONLINE 0 0 0 |c7t9d0 ONLINE 0 0 0 |c7t10d0 ONLINE 0 0 0 |c7t11d0 ONLINE 0 0 0 |c7t12d0 ONLINE 0 0 0 |c7t13d0 ONLINE 0 0 0 |c7t14d0 ONLINE 0 0 0 |c7t15d0 ONLINE 0 0 0 |spares | c7t16d0AVAIL |_ __ |Imported zpool on new server: | |zpool status backup | pool: backup | state: ONLINE |config: |NAME STATE READ WRITE CKSUM |backup ONLINE 0 0 0 | raidz2 ONLINE 0 0 0 |c10t2d0 ONLINE 0 0 0 |c10t2d2 ONLINE 0 0 0 |c10t2d3 ONLINE 0 0 0 |c10t2d1 ONLINE 0 0 0 |c10t2d4 ONLINE 0 0 0 |c10t2d5 ONLINE 0 0 0 |c10t2d6 ONLINE 0 0 0 |c10t2d7 ONLINE 0 0 0 |c10t3d0 ONLINE 0 0 0 |c10t3d1 ONLINE 0 0 0 |c10t3d2 ONLINE 0 0 0 |c10t3d3 ONLINE 0 0 0 |c10t3d4 ONLINE 0 0 0 |c10t3d5 ONLINE 0 0 0 |c10t3d6 ONLINE 0 0 0 |spares | c10t3d7AVAIL |_ Is ZFS dependent on the order of the drives? Will this cause any issue down the road? Thank you all; Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Devena line of enterprise SSD
Price? I cannot find it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] combining series of snapshots
You might bring over all of your old data and snaps, then clone that into a new volume. Bring your recent stuff into the clone. Since the clone only updates blocks that are different than the underlying snap, you may see a significant storage savings. Two clones could even be made - one for your live data, another to access the historical data. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iScsi slow
iSCSI writes require a sync to disk for every write. SMB writes get cached in memory, therefore are much faster. I am not sure why it is so slow for reads. Have you tried comstar iSCSI? I have read in these forums that it is faster. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI confusion
VMware will properly handle sharing a single iSCSI volume across multiple ESX hosts. We have six ESX hosts sharing the same iSCSI volumes - no problems. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Consolidating a huge stack of DVDs using ZFS dedup: automation?
On 05/04/2010 09:29 AM, Kyle McDonald wrote: On 3/2/2010 10:15 AM, Kjetil Torgrim Homme wrote: valrh...@gmail.com valrh...@gmail.com writes: I have been using DVDs for small backups here and there for a decade now, and have a huge pile of several hundred. They have a lot of overlapping content, so I was thinking of feeding the entire stack into some sort of DVD autoloader, which would just read each disk, and write its contents to a ZFS filesystem with dedup enabled. [...] That would allow me to consolidate a few hundred CDs and DVDs onto probably a terabyte or so, which could then be kept conveniently on a hard drive and archived to tape. it would be inconvenient to make a dedup copy on harddisk or tape, you could only do it as a ZFS filesystem or ZFS send stream. it's better to use a generic tool like hardlink(1), and just delete files afterwards with There is a perl script floating around on the internet for years that will convert copies of files on the same FS to hardlinks (sorry I don't have the name handy). So you don't need ZFS. Once this is done you can even recreate an ISO and burn it back to DVD (possibly merging hundreds of CD's into one DVD (or BD!). The script can also delete the duplicates, but there isn't much control over which one it keeps - for backupsyou may realyl want to keep the earliest (or latest?) backup the file appeared in. I've used Dirvish http://www.dirvish.org/ and rsync to do just that...worked great! Scott Using ZFS Dedup is an interesting way of doing this. However archiving the result may be hard. If you use different datasets (FS's) for each backup, can you only send 1 dataset at a time (since you can only snapshot on a dataset level? Won't that 'undo' the deduping? If you instead put all the backups on on data set, then the snapshot can theoretically contain the dedpued data. I'm not clear on whether 'send'ing it will preserve the deduping or not - or if it's up to the receiving dataset to recognize matching blocks? If the dedup is in the stream, then you may be able to write the stream to a DVD or BD. Still if you save enough space so that you can add the required level of redundancy, you could just leave it on disk and chuck the DVD's. Not sure I'd do that, but it might let me put the media in the basement, instead of the closet, or on the desk next to me. -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for ISCSI ntfs backing store.
At the time we had it setup as 3 x 5 disk raidz, plus a hot spare. These 16 disks were in a SAS cabinet, and the the slog was on the server itself. We are now running 2 x 7 raidz2 plus a hot spare and slog, all inside the cabinet. Since the disks are 1.5T, I was concerned about resliver times for a failed disk. About the only thing I would consider at this point is getting an SSD for the l2arc for dedupe performance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarking Methodologies
My use case for opensolaris is as a storage server for a VM environment (we also use EqualLogic, and soon an EMC CX4-120). To that end, I use iometer within a VM, simulating my VM IO activity, with some balance given to easy benchmarking. We have about 110 VMs across eight ESX hosts. Here is what I do: * Attach a 100G vmdk to one Windows 2003 R2 VM * Create a 32G test file (my opensolaris box has 16G of RAM) * export/import the pool on the solaris box, and reboot my guest to clear caches all around * Run a disk queue depth of 32 outstanding IOs * 60% read, 65% random, 8k block size * Run for five minutes spool up, then run the test for five minutes My actual workload is closer to 50% read, 16k block size, so I adjust my interpretation of the results accordingly. Probably I should run a lot more iometer daemons. Performance will increase as the benchmark runs due to the l2arc filling up, so I found that running the benchmark starting at 5 minutes into the work load was a happy medium. Things will get a bit faster the longer the benchmark runs, but this is good as far as benchmarking goes. Only occasionally due I get wacko results, which I happily toss out the window. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for ISCSI ntfs backing store.
I have used build 124 in this capacity, although I did zero tuning. I had about 4T of data on a single 5T iSCSI volume over gigabit. The windows server was a VM, and the opensolaris box is on a Dell 2950, 16G of RAM, x25e for the zil, no l2arc cache device. I used comstar. It was being used as a target for Doubletake, so it only saw write IO, with very little read. My load testing using iometer was very positive, and I would not have hesitated to use it as the primary node serving about 1000 users, maybe 200-300 active at a time. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharing a ssd between rpool and l2arc
Just clarifying Darren's comment - we got bitten by this pretty badly so I figure it's worth saying again here. ZFS will *allow* you to use a ZVOL of one pool as a ZDEV in another pool, but it results in race conditions and an unstable system. (At least on Solaris 10 update 8). We tried to use a ZVOL from rpool (on fast 15k rpm drives) as a cache device for another pool (on slower 7.2k rpm drives). It worked great up until it hit the race condition and hung the system. It would have been nice if zfs had issued a warning, or at least if this fact was better documented. Scott Duckworth, Systems Programmer II Clemson University School of Computing On Tue, Mar 30, 2010 at 5:09 AM, Darren J Moffat darr...@opensolaris.orgwrote: On 30/03/2010 10:05, Erik Trimble wrote: F. Wessels wrote: Thanks for the reply. I didn't get very much further. Yes, ZFS loves raw devices. When I had two devices I wouldn't be in this mess. I would simply install opensolaris on the first disk and add the second ssd to the data pool with a zpool add mpool cache cxtydz Notice that no slices or partitions were used. But I don't have space for two devices. So I have to deal with slices and partitions. I did another clean install in 12Gb partition leaving 18Gb free. I tried parted to resize the partition, but it said that resizing (solaris2) partitions wasn't implemented. I tried fdisk but no luck either. I tried the send and receive, create new partition and slices, restore rpool in slice0, do installgrub but it wouldn't boot anymore. Can anybody give a summary of commands/steps howto accomplish a bootable rpool and l2arc on a ssd. Preferably for the x86 platform. Look up zvols, as this is what you want to use, NOT partitions (for the many reasons you've encountered). In this case partitions is the only way this will work. In essence, do a normal install, using the ENTIRE disk for your rpool. Then create a zvol in the rpool: # zfs create -V 8GB rpool/zvolname Add this zvol as the cache device (L2arc) for your other pool # zpool create tank mirror c1t0d0 c1t1d0s0 cache rpool/zvolname That won't work L2ARC devices can not be a ZVOL of another pool, they can't be a file either. An L2ARC device must be a physical device. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Rethinking my zpool
You will get much better random IO with mirrors, and better reliability when a disk fails with raidz2. Six sets of mirrors are fine for a pool. From what I have read, a hot spare can be shared across pools. I think the correct term would be load balanced mirrors, vs RAID 10. What kind of performance do you need? Maybe raidz2 will give you the performance you need. Maybe not. Measure the performance of each configuration and decide for yourself. I am a big fan of iometer for this type of work. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is this a sensible spec for an iSCSI storage box?
One of the reasons I am investigating solaris for this is sparse volumes and dedupe could really help here. Currently we use direct attached storage on the dom0s and allocate an LVM to the domU on creation. Just like your example above, we have lots of those 80G to start with please volumes with 10's of GB unused. I also think this data set would dedupe quite well since there are a great many identical OS files across the domUs. Is that assumption correct? This is one reason I like NFS - thin by default, and no wasted space within a zvol. zvols can be thin as well, but opensolaris will not know the inside format of the zvol, and you may still have a lot of wasted space after a while as files inside of the zvol come and go. In theory dedupe should work well, but I would be careful about a possible speed hit. I've not seen an example of that before. Do you mean having two 'head units' connected to an external JBOD enclosure or a proper HA cluster type configuration where the entire thing, disks and all, are duplicated? I have not done any type of cluster work myself, but from what I have read on Sun's site, yes, you could connect the same jbod to two head units, active/passive, in an HA cluster, but no duplicate disks/jbod. When the active goes down, passive detects this and takes over the pool by doing an import. During the import, any outstanding transactions on the zil are replayed, whether they are on a slog or not. I believe this is how Sun does it on their open storage boxes (7000 series). Note - two jbods could be used, one for each head unit, making an active/active setup. Each jbod is active on one node, passive on the other. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is this a sensible spec for an iSCSI storage box?
It is hard, as you note, to recommend a box without knowing the load. How many linux boxes are you talking about? I think having a lot of space for your L2ARC is a great idea. Will you mirror your SLOG, or load balance them? I ask because perhaps one will be enough, IO wise. My box has one SLOG (X25-E) and can support about 2600 IOPS using an iometer profile that closely approximates my work load. My ~100 VMs on 8 ESX boxes average around 1000 IOPS, but can peak 2-3x that during backups. Don't discount NFS. I absolutely love NFS for management and thin provisioning reasons. Much easier (to me) than managing iSCSI, and performance is similar. I highly recommend load testing both iSCSI and NFS before you go live. Crash consistent backups of your VMs are possible using NFS, and recovering a VM from a snapshot is a little easier using NFS, I find. Why not larger capacity disks? Hopefully your switches support NIC aggregation? The only issue I have had on 2009.06 using iSCSI (I had a windows VM directly attaching to an iSCSI 4T volume) was solved and back ported to 2009.06 (bug 6794994). -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is this a sensible spec for an iSCSI storage box?
I was planning to mirror them - mainly in the hope that I could hot swap a new one in the event that an existing one started to degrade. I suppose I could start with one of each and convert to a mirror later although the prospect of losing either disk fills me with dread. You do not need to mirror the L2ARC devices, as the system will just hit disk as necessary. Mirroring sounds like a good idea on the SLOG, but this has been much discussed on the forums. Why not larger capacity disks? We will run out of iops before we run out of space. Interesting. I find IOPS is more proportional to the number of VMs vs disk space. User: I need a VM that will consume up to 80G in two years, so give me an 80G disk. Me: OK, but recall we can expand disks and filesystems on the fly, without downtime. User: Well, that is cool, but 80G to start with please. Me: sigh I also believe the SLOG and L2ARC will make using high RPM disks not as necessary. But, from what I have read, higher RPM disks will greatly help with scrubs and reslivers. Maybe two pools - one with fast mirrored SAS, another with big SATA. Or all SATA, but one pool with mirrors, another with raidz2. Many options. But measure to see what works for you. iometer is great for that, I find. Any opinions on the use of battery backed SAS adapters? Surely these will help with performance in write back mode, but I have not done any hard measurements. Anecdotally my PERC5i in a Dell 2950 seemed to greatly help with IOPS on a five disk raidz. There are pros and cons. Search the forums, but off the top of my head 1) SLOGs are much larger than controller caches: 2) only synced write activity is cached in a ZIL, whereas a controller cache will cache everything, needed or not, thus running out of space sooner; 3) SLOGS and L2ARC devices are specialized caches for read and write loads, vs. the all in one cache of a controller. 4) A controller *may* be faster, since it uses ram for the cache. One of the benefits of a SLOG on the SAS/SATA bus is for a cluster. If one node goes down, the other can bring up the pool, check the ZIL for any necessary transactions, and apply them. To do this with battery backed cache, you would need fancy interconnects between the nodes, cache mirroring, etc. All of those things that SAN array products do. Sounds like you have a fun project. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/OSOL/Firewire...
Apple users have different expectations regarding data loss than Solaris and Linux users do. Come on, no Apple user bashing. Not true, not fair. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can we get some documentation on iSCSI sharing after comstar took over?
This is what I used: http://wikis.sun.com/display/OpenSolarisInfo200906/How+to+Configure+iSCSI+Target+Ports I distilled that to: disable the old, enable the new (comstar) * sudo svcadm disable iscsitgt * sudo svcadm enable stmf Then four steps (using my zfs/zpool info - substitute for yours): * sudo zfs create -s -V 5t data01/san/gallardo/g (the -s makes it thin, -V specifies a block volume) * sbdadm create-lu /dev/zvol/rdsk/data01/san/gallardo/g * sudo itadm create-target * sudo stmfadm add-view 600144F0E24785004A80910A0001 This should allow any initiator to connect to your volume, no security. Not quite a one liner. After you create the target once (step 3), you do not have to do that again for the next volume. So three lines. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup zpool to tape
Greg, I am using NetBackup 6.5.3.1 (7.x is out) with fine results. Nice and fast. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] Moving Storage to opensolaris+zfs. What a
To be clear, you can do what you want with the following items (besides your server): (1) OpenSolaris LiveCD (1) 8GB USB Flash drive As many tapes as you need to store your data pools on. Make sure the USB drive has a saved stream from your rpool. It should also have a downloaded copy of whichever main backup software you use. That's it. You backup data using Amanda/Bacula/et al onto tape. You backup your boot/root filesystem using 'zfs send' onto the USB key. Erik, great! I never thought of the USB key to store an rpool copy. I will give it a go on my test box. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz2 array FAULTED with only 1 drive down
You might have to force the import with -f. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD and ZFS
I don't think adding an SSD mirror to an existing pool will do much for performance. Some of your data will surely go to those SSDs, but I don't think the solaris will know they are SSDs and move blocks in and out according to usage patterns to give you an all around boost. They will just be used to store data, nothing more. Perhaps it will be more useful to add the SSDs as either an L2ARC or SLOG for the ZIL, but that will depend upon your work load. If you do NFS or iSCSI access, the putting the ZIL onto the SSD drive(s) will speed up writes. Added to the L2ARC will speed up reads. Here is the ZFS best practices guide, which should help with this decision: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Read that, then come back with more questions. Best, Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
Thanks Dan. When I try the clone then import: pfexec zfs clone data01/san/gallardo/g...@zfs-auto-snap:monthly-2009-12-01-00:00 data01/san/gallardo/g-testandlab pfexec sbdadm import-lu /dev/zvol/rdsk/data01/san/gallardo/g-testandlab The sbdadm import-lu gives me: sbdadm: guid in use which makes sense, now that I see it. The man pages make it look like I cannot give it another GUID during the import. Any other thoughts? I *could* delete the current lu, import, get my data off and reverse the process, but that would take the current volume off line, which is not what I want to do. Thanks, Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
Sure, but that will put me back into the original situation. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
That is likely it. I create the volume using 2009.06, then later upgraded to 124. I just now created a new zvol, connected it to my windows server, formatted, and added some data. Then I snapped the zvol, cloned the snap, and used 'pfexec sbdadm create-lu'. When presented to the windows server, it behaved as expected. I could see the data I created prior to the snapshot. Thank you very much Dave (and everyone else). Now, -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
I plan on filing a support request with Sun, and will try to post back with any results. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Mounting a snapshot of an iSCSI volume using Windows
I have a single zfs volume, shared out using COMSTAR and connected to a Windows VM. I am taking snapshots of the volume regularly. I now want to mount a previous snapshot, but when I go through the process, Windows sees the new volume, but thinks it is blank and wants to initialize it. Any ideas how to get Windows to see that it has data on it? Steps I took after the snap: zfs clone snapshot data01/san/gallardo/g-recovery sbdadm create-lu /dev/zvol/rdsk/data01/san/gallardo/g-recovery stmfadm add-view -h HG-Gallardo -t TG-Gallardo -n 1 600144F0EAE40A004B6B59090003 At this point, my server Gallardo can see the LUN, but like I said, it looks blank to the OS. I suspect the 'sbdadm create-lu' phase. Any help to get Windows to see it as a LUN with NTFS data would be appreciated. Thanks, Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS configuration suggestion with 24 drives
Link aggregation can use different algorithms to load balance. Using L4 (IP plus originating port I think), using a single client computer and the same protocol (NFS), but different origination ports has allowed me to saturate both NICS in my LAG. So yes, you just need more than one 'conversation', but the LAG setup will determine how a conversation is defined. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS configuration suggestion with 24 drives
It looks like there is not a free slot for a hot spare? If that is the case, then it is one more factor to push towards raidz2, as you will need time to remove the failed disk and insert a new one. During that time you don't want to be left unprotected. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS/NFS/LDOM performance issues
[Cross-posting to ldoms-discuss] We are occasionally seeing massive time-to-completions for I/O requests on ZFS file systems on a Sun T5220 attached to a Sun StorageTek 2540 and a Sun J4200, and using a SSD drive as a ZIL device. Primary access to this system is via NFS, and with NFS COMMITs blocking until the request has been sent to disk, performance has been deplorable. The NFS server is a LDOM domain on the T5220. To give an idea of how bad the situation is, iotop from the DTrace Toolkit occasionally reports single I/O requests to 15k RPM FC disks that take more than 60 seconds to complete, and even requests to a SSD drive that take over 10 seconds to complete. It's not uncommon to open a small text file using vim (or similar editor) and nothing to pop up for 10-30 seconds. Browsing the web becomes a chore, as the browser locks up for a few seconds after doing anything. I have a full write-up of the situation at http://www.cs.clemson.edu/~duckwos/zfs-performance/. Any thoughts or comments are welcome. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/NFS/LDOM performance issues
No errors reported on any disks. $ iostat -xe extended device statistics errors --- devicer/sw/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot vdc0 0.65.6 25.0 33.5 0.0 0.1 17.3 0 2 0 0 0 0 vdc1 78.1 24.4 3199.2 68.0 0.0 4.4 43.3 0 20 0 0 0 0 vdc2 78.0 24.6 3187.6 67.6 0.0 4.5 43.5 0 20 0 0 0 0 vdc3 78.1 24.4 3196.0 67.9 0.0 4.5 43.5 0 21 0 0 0 0 vdc4 78.2 24.5 3189.8 67.6 0.0 4.5 43.7 0 21 0 0 0 0 vdc5 78.3 24.4 3200.3 67.9 0.0 4.5 43.5 0 21 0 0 0 0 vdc6 78.4 24.6 3186.5 67.7 0.0 4.5 43.5 0 21 0 0 0 0 vdc7 76.4 25.9 3233.0 67.4 0.0 4.2 40.7 0 20 0 0 0 0 vdc8 76.7 26.0 3222.5 67.1 0.0 4.2 41.1 0 21 0 0 0 0 vdc9 76.5 26.0 3233.9 67.7 0.0 4.2 40.8 0 20 0 0 0 0 vdc1076.5 25.7 3221.6 67.2 0.0 4.2 41.5 0 21 0 0 0 0 vdc1176.4 25.9 3228.2 67.4 0.0 4.2 41.1 0 20 0 0 0 0 vdc1276.4 26.1 3216.2 67.4 0.0 4.3 41.6 0 21 0 0 0 0 vdc13 0.08.70.3 248.4 0.0 0.01.8 0 0 0 0 0 0 vdc1495.38.2 2919.3 28.2 0.0 2.5 24.3 0 21 0 0 0 0 vdc1595.99.4 2917.6 26.2 0.0 2.1 19.7 0 19 0 0 0 0 vdc1695.38.0 2924.3 28.2 0.0 2.6 25.5 0 22 0 0 0 0 vdc1796.19.4 2920.5 26.2 0.0 2.0 19.3 0 19 0 0 0 0 vdc1895.48.2 2923.3 28.2 0.0 2.4 23.4 0 21 0 0 0 0 vdc1995.89.3 2903.2 26.2 0.0 2.5 24.3 0 21 0 0 0 0 vdc2095.08.4 2877.6 28.1 0.0 2.5 23.9 0 21 0 0 0 0 vdc2195.99.5 2848.2 26.2 0.0 2.6 24.3 0 21 0 0 0 0 vdc2295.08.4 2874.3 28.1 0.0 2.5 23.7 0 21 0 0 0 0 vdc2395.79.5 2854.0 26.2 0.0 2.5 23.4 0 21 0 0 0 0 vdc2495.18.4 2883.9 28.1 0.0 2.4 23.5 0 21 0 0 0 0 vdc2595.69.4 2839.3 26.2 0.0 2.8 26.5 0 22 0 0 0 0 vdc26 0.06.90.2 319.8 0.0 0.02.6 0 0 0 0 0 0 Nothing sticks out in /var/adm/messages on either the primary or cs0 domain. The SSD is a recent addition (~3 months ago), and was added in an attempt to counteract the poor performance we were already seeing without the SSD. I will check firmware versions tomorrow. I do recall updating the firmware about 8 months ago when we upgraded CAM to support the new J4200 array. At the time, it was the most recent CAM release available, not the outdated version that shipped on the CD in the array package. My supervisor pointed me to http://forums.sun.com/thread.jspa?threadID=5416833 which describes what seems to be an identical problem. It references http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6547651 which was reported to be fixed in Solaris 10 update 4. No solution was posted, but it was pointed out that a similar configuration without LDOMs in the mix provided superb performance. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/NFS/LDOM performance issues
Thus far there is no evidence that there is anything wrong with your storage arrays, or even with zfs. The problem seems likely to be somewhere else in the kernel. Agreed. And I tend to think that the problem lays somewhere in the LDOM software. I mainly just wanted to get some experienced eyes on the problem to see if anything sticks out before I go through the trouble of reinstalling the system without LDOMs (the original need for VMs in this application no longer exists). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL to disk
I think Y is such a variable and complex number it would be difficult to give a rule of thumb, other than to 'test with your workload'. My server, having three, five disk raidzs (striped) and an intel x25-e as a zil can fill my two G ethernet pipes over NFS (~200MBps) during mostly sequential writes. That same server can only consume about 22 MBps using an artificial load designed to simulate my VM activity (using iometer). So it varies greatly depending upon Y. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Yes, a coworker lost a second disk during a rebuild of a raid5 and lost all data. I have not had a failure, however when migrating EqualLogic arrays in and out of pools, I lost a disk on an array. No data loss, but it concerns me because during the moves, you are essentially reading and writing all of the data on the disk. Did I have a latent problem on that particular disk that only exposed itself when doing such a large read/write? What if another disk had failed, and during the rebuild this latent problem was exposed? Trouble, trouble. They say security is an onion. So is data protection. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using iSCSI on ZFS with non-native FS - How to backup.
It does 'just work', however you may have some file and/or file system corruption if the snapshot was taken at the moment that your mac is updating some files. So use the time slider function and take a lot of snaps. :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirroring ZIL device
# 1. It may help to use 15k disks as the zil. When I tested using three 15k disks striped as my zil, it made my workload go slower, even though it seems like it should have been faster. My suggestion is to test it out, and see if it helps. #3. You may get good performance with an inexpensive SSD because the SSD should have fast random writes, but probably not fast sequential writes. But I would test it first against your anticipated workload. :) An Intel 32G X25-E runs just shy of $400, and they are pretty speedy. I don't know if that would fit your budget. There is also some concern about losing power and having the X25 RAM cache disappear during a write. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X45xx storage vs 7xxx Unified storage
If the 7310s can meet your performance expectations, they sound much better than a pair of x4540s. Auto-fail over, SSD performance (although these can be added to the 4540s), ease of management, and a great front end. I haven't seen if you can use your backup software with the 7310s, but from what I have read in this thread, that may be the only downside (a big one). Everything else points to the 7310s. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL/log on SSD weirdness
I second the use of zilstat - very useful, especially if you don't want to mess around with adding a log device and then having to destroy the pool if you don't want the log device any longer. On Nov 18, 2009, at 2:20 AM, Dushyanth wrote: Just to clarify : Does iSCSI traffic from a Solaris iSCSI initiator to a third party target go through ZIL ? It depends on whether the application requires a sync or not. dd does not, but databases (in general) do. As Richard said, ZFS treats the iSCSI volume just like any other vdev (pool of disks), so the fact that it is an iSCSI volume has nothing to do with ZFS' zil usage. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL/log on SSD weirdness
I am sorry that I don't have any links, but here is what I observe on my system. dd does not do sync writes, so the ZIL is not used. iSCSI traffic does sync writes (as of 2009.06, but not 2008.05), so if you repeat your test using an iSCSI target from your system, you should see log activity. Same for NFS. I see no ZIL activity using rsync, for an example of a network file transfer that does not require sync. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CIFS crashes when accessed with Adobe Photoshop Elements 6.0 via Vista
upgrade to the latest dev release fixed the problem for me. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] CIFS crashes when accessed with Adobe Photoshop Elements 6.0 via Vista
I have a repeatable test case for this indecent.Every time I access my ZFS cifs shared file system with Adobe Photoshop elements 6.0 via my Vista workstation the OpenSolaris server stops serving CIFS. The share functions as expected for all other CIFS operations. -Begin Configuration Data- -scotts:zelda# cat /etc/release OpenSolaris 2009.06 snv_111b X86 Copyright 2009 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 07 May 2009 -scotts:zelda# uname -a SunOS zelda 5.11 snv_111b i86pc i386 i86pc -scotts:zelda# -scotts:zelda# prtdiag System Configuration: IBM IBM eServer 325 -[8835W11]- BIOS Configuration: IBM IBM BIOS Version 1.36 -[M1E136AUS-1.36]- 01/19/05 BMC Configuration: IPMI 1.5 (KCS: Keyboard Controller Style) Processor Sockets Version Location Tag -- Opteron CPU0-Socket 940 Opteron CPU1-Socket 940 Memory Device Sockets TypeStatus Set Device Locator Bank Locator --- -- --- --- DRAMin use 1 DDR1Bank 0 DRAMin use 1 DDR2Bank 0 DRAMin use 2 DDR3Bank 1 DRAMin use 2 DDR4Bank 1 DRAMin use 3 DDR5Bank 2 DRAMin use 3 DDR6Bank 2 On-Board Devices = Upgradeable Slots ID StatusType Description --- - 1 in usePCI-XPCI-X Slot 1 2 available PCI-XPCI-X Slot 2 -scotts:zelda# zpool status pool: ary01 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM ary01 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c5t8d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c6t8d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 spares c6t1d0AVAIL errors: No known data errors pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c3d0s0ONLINE 0 0 0 errors: No known data errors -scotts:zelda# zfs get all ary01/media NAME PROPERTY VALUE SOURCE ary01/media type filesystem - ary01/media creation Fri Jul 11 23:24 2008 - ary01/media used 347G - ary01/media available 1.09T - ary01/media referenced 344G - ary01/media compressratio 1.00x - ary01/media mountedyes- ary01/media quota none default ary01/media reservationnone default ary01/media recordsize 128K default ary01/media mountpoint /shared_media local ary01/media sharenfs on local ary01/media checksum on default ary01/media compressionoffdefault ary01/media atime on default ary01/media deviceson default ary01/media exec on default ary01/media setuid on default ary01/media readonly offdefault ary01/media zoned offlocal ary01/media snapdirvisiblelocal ary01/media aclmodegroupmask default ary01/media aclinherit restricted default ary01/media canmount on default ary01/media shareiscsi offdefault ary01/media xattr on default ary01/media copies 1 default ary01/media version3 - ary01/media utf8only
Re: [zfs-discuss] Difficulty testing an SSD as a ZIL
Excellent! That worked just fine. Thank you Victor. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Difficulty testing an SSD as a ZIL
Hi all, I received my SSD, and wanted to test it out using fake zpools with files as backing stores before attaching it to my production pool. However, when I exported the test pool and imported, I get an error. Here is what I did: I created a file to use as a backing store for my new pool: mkfile 1g /data01/test2/1gtest Created a new pool: zpool create ziltest2 /data01/test2/1gtest Added the SSD as a log: zpool add -f ziltest2 log c7t1d0 (c7t1d0 is my SSD. I used the -f option since I had done this before with a pool called 'ziltest', same results) A 'zpool status' returned no errors. Exported: zpool export ziltest2 Imported: zpool import -d /data01/test2 ziltest2 cannot import 'ziltest2': one or more devices is currently unavailable This happened twice with two different test pools using file-based backing stores. I am nervous about adding the SSD to my production pool. Any ideas why I am getting the import error? Thanks, Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] File level cloning
I don't think so. But, you can clone at the ZFS level, and then just use the vmdk(s) that you need. As long as you don't muck about with the other stuff in the clone, the space usage should be the same. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool getting in a stuck state?
Hi Jeremy, I had a loosely similar problem with my 2009.06 box. In my case (which may not be yours), working with support we found a bug that was causing my pool to hang. I also got erroneous errors when I did a scrub ( 3 x 5 disk raidz). I am using the same LSI controller. A sure fire way to kill the box was to setup a file system as an iSCSI target, and write a lot of data to it, around 1-2MB/s. It would usually die inside of a few hours. NFS writing was not as bad, but within a day it would panic there too. The solution for me was to upgrade to 124. Since the upgrade three weeks ago, I have had no problems. Again, I don't know if this would fix your problem, but it may be worth a try. Just don't upgrade your ZFS version, and you will be able to roll back to 2009.06 at any time. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Interesting. We must have different setups with our PERCs. Mine have always auto rebuilt. -- Scott Meilicke On Oct 22, 2009, at 6:14 AM, Edward Ned Harvey sola...@nedharvey.com wrote: Replacing failed disks is easy when PERC is doing the RAID. Just remove the failed drive and replace with a good one, and the PERC will rebuild automatically. Sorry, not correct. When you replace a failed drive, the perc card doesn't know for certain that the new drive you're adding is meant to be a replacement. For all it knows, you could coincidentally be adding new disks for a new VirtualDevice which already contains data, during the failure state of some other device. So it will not automatically resilver (which would be a permanently destructive process, applied to a disk which is not *certainly* meant for destruction). You have to open the perc config interface, tell it this disk is a replacement for the old disk (probably you're just saying This disk is the new global hotspare) or else the new disk will sit there like a bump on a log. Doing nothing. We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Thank you Bob and Richard. I will go with A, as it also keeps things simple. One physical device per pool. -Scott On 10/20/09 6:46 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Tue, 20 Oct 2009, Richard Elling wrote: The ZIL device will never require more space than RAM. In other words, if you only have 16 GB of RAM, you won't need more than that for the separate log. Does the wasted storage space annoy you? :-) What happens if the machine is upgraded to 32GB of RAM later? The write performace of the X25-E is likely to be the bottleneck for a write-mostly storage server if the storage server has excellent network connectivity. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Thanks Ed. It sounds like you have run in this mode? No issues with the perc? -- Scott Meilicke On Oct 20, 2009, at 9:59 PM, Edward Ned Harvey sola...@nedharvey.com wrote: System: Dell 2950 16G RAM 16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no extra drive slots, a single zpool. svn_124, but with my zpool still running at the 2009.06 version (14). My plan is to put the SSD into an open disk slot on the 2950, but will have to configure it as a RAID 0, since the onboard PERC5 controller does not have a JBOD mode. You can JBOD with the perc. It might be technically a raid0 or raid1 with a single disk in it, but that would be functionally equivalent to JBOD. We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. Crane Aerospace Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
sigh Thanks Frédéric, that is a very interesting read. So my options as I see them now: 1. Keep the x25-e, and disable the cache. Performance should still be improved, but not by a *whole* like, right? I will google for an expectation, but if anyone knows off the top of their head, I would be appreciative. 2. Buy a ZEUS or similar SSD with a cap backed cache. Pricing is a little hard to come by, based on my quick google, but I am seeing $2-3k for an 8G model. Is that right? Yowch. 3. Wait for the x25-e g2, which is rumored to have cap backed cache, and may or may not work well (but probably will). 4. Put the x25-e with disabled cache behind my PERC with the PERC cache enabled. My budget is tight. I want better performance now. #4 sounds good. Thoughts? Regarding mirrored SSDs for the ZIL, it was my understanding that if the SSD backed ZIL failed, ZFS would fail back to using the regular pool for the ZIL, correct? Assuming this is correct, a mirror would be to preserve performance during a failure? Thanks everyone, this has been really helpful. -Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Ed, your comment: If solaris is able to install at all, I would have to acknowledge, I have to shutdown anytime I need to change the Perc configuration, including replacing failed disks. Replacing failed disks is easy when PERC is doing the RAID. Just remove the failed drive and replace with a good one, and the PERC will rebuild automatically. But are you talking about OpenSolaris managed RAID? I am pretty sure, but not tested, that in pseudo JBOD mode (each disk a raid 0 or 1), the PERC would still present a replaced disk to the OS without reconfiguring the PERC BIOS. Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
I have an Intel X25-E 32G in the mail (actually the kingston version), and wanted to get a sanity check before I start. System: Dell 2950 16G RAM 16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no extra drive slots, a single zpool. svn_124, but with my zpool still running at the 2009.06 version (14). I will likely get another chassis and 16 disks for another pool in the 3-18 month time frame. My plan is to put the SSD into an open disk slot on the 2950, but will have to configure it as a RAID 0, since the onboard PERC5 controller does not have a JBOD mode. Options I am considering: A. Use all 32G for the ZIL B. Use 8G for the ZIL, 24G for an L2ARC. Any issues with slicing up an SSD like this? C. Use 8G for the ZIL, 16G for an L2ARC, and reserve 8G to be used as a ZIL for the future zpool. Since my future zpool would just be used as a backup to disk target, I am leaning towards option C. Any gotchas I should be aware of? Thanks, Scott -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss