Re: [zfs-discuss] ZFS High Availability
Ross Walker wrote: On May 12, 2010, at 1:17 AM, schickb schi...@gmail.com wrote: I'm looking for input on building an HA configuration for ZFS. I've read the FAQ and understand that the standard approach is to have a standby system with access to a shared pool that is imported during a failover. The problem is that we use ZFS for a specialized purpose that results in 10's of thousands of filesystems (mostly snapshots and clones). All versions of Solaris and OpenSolaris that we've tested take a long time ( hour) to import that many filesystems. I've read about replication through AVS, but that also seems require an import during failover. We'd need something closer to an active- active configuration (even if the second active is only modified through replication). Or some way to greatly speedup imports. Any suggestions? Bypass the complexities of AVS and the start-up times by implementing a ZFS head server in a pair of ESX/ESXi with Hot-spares using redundant back-end storage (EMC, NetApp, Equalogics). Then, if there is a hardware or software failure of the head server or the host it is on, the hot-spare automatically kicks in with the same running state as the original. By hot-spare here, I assume you are talking about a hot-spare ESX virtual machine. If there is a software issue and the hot-spare server comes up with the same state, is it not likely to fail just like the primary server? If it does not, can you explain why it would not? Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS High Availability
schickb wrote: I'm looking for input on building an HA configuration for ZFS. I've read the FAQ and understand that the standard approach is to have a standby system with access to a shared pool that is imported during a failover. The problem is that we use ZFS for a specialized purpose that results in 10's of thousands of filesystems (mostly snapshots and clones). All versions of Solaris and OpenSolaris that we've tested take a long time ( hour) to import that many filesystems. Do you see this behavior - the long import time - during boot-up as well? Or is an issue only during an export + import operation? I suspect that the zpool cache helps a bit (during boot) but does not get rid of the problem completely (unless it has been recently addressed). If it is not an issue during boot-up, I would give the Open HA Cluster/Solaris Cluster a try or check with ha-clusters-disc...@opensolaris.org. Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
joerg.schill...@fokus.fraunhofer.de wrote: The secure deletion of the data would be something that hallens before the file is actually unlinked (e.g. by rm). This secure deletion would need open the file in a non COW mode. That may not be sufficient. Earlier writes to the file might have left older copies of the blocks lying around which could be recovered. My $0.02 -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs recreate questions
JD Trout wrote: I have a quick ZFS question. With most hardware raid controllers all the data and the info is stored on the disk. Therefore, the integrity of the data can survive a controller failure or the deletion of the LUN as long as it is recreated with the same drives in the same location. Does this kind of functionality exists within ZFS? For example, lets say I have a JOB full of disks connected to an server running OSOL and all the drives are formated as one big raidz volume. Now lets say I experience a hardware failure and I have to bring in a new server with a new installation of OSOL. Would I be able to put the raidz volume from the JBOD back together so I can see the original data? The zpool metadata is also on the disk. As long as the disks are fine, you can reconnect them to another server and import them. ZFS will be able to find the zpools (in this case raidz volume). -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] Space allocation failure
Hi Matt, ZFS-team, Problem --- libzpool.so, when calling pwrite(2), splits the write into two. This is done to simulate partial disk writes. This has the side effect that the writes are not block aligned. Hence when the underlying device is a raw device, the write fails. Note: ztest always runs on top of files and hence does not see this failure. Solution Introduce a flag split_io, that when set, causes writes to be split (the current behavior). This is not set by default and is turned on by ztest. Patch built on top of build 55 is attached. Could this patch be accepted into opensolaris? Regards, Manoj Matthew Ahrens wrote: Manoj Joseph wrote: Unlike what I had assumed earlier, zio_t that is passed to vdev_file_io_start() has aligned offset and size. The libzpool library, when writing data to the devices below a zpool, splits the write into two. This is done for the sake of testing. The comment in the routine, vn_rdwr() says this: /* * To simulate partial disk writes, we split writes into two * system calls so that the process can be killed in between. */ This has the effect of creating misaligned writes to raw devices which fail with errno=EINVAL. Cool, glad you were able to figure it out! --matt diff -r 77d8e3c86357 usr/src/cmd/ztest/ztest.c --- a/usr/src/cmd/ztest/ztest.c Mon Dec 11 17:17:14 2006 -0800 +++ b/usr/src/cmd/ztest/ztest.c Fri Aug 17 10:31:21 2007 -0600 @@ -3228,6 +3228,9 @@ main(int argc, char **argv) /* Override location of zpool.cache */ spa_config_dir = /tmp; + /* Split writes to simulate partial writes */ + split_io = B_TRUE; + ztest_random_fd = open(/dev/urandom, O_RDONLY); process_options(argc, argv); diff -r 77d8e3c86357 usr/src/lib/libzpool/common/kernel.c --- a/usr/src/lib/libzpool/common/kernel.c Mon Dec 11 17:17:14 2006 -0800 +++ b/usr/src/lib/libzpool/common/kernel.c Fri Aug 17 10:31:21 2007 -0600 @@ -36,6 +36,7 @@ #include sys/spa.h #include sys/processor.h +int split_io = B_FALSE; /* * Emulation of kernel services in userland. @@ -373,14 +374,19 @@ vn_rdwr(int uio, vnode_t *vp, void *addr if (uio == UIO_READ) { iolen = pread64(vp-v_fd, addr, len, offset); } else { - /* - * To simulate partial disk writes, we split writes into two - * system calls so that the process can be killed in between. - */ - split = (len 0 ? rand() % len : 0); - iolen = pwrite64(vp-v_fd, addr, split, offset); - iolen += pwrite64(vp-v_fd, (char *)addr + split, - len - split, offset + split); + if (split_io) { + /* + * To simulate partial disk writes, we split writes + * into two system calls so that the process can be + * killed in between. + */ + split = (len 0 ? rand() % len : 0); + iolen = pwrite64(vp-v_fd, addr, split, offset); + iolen += pwrite64(vp-v_fd, (char *)addr + split, + len - split, offset + split); + } else { + iolen = pwrite64(vp-v_fd, addr, len, offset); + } } if (iolen == -1) diff -r 77d8e3c86357 usr/src/lib/libzpool/common/sys/zfs_context.h --- a/usr/src/lib/libzpool/common/sys/zfs_context.h Mon Dec 11 17:17:14 2006 -0800 +++ b/usr/src/lib/libzpool/common/sys/zfs_context.h Fri Aug 17 10:31:21 2007 -0600 @@ -341,6 +341,8 @@ typedef struct vattr { #define VN_RELE(vp) vn_close(vp) +extern int split_io; + extern int vn_open(char *path, int x1, int oflags, int mode, vnode_t **vpp, int x2, int x3); extern int vn_openat(char *path, int x1, int oflags, int mode, vnode_t **vpp, ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and powerpath
Peter Tribble wrote: I've not got that far. During an import, ZFS just pokes around - there doesn't seem to be an explicit way to tell it which particular devices or SAN paths to use. You can't tell it which devices to use in a straightforward manner. But you can tell it which directories to scan. zpool import [-d dir] By default, it scans /dev/dsk. Does truss of zfs import show the powerrpath devices being opened and read from? Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [zfs-code] Space allocation failure
Manoj Joseph wrote: Hi, In brief, what I am trying to do is to use libzpool to access a zpool - like ztest does. [snip] No, AFAIK, the pool is not damaged. But yes, it looks like the device can't be written to by the userland zfs. Well, I might have figured out something. Turssing the process shows this: /1: open64(/dev/rdsk/c2t0d0s0, O_RDWR|O_LARGEFILE) = 3 /108: pwrite64(3, X0101\0140104\n $\0\r .., 638, 4198400) Err#22 EINVAL /108: pwrite64(3, FC BFC BFC BFC BFC BFC B.., 386, 4199038) Err#22 EINVAL [more failures...] The writes are not aligned to a block boundary. And, apparantly, unlike files, this does not work for devices. Question: were ztest and libzpool not meant to be run on real devices? Or could there be an issue in how I setup up things? Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [zfs-code] Space allocation failure
Manoj Joseph wrote: Manoj Joseph wrote: Hi, In brief, what I am trying to do is to use libzpool to access a zpool - like ztest does. [snip] No, AFAIK, the pool is not damaged. But yes, it looks like the device can't be written to by the userland zfs. Well, I might have figured out something. Turssing the process shows this: /1: open64(/dev/rdsk/c2t0d0s0, O_RDWR|O_LARGEFILE) = 3 /108: pwrite64(3, X0101\0140104\n $\0\r .., 638, 4198400) Err#22 EINVAL /108: pwrite64(3, FC BFC BFC BFC BFC BFC B.., 386, 4199038) Err#22 EINVAL [more failures...] The writes are not aligned to a block boundary. And, apparantly, unlike files, this does not work for devices. Question: were ztest and libzpool not meant to be run on real devices? Or could there be an issue in how I setup up things? The failing write has this call stack: pwrite64:return libc.so.1`_pwrite64+0x15 libzpool.so.1`vn_rdwr+0x5b libzpool.so.1`vdev_file_io_start+0x17e libzpool.so.1`vdev_io_start+0x18 libzpool.so.1`zio_vdev_io_start+0x33d [snip] usr/src/uts/common/fs/zfs/vdev_file.c has this: /* * From userland we access disks just like files. */ #ifndef _KERNEL vdev_ops_t vdev_disk_ops = { vdev_file_open, vdev_file_close, vdev_default_asize, vdev_file_io_start, vdev_file_io_done, NULL, VDEV_TYPE_DISK, /* name of this vdev type */ B_TRUE /* leaf vdev */ }; Guess vdev_file_io_start() does not work very well for devices. Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [zfs-code] Space allocation failure
Hi, In brief, what I am trying to do is to use libzpool to access a zpool - like ztest does. Matthew Ahrens wrote: Manoj Joseph wrote: Hi, Replying to myself again. :) I see this problem only if I attempt to use a zpool that already exists. If I create one (using files instead of devices, don't know if it matters) like ztest does, it works like a charm. You should probably be posting on zfs-discuss. Switching from zfs-code to zfs-discuss. The pool you're trying to access is damaged. It would appear that one of the devices can not be written to. No, AFAIK, the pool is not damaged. But yes, it looks like the device can't be written to by the userland zfs. bash-3.00# zpool import test bash-3.00# zfs list test NAME USED AVAIL REFER MOUNTPOINT test85K 1.95G 24.5K /test bash-3.00# ./udmu test pool: test state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 c2t0d0ONLINE 0 0 0 errors: No known data errors Export the pool. cannot open 'test': no such pool Import the pool. error: ZFS: I/O failure (write on unknown off 0: zio 8265d80 [L0 unallocated] 4000L/400P DVA[0]=0:1000:400 DVA[1]=0:18001000:400 fletcher4 lzjb LE contiguous birth=245 fill=0 cksum=6bba8d3a44:2cfa96558ac7:c732e55bea858:2b86470f6a83373): error 28 Abort (core dumped) bash-3.00# zpool import test bash-3.00# zpool status test pool: test state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM testONLINE 0 0 0 c2t0d0ONLINE 0 0 0 errors: No known data errors bash-3.00# touch /test/z bash-3.00# sync bash-3.00# ls -l /test/z -rw-r--r-- 1 root root 0 Jun 28 04:18 /test/z bash-3.00# The userland zfs's export succeeds. But doing a system(zpool status test) right after the spa_export() succeeds shows that the the 'kernel zfs' still thinks it is imported. I guess that makes sense. Nothing has been told to the 'kernel zfs' about the export. But I still do not understand why the 'userland zfs' can't write to the pool. Regards, Manoj PS: The code I have be tinkering with is attached. --matt Any clue as to why this is so would be appreciated. Cheers Manoj Manoj Joseph wrote: Hi, I tried adding an spa_export();spa_import() to the code snippet. I get a similar crash while importing. I/O failure (write on unknown off 0: zio 822ed40 [L0 unallocated] 4000L/400P DVA[0]=0:1000:400 DVA[1]=0:18001000:400 fletcher4 lzjb LE contiguous birth=4116 fill=0 cksum=69c3a4acfc:2c42fdcaced5:c5231ffcb2285:2b8c1a5f2cb2bfd): error 28 Abort (core dumped) I thought ztest could use an existing pool. Is that assumption wrong? These are the stacks of interest. d11d78b9 __lwp_park (81c3e0c, 81c3d70, 0) + 19 d11d1ad2 cond_wait_queue (81c3e0c, 81c3d70, 0, 0) + 3e d11d1fbd _cond_wait (81c3e0c, 81c3d70) + 69 d11d1ffb cond_wait (81c3e0c, 81c3d70) + 24 d131e4d2 cv_wait (81c3e0c, 81c3d6c) + 5e d12fe2dd txg_wait_synced (81c3cc0, 1014, 0) + 179 d12f9080 spa_config_update (819dac0, 0) + c4 d12f467a spa_import (8047657, 8181f88, 0) + 256 080510c6 main (2, 804749c, 80474a8) + b2 08050f22 _start (2, 8047650, 8047657, 0, 804765c, 8047678) + 7a d131ed79 vpanic (d1341dbc, ca5cd248) + 51 d131ed9f panic(d1341dbc, d135a384, d135a724, d133a630, 0, 0) + 1f d131921d zio_done (822ed40) + 455 d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a d1318c88 zio_wait_children_done (822ed40) + 18 d131c15d zio_next_stage (822ed40) + 161 d131ba83 zio_vdev_io_assess (822ed40) + 183 d131c15d zio_next_stage (822ed40) + 161 d1307011 vdev_mirror_io_done (822ed40) + 421 d131b8a2 zio_vdev_io_done (822ed40) + 36 d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a d1318c88 zio_wait_children_done (822ed40) + 18 d1306be6 vdev_mirror_io_start (822ed40) + 1d2 d131b862 zio_vdev_io_start (822ed40) + 34e d131c313 zio_next_stage_async (822ed40) + 1ab d131bb47 zio_vdev_io_assess (822ed40) + 247 d131c15d zio_next_stage (822ed40) + 161 d1307011 vdev_mirror_io_done (822ed40) + 421 d131b8a2 zio_vdev_io_done (822ed40) + 36 d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a d1318c88 zio_wait_children_done (822ed40) + 18 d1306be6 vdev_mirror_io_start (822ed40) + 1d2 d131b862 zio_vdev_io_start (822ed40) + 34e d131c15d zio_next_stage (822ed40) + 161 d1318dc1 zio_ready (822ed40) + 131 d131c15d zio_next_stage (822ed40) + 161 d131b41b zio_dva_allocate (822ed40) + 343 d131c15d zio_next_stage (822ed40) + 161 d131bdcb zio_checksum_generate (822ed40) + 123 d131c15d zio_next_stage (822ed40) + 161 d1319873 zio_write_compress (822ed40) + 4af d131c15d zio_next_stage (822ed40) + 161 d1318b92 zio_wait_for_children (822ed40, 1, 822ef28) + 6a d1318c68
Re: [zfs-discuss] fchmod(2) returns ENOSPC on ZFS
Matthew Ahrens wrote: In a COW filesystem such as ZFS, it will sometimes be necessary to return ENOSPC in cases such as chmod(2) which previously did not. This is because there could be a snapshot, so overwriting some information actually requires a net increase in space used. That said, we may be generating this ENOSPC in cases where it is not strictly necessary (eg, when there are no snapshots). We're working on some of these cases. Can you show us the output of 'zfs list' when the ENOSPC occurs? Is there a bug id for this? Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] fchmod(2) returns ENOSPC on ZFS
Hi, I find that fchmod(2) on a zfs filesystem can sometimes generate errno = ENOSPC. However this error value is not in the manpage of fchmod(2). Here's where ENOSPC is generated. zfs`dsl_dir_tempreserve_impl zfs`dsl_dir_tempreserve_space+0x4e zfs`dmu_tx_try_assign+0x230 zfs`dmu_tx_assign+0x21 zfs`zfs_setattr+0x41b genunix`fop_setattr+0x24 genunix`vpsetattr+0x110 genunix`fdsetattr+0x26 genunix`fchmod+0x2a genunix`dtrace_systrace_syscall+0xbc unix`sys_sysenter+0x101 Is this correct behavior? Is it the manpage that needs fixing? zpool list shows this. NAMESIZEUSED AVAILCAP HEALTH ALTROOT pool1 115M 83.1M 31.9M72% ONLINE - While I am unable to guarantee that there has been no activity after fchmod() has failed, I am fairly sure that the filesystem was not full when it returned ENOSPC. I have done all my analysis on build 54. So I might just be looking at outdated stuff. Please let me know what you think. Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fchmod(2) returns ENOSPC on ZFS
Matthew Ahrens wrote: Manoj Joseph wrote: Hi, I find that fchmod(2) on a zfs filesystem can sometimes generate errno = ENOSPC. However this error value is not in the manpage of fchmod(2). Here's where ENOSPC is generated. zfs`dsl_dir_tempreserve_impl zfs`dsl_dir_tempreserve_space+0x4e zfs`dmu_tx_try_assign+0x230 zfs`dmu_tx_assign+0x21 zfs`zfs_setattr+0x41b genunix`fop_setattr+0x24 genunix`vpsetattr+0x110 genunix`fdsetattr+0x26 genunix`fchmod+0x2a genunix`dtrace_systrace_syscall+0xbc unix`sys_sysenter+0x101 Is this correct behavior? Is it the manpage that needs fixing? zpool list shows this. In a COW filesystem such as ZFS, it will sometimes be necessary to return ENOSPC in cases such as chmod(2) which previously did not. This is because there could be a snapshot, so overwriting some information actually requires a net increase in space used. Could the manpage be updated to reflect this? That said, we may be generating this ENOSPC in cases where it is not strictly necessary (eg, when there are no snapshots). We're working on some of these cases. Can you show us the output of 'zfs list' when the ENOSPC occurs? -bash-3.00# zfs list pool1 NAMEUSED AVAIL REFER MOUNTPOINT pool1 83.0M 0 82.8M /pool1 -bash-3.00# zpool list pool1 NAMESIZEUSED AVAILCAP HEALTH ALTROOT pool1 115M 83.0M 32.0M72% ONLINE - zfs list does say that there is no available space. There is 32M available on the zpool though. Interesting... Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file system full corruption in ZFS
Michael Barrett wrote: Normally if you have a ufs file system hit 100% and you have a very high level of system and application load on the box (that resides in the 100% file system) you will run into inode issues that require a fsck and show themselves by not being about to long list out all their attributes (ls -la). Not a bug, just what happens. I don't see how something like this can not be a bug. Don't tell me this is a feature and UFS is working as per design! ;) Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file system full corruption in ZFS
dudekula mastan wrote: Atleaset in my experience, I saw Corruptions when ZFS file system was full. So far there is no way to check the file system consistency on ZFS (to the best of my knowledge). ZFS people claiming that ZFS file system is always consistent and there is no need for FSCK command. ZFS is always consistent on disk. This does not mean there cannot be data loss - especially on an unreplicated pool. ZFS can self heal only when there is redundancy in the pool. If you do see corruptions, you should probably report then here along with the zpool configuration details and test cases if any. Please do file bugs. Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS over a layered driver interface
Shweta Krishnan wrote: I ran zpool with truss, and here is the system call trace. (again, zfs_lyr is the layered driver I am trying to use to talk to the ramdisk driver). When I compared it to a successful zpool creation, the culprit is the last failing ioctl i.e. ioctl(3, ZFS_IOC_CREATE_POOL, address) I tried looking at the source code for the failing ioctl, but didn't get any hints there. Guess I must try dtrace (which I am about to learn!). bash-3.00# truss -f zpool create adsl-pool /devices/pseudo/[EMAIL PROTECTED]:zfsminor1 2 /var/tmp/zpool.truss bash-3.00# grep Err /var/tmp/zpool.truss 2232: open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT 2232: xstat(2, /lib/libdiskmgt.so.1, 0x080469C8)Err#2 ENOENT 2232: xstat(2, /lib/libxml2.so.2, 0x08046868) Err#2 ENOENT 2232: xstat(2, /lib/libz.so.1, 0x08046868) Err#2 ENOENT 2232: stat64(/devices/pseudo/[EMAIL PROTECTED]:zfsminor1s2, 0x080429E0) Err#2 ENOENT 2232: modctl(MODSIZEOF_DEVID, 0x03740001, 0x080429BC, 0x08071714, 0x) Err#22 EINVAL MODSIZEOF_DEVID is 10. $ dtrace -n 'syscall::modctl:entry{trace(arg0); ustack();}' The relevant stack is the following. 0 71587 modctl:entry10 libc.so.1`modctl+0x15 zpool`make_disks+0x1bf zpool`make_disks+0x72 zpool`make_root_vdev+0x56 zpool`zpool_do_create+0x1c4 zpool`main+0xa2 zpool`_start+0x7a make_disks() calls devid_get() which calls modctl(MODSIZEOF_DEVID). This fails. http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/zpool/zpool_vdev.c#959 The code however, seems to ignore this. So this might not be the issue. 2232: mkdir(/var/run/sysevent_channels/syseventd_channel, 0755) Err#17 EEXIST 2232: unlink(/var/run/sysevent_channels/syseventd_channel/17) Err#2 ENOENT 2232/1: umount2(/var/run/sysevent_channels/syseventd_channel/17, 0x) Err#22 EINVAL 2232/1: ioctl(7, I_CANPUT, 0x) Err#89 ENOSYS 2232/1: stat64(/adsl-pool, 0x08043330)Err#2 ENOENT 2232/1: ioctl(3, ZFS_IOC_POOL_CREATE, 0x08041BC4) Err#22 EINVAL ZFS_IOC_POOL_CREATE is failing. I am not sure if the problem has already happened or if it happens during this ioctl. But you could try dtracing this ioctl and see where EINVAL is being set. $ dtrace -n 'fbt:zfs:zfs_ioc_pool_create:entry{self-t=1;} \ fbt:zfs::return/self-t arg1 == 22/{stack(); exit(0);} \ fbt:zfs:zfs_ioc_pool_create:return{self-t=0;}' If it does not provide a clue, you could try the following trace with is more heavy weight. Warning: it could generate a lot of output. :) $ dtrace -n 'fbt:zfs:zfs_ioc_pool_create:entry{self-t=1;} \ fbt:zfs::entry/self-t/{} fbt:zfs::return/self-t/{trace(arg1);} \ fbt:zfs:zfs_ioc_pool_create:return{self-t=0;}' Perhaps there are folks on this list who know what the problem is without all the dtracing that I am suggesting. But this is what I would try. Good luck! :) -Manoj PS: When running the above scripts, run it on one telnet/ssh/xterm window. Run 'zpool create' on another. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [osol-discuss] ZFS over a layered driver interface
Hi, This is probably better discussed on zfs-discuss. I am CCing the list. Followup emails could leave out opensolaris-discuss... Shweta Krishnan wrote: Does zfs/zpool support the layered driver interface? I wrote a layered driver with a ramdisk device as the underlying device, and successfully got a UFS file system on the ramdisk to boot via the layered device. I am trying to do the same with a ZFS file system. However, since ZFS file systems are created as datasets within a storage pool and not directly on a specified underlying device, I can't think how I will get a ZFS file system to mount using a layered driver atop a real device. I tried specifying the layered device as the storage pool component for 'zpool create', but that gave me an invalid argument for this pool operation error. I also tried setting the mountpoint for a zfs filesystem as 'legacy' and doing a regular mount with the layered device, but that gave me an invalid dataset error. You would have to create a zpool even for legacy mounts. You cannot skip that step. Probably the answer to your problem lies in the invalid argument for this pool operation error message. Did you try trussing the zpool create? What was the syscall that failed. If you are familiar with dtrace, you might be able to narrow it down to what is causing the failure. -Manoj I looked through the documentation for zfs/zpool and searched extensively, but haven't been able to figure this one out yet. I am a newbie to ZFS, so pardon me if this is something trivial. Can someone point me to possible method to achieve the above? This message posted from opensolaris.org ___ opensolaris-discuss mailing list [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Will this work?
Robert Thurlow wrote: I've written some about a 4-drive Firewire-attached box based on the Oxford 911 chipset, and I've had I/O grind to a halt in the face of media errors - see bugid 6539587. I haven't played with USB drives enough to trust them more, but this was a hole I fell in with Firewire. I've had fabulous luck with a Firewire attached DVD burner, though. 6539587 does not seem to be visible on the opensolaris bugs database. :-/ -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs?
Lee Fyock wrote: least this year. I'd like to favor available space over performance, and be able to swap out a failed drive without losing any data. Lee Fyock later wrote: In the mean time, I'd like to hang out with the system and drives I have. As mike said, my understanding is that zfs would provide error correction until a disc fails, if the setup is properly done. That's the setup for which I'm requesting a recommendation. ZFS always lets you know if the data you are requesting has gone bad. If you have redundancy, it provides error correction as well. Money isn't an issue here, but neither is creating an optimal zfs system. I'm curious what the right zfs configuration is for the system I have. You obviously have the option of having a giant pool of all the disks and what you get is dynamic striping. But if a disk goes toast, the data in it is gone. If you plan to back up important data elsewhere and data loss is something you can live with, this might be a good choice. The next option is to mirror (/raidz) disks. If you mirror a 200 GB disk with a 250 GB one, you will get only 200 GB of redundant storage. If a disk goes for a toss, all of your data is safe. But you lose disk space. Mirroring the 600GB disk with a stripe of 160+200+250 would have been nice, but I believe this is not possible with ZFS (yet?). There is a third option - create a giant pool of all the disks. Set copy=2. ZFS will create two copies of all the data blocks. That is pretty good redundancy. But depending on how full your disks are, the copies may or may not be on different disks. In other words, this does not guarantee that *all* of your data is safe, if say your 600 GB disk dies. But it might be 'good enough'. From what I understand your requirements are, this just might be your best choice. A periodic scrub would also be a good thing to do. The earlier you detect a flaky disk, the better it is... Hope this helps. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue with adding existing EFI disks to a zpool
Mario Goebbels wrote: do it. So I added the disk using the zero slice notation (c0d0s0), as suggested for performance reasons. I checked the pool status and noticed however that the pool size didn't raise. I believe you got this wrong. You should have given ZFS the whole disk - c0d0 and not a slice. When presented a whole disk, it EFI-labels it and turns on the write cache. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ARC, mmap, pagecache...
Hi, I was wondering about the ARC and its interaction with the VM pagecache... When a file on a ZFS filesystem is mmaped, does the ARC cache get mapped to the process' virtual memory? Or is there another copy? -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: concatination stripe - zfs?
Richard Elling wrote: In other words, the sync command schedules a sync. The consistent way to tell if writing is finished is to observe the actual I/O activity. ZFS goes beyond this POSIX requirement. When a sync(1M) returns, all dirty data that has been cached has been committed to disk. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preferred backup mechanism for ZFS?
Wee Yeh Tan wrote: On 4/23/07, Robert Milkowski [EMAIL PROTECTED] wrote: bash-3.00# mdb -k Loading modules: [ unix krtld genunix dtrace specfs ufs sd pcisch md ip sctp usba fcp fctl qlc ssd crypto lofs zfs random ptm cpc nfs ] segmap_percent/D segmap_percent: segmap_percent: 12 (it's static IIRC) segmap_percent is only referenced in http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sun4/os/startup.c#2270. You are right that it is static but that means you cannot tune that unless you run your own kernel. I took a quick look at startup.c. Looks like you should be able to set this value in /etc/system. You should not have to compile your own kernel. Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Preferred backup mechanism for ZFS?
Dennis Clarke wrote: So now here we are ten years later with a new filesystem and I have no way to back it up in such a fashion that I can restore it perfectly. I can take snapshots. I can do a strange send and receive but the process is not stable From zfs (1M) we see : The format of the stream is evolving. No backwards compati- bility is guaranteed. You may not be able to receive your streams on future versions of ZFS. The format of the stream may not be stable. So you can't dump the stream to a file somewhere and expect to receive from it sometime in the future. But if you stash it away as a pool on the same machine or elsewhere, it is not an issue. # zfs send [-i b] pool/[EMAIL PROTECTED] | [ssh host] zfs receive poolB/received/[EMAIL PROTECTED] Right? I think this is quite cool! :) -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to bind the oracle 9i data file to zfs volumes
Simon wrote: So,does mean this is oracle bug ? Or it's impossible(or inappropriate) to use ZFS/SVM volumes to create oracle data file,instead,should use zfs or ufs filesystem to do this. Oracle can use SVM volumes to hold its data. Unless I am mistaken, it should be able to use zvols as well. However, googling for 'zvol + Oracle' did not get me anything useful. Perhaps it is not a configuration that is very popular. ;) My $ 0.02. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: simple Raid-Z question
Erik Trimble wrote: While expanding a zpool in the way you've show is useful, it has nowhere near the flexibility of simply adding single disks to existing RAIDZ vdevs, which was the original desire expressed. This conversation has been had several times now (take a look in the archives around Jan for the last time it came up). Perhaps, this should be added to the FAQ? Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up for zfsboot
Constantin Gonzalez wrote: Do I still have the advantages of having the whole disk 'owned' by zfs, even though it's split into two parts? I'm pretty sure that this is not the case: - ZFS has no guarantee that someone will do something else with that other partition, so it can't assume the right to turn on disk cache for the whole disk. Can write-cache not be turned on manually as the user is sure that it is only ZFS that is using the entire disk? -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] File level snapshots in ZFS?
Richard Elling wrote: Atul Vidwansa wrote: Hi Richard, I am not talking about source(ASCII) files. How about versioning production data? I talked about file level snapshots because snapshotting entire filesystem does not make sense when application is changing just few files at a time. CVS supports binary files. The nice thing about version control systems is that you can annotate the versions. With ZFS snapshots, you don't get annotations. Sure version control systems do file versioning. But, ZFS with its COW brings a new way of doing this. I do not see applications like emacs, star office etc using SCCS/CVS. But I can easily see then using file snapshots if zfs were to offer it (am conveniently ignoring portability). It was suggested that filesystem snapshots be used to achieve the same purpose. It would not work, if you have to roll back one file change but not others... Extended attributes could potentially be used to annotate file snapshots... ;) I can also see possibilities with clustered/distributed applications (parallel Postgresql perhaps?) needing to commit/revoke across servers using this. Layered distributed filesystems could potentially use this for recovery. But I also remember a long thread on this not too long ago going nowhere. ;) Just my $0.02. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How big a write to a regular file is atomic?
Richard L. Hamilton wrote: and does it vary by filesystem type? I know I ought to know the answer, but it's been a long time since I thought about it, and I must not be looking at the right man pages. And also, if it varies, how does one tell? For a pipe, there's fpathconf() with _PC_PIPE_BUF, but how about for a regular file? For ZFS, it is atomic up to the whole-block level. See: http://www.opensolaris.org/jive/thread.jspa?messageID=18705#18818 -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] nv59 + HA-ZFS
David Anderson wrote: Hi, I'm attempting to build a ZFS SAN with iSCSI+IPMP transport. I have two ZFS nodes that access iSCSI disks on the storage network and then the ZFS nodes share ZVOLs via iSCSI to my front-end Linux boxes. My throughput from one Linux box is about 170+MB/s with nv59 (earlier builds were about 60MB/s), so I am pleased with the performance so far. My next step is to configure HA-ZFS for failover between the two ZFS nodes. Does Sun Cluster 3.2 work with SXCE? If so, are there any caveats for my situation? I thought Sun Cluster's support for iSCSI was not ready. You could perhaps check with the sun cluster group. Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] writes lost with zfs !
Ayaz, Ayaz Anjum wrote: HI ! I have some concerns here, from my experience in the past, touching a file ( doing some IO ) will cause the ufs filesystem to failover, unlike zfs where it did not ! Why the behaviour of zfs different than ufs ? is not this compromising data integrity ? As others have explained, until a sync is done or unless the file is opened to do 'sync writes', a write is not guaranteed to be on disk. If the node fails before the disk commit, the data can be lost. Applications are written with this in mind. While ZFS and UFS do lots of things differently, the above applies to both of them and to all POSIX filesystems in general. Could you tell us more about how the UFS failover happened? Did you see a UFS panic? Did the Sun Cluster disk path monitor cause the failover? Regards, Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS/UFS layout for 4 disk servers
Matt B wrote: Any thoughts on the best practice points I am raising? It disturbs me that it would make a statement like don't use slices for production. ZFS turns on write cache on the disk if you give it the entire disk to manage. It is good for performance. So, you should use whole disks when ever possible. Slices work too, but write cache for the disk will not be turned on by zfs. Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] writes lost with zfs !
Ayaz Anjum wrote: HI ! I have tested the following scenario created a zfs filesystem as part of HAStoragePlus in SunCluster 3.2, Solaris 11/06 Currently i am having only one fc hba per server. 1. There is no IO to the zfs mountpoint. I disconnected the FC cable. Filesystem on zfs still shows as mounted (because of no IO to filesystem). I touch a file. Still ok. i did a sync and only then the node panicked and zfs filesystem failed over to other cluster node. however my file which i touched is lost This is to be expected, I'd say. HAStoragePlus is primarily a wrapper over zfs that manages the import/export and mount/unmount. It can not and does not provide for a retry of pending IOs. The 'touch' would have been part of a zfs transaction group that never got committed. And it stays lost when the pool is imported on the other node. In other words, it does not provide the same kind of high availability that, say, PxFS for instance provides. 2. with zfs mounted on one cluster node, i created a file and keeps it updating every second, then i removed the fc cable, the writes are still continuing to the file system, after 10 seconds i have put back the fc cable and my writes continues, no failover of zfs happens. seems that all IO are going to some cache. Any suggestions on whts going wrong over here and whts the solution to this. I don't know for sure. But my guess is, if you do a fsync after the writes and wait for the fsync to complete, then you might get some action. fsync should fail. zfs could panic the node. If it does, you will see a failover. Hope that helps. -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: SPEC SFS97 benchmark of ZFS,UFS,VxFS
Alan Romeril wrote: PxFS performance improvements of the order of 5-6 times are possible depending on the workload using Fastwrite option. Fantastic! Has this been targetted at directory operations? We've had issues with large directorys full of small files being very slow to handle over PxFS. The 'fastwrite option' speeds up write operations. So this doesn't do much for directory operations. Are there plans for PxFS on ZFS any time soon :) ? PxFS on ZFS is unlikely to happen. A clusterized version of ZFS (as mentioned before on this alias) is being considered. Or any plans to release PxFS as part of opensolaris? PxFS is tightly couple to the cluster framework. Without open sourcing cluster, PxFS in its current form cannot be opensourced as it would not make sense. As for open sourcing cluster, my guess is as good as yours. Regards, Manoj -- Sun Cluster Engineering ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and Sun Cluster....
Tatjana S Heuser wrote: Is it planned to have the cluster fs or proxy fs layer between the ZFS layer and the Storage pool layer? This, AFAIK, is not the current plan of action. Sun Cluster should be moving towards ZFS as a 'true' cluster filesystem. Not going the 'proxy fs layer' way (PxFS/GFS), IMHO, is not due to technical infeasibility. Regards, Manoj -- Global Data and Devices, Sun Cluster Engineering. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss