Re: [zfs-discuss] ZFS High Availability

2010-05-12 Thread Manoj Joseph
Ross Walker wrote:
 On May 12, 2010, at 1:17 AM, schickb schi...@gmail.com wrote:
 
 I'm looking for input on building an HA configuration for ZFS. I've  
 read the FAQ and understand that the standard approach is to have a  
 standby system with access to a shared pool that is imported during  
 a failover.

 The problem is that we use ZFS for a specialized purpose that  
 results in 10's of thousands of filesystems (mostly snapshots and  
 clones). All versions of Solaris and OpenSolaris that we've tested  
 take a long time ( hour) to import that many filesystems.

 I've read about replication through AVS, but that also seems require  
 an import during failover. We'd need something closer to an active- 
 active configuration (even if the second active is only modified  
 through replication). Or some way to greatly speedup imports.

 Any suggestions?
 
 Bypass the complexities of AVS and the start-up times by implementing  
 a ZFS head server in a pair of ESX/ESXi with Hot-spares using  
 redundant back-end storage (EMC, NetApp, Equalogics).
 
 Then, if there is a hardware or software failure of the head server or  
 the host it is on, the hot-spare automatically kicks in with the same  
 running state as the original.

By hot-spare here, I assume you are talking about a hot-spare ESX
virtual machine.

If there is a software issue and the hot-spare server comes up with the
same state, is it not likely to fail just like the primary server? If it
does not, can you explain why it would not?

Cheers
Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS High Availability

2010-05-12 Thread Manoj Joseph
schickb wrote:
 I'm looking for input on building an HA configuration for ZFS. I've
 read the FAQ and understand that the standard approach is to have a
 standby system with access to a shared pool that is imported during a
 failover.
 
 The problem is that we use ZFS for a specialized purpose that results
 in 10's of thousands of filesystems (mostly snapshots and clones).
 All versions of Solaris and OpenSolaris that we've tested take a long
 time ( hour) to import that many filesystems.

Do you see this behavior - the long import time - during boot-up as
well? Or is an issue only during an export + import operation?

I suspect that the zpool cache helps a bit (during boot) but does not
get rid of the problem completely (unless it has been recently addressed).

If it is not an issue during boot-up, I would give the Open HA
Cluster/Solaris Cluster a try or check with
ha-clusters-disc...@opensolaris.org.

Cheers
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Secure delete?

2010-04-11 Thread Manoj Joseph
joerg.schill...@fokus.fraunhofer.de wrote:
 The secure deletion of the data would be something that hallens before
 the file is actually unlinked (e.g. by rm). This secure deletion would
 need open the file in a non COW mode.

That may not be sufficient. Earlier writes to the file might have left
older copies of the blocks lying around which could be recovered.

My $0.02

-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs recreate questions

2010-03-29 Thread Manoj Joseph
JD Trout wrote:
 I have a quick ZFS question.  With most hardware raid controllers all
 the data and the info is stored on the disk. Therefore, the integrity
 of the data can survive a controller failure or the deletion of the
 LUN  as long as it is recreated with the same drives in the same
 location.  Does this kind of functionality exists within ZFS?
 
 For example, lets say I have a JOB full of disks connected to an
 server running OSOL and all the drives are formated as one big  raidz
 volume. Now lets say I experience a hardware failure and I have to
 bring in a new server with a new installation of OSOL. Would I be
 able to put the raidz volume from the JBOD back together so I can see
 the original data?

The zpool metadata is also on the disk. As long as the disks are fine,
you can reconnect them to another server and import them. ZFS will be
able to find the zpools (in this case raidz volume).

-Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [zfs-code] Space allocation failure

2007-08-20 Thread Manoj Joseph
Hi Matt, ZFS-team,

Problem
---
libzpool.so, when calling pwrite(2), splits the write into two. This is
done to simulate partial disk writes. This has the side effect that the
writes are not block aligned. Hence when the underlying device is a raw
device, the write fails.

Note: ztest always runs on top of files and hence does not see this failure.

Solution

Introduce a flag split_io, that when set, causes writes to be split (the
current behavior). This is not set by default and is turned on by ztest.

Patch built on top of build 55 is attached.

Could this patch be accepted into opensolaris?

Regards,
Manoj


Matthew Ahrens wrote:
 Manoj Joseph wrote:
 Unlike what I had assumed earlier, zio_t that is passed to 
 vdev_file_io_start() has aligned offset and size.

 The libzpool library, when writing data to the devices below a zpool, 
 splits the write into two. This is done for the sake of testing. The 
 comment in the routine, vn_rdwr() says this:
 /*
  * To simulate partial disk writes, we split writes into two
  * system calls so that the process can be killed in between.
  */

 This has the effect of creating misaligned writes to raw devices which 
 fail with errno=EINVAL.
 
 Cool, glad you were able to figure it out!
 
 --matt


diff -r 77d8e3c86357 usr/src/cmd/ztest/ztest.c
--- a/usr/src/cmd/ztest/ztest.c	Mon Dec 11 17:17:14 2006 -0800
+++ b/usr/src/cmd/ztest/ztest.c	Fri Aug 17 10:31:21 2007 -0600
@@ -3228,6 +3228,9 @@ main(int argc, char **argv)
 	/* Override location of zpool.cache */
 	spa_config_dir = /tmp;
 
+	/* Split writes to simulate partial writes */
+	split_io = B_TRUE;
+
 	ztest_random_fd = open(/dev/urandom, O_RDONLY);
 
 	process_options(argc, argv);
diff -r 77d8e3c86357 usr/src/lib/libzpool/common/kernel.c
--- a/usr/src/lib/libzpool/common/kernel.c	Mon Dec 11 17:17:14 2006 -0800
+++ b/usr/src/lib/libzpool/common/kernel.c	Fri Aug 17 10:31:21 2007 -0600
@@ -36,6 +36,7 @@
 #include sys/spa.h
 #include sys/processor.h
 
+int split_io = B_FALSE;
 
 /*
  * Emulation of kernel services in userland.
@@ -373,14 +374,19 @@ vn_rdwr(int uio, vnode_t *vp, void *addr
 	if (uio == UIO_READ) {
 		iolen = pread64(vp-v_fd, addr, len, offset);
 	} else {
-		/*
-		 * To simulate partial disk writes, we split writes into two
-		 * system calls so that the process can be killed in between.
-		 */
-		split = (len  0 ? rand() % len : 0);
-		iolen = pwrite64(vp-v_fd, addr, split, offset);
-		iolen += pwrite64(vp-v_fd, (char *)addr + split,
-		len - split, offset + split);
+		if (split_io) {
+			/*
+			 * To simulate partial disk writes, we split writes
+			 * into two system calls so that the process can be
+			 * killed in between.
+			 */
+			split = (len  0 ? rand() % len : 0);
+			iolen = pwrite64(vp-v_fd, addr, split, offset);
+			iolen += pwrite64(vp-v_fd, (char *)addr + split,
+			len - split, offset + split);
+		} else {
+			iolen = pwrite64(vp-v_fd, addr, len, offset);
+		}
 	}
 
 	if (iolen == -1)
diff -r 77d8e3c86357 usr/src/lib/libzpool/common/sys/zfs_context.h
--- a/usr/src/lib/libzpool/common/sys/zfs_context.h	Mon Dec 11 17:17:14 2006 -0800
+++ b/usr/src/lib/libzpool/common/sys/zfs_context.h	Fri Aug 17 10:31:21 2007 -0600
@@ -341,6 +341,8 @@ typedef struct vattr {
 
 #define	VN_RELE(vp)	vn_close(vp)
 
+extern int split_io;
+
 extern int vn_open(char *path, int x1, int oflags, int mode, vnode_t **vpp,
 int x2, int x3);
 extern int vn_openat(char *path, int x1, int oflags, int mode, vnode_t **vpp,

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and powerpath

2007-07-13 Thread Manoj Joseph
Peter Tribble wrote:

 I've not got that far. During an import, ZFS just pokes around - there
 doesn't seem to be an explicit way to tell it which particular devices
 or SAN paths to use.

You can't tell it which devices to use in a straightforward manner. But 
you can tell it which directories to scan.

zpool import [-d dir]

By default, it scans /dev/dsk.

Does truss of zfs import show the powerrpath devices being opened and 
read from?

Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: [zfs-code] Space allocation failure

2007-06-28 Thread Manoj Joseph

Manoj Joseph wrote:

Hi,

In brief, what I am trying to do is to use libzpool to access a zpool - 
like ztest does.


[snip]

No, AFAIK, the pool is not damaged. But yes, it looks like the device 
can't be written to by the userland zfs.


Well, I might have figured out something.

Turssing the process shows this:

/1: open64(/dev/rdsk/c2t0d0s0, O_RDWR|O_LARGEFILE) = 3
/108:   pwrite64(3,  X0101\0140104\n $\0\r  .., 638, 4198400) Err#22 
EINVAL
/108:   pwrite64(3, FC BFC BFC BFC BFC BFC B.., 386, 4199038) Err#22 
EINVAL

[more failures...]

The writes are not aligned to a block boundary. And, apparantly, unlike 
files, this does not work for devices.


Question: were ztest and libzpool not meant to be run on real devices? 
Or could there be an issue in how I setup up things?


Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: [zfs-code] Space allocation failure

2007-06-28 Thread Manoj Joseph

Manoj Joseph wrote:

Manoj Joseph wrote:

Hi,

In brief, what I am trying to do is to use libzpool to access a zpool 
- like ztest does.


[snip]

No, AFAIK, the pool is not damaged. But yes, it looks like the device 
can't be written to by the userland zfs.


Well, I might have figured out something.

Turssing the process shows this:

/1: open64(/dev/rdsk/c2t0d0s0, O_RDWR|O_LARGEFILE) = 3
/108:   pwrite64(3,  X0101\0140104\n $\0\r  .., 638, 4198400) Err#22 
EINVAL
/108:   pwrite64(3, FC BFC BFC BFC BFC BFC B.., 386, 4199038) Err#22 
EINVAL

[more failures...]

The writes are not aligned to a block boundary. And, apparantly, unlike 
files, this does not work for devices.


Question: were ztest and libzpool not meant to be run on real devices? 
Or could there be an issue in how I setup up things?


The failing write has this call stack:

  pwrite64:return
  libc.so.1`_pwrite64+0x15
  libzpool.so.1`vn_rdwr+0x5b
  libzpool.so.1`vdev_file_io_start+0x17e
  libzpool.so.1`vdev_io_start+0x18
  libzpool.so.1`zio_vdev_io_start+0x33d
  [snip]

usr/src/uts/common/fs/zfs/vdev_file.c has this:

/*
 * From userland we access disks just like files.
 */
#ifndef _KERNEL

vdev_ops_t vdev_disk_ops = {
vdev_file_open,
vdev_file_close,
vdev_default_asize,
vdev_file_io_start,
vdev_file_io_done,
NULL,
VDEV_TYPE_DISK, /* name of this vdev type */
B_TRUE  /* leaf vdev */
};

Guess vdev_file_io_start() does not work very well for devices.

Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: [zfs-code] Space allocation failure

2007-06-27 Thread Manoj Joseph

Hi,

In brief, what I am trying to do is to use libzpool to access a zpool - 
like ztest does.


Matthew Ahrens wrote:

Manoj Joseph wrote:

Hi,

Replying to myself again. :)

I see this problem only if I attempt to use a zpool that already 
exists. If I create one (using files instead of devices, don't know if 
it matters) like ztest does, it works like a charm.


You should probably be posting on zfs-discuss.


Switching from zfs-code to zfs-discuss.

The pool you're trying to access is damaged.  It would appear that one 
of the devices can not be written to.


No, AFAIK, the pool is not damaged. But yes, it looks like the device 
can't be written to by the userland zfs.


bash-3.00# zpool import test
bash-3.00# zfs list test
NAME   USED  AVAIL  REFER  MOUNTPOINT
test85K  1.95G  24.5K  /test
bash-3.00# ./udmu test
 pool: test
 state: ONLINE
 scrub: none requested
 config:

NAMESTATE READ WRITE CKSUM
testONLINE   0 0 0
  c2t0d0ONLINE   0 0 0

errors: No known data errors
Export the pool.
cannot open 'test': no such pool
Import the pool.
error: ZFS: I/O failure (write on unknown off 0: zio 8265d80 [L0 
unallocated] 4000L/400P DVA[0]=0:1000:400 DVA[1]=0:18001000:400 
fletcher4 lzjb LE contiguous birth=245 fill=0 
cksum=6bba8d3a44:2cfa96558ac7:c732e55bea858:2b86470f6a83373): error 28

Abort (core dumped)
bash-3.00# zpool import test
bash-3.00# zpool status test
  pool: test
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
testONLINE   0 0 0
  c2t0d0ONLINE   0 0 0

errors: No known data errors
bash-3.00# touch /test/z
bash-3.00# sync
bash-3.00# ls -l /test/z
-rw-r--r--   1 root root   0 Jun 28 04:18 /test/z
bash-3.00#

The userland zfs's export succeeds. But doing a system(zpool status 
test) right after the spa_export() succeeds shows that the the 'kernel 
zfs' still thinks it is imported.


I guess that makes sense. Nothing has been told to the 'kernel zfs' 
about the export.


But I still do not understand why the 'userland zfs' can't write to the 
pool.


Regards,
Manoj

PS: The code I have be tinkering with is attached.



--matt



Any clue as to why this is so would be appreciated.

Cheers
Manoj

Manoj Joseph wrote:

Hi,

I tried adding an spa_export();spa_import() to the code snippet. I 
get a similar crash while importing.


I/O failure (write on unknown off 0: zio 822ed40 [L0 unallocated] 
4000L/400P DVA[0]=0:1000:400 DVA[1]=0:18001000:400 fletcher4 lzjb 
LE contiguous birth=4116 fill=0 
cksum=69c3a4acfc:2c42fdcaced5:c5231ffcb2285:2b8c1a5f2cb2bfd): error 
28 Abort (core dumped)


I thought ztest could use an existing pool. Is that assumption wrong?

These are the stacks of interest.

 d11d78b9 __lwp_park (81c3e0c, 81c3d70, 0) + 19
 d11d1ad2 cond_wait_queue (81c3e0c, 81c3d70, 0, 0) + 3e
 d11d1fbd _cond_wait (81c3e0c, 81c3d70) + 69
 d11d1ffb cond_wait (81c3e0c, 81c3d70) + 24
 d131e4d2 cv_wait  (81c3e0c, 81c3d6c) + 5e
 d12fe2dd txg_wait_synced (81c3cc0, 1014, 0) + 179
 d12f9080 spa_config_update (819dac0, 0) + c4
 d12f467a spa_import (8047657, 8181f88, 0) + 256
 080510c6 main (2, 804749c, 80474a8) + b2
 08050f22 _start   (2, 8047650, 8047657, 0, 804765c, 8047678) + 7a


 d131ed79 vpanic   (d1341dbc, ca5cd248) + 51
 d131ed9f panic(d1341dbc, d135a384, d135a724, d133a630, 0, 0) + 1f
 d131921d zio_done (822ed40) + 455
 d131c15d zio_next_stage (822ed40) + 161
 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a
 d1318c88 zio_wait_children_done (822ed40) + 18
 d131c15d zio_next_stage (822ed40) + 161
 d131ba83 zio_vdev_io_assess (822ed40) + 183
 d131c15d zio_next_stage (822ed40) + 161
 d1307011 vdev_mirror_io_done (822ed40) + 421
 d131b8a2 zio_vdev_io_done (822ed40) + 36
 d131c15d zio_next_stage (822ed40) + 161
 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a
 d1318c88 zio_wait_children_done (822ed40) + 18
 d1306be6 vdev_mirror_io_start (822ed40) + 1d2
 d131b862 zio_vdev_io_start (822ed40) + 34e
 d131c313 zio_next_stage_async (822ed40) + 1ab
 d131bb47 zio_vdev_io_assess (822ed40) + 247
 d131c15d zio_next_stage (822ed40) + 161
 d1307011 vdev_mirror_io_done (822ed40) + 421
 d131b8a2 zio_vdev_io_done (822ed40) + 36
 d131c15d zio_next_stage (822ed40) + 161
 d1318b92 zio_wait_for_children (822ed40, 11, 822ef30) + 6a
 d1318c88 zio_wait_children_done (822ed40) + 18
 d1306be6 vdev_mirror_io_start (822ed40) + 1d2
 d131b862 zio_vdev_io_start (822ed40) + 34e
 d131c15d zio_next_stage (822ed40) + 161
 d1318dc1 zio_ready (822ed40) + 131
 d131c15d zio_next_stage (822ed40) + 161
 d131b41b zio_dva_allocate (822ed40) + 343
 d131c15d zio_next_stage (822ed40) + 161
 d131bdcb zio_checksum_generate (822ed40) + 123
 d131c15d zio_next_stage (822ed40) + 161
 d1319873 zio_write_compress (822ed40) + 4af
 d131c15d zio_next_stage (822ed40) + 161
 d1318b92 zio_wait_for_children (822ed40, 1, 822ef28) + 6a
 d1318c68

Re: [zfs-discuss] fchmod(2) returns ENOSPC on ZFS

2007-06-15 Thread Manoj Joseph

Matthew Ahrens wrote:
In a COW filesystem such as ZFS, it will sometimes be necessary to 
return ENOSPC in cases such as chmod(2) which previously did not.  This 
is because there could be a snapshot, so overwriting some information 
actually requires a net increase in space used.


That said, we may be generating this ENOSPC in cases where it is not 
strictly necessary (eg, when there are no snapshots).  We're working on 
some of these cases.  Can you show us the output of 'zfs list' when the 
ENOSPC occurs?


Is there a bug id for this?

Regards,
Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] fchmod(2) returns ENOSPC on ZFS

2007-06-13 Thread Manoj Joseph

Hi,

I find that fchmod(2) on a zfs filesystem can sometimes generate errno = 
ENOSPC. However this error value is not in the manpage of fchmod(2).


Here's where ENOSPC is generated.

  zfs`dsl_dir_tempreserve_impl
  zfs`dsl_dir_tempreserve_space+0x4e
  zfs`dmu_tx_try_assign+0x230
  zfs`dmu_tx_assign+0x21
  zfs`zfs_setattr+0x41b
  genunix`fop_setattr+0x24
  genunix`vpsetattr+0x110
  genunix`fdsetattr+0x26
  genunix`fchmod+0x2a
  genunix`dtrace_systrace_syscall+0xbc
  unix`sys_sysenter+0x101

Is this correct behavior? Is it the manpage that needs fixing? zpool 
list shows this.


NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
pool1   115M   83.1M   31.9M72%  ONLINE -

While I am unable to guarantee that there has been no activity after 
fchmod() has failed, I am fairly sure that the filesystem was not full 
when it returned ENOSPC.


I have done all my analysis on build 54. So I might just be looking at 
outdated stuff.


Please let me know what you think.

Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fchmod(2) returns ENOSPC on ZFS

2007-06-13 Thread Manoj Joseph

Matthew Ahrens wrote:

Manoj Joseph wrote:

Hi,

I find that fchmod(2) on a zfs filesystem can sometimes generate errno 
= ENOSPC. However this error value is not in the manpage of fchmod(2).


Here's where ENOSPC is generated.

  zfs`dsl_dir_tempreserve_impl
  zfs`dsl_dir_tempreserve_space+0x4e
  zfs`dmu_tx_try_assign+0x230
  zfs`dmu_tx_assign+0x21
  zfs`zfs_setattr+0x41b
  genunix`fop_setattr+0x24
  genunix`vpsetattr+0x110
  genunix`fdsetattr+0x26
  genunix`fchmod+0x2a
  genunix`dtrace_systrace_syscall+0xbc
  unix`sys_sysenter+0x101

Is this correct behavior? Is it the manpage that needs fixing? zpool 
list shows this.


In a COW filesystem such as ZFS, it will sometimes be necessary to 
return ENOSPC in cases such as chmod(2) which previously did not.  This 
is because there could be a snapshot, so overwriting some information 
actually requires a net increase in space used.


Could the manpage be updated to reflect this?

That said, we may be generating this ENOSPC in cases where it is not 
strictly necessary (eg, when there are no snapshots).  We're working on 
some of these cases.  Can you show us the output of 'zfs list' when the 
ENOSPC occurs?


-bash-3.00# zfs list pool1
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool1  83.0M  0  82.8M  /pool1

-bash-3.00# zpool list pool1
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
pool1   115M   83.0M   32.0M72%  ONLINE -

zfs list does say that there is no available space. There is 32M 
available on the zpool though. Interesting...


Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file system full corruption in ZFS

2007-05-29 Thread Manoj Joseph

Michael Barrett wrote:

Normally if you have a ufs file system hit 100% and you have a very high 
level of system and application load on the box (that resides in the 
100% file system) you will run into inode issues that require a fsck and 
show themselves by not being about to long list out all their attributes 
(ls -la).  Not a bug, just what happens.


I don't see how something like this can not be a bug. Don't tell me this 
is a feature and UFS is working as per design! ;)


Cheers
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] file system full corruption in ZFS

2007-05-29 Thread Manoj Joseph

dudekula mastan wrote:
Atleaset in my experience, I saw Corruptions when ZFS file system was 
full. So far there is no way to check the file system consistency on ZFS 
(to the best of my knowledge). ZFS people claiming that ZFS file system 
is always consistent and there is no need for FSCK command.


ZFS is always consistent on disk. This does not mean there cannot be 
data loss - especially on an unreplicated pool. ZFS can self heal only 
when there is redundancy in the pool.


If you do see corruptions, you should probably report then here along 
with the zpool configuration details and test cases if any. Please do 
file bugs.


Cheers
Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS over a layered driver interface

2007-05-14 Thread Manoj Joseph

Shweta Krishnan wrote:

I ran zpool with truss, and here is the system call trace. (again, zfs_lyr is 
the layered driver I am trying to use to talk to the ramdisk driver).

When I compared it to a successful zpool creation, the culprit is the last 
failing ioctl
i.e. ioctl(3, ZFS_IOC_CREATE_POOL, address)

I tried looking at the source code for the failing ioctl, but didn't get any 
hints there.
Guess I must try dtrace (which I am about to learn!).

bash-3.00# truss -f zpool create adsl-pool /devices/pseudo/[EMAIL 
PROTECTED]:zfsminor1 2 /var/tmp/zpool.truss
bash-3.00# grep Err /var/tmp/zpool.truss 
2232:   open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT

2232:   xstat(2, /lib/libdiskmgt.so.1, 0x080469C8)Err#2 ENOENT
2232:   xstat(2, /lib/libxml2.so.2, 0x08046868)   Err#2 ENOENT
2232:   xstat(2, /lib/libz.so.1, 0x08046868)  Err#2 ENOENT
2232:   stat64(/devices/pseudo/[EMAIL PROTECTED]:zfsminor1s2, 0x080429E0) 
Err#2 ENOENT
2232:   modctl(MODSIZEOF_DEVID, 0x03740001, 0x080429BC, 0x08071714, 0x) 
Err#22 EINVAL



MODSIZEOF_DEVID is 10.

$ dtrace -n 'syscall::modctl:entry{trace(arg0); ustack();}'

The relevant stack is the following.

  0  71587 modctl:entry10
  libc.so.1`modctl+0x15
  zpool`make_disks+0x1bf
  zpool`make_disks+0x72
  zpool`make_root_vdev+0x56
  zpool`zpool_do_create+0x1c4
  zpool`main+0xa2
  zpool`_start+0x7a

make_disks() calls devid_get() which calls modctl(MODSIZEOF_DEVID). This 
fails.


http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/zpool/zpool_vdev.c#959

The code however, seems to ignore this. So this might not be the issue.


2232:   mkdir(/var/run/sysevent_channels/syseventd_channel, 0755) Err#17 
EEXIST
2232:   unlink(/var/run/sysevent_channels/syseventd_channel/17) Err#2 ENOENT
2232/1: umount2(/var/run/sysevent_channels/syseventd_channel/17, 
0x) Err#22 EINVAL
2232/1: ioctl(7, I_CANPUT, 0x)  Err#89 ENOSYS
2232/1: stat64(/adsl-pool, 0x08043330)Err#2 ENOENT
2232/1: ioctl(3, ZFS_IOC_POOL_CREATE, 0x08041BC4)   Err#22 EINVAL



ZFS_IOC_POOL_CREATE is failing. I am not sure if the problem has already 
happened or if it happens during this ioctl.


But you could try dtracing this ioctl and see where EINVAL is being set.

$ dtrace -n 'fbt:zfs:zfs_ioc_pool_create:entry{self-t=1;}  \
  fbt:zfs::return/self-t  arg1 == 22/{stack(); exit(0);} \
  fbt:zfs:zfs_ioc_pool_create:return{self-t=0;}'

If it does not provide a clue, you could try the following trace with is 
more heavy weight. Warning: it could generate a lot of output. :)


$ dtrace -n 'fbt:zfs:zfs_ioc_pool_create:entry{self-t=1;}  \
  fbt:zfs::entry/self-t/{} fbt:zfs::return/self-t/{trace(arg1);}  \
  fbt:zfs:zfs_ioc_pool_create:return{self-t=0;}'

Perhaps there are folks on this list who know what the problem is 
without all the dtracing that I am suggesting. But this is what I would try.


Good luck! :)

-Manoj

PS: When running the above scripts, run it on one telnet/ssh/xterm 
window. Run 'zpool create' on another.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: [osol-discuss] ZFS over a layered driver interface

2007-05-13 Thread Manoj Joseph

Hi,

This is probably better discussed on zfs-discuss. I am CCing the list. 
Followup emails could leave out opensolaris-discuss...


Shweta Krishnan wrote:

Does zfs/zpool support the layered driver interface?

I wrote a layered driver with a ramdisk device as the underlying
device, and successfully got a  UFS file system on the ramdisk to
boot via the layered device.

I am trying to do the same with a ZFS file system. However, since ZFS
file systems are created as datasets within a storage pool and not
directly on a specified underlying device, I can't think how I will
get a ZFS file system to mount using a layered driver atop a real
device.

I tried specifying the layered device as the storage pool component
for 'zpool create', but that gave me an invalid argument for this
pool operation error. I also tried setting the mountpoint for a zfs
filesystem as 'legacy' and doing a regular mount with the layered
device, but that gave me an invalid dataset error.


You would have to create a zpool even for legacy mounts. You cannot skip 
that step.


Probably the answer to your problem lies in the invalid argument for 
this pool operation error message. Did you try trussing the zpool 
create? What was the syscall that failed. If you are familiar with 
dtrace, you might be able to narrow it down to what is causing the failure.


-Manoj


I looked through the documentation for zfs/zpool and searched
extensively, but haven't been able to figure this one out yet. I am a
newbie to ZFS, so pardon me if this is something trivial.

Can someone point me to possible method to achieve the above?


This message posted from opensolaris.org 
___ opensolaris-discuss

mailing list [EMAIL PROTECTED]


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Will this work?

2007-05-11 Thread Manoj Joseph

Robert Thurlow wrote:

I've written some about a 4-drive Firewire-attached box based on the
Oxford 911 chipset, and I've had I/O grind to a halt in the face of
media errors - see bugid 6539587.  I haven't played with USB drives
enough to trust them more, but this was a hole I fell in with Firewire.
I've had fabulous luck with a Firewire attached DVD burner, though.


6539587 does not seem to be visible on the opensolaris bugs database. :-/

-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Motley group of discs?

2007-05-05 Thread Manoj Joseph

Lee Fyock wrote:
least this year. I'd like to favor available space over performance, and 
be able to swap out a failed drive without losing any data.


Lee Fyock later wrote:

In the mean time, I'd like to hang out with the system and drives I
have. As mike said, my understanding is that zfs would provide
error correction until a disc fails, if the setup is properly done.
That's the setup for which I'm requesting a recommendation.


ZFS always lets you know if the data you are requesting has gone bad. If 
you have redundancy, it provides error correction as well.



Money isn't an issue here, but neither is creating an optimal zfs
system. I'm curious what the right zfs configuration is for the
system I have.


You obviously have the option of having a giant pool of all the disks 
and what you get is dynamic striping. But if a disk goes toast, the data 
in it is gone. If you plan to back up important data elsewhere and data 
loss is something you can live with, this might be a good choice.


The next option is to mirror (/raidz) disks. If you mirror a 200 GB disk 
with a 250 GB one, you will get only 200 GB of redundant storage. If a 
disk goes for a toss, all of your data is safe. But you lose disk space.


Mirroring the 600GB disk with a stripe of 160+200+250 would have been 
nice, but I believe this is not possible with ZFS (yet?).


There is a third option - create a giant pool of all the disks. Set 
copy=2. ZFS will create two copies of all the data blocks. That is 
pretty good redundancy. But depending on how full your disks are, the 
copies may or may not be on different disks. In other words, this does 
not guarantee that *all* of your data is safe, if say your 600 GB disk 
dies. But it might be 'good enough'. From what I understand your 
requirements are, this just might be your best choice.


A periodic scrub would also be a good thing to do. The earlier you 
detect a flaky disk, the better it is...


Hope this helps.

-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue with adding existing EFI disks to a zpool

2007-05-05 Thread Manoj Joseph

Mario Goebbels wrote:

do it. So I added the disk using the zero slice notation (c0d0s0),
as suggested for performance reasons. I checked the pool status and
noticed however that the pool size didn't raise.


I believe you got this wrong. You should have given ZFS the whole disk - 
c0d0 and not a slice. When presented a whole disk, it EFI-labels it and 
turns on the write cache.


-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ARC, mmap, pagecache...

2007-04-27 Thread Manoj Joseph

Hi,

I was wondering about the ARC and its interaction with the VM 
pagecache... When a file on a ZFS filesystem is mmaped, does the ARC 
cache get mapped to the process' virtual memory? Or is there another copy?


-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: concatination stripe - zfs?

2007-04-24 Thread Manoj Joseph

Richard Elling wrote:


In other words, the sync command schedules a sync.  The consistent way
to tell if writing is finished is to observe the actual I/O activity.


ZFS goes beyond this POSIX requirement. When a sync(1M) returns, all 
dirty data that has been cached has been committed to disk.


-Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Preferred backup mechanism for ZFS?

2007-04-22 Thread Manoj Joseph

Wee Yeh Tan wrote:

On 4/23/07, Robert Milkowski [EMAIL PROTECTED] wrote:

bash-3.00# mdb -k
Loading modules: [ unix krtld genunix dtrace specfs ufs sd pcisch md 
ip sctp usba fcp fctl qlc ssd crypto lofs zfs random ptm cpc nfs ]

 segmap_percent/D
segmap_percent:
segmap_percent: 12

(it's static IIRC)


segmap_percent is only referenced in
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sun4/os/startup.c#2270. 



You are right that it is static but that means you cannot tune that
unless you run your own kernel.


I took a quick look at startup.c. Looks like you should be able to set 
this value in /etc/system. You should not have to compile your own kernel.


Regards,
Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Preferred backup mechanism for ZFS?

2007-04-19 Thread Manoj Joseph

Dennis Clarke wrote:

So now here we are ten years later with a new filesystem and I have no
way to back it up in such a fashion that I can restore it perfectly. I
can take snapshots. I can do a strange send and receive but the
process is not stable From zfs (1M) we see :

The format of the stream is evolving. No backwards  compati-
bility  is  guaranteed.  You may not be able to receive your
streams on future versions of ZFS.


The format of the stream may not be stable. So you can't dump the stream 
to a file somewhere and expect to receive from it sometime in the future.


But if you stash it away as a pool on the same machine or elsewhere, it 
is not an issue.


# zfs send [-i b] pool/[EMAIL PROTECTED] | [ssh host] zfs receive 
poolB/received/[EMAIL PROTECTED]

Right? I think this is quite cool! :)

-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to bind the oracle 9i data file to zfs volumes

2007-04-16 Thread Manoj Joseph

Simon wrote:

So,does mean this is oracle bug ? Or it's impossible(or inappropriate)
to use ZFS/SVM volumes to create oracle data file,instead,should use
zfs or ufs filesystem to do this.


Oracle can use SVM volumes to hold its data. Unless I am mistaken, it 
should be able to use zvols as well.


However, googling for 'zvol + Oracle' did not get me anything useful. 
Perhaps it is not a configuration that is very popular. ;)


My $ 0.02.

-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: simple Raid-Z question

2007-04-08 Thread Manoj Joseph

Erik Trimble wrote:

While expanding a zpool in the way you've show is useful, it has nowhere 
near the flexibility of simply adding single disks to existing RAIDZ 
vdevs, which was the original desire expressed.  This conversation has 
been had several times now (take a look in the archives around Jan for 
the last time it came up).


Perhaps, this should be added to the FAQ?

Cheers
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up for zfsboot

2007-04-04 Thread Manoj Joseph

Constantin Gonzalez wrote:


Do I still have the advantages of having the whole disk
'owned' by zfs, even though it's split into two parts?


I'm pretty sure that this is not the case:

- ZFS has no guarantee that someone will do something else with that other
  partition, so it can't assume the right to turn on disk cache for the whole
  disk.


Can write-cache not be turned on manually as the user is sure that it is 
only ZFS that is using the entire disk?


-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] File level snapshots in ZFS?

2007-03-30 Thread Manoj Joseph

Richard Elling wrote:

Atul Vidwansa wrote:

Hi Richard,
   I am not talking about source(ASCII) files. How about versioning
production data? I talked about file level snapshots because
snapshotting entire filesystem does not make sense when application is
changing just few files at a time.


CVS supports binary files.  The nice thing about version control systems
is that you can annotate the versions.  With ZFS snapshots, you don't get
annotations.


Sure version control systems do file versioning. But, ZFS with its COW 
brings a new way of doing this.


I do not see applications like emacs, star office etc using
SCCS/CVS. But I can easily see then using file snapshots if zfs were to 
offer it (am conveniently ignoring portability).


It was suggested that filesystem snapshots be used to achieve the same 
purpose. It would not work, if you have to roll back one file change but 
not others...


Extended attributes could potentially be used to annotate file 
snapshots... ;)


I can also see possibilities with clustered/distributed applications 
(parallel Postgresql perhaps?) needing to commit/revoke across servers 
using this. Layered distributed filesystems could potentially use this 
for recovery.


But I also remember a long thread on this not too long ago going 
nowhere. ;)


Just my $0.02.

-Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How big a write to a regular file is atomic?

2007-03-28 Thread Manoj Joseph

Richard L. Hamilton wrote:

and does it vary by filesystem type? I know I ought to know the
answer, but it's been a long time since I thought about it, and
I must not be looking at the right man pages.  And also, if it varies,
how does one tell?  For a pipe, there's fpathconf() with _PC_PIPE_BUF,
but how about for a regular file?


For ZFS, it is atomic up to the whole-block level.

See: http://www.opensolaris.org/jive/thread.jspa?messageID=18705#18818

-Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nv59 + HA-ZFS

2007-03-14 Thread Manoj Joseph

David Anderson wrote:

Hi,

I'm attempting to build a ZFS SAN with iSCSI+IPMP transport. I have two 
ZFS nodes that access iSCSI disks on the storage network and then the 
ZFS nodes share ZVOLs via iSCSI to my front-end Linux boxes. My 
throughput from one Linux box is about 170+MB/s with nv59 (earlier 
builds were about 60MB/s), so I am pleased with the performance so far.


My next step is to configure HA-ZFS for failover between the two ZFS 
nodes. Does Sun Cluster 3.2 work with SXCE? If so, are there any caveats 
for my situation?


I thought Sun Cluster's support for iSCSI was not ready. You could 
perhaps check with the sun cluster group.


Regards,
Manoj

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] writes lost with zfs !

2007-03-12 Thread Manoj Joseph

Ayaz,

Ayaz Anjum wrote:


HI !

I have some concerns here,  from my experience in the past, touching a 
file ( doing some IO ) will cause the ufs filesystem to failover, unlike 
zfs where it did not ! Why the behaviour of zfs different than ufs ? is 
not this compromising data integrity ?


As others have explained, until a sync is done or unless the file is 
opened to do 'sync writes', a write is not guaranteed to be on disk. If 
the node fails before the disk commit, the data can be lost. 
Applications are written with this in mind.


While ZFS and UFS do lots of things differently, the above applies to 
both of them and to all POSIX filesystems in general.


Could you tell us more about how the UFS failover happened? Did you see 
a UFS panic? Did the Sun Cluster disk path monitor cause the failover?


Regards,
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS/UFS layout for 4 disk servers

2007-03-07 Thread Manoj Joseph

Matt B wrote:

Any thoughts on the best practice points I am raising? It disturbs me
that it would make a statement like don't use slices for
production.


ZFS turns on write cache on the disk if you give it the entire disk to 
manage. It is good for performance. So, you should use whole disks when 
ever possible.


Slices work too, but write cache for the disk will not be turned on by zfs.

Cheers
Manoj
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] writes lost with zfs !

2007-03-07 Thread Manoj Joseph

Ayaz Anjum wrote:


HI !

I have tested the following scenario

created a zfs filesystem as part of HAStoragePlus in SunCluster 3.2, 
Solaris 11/06


Currently i am having only one fc hba per server.

1. There is no IO to the zfs mountpoint. I disconnected the FC cable. 
Filesystem on zfs still shows as mounted (because of no IO to 
filesystem). I touch a file. Still ok. i did a sync and only then the 
node panicked and zfs filesystem failed over to other cluster node. 
however my file which i touched is lost 


This is to be expected, I'd say.

HAStoragePlus is primarily a wrapper over zfs that manages the 
import/export and mount/unmount. It can not and does not provide for a 
retry of pending IOs.


The 'touch' would have been part of a zfs transaction group that never 
got committed. And it stays lost when the pool is imported on the other 
node.


In other words, it does not provide the same kind of high availability 
that, say, PxFS for instance provides.


2. with zfs mounted on one cluster node, i created a file and keeps it 
updating every second, then i removed the fc cable, the writes are still 
continuing to the file system, after 10 seconds i have put back the fc 
cable and my writes continues, no failover of zfs happens.


seems that all IO are going to some cache. Any suggestions on whts going 
wrong over here and whts the solution to this.


I don't know for sure. But my guess is, if you do a fsync after the 
writes and wait for the fsync to complete, then you might get some 
action. fsync should fail. zfs could panic the node. If it does, you 
will see a failover.


Hope that helps.

-Manoj


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: SPEC SFS97 benchmark of ZFS,UFS,VxFS

2006-08-20 Thread Manoj Joseph

Alan Romeril wrote:

PxFS performance improvements of the order of 5-6 times are possible
depending on the workload using Fastwrite option.

Fantastic!  Has this been targetted at directory operations?  We've
had issues with large directorys full of small files being very slow
to handle over PxFS.


The 'fastwrite option' speeds up write operations. So this doesn't do 
much for directory operations.


Are there plans for PxFS on ZFS any time soon :) ?  


PxFS on ZFS is unlikely to happen. A clusterized version of ZFS (as 
mentioned before on this alias) is being considered.



Or any plans to
release PxFS as part of opensolaris?


PxFS is tightly couple to the cluster framework. Without open sourcing 
cluster, PxFS in its current form cannot be opensourced as it would not 
make sense.


As for open sourcing cluster, my guess is as good as yours.

Regards,
Manoj

--
Sun Cluster Engineering

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS and Sun Cluster....

2006-05-31 Thread Manoj Joseph

Tatjana S Heuser wrote:

Is it planned to have the cluster fs or proxy fs layer between the ZFS layer
and the Storage pool layer?


This, AFAIK, is not the current plan of action. Sun Cluster should be 
moving towards ZFS as a 'true' cluster filesystem.


Not going the 'proxy fs layer' way (PxFS/GFS), IMHO, is not due to 
technical infeasibility.


Regards,
Manoj

--
Global Data and Devices,
Sun Cluster Engineering.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss