Re: [zfs-discuss] X4540

2008-07-10 Thread Spencer Shepler

On Jul 10, 2008, at 7:05 AM, Ross wrote:

 Oh god, I hope not.  A patent on fitting a card in a PCI-E slot, or  
 using nvram with RAID (which raid controllers have been doing for  
 years) would just be rediculous.  This is nothing more than cache,  
 and even with the American patent system I'd have though it hard to  
 get that past the obviousness test.

How quickly they forget.

Take a look at the Prestoserve User's Guide for a refresher...

http://docs.sun.com/app/docs/doc/801-4896-11



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS NFS cannot write

2008-06-05 Thread Spencer Shepler

Check the permissions on the directory...

On Jun 5, 2008, at 1:06 PM, Gary Leong wrote:

 This is the first time I tried nfs with zfs.  I shared the zfs  
 filesystem with nfs, but i can't write to the files though i mount  
 it as read-write.  This is for Solaris 10 update 4.  I wonder if  
 there is a bug?

 ---server (sdw2-2)

 #zfs create -o sharenfs=on  data/nfstest

 #zfs get all data/nfstest
 NAME  PROPERTY   VALUE  SOURCE
 data/nfstest  type   filesystem -
 data/nfstest  creation   Thu Jun  5 13:22 2008  -
 data/nfstest  used   40.7K  -
 data/nfstest  available  15.4T  -
 data/nfstest  referenced 40.7K  -
 data/nfstest  compressratio  1.00x  -
 data/nfstest  mountedyes-
 data/nfstest  quota  none   default
 data/nfstest  reservationnone   default
 data/nfstest  recordsize 128K   default
 data/nfstest  mountpoint /data/nfstest  default
 data/nfstest  sharenfs   on local
 data/nfstest  checksum   on default
 data/nfstest  compressionoffdefault
 data/nfstest  atime  on inherited from  
 data
 data/nfstest  deviceson default
 data/nfstest  exec   on default
 data/nfstest  setuid on default
 data/nfstest  readonly   offdefault
 data/nfstest  zoned  offdefault
 data/nfstest  snapdirhidden default
 data/nfstest  aclmodegroupmask  default
 data/nfstest  aclinherit secure default
 data/nfstest  canmount   on default
 data/nfstest  shareiscsi offdefault
 data/nfstest  xattr  on default

 #share
 -   /data/nfstest   rw   

 client

 #mount -o rw sdw2-2:/data/nfstest /data/nfsmount/

 #mount -o rw sdw2-2:/data/nfstest /sdw2nfs/
 touch /sdw2nfs/dummy.txt
 touch: /sdw2nfs/dummy.txt cannot create


 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS4-sharing-ZFS issues

2008-05-21 Thread Spencer Shepler

On May 21, 2008, at 1:43 PM, Will Murnane wrote:

 I'm looking at implementing home directories on ZFS.  This will be
 about 400 users, each with a quota.  The ZFS way of doing this AIUI is
 create one filesystem per user, assign them a quota and/or
 reservation, and set sharenfs=on.  So I tried it:
 # zfs create local-space/test
 # zfs set sharenfs=on local-space/test
 # zfs create local-space/test/foo
 # zfs create local-space/test/foo/bar
 # share
 -   /export/local-space/test   rw   
 -   /export/local-space/test/foo   rw   
 -   /export/local-space/test/foo/bar   rw   
 All good so far.  Now, I understand that with nfs in general, the
 child filesystems will not be mounted, and I do see this behavior on
 Linux.  If I specify nfs4, the children are mounted as I expected:
 # mount -t nfs4 server:/export/local-space/test /mnt/
 # cd /mnt/
 # ls
 foo
 # ls foo
 bar
 Okay, all is well.  Try the same thing on a Solaris client, though,
 and it doesn't work:
 # mount -o vers=4 ds3:/export/local-space/test /mnt/
 # cd mnt
 # ls
 foo
 # ls foo
 nothing

This behavior was a recent addition to the Solaris client and therefore
are seeing this lack of functionality.  Any recent Solaris Express or
OpenSolaris install will have the functionality you desire.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Not all filesystems shared over NFS

2008-05-18 Thread Spencer Shepler

On May 18, 2008, at 3:39 AM, Johan Kooijman wrote:

 Morning all,

 situation is as follows: OpenSolaris NFS server, Linux client.

 I've created a ZFS filesystem and shared it over NFS:

 -bash-3.2# zfs list | grep vz
 datatank/vz  126M   457G   126M  /datatank/vz
 datatank/vz/private   37K   457G19K  /datatank/ 
 vz/private
 datatank/vz/private/28999 18K   457G18K  /datatank/ 
 vz/private/28999
 datatank/vz/root  37K   457G19K  /datatank/ 
 vz/root
 datatank/vz/root/2899918K   457G18K  /datatank/ 
 vz/root/28999

 -bash-3.2# cat /etc/dfs/sharetab
 /datatank/vz/root/28999 -   nfs  
 anon=0,sec=sys,[EMAIL PROTECTED]/24
 /datatank/vz/root   -   nfs  
 anon=0,sec=sys,[EMAIL PROTECTED]/24
 /datatank/vz/private-   nfs  
 anon=0,sec=sys,[EMAIL PROTECTED]/24
 /datatank/vz/private/28999  -   nfs  
 anon=0,sec=sys,[EMAIL PROTECTED]/24
 /datatank/vz-   nfs anon=0,sec=sys,[EMAIL PROTECTED]/24

 So far, so good. I can mount it on my linux machine:

 [EMAIL PROTECTED] vz]# mount -t nfs
 192.168.178.31:/datatank/vz on /vz type nfs (rw,addr=192.168.178.31)

 As you can see ,I've created a file system datatank/vz/root/28999,  
 which should appear on the Linux client. It doesn't:

 [EMAIL PROTECTED] vz]# ls -l /vz/private/
 total 0

 It does on the server:

 -bash-3.2# ls -l /datatank/vz/private/
 total 3
 drwxr-xr-x   2 root root   2 May 18 09:21 28999

 Can anyone give me some directions on this?

I believe that you will need to mount those filesystems directly.

In later versions of the OpenSolaris NFsv4 client, those
filesystems will be mounted automatically.  I believe this
feature is also available on later versions of the Linux
NFSv4 client as well but I don't happen to remember the specifics.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] share zfs hierarchy over nfs

2008-04-30 Thread Spencer Shepler

On Apr 29, 2008, at 9:35 PM, Tim Wood wrote:

 Hi,
 I have a pool /zfs01 with two sub file systems /zfs01/rep1 and / 
 zfs01/rep2.  I used [i]zfs share[/i] to make all of these mountable  
 over NFS, but clients have to mount either rep1 or rep2  
 individually.  If I try to mount /zfs01 it shows directories for  
 rep1 and rep2, but none of their contents.

 On a linux machine I think I'd have to set the [i]no_sub_tree_check 
 [/i] flag in /etc/exports to let an NFS mount move through the  
 different exports, but I'm just beginning with solaris, so I'm not  
 sure what to do here.

 I found this post in the forum: http://opensolaris.org/jive/ 
 thread.jspa?messageID=169354#169354

 but that makes it sound like this issue was resolved by changing  
 the NFS client behavior in solaris.  Since my NFS client machines  
 are going to be linux machines that doesn't help me any.

My understanding is that the linux client has the same
capabilities of the Solaris client in that it can
traverse server side mount points dynamically.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS async and ZFS zil_disable

2008-04-22 Thread Spencer Shepler

On Apr 22, 2008, at 12:16 PM, msl wrote:

 Hello all,
  I think the two options are very similar in the cliente side  
 view, but i want to hear from the experts... So, somebody can talk  
 a little about the two options?
  We have two different layers here, i think:
  1) The async from the protocol stack, and the other...
  2) From the filesystem point of view.

  What makes me think that the first option could be more quick  
 for the client, because the ack is in a higher level (NFS protocol).

The NFS client has control over WRITE requests in that it
may ask to have them done async and then follow it with
a COMMIT request to ensure the data is in stable-storage/disk.

However, the NFS client has no control over namespace operations
(file/directory create/remove/rename).  These must be done
synchronously -- no way for the client to direct the operational
behavior of the server in these cases.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS async and ZFS zil_disable

2008-04-22 Thread Spencer Shepler

On Apr 22, 2008, at 2:00 PM, msl wrote:


 On Apr 22, 2008, at 12:16 PM, msl wrote:

 Hello all,
  I think the two options are very similar in the
 cliente side
 view, but i want to hear from the experts... So,
 somebody can talk
 a little about the two options?
  We have two different layers here, i think:
  1) The async from the protocol stack, and the
 other...
  2) From the filesystem point of view.

  What makes me think that the first option could
 be more quick
 for the client, because the ack is in a higher
 level (NFS protocol).

 The NFS client has control over WRITE requests in
 that it
 may ask to have them done async and then follow it
 with
 a COMMIT request to ensure the data is in
 stable-storage/disk.
  Great information... so, the sync option on the server (export)  
 side is just a possible option for the client requests? I mean,  
 the sync/async option is a requirement in a nfs write request  
 operation? When i did the question, i was talking about the server  
 side, i did not know about the possibility of the client requests  
 sync/async.

The Solaris NFS server does not offer a method to specify sync/ 
async behavior
for NFS WRITE requests.  The Solaris server will do what the client
asks it to do.

 However, the NFS client has no control over namespace
 operations
 (file/directory create/remove/rename).  These must be
 done
 synchronously -- no way for the client to direct the
 operational
 behavior of the server in these cases.
  If i understand well, here the zil_disable is is a problem for  
 the NFS semantics... i mean, the service will be compromise,  
 because the nfs client can't control the namespace operations.  
 What is a big diff in my initial question.

Yes, zil_disable can be a problem as described by Eric here:
http://blogs.sun.com/erickustarz/entry/zil_disable



 Spencer

  Thanks a lot for your comments! Anybody else?
  ps.: how can i enable async in nfs server on solaris? just add  
 async for the export options?

See above; not possible.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with sharing multiple zfs file systems

2007-11-27 Thread Spencer Shepler

On Nov 27, 2007, at 1:36 AM, Anton B. Rang wrote:

 Given that it will be some time before NFSv4 support, let alone  
 NFSv4 support for mount point crossing, in most client operating  
 systems ... what obstacles are in the way of constructing an NFSv3  
 server which would 'do the right thing' transparently to clients so  
 long as the file systems involved were within a single ZFS pool?

 So far I can think of (a) clients expect inode numbers to be unique  
 -- this could be solved by making them (optionally) unique within a  
 pool; (b) rename and link semantics depend on the file system --  
 for rename this is easy, for link it might require a cross-file- 
 system hard link object, which is certainly doable.

 This would go a long way towards making ZFS-with-many-filesystems  
 approaches more palatable. (Hmmm, how does CIFS support deal with  
 the many-filesystems problem today?)


What you describe is the nohide option that was first
introduced in Irix and picked up in the Linux NFS server implementation.

As you say, inode number uniqueness would be one key issue along
with the others you mention.

I have a bias but I would rather see effort placed into dealing
with issues that stand in the way of effective deployment and
use of NFSv4; it will be more effective at CIFS/NFS co-existence
and deals with a number of other issues that can not be
easily solved with NFSv3.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with sharing multiple zfs file systems

2007-11-27 Thread Spencer Shepler

On Nov 27, 2007, at 9:48 AM, Richard Elling wrote:

 Anton B. Rang wrote:
 Given that it will be some time before NFSv4 support, let alone  
 NFSv4 support for mount point crossing, in most client operating  
 systems ... what obstacles are in the way of constructing an NFSv3  
 server which would 'do the right thing' transparently to clients  
 so long as the file systems involved were within a single ZFS pool?

 So far I can think of (a) clients expect inode numbers to be  
 unique -- this could be solved by making them (optionally) unique  
 within a pool; (b) rename and link semantics depend on the file  
 system -- for rename this is easy, for link it might require a  
 cross-file-system hard link object, which is certainly doable.

 This would go a long way towards making ZFS-with-many-filesystems  
 approaches more palatable. (Hmmm, how does CIFS support deal with  
 the many-filesystems problem today?)


 One solution, which is only about 18 years old, is automount.
 Does anyone know how ubiquitous automount clients are?

I think you have answered your own question.  18 years and
we have to ask the question? It must not be very ubiquitous. :-)

The fact is that automount has ended up meaning different
things to the various operating environments.  The maps and options
are slightly different in implementations and therefore it leads
to not being an effective method of managing a namespace (along with
a host of other issues that stand in the way).  We need to move
away from automount usage and on to namespaces that are managed
at the server side of things and integrated in a way to allow
admins to effectively manage large environments.  NFSv4 and
the client's ability to move from filesystem to filesystem at
the same server and then to be referred to another server
gives us a reasonable base to start from.  There is an effort
that is going on to define the back end server namespace management
and it turns out that it will be discussed at the next IETF meeting.
If you are interested in the ideas to date, you can check out the
following internet draft:

http://www.ietf.org/internet-drafts/draft-tewari-federated-fs- 
protocol-00.txt

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problem with sharing multiple zfs file systems

2007-11-26 Thread Spencer Shepler

On Nov 21, 2007, at 2:11 PM, Simon Gao wrote:

 Here is one issue I am running into when setting up a new NFS  
 server to share several zfs file systems.

 I created following zfs file system from a zfs pool called bigpool.  
 The bigpool is the top level file system and mounted as /export/ 
 bigpool.

 file system   mount point

 bigpool /export/bigpool
 bigpool/zfs1  /export/bigpool/zfs1
 bigpool/zfs2  /export/bigpool/zfs2

 All directories under /export are owned by a group called users.  
 Also group users have write access to them.

 Next,  I exported bigpool (zfs1 and zfs2 inherited from bigpool) as  
 NFS share.

 zfs set sharenfs=on bigpool

 On a Linux client, I can mounte all shares directly without  
 problem. If I mounted /export/bigpool to /mnt/nfs_share on the  
 Linux client. The ownership and permissions
 on /mnt/nfs_share match to /export/bigpool on the nfs server.

 However, permissions on /mnt/nfs_share/zfs1 or /mnt/nfs_share/zfs2  
 are not inherited correctly. The group ownership is switched to  
 root on /mnt/nfs_share/zfs1,zfs2 and write permission is removed. I  
 expect /mnt/nfs_share/zfs1 should match /export/bigpool/zfs1, so  
 does for zfs2. Why ownership and permissions do not get inherited?

 When I directly mount /export/bigpool/zfs1 to /mnt/nfs_share/zfs1,  
 then ownership and permissions match again.

 Since with ZFS, creating and using multiple file systems are  
 recommended practice, does it mean that it will be lots of more  
 trouble in managing NFS shares on the system? Is there a way to  
 only export top level file system and let all permissions and  
 ownership flow down correctly on client side? Or maybe there are  
 some special settings out there to solve my problem?

 Any help is appreciated.

What you are describing is general NFS behavior.  Nothing special
about ZFS usage here.  When mounting /export/bigpool at the client,
the client observes the underlying directory /export/bigpool/zfs1 and
hence the change in ownership and permissions.  When the client
mounts the path /export/bigpool/zfs1, it is accessing that
filesystems directory and has the ownership and other attributes
that are expected of that filesystem.

With an NFSv4 client that provides 'mirror mounts', the client
will be able to mount /export/bigpool and have the underlying  
filesystems
automatically mounted when accessed and the behavior you describe
will be alleviated by the access of the desired filesystem.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nfs-ownership

2007-10-18 Thread Spencer Shepler

On Oct 17, 2007, at 11:25 AM, Claus Guttesen wrote:

 Did you mount both the parent and all the children on the client ?

 No, I just assumed that the sub-partitions would inherit the same
 uid/gid as the parent. I have done a chown -R.

   Ahhh, the issue is not permissions, but how the NFS server
 sees the various directories to share. Each dataset in the zpool is
 seen as a separate FS from the OS perspective; each is a separate NFS
 share. In which case each has to be mounted separately on the NFS
 client.

 Thank you for the clarification. When mounting the same partitions
 from  a windows-client I get r/w access to both the parent- and
 child-partition.

 Will it be possible to implement such a feature in nfs?

NFSv4 allows the client visibility into the shared filesystems
at the server.  It is up to the client to mount or access
those individual filesystems.  The Solaris client is being updated
with this functionality (we have named it mirror-mounts); I don't know
about the bsd client's ability to do the same.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nfs-ownership

2007-10-16 Thread Spencer Shepler

Claus,

Is the mount using NFSv4?  If so, there is likely a midguided
mapping of the user/groups between the client and server.

While not including BSD info, there is a little bit on
NFSv4 user/group mappings at this blog:
http://blogs.sun.com/nfsv4

Spencer

On Oct 16, 2007, at 2:11 PM, Claus Guttesen wrote:

 Hi.

 I have created some zfs-partitions. First I create the
 home/user-partitions. Beneath that I create additional partitions.
 Then I have do a chown -R for that user. These partitions are shared
 using the sharenfs=on. The owner- and group-id is 1009.

 These partitions are visible as the user assigned above. But when I
 mount the home/user partition from a FreeBSD-client, only the
 top-partiton has the proper uid- and guid-assignment. The partitons
 beneath are assigned to the root/wheel (uid 0 and gid 0 on FreeBSD).

 Am I doing something wrong

 From nfs-client:

 ls -l spool
 drwxr-xr-x  181 print  print  181 16 oct 21:00 2007-10-16
 drwxr-xr-x2 rootwheel 2 11 oct 11:07 c8

 From nfs-server:
 ls -l spool
 drwxr-xr-x 185 print print 185 Oct 16 21:10 2007-10-16
 drwxr-xr-x   6 print print   6 Oct 13 17:10 c8

 The folder 2007-10-16 is a regular folder below the nfs-mounted
 partition, c8 is a zfs-partition.

 -- 
 regards
 Claus

 When lenity and cruelty play for a kingdom,
 the gentlest gamester is the soonest winner.

 Shakespeare
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fileserver performance tests

2007-10-10 Thread Spencer Shepler

On Oct 10, 2007, at 8:41 AM, Luke Lonergan wrote:

 Hi Eric,

 On 10/10/07 12:50 AM, eric kustarz [EMAIL PROTECTED] wrote:

 Since you were already using filebench, you could use the
 'singlestreamwrite.f' and 'singlestreamread.f' workloads (with
 nthreads set to 20, iosize set to 128k) to achieve the same things.

 Yes but once again we see the utility of the zero software needed  
 approach
 to benchmarking!  The dd test rules for general audience on the  
 mailing
 lists IMO.

 The other goodness aspect of the dd test is that the results are
 indisputable because dd is baked into the OS.

And filebench will be in the next build in the same way.

Spencer


 That all said - we don't have a simple dd benchmark for random  
 seeking.

 With the latest version of filebench, you can then use the '-c'
 option to compare your results in a nice HTML friendly way.

 That's worth the effort.

 - Luke


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fileserver performance tests

2007-10-10 Thread Spencer Shepler

On Oct 10, 2007, at 2:56 AM, Thomas Liesner wrote:

 Hi Eric,

 Are you talking about the documentation at:
 http://sourceforge.net/projects/filebench
 or:
 http://www.opensolaris.org/os/community/performance/filebench/
 and:
 http://www.solarisinternals.com/wiki/index.php/FileBench
 ?

 i was talking about the solarisinternals wiki. I can't find any  
 documentation at the sourceforge site and the opensolaris site  
 refers to solarisinternals for a more detailed documentation. The  
 INSTALL document within the distribution refers to  
 solarisinternals and pkgadd which of course isn't working without  
 providing a package ;)

 This is the output of make within filebench/filebench:

 [EMAIL PROTECTED] # make
 make: Warning: Can't find `../Makefile.cmd': Datei oder Verzeichnis  
 nicht gefunden
 make: Fatal error in reader: Makefile, line 27: Read of include  
 file `../Makefile.cmd' failed

I am working to clean that up and will be posting binaries as well.

Spencer



 Before looking at the results, decide if that really *is* your
 expected workload

 Sure enough i have to dig deeper into the filebench workloads and  
 create my own workload to represent my expected workload even  
 better, but the tasks within the fileserver workload are already  
 quite representative (i could skip the append test though...)

 Regards,
 Tom


 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] problem: file copy's aren't getting the current file

2007-08-30 Thread Spencer Shepler

On Aug 30, 2007, at 12:35 PM, Richard Elling wrote:

 NFS clients can cache.  This cache can be loosely synchronized for
 performance reasons.  See the settings for actimeo and related  
 variables
 in mount_nfs(1m)

The NFS client will getattr/OPEN at the point where the application
opens the file (close to open consistency) and actimeo will not
change that behavior.  The nocto mount option will disable that.

If the client is copying an older version of the file, then the
client is either not checking the file's modification time
correctly or the NFS server is not telling the truth.

Spencer



   -- richard

 Russ Petruzzelli wrote:
I'm not sure if this is a zfs, zones, or solaris/nfs problem... So
 I'll start on this alias...

 Problem:
 I am seeing file copies from one machine to another grab an older  
 file.
 (Worded differently:  The cp command is not getting the most  
 recent file.)

 For instance,
 On a T2000, Solaris 10u3, with zfs setup, and a zone I try to copy  
 in a
 file from my swan home directory to a directory in the zone ...
 The file copied, is not the file currently in my home directory.   
 It is
 an older version of it.

 I've suspected this for some time (months) but today was the first  
 time
 I could actually see it happen.
 The niagara box seems to pull the file from some cache, but where?

 Thanks in advance for any pointers or configuration advice.  This is
 wreaking havoc on my testing.

 Russ


 - 
 ---

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cluster File System Use Cases

2007-07-13 Thread Spencer Shepler

On Jul 13, 2007, at 2:20 AM, Richard L. Hamilton wrote:

 Bringing this back towards ZFS-land, I think that
 there are some clever
 things we can do with snapshots and clones.  But the
 age-old problem
 of arbitration rears its ugly head.  I think I could
 write an option to expose
 ZFS snapshots to read-only clients.  But in doing so,
 I don't see how to
 prevent an ill-behaved client from clobbering the
 data.  To solve that
 problem, an arbiter must decide who can write where.
  The SCSI
 rotocol has almost nothing to assist us in this
 cause, but NFS, QFS,
 and pxfs do.  There is room for cleverness, but not
 at the SCSI or block
 level.
  -- richard

 Yeah; ISTR that IBM mainframe complexes with what they called
 shared DASD (DASD==Direct Access Storage Device, i.e. disk, drum,  
 or the
 like) depended on extent reserves.  IIRC, SCSI dropped extent reserve
 support, and indeed it was never widely nor reliably available anyway.
 AFAIK, all SCSI offers is reserves of an entire LUN; that doesn't  
 even help
 with slices, let alone anything else.  Nor (unlike either the VTOC  
 structure
 on MVS nor VxFS) is ZFS extent-based anyway; so even if extent  
 reserves
 were available, they'd only help a little.  Which means, as he  
 says, some
 sort of arbitration.

 I wonder whether the hooks for putting the ZIL on a separate device
 will be of any use for the cluster filesystem problem; it almost  
 makes me
 wonder if there could be any parallels between pNFS and a refactored
 ZFS.

We are busy layering pNFS on ZFS in the NFSv4.1 project and hope to
allow for coordination with client access and other interesting  
features.

Spencer


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Rsync update to ZFS server over SSH faster than over NFS?

2007-05-25 Thread Spencer Shepler


On May 25, 2007, at 11:22 AM, Roch Bourbonnais wrote:



Le 22 mai 07 à 01:11, Nicolas Williams a écrit :


On Mon, May 21, 2007 at 06:09:46PM -0500, Albert Chin wrote:

But still, how is tar/SSH any more multi-threaded than tar/NFS?


It's not that it is, but that NFS sync semantics and ZFS sync  
semantics

conspire against single-threaded performance.



Hi Nic,  I don't agree with the blanket statement. So to clarify.

There are 2 independant things at play here.

a) NFS sync semantics conspire againts single thread performance  
with any backend filesystem.

 However NVRAM normally offers some releaf of the issue.

b) ZFS sync semantics along with the Storage Software + imprecise  
protocol in between, conspire againts ZFS performance
of some workloads on NVRAM backed storage. NFS being one of the  
affected workloads.


The conjunction of the 2 causes worst than expected NFS perfomance  
over ZFS backend running __on NVRAM back storage__.
If you are not considering NVRAM storage, then I know of no ZFS/NFS  
specific problems.


Issue b) is being delt with, by both Solaris and Storage Vendors  
(we need a refined protocol);


Issue a) is not related to ZFS and rather fundamental NFS issue.  
Maybe future NFS protocol will help.



Net net; if one finds a way to 'disable cache flushing' on the  
storage side, then one reaches the state
we'll be, out of the box, when b) is implemented by Solaris _and_  
Storage vendor. At that point,  ZFS becomes a fine NFS
server not only on JBOD as it is today , both also on NVRAM backed  
storage.


I will add a third category, response time of individual requests.

One can think of the ssh stream of filesystem data as one large remote
procedure call that says put this directory tree and contents on
the server.  The time it takes is essentially the time it takes
to transfer the filesystem data.  The latency on the very last of the
request, amortized across the entire stream is zero.

For the NFS client, there is response time injected at each request
and the best way to amortize this is through parallelism and that is
very difficult for some applications.  Add the items in a) and b) and
there is a lot to deal with.  Not insurmountable but it takes a little
more effort to build an effective solution.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)

2007-04-21 Thread Spencer Shepler


On Apr 21, 2007, at 9:46 AM, Andy Lubel wrote:

so what you are saying is that if we were using NFS v4 things  
should be dramatically better?


I certainly don't support this assertion (if it was being made).

NFSv4 does have some advantages from the perspective of enabling
more aggressive file data caching; that will enable NFSv4 to
outperform NFSv3 in some specific workloads.  In general, however,
NFSv4 performs similarly to NFSv3.

Spencer




do you think this applies to any NFS v4 client or only Suns?



-Original Message-
From: [EMAIL PROTECTED] on behalf of Erblichs
Sent: Sun 4/22/2007 4:50 AM
To: Leon Koll
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Re: ZFS+NFS on storedge 6120 (sun t4)

Leon Koll,

As a knowldegeable outsider I can say something.

The benchbark (SFS) page specifies NFSv3,v2 support, so I question
 whether you ra n NFSv4. I would expect a major change in
 performance just to version 4 NFS version and ZFS.

The benchmark seems to stress your configuration enough that
the latency to service NFS ops increases to the point of non
serviced NFS requests. However, you don't know what is the
byte count per IO op. Reads are bottlenecked against rtt of
the connection and writes are normally sub 1K with a later
commit. However, many ops are probably just file handle
verifications which again are limited to your connection
rtt (round trip time). So, my initial guess is that the number
of NFS threads are somewhat related to the number of non
state (v4 now has state) per file handle op. Thus, if a 64k
ZFS block is being modified by 1 byte, COW would require a
64k byte read, 1 byte modify, and then allocation of another
64k block. So, for every write op, you COULD be writing a
full ZFS block.

This COW philosphy works best with extending delayed writes, etc
where later reads would make the trade-off of increased
latency of the larger block on a read op versus being able
to minimize the number of seeks on the write and read. Basicly
increasing the block size from say 8k to 64K. Thus, your
read latency goes up just to get the data off the disk
and minimizing the number of seeks, and dropping the read
ahead logic for the needed 8k to 64k file offset.

I do NOT know that THAT 4000 IO OPS load would match your maximal
load and that your actual load would never increase past 2000 IO ops.
Secondly, jumping from 2000 to 4000 seems to be too big of a jump
for your environment. Going to 2500 or 3000 might be more
appropriate. Lastly wrt the benchmark, some remnants (NFS and/or ZFS
and/or benchmark) seem to remain that have a negative impact.

	Lastly, my guess is that this NFS and the benchark are stressing  
small

partial block writes and that is probably one of the worst case
scenarios for ZFS. So, my guess is the proper analogy is trying to
kill a nat with a sledgehammer. Each write IO OP really needs to be
equal
to a full size ZFS block to get the full benefit of ZFS on a per byte
basis.

Mitchell Erblich
Sr Software Engineer
-





Leon Koll wrote:


Welcome to the club, Andy...

I tried several times to attract the attention of the community to  
the dramatic performance degradation (about 3 times) of NFZ/ZFS  
vs. ZFS/UFS combination - without any result : a href=http:// 
www.opensolaris.org/jive/thread.jspa?messageID=98592[1]/a , a  
href=http://www.opensolaris.org/jive/thread.jspa?threadID=24015; 
[2]/a.


Just look at two graphs in my a href=http://napobo3.blogspot.com/ 
2006/08/spec-sfs-bencmark-of-zfsufsvxfs.htmlposting dated  
August, 2006/a to see how bad the situation was and,  
unfortunately, this situation wasn't changed much recently: http:// 
photos1.blogger.com/blogger/7591/428/1600/sfs.1.png


I don't think the storage array is a source of the problems you  
reported. It's somewhere else...


[i]-- leon[/i]


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Multi-tera, small-file filesystems

2007-04-18 Thread Spencer Shepler


On Apr 18, 2007, at 6:44 PM, Robert Milkowski wrote:


Hello Carson,

Thursday, April 19, 2007, 1:22:17 AM, you wrote:

CG Robert Milkowski wrote:


We did some tests with Linux (2.4 and 2.6) and it seems there's a
problem if you have thousands of nfs file systems - they won't  
all be
mounted automatically, and even doing it manually (or in a script  
with

a sleep between each mount) there seems to be a limit below 1000. We
did not investigate further as in that environment all nfs  
clients are

Solaris server (x86, sparc) and we see no problems with thousands of
file systems.


CG The Linux limitation is possibly due to privileged port  
exhaustion with

CG TCP mounts, FYI.


We've been thinking about the same lines (1024-some services already
running).

But still with few hundreds nfs entries Linux time outs end you end up
with some file system not mounted, etc.


See the Linux NFS FAQ at http://nfs.sourceforge.net/
Question/Answer B3.  There is a limit of a few hundred
NFS mounts.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Cluster File System Use Cases

2007-03-06 Thread Spencer Shepler


The pNFS protocol doesn't preclude varying meta-data server designs
and their various locking strategies.

As an example, there has been work going on at University of Michigan/ 
CITI

to extend the Linux/NFSv4 implementation to allow for a pNFS server on
top of the Polyserve solution.

Spencer

On Mar 5, 2007, at 2:37 PM, Rayson Ho wrote:


I read this paper on Sunday. Seems interesting:

The Architecture of PolyServe Matrix Server: Implementing a Symmetric
Cluster File System

http://www.polyserve.com/requestinfo_formq1.php?pdf=2

What interested me the most is that the metadata and lock are spread
across all the nodes. I read the Parallel NFS (pNFS) presentation,
and seems like pNFS still has the metadata on one server... (Lisa,
correct me if I am wrong).

http://opensolaris.org/os/community/os_user_groups/frosug/pNFS/ 
FROSUG-pNFS.pdf


Rayson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

2007-03-05 Thread Spencer Shepler


On Mar 5, 2007, at 11:17 AM, Leon Koll wrote:


On 3/5/07, Roch - PAE [EMAIL PROTECTED] wrote:


Leon Koll writes:

  On 3/5/07, Roch - PAE [EMAIL PROTECTED] wrote:
  
   Leon Koll writes:
 On 2/28/07, Roch - PAE [EMAIL PROTECTED] wrote:
 
 
  http://bugs.opensolaris.org/bugdatabase/view_bug.do? 
bug_id=6467988

 
  NFSD  threads are created  on a  demand  spike (all of   
them
  waiting  on I/O) but thentend to stick around   
servicing

  moderate loads.
 
  -r

 Hello Roch,
 It's not my case. NFS stops to service after some point.  
And the

 reason is in ZFS. It never happens with NFS/UFS.
 Shortly, my scenario:
 1st SFS run, 2000 requested IOPS. NFS is fine, ;low number  
of threads.
 2st SFS run, 4000 requested IOPS. NFS cannot serve all  
requests, no of

 threads jumps to max
 3rd SFS run, 2000 requested IOPS. NFS cannot serve all  
requests, no of

 threads jumps to max.
 System cannot get back to the same results under equal  
load (1st and 3rd).
 Reboot between 2nd and 3rd doesn't help. The only  
persistent thing is
 a directory structure that was created during the 2nd run  
(in SFS

 higher requested load - more directories/files created).
 I am sure it's a bug. I need help. I don't care that ZFS  
works N times
 worse than UFS. I really care that after heavy load  
everything is

 totally screwed.

 Thanks,
 -- Leon
  
   Hi Leon,
  
   How much is the slowdown between 1st and 3rd ? How filled is
 
  Typical case is:
  1st: 1996 IOPS, latency  2.7
  3rd: 1375 IOPS, latency 37.9
 

The large latency increase is the  side effect of requesting
more than what can be delivered. Queue builds up and latency
follow. So  it  should  not be  the  primary  focus IMO. The
Decrease in IOPS is the primary problem.

One hypothesis is that over the life of the FS we're moving
toward spreading access to the full disk platter. We can
imagine some fragmentation hitting as well. I'm not sure
how I'd test both hypothesis.

   the pool at each stage ? What does 'NFS stops to service'
   mean ?
 
  There is a lot of error messages on the NFS(SFS) client :
  sfs352: too many failed RPC calls - 416 good 27 bad
  sfs3132: too many failed RPC calls - 302 good 27 bad
  sfs3109: too many failed RPC calls - 533 good 31 bad
  sfs353: too many failed RPC calls - 301 good 28 bad
  sfs3144: too many failed RPC calls - 305 good 25 bad
  sfs3121: too many failed RPC calls - 311 good 30 bad
  sfs370: too many failed RPC calls - 315 good 27 bad
 

Can this be timing out or queue full drops ? Might be a side
effect of SFS requesting more than what can be delivered.


I don't know was it timeouts or full drops. SFS marked such runs as  
INVALID.

I can run whatever is needed to help to investigate the problem. If
you have a D script that will tell us more, please send it to me.
I appreciate your help.


The failed RPCs are indeed a result of the SFS client timing out
the requests it has made to the server.  The server is being
overloaded for its capabilities and the benchmark results
show that.  I agree with Roch that as the SFS benchmark adds
more data to the filesystems that additional latency is
added and for this particular configuration and the
server is being over-driven.

The helpful thing would be to run smaller increments in the
benchmark to determine where the response time increases
beyond what the SFS workload can handle.

There have been a number of changes in ZFS recently that should
help with SFS performance measurement but fundamentally it
all depends on the configuration of the server (number of spindles
and CPU available).  So there may be a limit that is being
reached based on the hardware configuration.

What is your real goal here, Leon?  Are you trying to gather SFS
data to fit into sizing of a particular solution or just trying
to gather performance results for other general comparisons?
There are certainly better benchmarks than SFS for either
sizing and comparison reasons.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] suggestion: directory promotion to filesystem

2007-02-21 Thread Spencer Shepler


On Feb 21, 2007, at 12:11 PM, Matthew Ahrens wrote:


Adrian Saul wrote:

Not hard to work around - zfs create and a mv/tar command and it is
done... some time later.  If there was say  a zfs graft directory
newfs command, you could just break of the directory as a new
filesystem and away you go - no copying, no risking cleaning up the
wrong files etc.


Yep, this idea was previously discussed on this list -- search for  
zfs split and see the following RFE:


6400399 want zfs split



Note that current draft specification for NFSv4.1 has the capability
to split a filesystem such that the NFSv4.1 client will recognize it.
Then the new filesystem can be migrated to another server is needed.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Honeycomb

2006-12-20 Thread Spencer Shepler
On Wed, Dennis wrote:
 Hello,
 
 I just wanted to know if there are any news regarding Project Honeycomb? 
 Wasn?? it announced for end of 2006? Is there still development?


http://www.sun.com/storagetek/honeycomb/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-09 Thread Spencer Shepler
On Fri, Ben Rockwood wrote:
 eric kustarz wrote:
 So i'm guessing there's lots of files being created over NFS in one 
 particular dataset?
 
 We should figure out how many creates/second you are doing over NFS (i 
 should have put a timeout on the script).  Here's a real simple one 
 (from your snoop it looked like you're only doing NFSv3, so i'm not 
 tracking NFSv4):
 
 #!/usr/sbin/dtrace -s
 
 rfs3_create:entry,
 zfs_create:entry
 {
 @creates[probefunc] = count();
 }
 
 tick-60s
 {
 exit(0);
 }
 
 
 
 Eric, I love you. 
 
 Running this bit of DTrace reveled more than 4,000 files being created 
 in almost any given 60 second window.  And I've only got one system that 
 would fit that sort of mass file creation: our Joyent Connector products 
 Courier IMAP server which uses Maildir.  As a test I simply shutdown 
 Courier and unmounted the mail NFS share for good measure and sure 
 enough the problem vanished and could not be reproduced.  10 minutes 
 later I re-enabled Courier and our problem came back. 
 
 Clearly ZFS file creation is just amazingly heavy even with ZIL 
 disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron 
 cores we're in big trouble in the longer term.  In the meantime I'm 
 going to find a new home for our IMAP Mail so that the other things 
 served from that NFS server at least aren't effected.
 
 You asked for the zpool and zfs info, which I don't want to share 
 because its confidential (if you want it privately I'll do so, but not 
 on a public list), but I will say that its a single massive Zpool in 
 which we're using less than 2% of the capacity.   But in thinking about 
 this problem, even if we used 2 or more pools, the CPU consumption still 
 would have choked the system, right?  This leaves me really nervous 
 about what we'll do when its not an internal mail server thats creating 
 all those files but a customer. 
 
 Oddly enough, this might be a very good reason to use iSCSI instead of 
 NFS on the Thumper.
 
 Eric, I owe you a couple cases of beer for sure.  I can't tell you how 
 much I appreciate your help.  Thanks to everyone else who chimed in with 
 ideas and suggestions, all of you guys are the best!

Good to hear that you have figured out what is happening, Ben.

For future reference, there are two commands that you may want to
make use of in observing the behavior of the NFS server and individual
filesystems.

There is the trusty, nfsstat command.  In this case, you would have been
able to do something like:
nfsstat -s -v3 60

This will provide all of the server side NFSv3 statistics on 60 second
intervals.  

Then there is a new command fsstat that will provide vnode level
activity on a per filesystem basis.  Therefore, if the NFS server
has multiple filesystems active and you want ot look at just one
something like this can be helpful:

fsstat /export/foo 60

Fsstat has a 'full' option that will list all of the vnode operations
or just certain types.  It also will watch a filesystem type (e.g. zfs, nfs).
Very useful.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-02 Thread Spencer Shepler
On Thu, Darren J Moffat wrote:
 Ceri Davies wrote:
 For NFS, it's possible (but likely suboptimal) for clients to be
 configured to mount the filesystem from server A and fail over to
 server B, assuming that the pool import can happen quickly enough for
 them not to receive ENOENT.
 
 IIRC NFS client side failover is really only intended for read-only 
 mounts.  I can't remember though if this is enforced or not though.

NFS client side failover is for read-only exports.  No way to strictly
enforce since the NFSv2/v3 protocols don't have support.  The client
attempts to ensure that active files look the same when failing
over.

NFSv4 has migration support such that a filesystem can move between
servers but the administrative model is for that: movement and not
server failover.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Spencer Shepler
On Wed, Adam Leventhal wrote:
 On Wed, Nov 01, 2006 at 01:17:02PM -0500, Torrey McMahon wrote:
  Is there going to be a method to override that on the import? I can see 
  a situation where you want to import the pool for some kind of 
  maintenance procedure but you don't want the iSCSI target to fire up 
  automagically.
 
 There isn't -- to my knowledge -- a way to do this today for NFS shares.
 This would be a reasonable RFE that would apply to both NFS and iSCSI.

In the case of NFS, this can be dangerous if the rest of the NFS
server is allowed to come up and serve other filesystems.  The non-shared
filesystem will end up returning ESTALE errors to clients that are
active on that filesystem.  It should be an all or nothing selection...

Spencer

 
  Also, what if I don't have the iSCSI target packages on the node I'm 
  importing to? Error messages? Nothing?
 
 You'll get an error message reporting that it could not be shared.
 
 Adam
 
 -- 
 Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS ACLs and Samba

2006-10-26 Thread Spencer Shepler
On Thu, Joerg Schilling wrote:
 Spencer Shepler [EMAIL PROTECTED] wrote:
 
  On Wed, Jonathan Edwards wrote:
   
   On Oct 25, 2006, at 15:38, Roger Ripley wrote:
   
   IBM has contributed code for NFSv4 ACLs under AIX's JFS; hopefully  
   Sun will not tarry in following their lead for ZFS.
   
   http://lists.samba.org/archive/samba-cvs/2006-September/070855.html
   
   I thought this was still in draft:
   http://ietf.org/internet-drafts/draft-ietf-nfsv4-acl-mapping-05.txt
 
  That I-D describes the Posix/NFSv4 mapping that can be done.
 
  NFSv4 ACLs to/from Samba/NT ACLs are a different story; no interdependency.
 
 VFSv4 ACLs are bitwise identical to WIN-NT ACLs, could you please explain
 why there is a difference for Samba?

One known difference between NFSv4 ACLs and NT ACLs is information
about how ACEs were populated via inheritance.  There is a proposal in
the NFSv4 WG at the moment to add this functionality to NFSv4.1.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS ACLs and Samba

2006-10-25 Thread Spencer Shepler
On Wed, Jonathan Edwards wrote:
 
 On Oct 25, 2006, at 15:38, Roger Ripley wrote:
 
 IBM has contributed code for NFSv4 ACLs under AIX's JFS; hopefully  
 Sun will not tarry in following their lead for ZFS.
 
 http://lists.samba.org/archive/samba-cvs/2006-September/070855.html
 
 I thought this was still in draft:
 http://ietf.org/internet-drafts/draft-ietf-nfsv4-acl-mapping-05.txt

That I-D describes the Posix/NFSv4 mapping that can be done.

NFSv4 ACLs to/from Samba/NT ACLs are a different story; no interdependency.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [nfs-discuss] Re: [zfs-discuss] Re: NFS Performance and Tar

2006-10-13 Thread Spencer Shepler
On Fri, Joerg Schilling wrote:
 Spencer Shepler [EMAIL PROTECTED] wrote:
 
  I didn't comment on the error conditions that can occur during
  the writing of data upon close().  What you describe is the preferred
  method of obtaining any errors that occur during the writing of data.
  This occurs because the NFS client is writing asynchronously and the
  only method the application has of retrieving the error information
  is from the fsync() or close() call.  At close(), it is to late
  to recovery so fsync() can be used to obtain any asynchronous error
  state.
 
  This doesn't change the fact that upon close() the NFS client will
  write data back to the server.  This is done to meet the
  close-to-open semantics of NFS.
 
 Your working did not match with the reality, this is why I did write this.
 You did write that upon close() the client will first do something similar to 
 fsync on that file. The problem is that this is done asynchronously and the
 close() return value does noo contain an indication on whether the fsync
 did succeed.

Sorry, the code in Solaris would behave as I described.  Upon the 
application closing the file, modified data is written to the server.
The client waits for completion of those writes.  If there is an error,
it is returned to the caller of close().

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [nfs-discuss] Re: [zfs-discuss] Re: NFS Performance and Tar

2006-10-13 Thread Spencer Shepler
On Fri, Joerg Schilling wrote:
 Spencer Shepler [EMAIL PROTECTED] wrote:
 
  Sorry, the code in Solaris would behave as I described.  Upon the 
  application closing the file, modified data is written to the server.
  The client waits for completion of those writes.  If there is an error,
  it is returned to the caller of close().
 
 So is this Solaris specific, or why are people warned to depend on the close()
 return code only?

All unix NFS clients that I know of behave the way I described.

I believe the warning about relying on close() is that by the time
the application receives the error it is too late to recover.

If the application uses fsync() and receives an error, the application
can warn the user and they may be able to do something about it (your
example of ENOSPC is a very good one).  Space can be freed, and the
fsync() can be done again and the client will again push the writes
to the server and be successful.

If an application doesn't care about recovery but wants the error to
report back to the user, then close() is sufficient.

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [nfs-discuss] Re: [zfs-discuss] Re: NFS Performance and Tar

2006-10-12 Thread Spencer Shepler
On Thu, Joerg Schilling wrote:
 Spencer Shepler [EMAIL PROTECTED] wrote:
 
  On Thu, Joerg Schilling wrote:
   Spencer Shepler [EMAIL PROTECTED] wrote:
   
The close-to-open behavior of NFS clients is what ensures that the
file data is on stable storage when close() returns.
   
   In the 1980s this was definitely not the case. When did this change?
 
  It has not.  NFS clients have always flushed (written) modified file data 
  to the server before returning to the applications close().  The NFS
  client also asks that the data be committed to disk in this case.
 
 This is definitely wrong.
 
 Our developers did loose many files in the 1980s when the NFS file server
 did fill up the exported filesystem while several NFS clients did try to
 write back edited files at the same time.
 
 VI at that time did not call fsync and for this reason did not notice that
 the file could not be written back properly.
 
 What happens: All client did call statfs() and did asume that there is 
 still space on the server. They all did allow to put blocks into the local
 clients buffer cache. VI did call close, but the client did notice the
 no space problem after the close did return and VI did not notice that the
 file was damaged and allowed the user to quit VI.
 
 Some time later, Sun did enhance VI to first call fsync() and then call
 close(). Only if both return 0, the file is granted to be on the server.
 Sun also did inform us to write applications this way in order to prevent
 lost file content.

I didn't comment on the error conditions that can occur during
the writing of data upon close().  What you describe is the preferred
method of obtaining any errors that occur during the writing of data.
This occurs because the NFS client is writing asynchronously and the
only method the application has of retrieving the error information
is from the fsync() or close() call.  At close(), it is to late
to recovery so fsync() can be used to obtain any asynchronous error
state.

This doesn't change the fact that upon close() the NFS client will
write data back to the server.  This is done to meet the
close-to-open semantics of NFS.


Having tar create/write/close files concurrently would be a 
big win over NFS mounts on almost any system.
   
   Do you have an idea on how to do this?
 
  My naive thought would be to have multiple threads that create and
  write file data upon extraction.  This multithreaded behavior would
  provide better overall throughput of an extraction given NFS' response
  time characteristics.  More outstanding requests results in better
  throughput.  It isn't only the file data being written to disk that
  is the overhead of the extraction, it is the creation of the directories
  and files that must also be committed to disk in the case of NFS.
  This is the other part that makes things slower than local access.
 
 Doing this with tar (which fetches the data from a serial data stream)
 would only make sense in case that there will be threads that only have the 
 task
 to wait for a final fsync()/close().
 
 It would also make it harder to implement error control as it may be that 
 a problem is detected late while another large file is being extracted.
 Star could not just quit with an error message but would need to delay the
 error caused exit.

Sure, I can see that it would be difficult.  My point is that tar is
not only waiting upon the fsync()/close() but also on file and directory
creation.  There is a longer delay not only because of the network
latency but also the latency to writing the filesystem data to
stable storage.  Parallel requests will tend to overcome the delay/bandwidth
issues.  Not easy but can be an advantage with respect to performance.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: NFS Performance and Tar

2006-10-03 Thread Spencer Shepler
On Tue, eric kustarz wrote:
 Ben Rockwood wrote:
 I was really hoping for some option other than ZIL_DISABLE, but finally 
 gave up the fight.  Some people suggested NFSv4 helping over NFSv3 but it 
 didn't... at least not enough to matter.
 
 ZIL_DISABLE was the solution, sadly.  I'm running B43/X86 and hoping to 
 get up to 48 or so soonish (I BFU'd it straight to B48 last night and 
 brick'ed it).
 
 Here are the times.  This is an untar (gtar xfj) of SIDEkick 
 (http://www.cuddletech.com/blog/pivot/entry.php?id=491) on NFSv4 on a 20TB 
 RAIDZ2 ZFS Pool:
 
 ZIL Enabled:
 real1m26.941s
 
 ZIL Disabled:
 real0m5.789s
 
 
 I'll update this post again when I finally get B48 or newer on the system 
 and try it.  Thanks to everyone for their suggestions.
 
 
 I imagine what's happening is that tar is a single-threaded application 
 and it's basically doing: open, asynchronous write, close.  This will go 
 really fast locally.  But for NFS due to the way it does cache 
 consistency, on CLOSE, it must make sure that the writes are on stable 
 storage, so it does a COMMIT, which basically turns your asynchronous 
 write into a synchronous write.  Which means you basically have a 
 single-threaded app doing synchronous writes- ~ 1/2 disk rotational 
 latency per write.
 
 Check out 'mount_nfs(1M)' and the 'nocto' option.  It might be ok for 
 you to relax the cache consistency for client's mount as you untar the 
 file(s).  Then remount w/out the 'nocto' option once you're done.

This will not correct the problem because tar is extracting and therefore
creating files and directories; those creates will be synchronous at
the NFS server and there is no method to change this behavior at the
client.

Spencer

 
 Another option is to run multiple untars together.  I'm guessing that 
 you've got I/O to spare from ZFS's point of view.
 
 eric
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Lots of seeks?

2006-08-08 Thread Spencer Shepler
On Tue, Anton B. Rang wrote:
 So while I'm feeling optimistic :-) we really ought to be able to do this in 
 two I/O operations. If we have, say, 500K of data to write (including all of 
 the metadata), we should be able to allocate a contiguous 500K block on disk 
 and write that with a single operation. Then we update the ??berblock.
 
 The only inherent problem preventing this right now is that we don't have 
 general scatter/gather at the driver level (ugh). 

Fixing this bug would help the NFS server significantly given the
general lack of continuity of incoming write data (split at mblk
boundaries).

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SPEC SFS97 benchmark of ZFS,UFS,VxFS

2006-08-07 Thread Spencer Shepler
On Mon, Leon Koll wrote:
 I performed a SPEC SFS97 benchmark on Solaris 10u2/Sparc with 4 64GB
 LUNs, connected via FC SAN.
 The filesystems that were created on LUNS: UFS,VxFS,ZFS.
 Unfortunately the ZFS test couldn't complete bacuase the box was hung
 under very moderate load (3000 IOPs).
 Additional tests were done using UFS and VxFS that were built on ZFS
 raw devices (Zvolumes).
 Results can be seen here:
 http://napobo3.blogspot.com/2006/08/spec-sfs-bencmark-of-zfsufsvxfs.html

Leon,

Might I suggest that you provide the details as specified in the
SPEC SFS run and reporting rules?  They can be buried in a link
from your blog but it would be helpful to have that information
available to your readers.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS iSCSI: where do do the mirroring/raidz

2006-08-02 Thread Spencer Shepler
On Wed, Darren J Moffat wrote:
 I have 12 36G disks (in a single D2 enclosure) connected to a V880 that 
 I want to share to a v40z that is on the same gigabit network switch.
 I've already decided that NFS is not the answer - the performance of ON 
 consolidation builds over NFS just doesn't cut it for me.

?

With a locally attached 3510 array on a 4-way v40z, I have been 
able to do a full nighly build in 1 hour 7 minutes.  
With NFSv3 access, from the same system, to a couple of 
different NFS servers, I have been able to achieve 1 hour 15 minutes 
in one case and 1 hour 22 minutes in the other.

Is that too slow?

Spencer

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wrong reported free space over NFS

2006-06-09 Thread Spencer Shepler
On Fri, Eric Schrock wrote:
 On Thu, Jun 08, 2006 at 10:53:06PM -0500, Spencer Shepler wrote:
  
  On Thu, Eric Schrock wrote:
   The problem is that statvfs() only returns two values (total blocks and
   free blocks) from which we have to calculate three values: size, free,
  
  ?
  
  From statvfs(2) the following are returned in struct statvfs:
  
   fsblkcnt_t  f_blocks;/* total # of blocks on file system
   in units of f_frsize */
   fsblkcnt_t  f_bfree; /* total # of free blocks */
   fsblkcnt_t  f_bavail;/* # of free blocks avail to
  
  So, the data is being passed back.  Is there something I am missing?
 
 Yes, because these values aren't as straightforward as they seem.  For
 example, consider the return values from UFS:
 
 $ truss -t statvfs -v statvfs df -h /   
 statvfs64(/, 0x080479BC)  = 0
 bsize=8192   frsize=1024  blocks=8068757  bfree=2258725  
 bavail=2178038   files=972608 ffree=809612favail=809612   
 fsid=0x198   basetype=ufs namemax=255
 flag=ST_NOTRUNC
 fstr=
 Filesystem size   used  avail capacity  Mounted on
 /dev/dsk/c1d0s07.7G   5.5G   2.1G73%/
 $
 
 Notice that the values don't correspond to your assumption.  In
 particular, 'bfree + bavail != blocks'.  The two values for 'bfree' and
 'bavail' are used for filesystems that have a notion of 'reserved'
 blocks, i.e. metadata blocks which are used by the filesystem but not
 available to the user in the form of free space.  That's why you have
 two values, and if you look at the source code for df(1), you'll see
 that it never uses 'bfree' (except in rare internal calculations)
 because it's basically useless.

I must have been half asleep when looking at this; thanks for the clue bat.

Spencer
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wrong reported free space over NFS

2006-06-08 Thread Spencer Shepler

On Thu, Eric Schrock wrote:
 The problem is that statvfs() only returns two values (total blocks and
 free blocks) from which we have to calculate three values: size, free,

?

From statvfs(2) the following are returned in struct statvfs:

 fsblkcnt_t  f_blocks;/* total # of blocks on file system
 in units of f_frsize */
 fsblkcnt_t  f_bfree; /* total # of free blocks */
 fsblkcnt_t  f_bavail;/* # of free blocks avail to

So, the data is being passed back.  Is there something I am missing?

 and available space. Prior to pooled storage, available = size - free.
 This isn't true with ZFS.  On your local filesystem, df(1) recognizes it
 as a ZFS filesystem, and uses libzfs to get the real amount of available
 space.  Over NFS, we have no choice but to stick with POSIX semantics,
 which means that we can never provide you with the right answer.  For
 implementation details, check out adjust_total_blocks() in
 usr/src/cmd/fs.d/df.c.

So, from the comments, that bit of df code seems to be adjusting
for quotas if they exist?  I am not sure I understand why 
zfs' VFS_STATVFS() function can't do what the df command is doing
and then return the appropriate value to both df and the NFS server?

So, in Robert's case, is that 17GB really available and if so that
would seem to be an important thing to report to the NFS clients.

Spencer


 On Thu, Jun 08, 2006 at 04:38:57PM -0700, Robert Milkowski wrote:
  NFS server (b39):
  
  bash-3.00# zfs get quota nfs-s5-s8/d5201 nfs-s5-p0/d5110
  NAME PROPERTY   VALUE  SOURCE
  nfs-s5-p0/d5110  quota  600G   local
  nfs-s5-s8/d5201  quota  600G   local
  bash-3.00#
  bash-3.00# df -h | egrep d5201|d5110
  nfs-s5-p0/d5110600G   527G73G88%/nfs-s5-p0/d5110
  nfs-s5-s8/d5201600G   314G   269G54%/nfs-s5-s8/d5201
  bash-3.00#
  
  
  NFS client (S10U1 + patches, NFSv3 mount over TCP):
  
  bash-3.00# df -h | egrep d5201|d5110
  NFS-srv:/nfs-s5-p0/d5110   600G   527G73G88%/opt/d5110
  NFS-srv:/nfs-s5-s8/d5201   583G   314G   269G54%/opt/d5201
  bash-3.00#
  
  
  Well why I get 583GB size for d5201 on NFS client?
  
  ps. maybe I'm tired and missiong something really obvious...?
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss