Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-24 Thread Pramod Batni

 I would greatly appreciate it if you could open the bug, I don't have an
 opensolaris bugzilla account yet and you'd probably put better technical
 details in it anyway :). If you do, could you please let me know the bug#
 so I can refer to it once S10U6 is out and I confirm it has the same
 behavior?
   

   6763592 creating zfs filesystems gets slower as the number of zfs 
filesystems increase

Pramod

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-23 Thread Pramod Batni



On 10/23/08 08:19, Paul B. Henson wrote:

On Tue, 21 Oct 2008, Pramod Batni wrote:

  

Why does creating a new ZFS filesystem require enumerating all existing
ones?
  

  This is to determine if any of the filesystems in the dataset are mounted.



Ok, that leads to another question, why does creating a new ZFS filesystem
require determining if any of the existing filesystems in the dataset are
mounted :)? I could see checking the parent filesystems, but why the
siblings?

  

 I am not sure.
 All the checking is done as part of the libshare's sa_init which is 
calling into sa_get_zfs_shares().



In any case a bug can be filed on this.



Should I open a sun support call to request such a bug? I guess I should
wait until U6 is released, I don't have support for SXCE...
  
 You could do that else I can open a bug for you citing the Nevada 
build [b97] you are using.


Pramod

Thanks...


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-23 Thread Paul B. Henson
On Thu, 23 Oct 2008, Pramod Batni wrote:

 On 10/23/08 08:19, Paul B. Henson wrote:
 
  Ok, that leads to another question, why does creating a new ZFS filesystem
  require determining if any of the existing filesystems in the dataset are
  mounted :)?

 I am not sure. All the checking is done as part of the libshare's sa_init
 which is calling into sa_get_zfs_shares().

It does make a big difference whether or not sharenfs is enabled, I haven't
finished my testing, but at 5000 filesystems it takes about 30 seconds to
create a new filesystem and over 30 minutes to reboot if they are shared,
but only 7 seconds to make a filesystem and about 15 minutes to reboot if
they are not.

 You could do that else I can open a bug for you citing the Nevada
 build [b97] you are using.

I would greatly appreciate it if you could open the bug, I don't have an
opensolaris bugzilla account yet and you'd probably put better technical
details in it anyway :). If you do, could you please let me know the bug#
so I can refer to it once S10U6 is out and I confirm it has the same
behavior?

Thanks much...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-22 Thread Paul B. Henson
On Tue, 21 Oct 2008, Pramod Batni wrote:

  Why does creating a new ZFS filesystem require enumerating all existing
  ones?

   This is to determine if any of the filesystems in the dataset are mounted.

Ok, that leads to another question, why does creating a new ZFS filesystem
require determining if any of the existing filesystems in the dataset are
mounted :)? I could see checking the parent filesystems, but why the
siblings?

 In any case a bug can be filed on this.

Should I open a sun support call to request such a bug? I guess I should
wait until U6 is released, I don't have support for SXCE...

Thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-21 Thread Pramod Batni



On 10/21/08 04:52, Paul B. Henson wrote:

On Mon, 20 Oct 2008, Pramod Batni wrote:

  

Yes, the implementation of the above ioctl walks the list of mounted
filesystems 'vfslist' [in this case it walks 5000 nodes of a linked list
before the ioctl returns] This in-kernel traversal of the filesystems is
taking time.



Hmm, O(n) :(... I guess that is the implementation of getmntent(3C)?
  
  In fact the problem is that 'zfs create' calls the ioctl way too many 
times.

  getmntent(3C) issues a single ioctl( MNTIOC_GETMNTENT).

Why does creating a new ZFS filesystem require enumerating all existing
ones?
  

 This is to determine if any of the filesystems in the dataset are mounted.
 The ioctl calls are coming from:

 libc.so.1`ioctl+0x8
 libc.so.1`getmntany+0x200
 libzfs.so.1`is_mounted+0x60
 libshare.so.1`sa_get_zfs_shares+0x118
 libshare.so.1`sa_init+0x330
 libzfs.so.1`zfs_init_libshare+0xac
 libzfs.so.1`zfs_share_proto+0x4c
 zfs`zfs_do_create+0x608
 zfs`main+0x2b0
 zfs`_start+0x108

  zfs_init_libshare is walking through a list of filesystems and 
determining if each of them
  are mounted. I think there can be a better way to do this rather than 
doing a is_mounted()

  check on each of the filesystems. In any case a bug can be filed on this.

Pramod
  

You could set 'zfs set mountpoint=none pool-name' and then create the
filesystems under the pool-name . [In my experiments the number of
ioctl's went down drastically.] You could then set a mountpoint for the
pool and then issue a 'zpool mount -a' .



That would work for an initial mass creation, but we are going to need to
create and delete fairly large numbers of file systems over time, this
workaround would not help for that.


  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-20 Thread Pramod Batni


Paul B. Henson wrote:
 snip

 At about 5000 filesystems, it starts taking over 30 seconds to
 create/delete additional filesystems.

 At 7848, over a minute:

 # time zfs create export/user/test

 real1m22.950s
 user1m12.268s
 sys 0m10.184s

 I did a little experiment with truss:

 # truss -c zfs create export/user/test2

 syscall   seconds   calls  errors
 _exit.000   1
 read .004 892
 open .023  67   2
 close.001  80
 brk  .006 653
 getpid   .0378598
 mount.006   1
 sysi86   .000   1
 ioctl 115.534 313036787920
 execve   .000   1
 fcntl.000  18
 openat   .000   2
 mkdir.000   1
 getppriv .000   1
 getprivimplinfo  .000   1
 issetugid.000   4
 sigaction.000   1
 sigfillset   .000   1
 getcontext   .000   1
 setustack.000   1
 mmap .000  78
 munmap   .000  28
 xstat.000  65  21
 lxstat   .000   1   1
 getrlimit.000   1
 memcntl  .000  16
 sysconfig.000   5
 lwp_sigmask  .000   2
 lwp_private  .000   1
 llseek   .084   15819
 door_info.000  13
 door_call.1038391
 schedctl .000   1
 resolvepath  .000  19
 getdents64   .000   4
 stat64   .000   3
 fstat64  .000  98
 zone_getattr .000   1
 zone_lookup  .000   2
    --   
 sys totals:   115.804 31338551   7944
 usr time: 107.174
 elapsed:  897.670


 and it seems the majority of time is spent in ioctl calls, specifically:

 ioctl(16, MNTIOC_GETMNTENT, 0x08045A60) = 0
   

Yes, the implementation of the above ioctl walks the list of mounted 
filesystems 'vfslist'
[in this case it walks 5000 nodes of a linked list before the ioctl 
returns] This in-kernel traversal of the filesystems
is taking time.
 Interestingly, I tested creating 6 filesystems simultaneously, which took a
 total of only three minutes, rather than 9 minutes had they been created
   
 sequentially. I'm not sure how parallelizable I can make an identity
 management provisioning system though.
   
 Was I mistaken about the increased scalability that was going to be
 available? Is there anything I could configure differently to improve this
 performance? We are going to need about 30,000 filesystems to cover our
   

You could set 'zfs set mountpoint=none pool-name' and then create the 
filesystems
under the pool-name . [In my experiments the number of ioctl's went 
down drastically.]
You could then set a mountpoint for the pool and then issue a 'zpool 
mount -a' .

Pramod
 faculty, staff, students, and group project directories. We do have 5
 x4500's which will be allocated to the task, so about 6000 filesystems per.
 Depending on what time of the quarter it is, our identity management sytem
 can create hundreds up to thousands of accounts, and when we purge accounts
 quarterly we typically delete 10,000 or so. Currently those jobs only take
 2-6 hours, with this level of performance from ZFS they would take days if
 not over a week :(.

 Thanks for any suggestions. What is the internal recommendation on maximum
 number of file systems per server?


   
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-20 Thread Paul B. Henson
On Mon, 20 Oct 2008, Pramod Batni wrote:

 Yes, the implementation of the above ioctl walks the list of mounted
 filesystems 'vfslist' [in this case it walks 5000 nodes of a linked list
 before the ioctl returns] This in-kernel traversal of the filesystems is
 taking time.

Hmm, O(n) :(... I guess that is the implementation of getmntent(3C)?

Why does creating a new ZFS filesystem require enumerating all existing
ones?

 You could set 'zfs set mountpoint=none pool-name' and then create the
 filesystems under the pool-name . [In my experiments the number of
 ioctl's went down drastically.] You could then set a mountpoint for the
 pool and then issue a 'zpool mount -a' .

That would work for an initial mass creation, but we are going to need to
create and delete fairly large numbers of file systems over time, this
workaround would not help for that.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-19 Thread Paul B. Henson

I originally started testing a prototype for an enterprise file service
implementation on our campus using S10U4. Scalability in terms of file
system count was pretty bad, anything over a couple of thousand and
operations started taking way too long.

I had thought there were a number of improvements/enhancements that had
been made since then to improve performance and scalability when a large
number of file systems exist. I've been testing with SXCE (b97) which
presumably has all of the enhancements (and potentially then some) that
will be available in U6, and I'm still seeing very poor scalability once
more than a few thousand filesystems are created.

I have a test install on an x4500 with two TB disks as a ZFS root pool, 44
TB disks configured as mirror pairs belonging to one zpool, and the last
two TB disks as hot spares.

At about 5000 filesystems, it starts taking over 30 seconds to
create/delete additional filesystems.

At 7848, over a minute:

# time zfs create export/user/test

real1m22.950s
user1m12.268s
sys 0m10.184s

I did a little experiment with truss:

# truss -c zfs create export/user/test2

syscall   seconds   calls  errors
_exit.000   1
read .004 892
open .023  67   2
close.001  80
brk  .006 653
getpid   .0378598
mount.006   1
sysi86   .000   1
ioctl 115.534 313036787920
execve   .000   1
fcntl.000  18
openat   .000   2
mkdir.000   1
getppriv .000   1
getprivimplinfo  .000   1
issetugid.000   4
sigaction.000   1
sigfillset   .000   1
getcontext   .000   1
setustack.000   1
mmap .000  78
munmap   .000  28
xstat.000  65  21
lxstat   .000   1   1
getrlimit.000   1
memcntl  .000  16
sysconfig.000   5
lwp_sigmask  .000   2
lwp_private  .000   1
llseek   .084   15819
door_info.000  13
door_call.1038391
schedctl .000   1
resolvepath  .000  19
getdents64   .000   4
stat64   .000   3
fstat64  .000  98
zone_getattr .000   1
zone_lookup  .000   2
   --   
sys totals:   115.804 31338551   7944
usr time: 107.174
elapsed:  897.670


and it seems the majority of time is spent in ioctl calls, specifically:

ioctl(16, MNTIOC_GETMNTENT, 0x08045A60) = 0

Interestingly, I tested creating 6 filesystems simultaneously, which took a
total of only three minutes, rather than 9 minutes had they been created
sequentially. I'm not sure how parallelizable I can make an identity
management provisioning system though.

Was I mistaken about the increased scalability that was going to be
available? Is there anything I could configure differently to improve this
performance? We are going to need about 30,000 filesystems to cover our
faculty, staff, students, and group project directories. We do have 5
x4500's which will be allocated to the task, so about 6000 filesystems per.
Depending on what time of the quarter it is, our identity management sytem
can create hundreds up to thousands of accounts, and when we purge accounts
quarterly we typically delete 10,000 or so. Currently those jobs only take
2-6 hours, with this level of performance from ZFS they would take days if
not over a week :(.

Thanks for any suggestions. What is the internal recommendation on maximum
number of file systems per server?


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS scalability in terms of file system count (or lack thereof) in S10U6

2008-10-19 Thread Ed Plese
On Sun, Oct 19, 2008 at 4:08 PM, Paul B. Henson [EMAIL PROTECTED] wrote:
 At about 5000 filesystems, it starts taking over 30 seconds to
 create/delete additional filesystems.

The biggest problem I ran into was the boot time, specifically when
zfs volinit is executing.  With ~3500 filesystems on S10U3 the boot
time for our X4500 was around 40 minutes.  Any idea what your boot
time is like with that many filesystems on the newer releases?


Ed Plese
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss