[zfs-discuss] live upgrade with lots of zfs filesystems -- still broken

2010-10-19 Thread Paul B. Henson


A bit over a year ago I posted about a problem I was having with live 
upgrade on a system with lots of file systems mounted:


http://opensolaris.org/jive/thread.jspa?messageID=411137#411137

An official Sun support call was basically just closed with no 
resolution. I was quite fortunate that Jens Elkner had made a workaround 
available which made live upgrade actually usable for my deployment 
(thanks again, Jens!). I would have been pretty screwed without it.


While still not exactly speedy, with the workaround in place live 
upgrade was fairly usable, and we've been using it for installing 
patches and upgrading to update releases with no problems.


Until now; unfortunately, after installing the latest live upgrade 
patches on my existing U8 system in preparation for upgrading to U9, 
live upgrade has become even less usable than when I initially tried it 
without the workaround in place. While creating a new BE was still 
reasonably quick, mounting it took over *six* hours to complete 8-/.


Whereas before the most amount of time expended was taken up by 
mounting/unmounting all the filesystems (resolved by Jens' patch), now 
the majority of the six hours were spent spinning in 
/etc/lib/lu/plugins/lupi_bebasic. I don't know exactly what it was doing 
(as the source code to live upgrade does not appear to be available), 
but for most of the six hours it seems it was comparing strings:


# pstack 1670
1670:   /etc/lib/lu/plugins/lupi_bebasic plugin
 fee05973 strcmp   (8046474, 8046478) + 1c3
 fef6ae45 lu_smlGetTagByName (806920c, 16ef, fefa0f30) + 74
 fef71717 lu_tsfSearchFields (806920c, 0, 3, 2, 1, 88369f4) + 13f
 fef4e2da lu_beoGetFstblFilterSwapAndShared (80513bc, 8046978, 8069234,
806920c) + 1be
 fef4f1f7 lu_beoGetFstblToMountBe (80513bc, 80541e4, 80469c4, 80513fc) +
247
 fef515cf lu_beoMountBeByBeName (80513bc, 8046a24, 805419c, 80513fc, 0, 0)
+ 39c
 0804ba6c  (804ef6c, 1, 8068dd4, 0, 8069ee4, 8069ee4)
 fef5fa2b  (804ef6c, 8046f3c, 8069ee4, 8046ae8)
 fef5f5c3  (804ef6c, 8046f3c, 8069ee4, 8046ae8)
 fef5f397  (804ef6c, 8046f3c, 8069ee4)
 fef5f1c5  (804ef6c, 8046f3c, 8069ee4)
 fef603c6  (804ef6c)
 fef5ec12 lu_pluginProcessLoop (804ef6c) + 42
 0804a028 main (2, 8046fa8, 8046fb4) + 2d3
 08049cba  (2, 80471d8, 80471f9, 0, 8069954, 8069914)

Six hours, fully utilizing a CPU core, comparing strings 8-/.

I considered opening a support ticket, but given the lack of response 
previously, I decided to poke around with it a bit myself first. truss 
of lumount revealed that getmntent was being called to enumerate mount 
points, so initially I tried preloading a shared library to interpose 
the getmntent call and skip all the mount points corresponding to my 
data file systems under /export.


That didn't make any difference. I then moved on to look at the multiple 
calls to the zfs binary made by lumount, which seemed potential sources 
of extraneous data which could cause unnecessary processing. Replacing 
/sbin/zfs with a wrapper script yielded quite unexpected results, as it 
seems there are many links to the zfs binary, which does different magic 
depending on the value of argv[0] 8-/.


The path to the zfs binary is statically defined in 
/etc/lib/lu/liblu.so.1, so in a display of horrid kludginess ;), I 
edited the binary file and replaced all instances of /sbin/zfs with 
/sbin/zfb, and created /sbin/zfb with the content:


---
#! /bin/sh

. /etc/default/lu
LUBIN=${LUBIN:=/usr/lib/lu}
. $LUBIN/lulib

if [ $1 = list ] ; then
/sbin/zfs $@ | /usr/bin/egrep -v -f `lulib_get_fs2ignore`
else
exec /sbin/zfs $@
fi
--

This utilizes the configuration included in Jens' patch to ignore the 
exact same set of file systems ignored by the rest of live upgrade with 
the patch installed.


With this kludge in place lumount took *23 seconds*, three orders of 
magnitude less time.


I tend to tilt at windmills, so I probably will end up opening another 
support ticket. The last time there seemed to be no interest in fixing 
live upgrade so it would actually scale :(, maybe this time I'll have 
better luck.


For those Oracle employees in the audience, if anyone could possibly 
explain exactly what processing lupi_bebasic is doing that results in 
six hours of string comparisons, I'm dying of curiosity :). And if 
anyone wants to jump up and champion the cause of getting live upgrade 
to work in an environment with many file systems, I'd be happy to help; 
it would be nice to have shipped code that works without breaking out 
the hex editor ;).


Thanks...


--
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-28 Thread Paul B. Henson
On Thu, 27 Aug 2009, Paul B. Henson wrote:

 However, I went to create a new boot environment to install the patches
 into, and so far that's been running for about an hour and a half :(,
 which was not expected or planned for.
[...]
 I don't think I'm going to make my downtime window :(, and will probably
 need to reschedule the patching. I never considered I might have to start
 the patch process six hours before the window.

Well, so far lucreate took 3.5 hours, lumount took 1.5 hours, applying the
patches took all of 10 minutes, luumount took about 20 minutes, and
luactivate has been running for about 45 minutes. I'm assuming it will
probably take at least the 1.5 hours of the lumount (particularly
considering it appears to be running a lumount process under the hood) if
not the 3.5 hours of lucreate. Add in the 1-1.5 hours to reboot, and, well,
so much for patches this maintenance window.

The lupi_bebasic process seems to be the time killer here. Not sure what
it's doing, but it spent 75 minutes running strcmp. Pretty much nothing but
strcmp. 75 CPU minutes running strcmp I took a look for the source but
I guess that component's not a part of opensolaris, or at least I couldn't
find it.

Hopefully I can figure out how to make this perform a little more
acceptably before our next maintenance window.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-28 Thread Casper . Dik

Well, so far lucreate took 3.5 hours, lumount took 1.5 hours, applying the
patches took all of 10 minutes, luumount took about 20 minutes, and
luactivate has been running for about 45 minutes. I'm assuming it will
probably take at least the 1.5 hours of the lumount (particularly
considering it appears to be running a lumount process under the hood) if
not the 3.5 hours of lucreate. Add in the 1-1.5 hours to reboot, and, well,
so much for patches this maintenance window.

The lupi_bebasic process seems to be the time killer here. Not sure what
it's doing, but it spent 75 minutes running strcmp. Pretty much nothing but
strcmp. 75 CPU minutes running strcmp I took a look for the source but
I guess that component's not a part of opensolaris, or at least I couldn't
find it.

Hopefully I can figure out how to make this perform a little more
acceptably before our next maintenance window.


Do you have a lot of files in /etc/mnttab, including nfs filesystems
mounted from server1,server2:/path?

And you're using lucreate for a ZFS root?  It should be quick; we are
changing a number of things in Solaris 10 update 8 and we hope it will
be faster/

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-28 Thread Jens Elkner
On Thu, Aug 27, 2009 at 10:59:16PM -0700, Paul B. Henson wrote:
 On Thu, 27 Aug 2009, Paul B. Henson wrote:
 
  However, I went to create a new boot environment to install the patches
  into, and so far that's been running for about an hour and a half :(,
  which was not expected or planned for.
 [...]
  I don't think I'm going to make my downtime window :(, and will probably
  need to reschedule the patching. I never considered I might have to start
  the patch process six hours before the window.
 
 Well, so far lucreate took 3.5 hours, lumount took 1.5 hours, applying the
 patches took all of 10 minutes, luumount took about 20 minutes, and
 luactivate has been running for about 45 minutes. I'm assuming it will

Have a look at http://iws.cs.uni-magdeburg.de/~elkner/luc/lu-5.10.patch
or http://iws.cs.uni-magdeburg.de/~elkner/luc/lu-5.11.patch ...
So first install most recent LU patches and than one of the above.
Since still on vacation (for ~8 weeks), haven't checked, whether there
are new LU patches out there and the patches still match (usually they do).
If not, adjusting the files manually shouldn't be a problem ;-)

There are also versions for pre svn_b107 and pre 121430-36,121431-37:
see http://iws.cs.uni-magdeburg.de/~elkner/

More info:
http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html#luslow

Have fun,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-28 Thread Paul B. Henson
On Fri, 28 Aug 2009 casper@sun.com wrote:

 luactivate has been running for about 45 minutes. I'm assuming it will
 probably take at least the 1.5 hours of the lumount (particularly
 considering it appears to be running a lumount process under the hood) if
 not the 3.5 hours of lucreate.

Eeeek, the luactivate command ended up taking about *7 hours* to complete.
And I'm not sure it was even successful, output excerpts at the end of this
message.

 Do you have a lot of files in /etc/mnttab, including nfs filesystems
 mounted from server1,server2:/path?

There's only one nfs filesystem in vfstab which is always mounted, user
home directories are automounted and would be in mnttab if accessed, but
during the lu process no users were on the box.

On the other hand, there are a *lot* of zfs filesytems in mnttab:

# grep zfs /etc/mnttab  | wc -l
8145

 And you're using lucreate for a ZFS root?  It should be quick; we are
 changing a number of things in Solaris 10 update 8 and we hope it will be
 faster/

lucreate on a system with *only* an os root pool is blazing (the magic of
clones). The problem occurs when my data pool (with 6k odd filesystems) is
also there. The live upgrade process is analyzing all 6k of those
filesystems, mounting them all in the alternate root, unmounting them all,
and who knows what else. This is totally wasted effort, those filesystems
have nothing to do with the OS or patching, and I'm really hoping that they
can just be completely ignored.

So, after 7 hours, here is the last bit of output from luactivate. Other
than taking forever and a day, all of the output up to this point seemed
normal. The BE s10u6 is neither the currently active BE nor the one being
made active, but these errors have me concerned something _bad_ might
happen if I reboot :(. Any thoughts?


Modifying boot archive service
Propagating findroot GRUB for menu conversion.
ERROR: Read-only file system: cannot create mount point
/.alt.s10u6/export/group/ceis
ERROR: failed to create mount point /.alt.s10u6/export/group/ceis for
file system export/group/ceis
ERROR: unmounting partially mounted boot environment file systems
ERROR: No such file or directory: error unmounting ospool/ROOT/s10u6
ERROR: umount: warning: ospool/ROOT/s10u6 not in mnttab
umount: ospool/ROOT/s10u6 no such file or directory
ERROR: cannot unmount ospool/ROOT/s10u6
ERROR: cannot mount boot environment by name s10u6
ERROR: Failed to mount BE s10u6.
ERROR: Failed to mount BE s10u6. Cannot propagate file
/etc/lu/installgrub.findroot to BE
File propagation was incomplete
ERROR: Failed to propagate installgrub
ERROR: Could not propagate GRUB that supports the findroot command.
Activation of boot environment patch-20090817 successful.

According to lustatus everything is good, but shiver... These boxes have
only been in full production about a month, it would not be good for them
to die during the first scheduled patches.


# lustatus
Boot Environment   Is   Active ActiveCanCopy
Name   Complete NowOn Reboot Delete Status
--  -- - -- --
s10u6  yes  no noyes-
s10u6-20090413 yes  yesnono -
patch-20090817 yes  no yes   no -


Tuanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-28 Thread Paul B. Henson
On Fri, 28 Aug 2009, Jens Elkner wrote:

 More info:
 http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html#luslow

**sweet**!!

This is *exactly* the functionality I was looking for. Thanks much

Any Sun people have any idea if Sun has any similar functionality planned
for live upgrade? Live upgrade without this capability is basically useless
on a system with lots of zfs filesystems.

Jens, thanks again, this is perfect.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-27 Thread Paul B. Henson

Well, so I'm getting ready to install the first set of patches on my x4500
since we deployed into production, and have run into an unexpected snag.

I already knew that with about 5-6k file systems the reboot cycle was going
to be over an hour (not happy about, but knew about and planned for).

However, I went to create a new boot environment to install the patches
into, and so far that's been running for about an hour and a half :(,
which was not expected or planned for.

First, it looks like the ludefine script spent about 20 minutes iterating
through all of my zfs file systems, and then something named lupi_bebasic
ran for over an hour, and then it looks like it mounted all of my zfs
filesystems under /.alt.tmp.b-nAe.mnt, and now it looks like it is
unmounting all of them.

I hadn't noticed before, but when I went to check on my test system (with
only a handful of filesystems), but evidently when I get to the point of
using lumount to mount the boot environment for patching, it's going to
again mount all of my zfs file systems under the alternative root, and then
need to unmount them all again after I'm done patching, which is going to
add probably another hour or two.

I don't think I'm going to make my downtime window :(, and will probably
need to reschedule the patching. I never considered I might have to start
the patch process six hours before the window.

I poked around a bit, but have not come across any way to exclude zfs
filesystems not part of the boot os pool from the copy and mount process.
I'm really hoping I'm just being stupid and missing something blindingly
obvious. Given a boot pool named ospool, and a data pool named export, is
there anyway to make live upgrade completely ignore the data pool? There
is no need for my 6k user file systems to be mounted in the alternative
environment during patching. I only want the file systems in the ospool
copied, processed, and mounted.

fingers crossed Thanks...



-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-27 Thread Trevor Pretty




Paul

You need to exclude all the file system that are not the "OS"

My S10 Virtual machine is not booted but you can put all the "excluded"
file systems in a file and use  -f  from memory.

You use to have to do this if there was a DVD in the drive otherwise
/cdrom got copied to the new boot environment. I know this because I
logged an RFE when Live Upgrade first appeared, and it was put into
state Deferred as the workaround is to just exclude it. I think it did
get fixed however in a later release.

trevor




Paul B. Henson wrote:

  Well, so I'm getting ready to install the first set of patches on my x4500
since we deployed into production, and have run into an unexpected snag.

I already knew that with about 5-6k file systems the reboot cycle was going
to be over an hour (not happy about, but knew about and planned for).

However, I went to create a new boot environment to install the patches
into, and so far that's been running for about an hour and a half :(,
which was not expected or planned for.

First, it looks like the ludefine script spent about 20 minutes iterating
through all of my zfs file systems, and then something named lupi_bebasic
ran for over an hour, and then it looks like it mounted all of my zfs
filesystems under /.alt.tmp.b-nAe.mnt, and now it looks like it is
unmounting all of them.

I hadn't noticed before, but when I went to check on my test system (with
only a handful of filesystems), but evidently when I get to the point of
using lumount to mount the boot environment for patching, it's going to
again mount all of my zfs file systems under the alternative root, and then
need to unmount them all again after I'm done patching, which is going to
add probably another hour or two.

I don't think I'm going to make my downtime window :(, and will probably
need to reschedule the patching. I never considered I might have to start
the patch process six hours before the window.

I poked around a bit, but have not come across any way to exclude zfs
filesystems not part of the boot os pool from the copy and mount process.
I'm really hoping I'm just being stupid and missing something blindingly
obvious. Given a boot pool named ospool, and a data pool named export, is
there anyway to make live upgrade completely ignore the data pool? There
is no need for my 6k user file systems to be mounted in the alternative
environment during patching. I only want the file systems in the ospool
copied, processed, and mounted.

fingers crossed Thanks...



  









www.eagle.co.nz
This email is confidential and may be legally 
privileged. If received in error please destroy and immediately notify 
us.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] live upgrade with lots of zfs filesystems

2009-08-27 Thread Paul B. Henson
On Thu, 27 Aug 2009, Trevor Pretty wrote:

 My S10 Virtual machine is not booted but you can put all the excluded
 file systems in a file and use -f from memory.

Unfortunately, I wasn't that stupid. I saw the -f option, but it's not
applicable to ZFS root:

 -f exclude_list_file

 Use  the  contents  of  exclude_list_file   to   exclude
 specific  files  (including  directories) from the newly
 created BE. exclude_list_file contains a list  of  files
 and directories, one per line. If a line item is a file,
 only that file is excluded; if a directory, that  direc-
 tory  and  all  files  beneath that directory, including
 subdirectories, are excluded.

 This option is not supported when the source BE is on  a
 ZFS file system.

After it finished unmounting everything from the alternative root, it seems
to have spawned *another* lupi_bebasic process which has eaten up 62
minutes of CPU time so far. Evidentally it's doing a lot of string
comparisons (per truss):

/1...@1:   - libc:strcmp() = 0
/1...@1:   - libc:strcmp(0x86fceec, 0xfefa1218)
/1...@1:   - libc:strcmp() = 0
/1...@1:   - libc:strcmp(0x86fd534, 0xfefa1218)
/1...@1:   - libc:strcmp() = 0
/1...@1:   - libc:strcmp(0x86fdccc, 0xfefa1218)
/1...@1:   - libc:strcmp() = 0
/1...@1:   - libc:strcmp(0x86fdcfc, 0xfefa1218)
/1...@1:   - libc:strcmp() = 0
/1...@1:   - libc:strcmp(0x86fec84, 0xfefa1218)
/1...@1:   - libc:strcmp() = 0
/1...@1:   - libc:strcmp(0x86fecb4, 0xfefa1218)
/1...@1:   - libc:strcmp() = 0

The first one finished in a bit over an hour, hopefully this one's about
done too and there's not any more stuff to do.

Thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss