[zfs-discuss] live upgrade with lots of zfs filesystems -- still broken
A bit over a year ago I posted about a problem I was having with live upgrade on a system with lots of file systems mounted: http://opensolaris.org/jive/thread.jspa?messageID=411137#411137 An official Sun support call was basically just closed with no resolution. I was quite fortunate that Jens Elkner had made a workaround available which made live upgrade actually usable for my deployment (thanks again, Jens!). I would have been pretty screwed without it. While still not exactly speedy, with the workaround in place live upgrade was fairly usable, and we've been using it for installing patches and upgrading to update releases with no problems. Until now; unfortunately, after installing the latest live upgrade patches on my existing U8 system in preparation for upgrading to U9, live upgrade has become even less usable than when I initially tried it without the workaround in place. While creating a new BE was still reasonably quick, mounting it took over *six* hours to complete 8-/. Whereas before the most amount of time expended was taken up by mounting/unmounting all the filesystems (resolved by Jens' patch), now the majority of the six hours were spent spinning in /etc/lib/lu/plugins/lupi_bebasic. I don't know exactly what it was doing (as the source code to live upgrade does not appear to be available), but for most of the six hours it seems it was comparing strings: # pstack 1670 1670: /etc/lib/lu/plugins/lupi_bebasic plugin fee05973 strcmp (8046474, 8046478) + 1c3 fef6ae45 lu_smlGetTagByName (806920c, 16ef, fefa0f30) + 74 fef71717 lu_tsfSearchFields (806920c, 0, 3, 2, 1, 88369f4) + 13f fef4e2da lu_beoGetFstblFilterSwapAndShared (80513bc, 8046978, 8069234, 806920c) + 1be fef4f1f7 lu_beoGetFstblToMountBe (80513bc, 80541e4, 80469c4, 80513fc) + 247 fef515cf lu_beoMountBeByBeName (80513bc, 8046a24, 805419c, 80513fc, 0, 0) + 39c 0804ba6c (804ef6c, 1, 8068dd4, 0, 8069ee4, 8069ee4) fef5fa2b (804ef6c, 8046f3c, 8069ee4, 8046ae8) fef5f5c3 (804ef6c, 8046f3c, 8069ee4, 8046ae8) fef5f397 (804ef6c, 8046f3c, 8069ee4) fef5f1c5 (804ef6c, 8046f3c, 8069ee4) fef603c6 (804ef6c) fef5ec12 lu_pluginProcessLoop (804ef6c) + 42 0804a028 main (2, 8046fa8, 8046fb4) + 2d3 08049cba (2, 80471d8, 80471f9, 0, 8069954, 8069914) Six hours, fully utilizing a CPU core, comparing strings 8-/. I considered opening a support ticket, but given the lack of response previously, I decided to poke around with it a bit myself first. truss of lumount revealed that getmntent was being called to enumerate mount points, so initially I tried preloading a shared library to interpose the getmntent call and skip all the mount points corresponding to my data file systems under /export. That didn't make any difference. I then moved on to look at the multiple calls to the zfs binary made by lumount, which seemed potential sources of extraneous data which could cause unnecessary processing. Replacing /sbin/zfs with a wrapper script yielded quite unexpected results, as it seems there are many links to the zfs binary, which does different magic depending on the value of argv[0] 8-/. The path to the zfs binary is statically defined in /etc/lib/lu/liblu.so.1, so in a display of horrid kludginess ;), I edited the binary file and replaced all instances of /sbin/zfs with /sbin/zfb, and created /sbin/zfb with the content: --- #! /bin/sh . /etc/default/lu LUBIN=${LUBIN:=/usr/lib/lu} . $LUBIN/lulib if [ $1 = list ] ; then /sbin/zfs $@ | /usr/bin/egrep -v -f `lulib_get_fs2ignore` else exec /sbin/zfs $@ fi -- This utilizes the configuration included in Jens' patch to ignore the exact same set of file systems ignored by the rest of live upgrade with the patch installed. With this kludge in place lumount took *23 seconds*, three orders of magnitude less time. I tend to tilt at windmills, so I probably will end up opening another support ticket. The last time there seemed to be no interest in fixing live upgrade so it would actually scale :(, maybe this time I'll have better luck. For those Oracle employees in the audience, if anyone could possibly explain exactly what processing lupi_bebasic is doing that results in six hours of string comparisons, I'm dying of curiosity :). And if anyone wants to jump up and champion the cause of getting live upgrade to work in an environment with many file systems, I'd be happy to help; it would be nice to have shipped code that works without breaking out the hex editor ;). Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
On Thu, 27 Aug 2009, Paul B. Henson wrote: However, I went to create a new boot environment to install the patches into, and so far that's been running for about an hour and a half :(, which was not expected or planned for. [...] I don't think I'm going to make my downtime window :(, and will probably need to reschedule the patching. I never considered I might have to start the patch process six hours before the window. Well, so far lucreate took 3.5 hours, lumount took 1.5 hours, applying the patches took all of 10 minutes, luumount took about 20 minutes, and luactivate has been running for about 45 minutes. I'm assuming it will probably take at least the 1.5 hours of the lumount (particularly considering it appears to be running a lumount process under the hood) if not the 3.5 hours of lucreate. Add in the 1-1.5 hours to reboot, and, well, so much for patches this maintenance window. The lupi_bebasic process seems to be the time killer here. Not sure what it's doing, but it spent 75 minutes running strcmp. Pretty much nothing but strcmp. 75 CPU minutes running strcmp I took a look for the source but I guess that component's not a part of opensolaris, or at least I couldn't find it. Hopefully I can figure out how to make this perform a little more acceptably before our next maintenance window. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
Well, so far lucreate took 3.5 hours, lumount took 1.5 hours, applying the patches took all of 10 minutes, luumount took about 20 minutes, and luactivate has been running for about 45 minutes. I'm assuming it will probably take at least the 1.5 hours of the lumount (particularly considering it appears to be running a lumount process under the hood) if not the 3.5 hours of lucreate. Add in the 1-1.5 hours to reboot, and, well, so much for patches this maintenance window. The lupi_bebasic process seems to be the time killer here. Not sure what it's doing, but it spent 75 minutes running strcmp. Pretty much nothing but strcmp. 75 CPU minutes running strcmp I took a look for the source but I guess that component's not a part of opensolaris, or at least I couldn't find it. Hopefully I can figure out how to make this perform a little more acceptably before our next maintenance window. Do you have a lot of files in /etc/mnttab, including nfs filesystems mounted from server1,server2:/path? And you're using lucreate for a ZFS root? It should be quick; we are changing a number of things in Solaris 10 update 8 and we hope it will be faster/ Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
On Thu, Aug 27, 2009 at 10:59:16PM -0700, Paul B. Henson wrote: On Thu, 27 Aug 2009, Paul B. Henson wrote: However, I went to create a new boot environment to install the patches into, and so far that's been running for about an hour and a half :(, which was not expected or planned for. [...] I don't think I'm going to make my downtime window :(, and will probably need to reschedule the patching. I never considered I might have to start the patch process six hours before the window. Well, so far lucreate took 3.5 hours, lumount took 1.5 hours, applying the patches took all of 10 minutes, luumount took about 20 minutes, and luactivate has been running for about 45 minutes. I'm assuming it will Have a look at http://iws.cs.uni-magdeburg.de/~elkner/luc/lu-5.10.patch or http://iws.cs.uni-magdeburg.de/~elkner/luc/lu-5.11.patch ... So first install most recent LU patches and than one of the above. Since still on vacation (for ~8 weeks), haven't checked, whether there are new LU patches out there and the patches still match (usually they do). If not, adjusting the files manually shouldn't be a problem ;-) There are also versions for pre svn_b107 and pre 121430-36,121431-37: see http://iws.cs.uni-magdeburg.de/~elkner/ More info: http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html#luslow Have fun, jel. -- Otto-von-Guericke University http://www.cs.uni-magdeburg.de/ Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2 39106 Magdeburg, Germany Tel: +49 391 67 12768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
On Fri, 28 Aug 2009 casper@sun.com wrote: luactivate has been running for about 45 minutes. I'm assuming it will probably take at least the 1.5 hours of the lumount (particularly considering it appears to be running a lumount process under the hood) if not the 3.5 hours of lucreate. Eeeek, the luactivate command ended up taking about *7 hours* to complete. And I'm not sure it was even successful, output excerpts at the end of this message. Do you have a lot of files in /etc/mnttab, including nfs filesystems mounted from server1,server2:/path? There's only one nfs filesystem in vfstab which is always mounted, user home directories are automounted and would be in mnttab if accessed, but during the lu process no users were on the box. On the other hand, there are a *lot* of zfs filesytems in mnttab: # grep zfs /etc/mnttab | wc -l 8145 And you're using lucreate for a ZFS root? It should be quick; we are changing a number of things in Solaris 10 update 8 and we hope it will be faster/ lucreate on a system with *only* an os root pool is blazing (the magic of clones). The problem occurs when my data pool (with 6k odd filesystems) is also there. The live upgrade process is analyzing all 6k of those filesystems, mounting them all in the alternate root, unmounting them all, and who knows what else. This is totally wasted effort, those filesystems have nothing to do with the OS or patching, and I'm really hoping that they can just be completely ignored. So, after 7 hours, here is the last bit of output from luactivate. Other than taking forever and a day, all of the output up to this point seemed normal. The BE s10u6 is neither the currently active BE nor the one being made active, but these errors have me concerned something _bad_ might happen if I reboot :(. Any thoughts? Modifying boot archive service Propagating findroot GRUB for menu conversion. ERROR: Read-only file system: cannot create mount point /.alt.s10u6/export/group/ceis ERROR: failed to create mount point /.alt.s10u6/export/group/ceis for file system export/group/ceis ERROR: unmounting partially mounted boot environment file systems ERROR: No such file or directory: error unmounting ospool/ROOT/s10u6 ERROR: umount: warning: ospool/ROOT/s10u6 not in mnttab umount: ospool/ROOT/s10u6 no such file or directory ERROR: cannot unmount ospool/ROOT/s10u6 ERROR: cannot mount boot environment by name s10u6 ERROR: Failed to mount BE s10u6. ERROR: Failed to mount BE s10u6. Cannot propagate file /etc/lu/installgrub.findroot to BE File propagation was incomplete ERROR: Failed to propagate installgrub ERROR: Could not propagate GRUB that supports the findroot command. Activation of boot environment patch-20090817 successful. According to lustatus everything is good, but shiver... These boxes have only been in full production about a month, it would not be good for them to die during the first scheduled patches. # lustatus Boot Environment Is Active ActiveCanCopy Name Complete NowOn Reboot Delete Status -- -- - -- -- s10u6 yes no noyes- s10u6-20090413 yes yesnono - patch-20090817 yes no yes no - Tuanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
On Fri, 28 Aug 2009, Jens Elkner wrote: More info: http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html#luslow **sweet**!! This is *exactly* the functionality I was looking for. Thanks much Any Sun people have any idea if Sun has any similar functionality planned for live upgrade? Live upgrade without this capability is basically useless on a system with lots of zfs filesystems. Jens, thanks again, this is perfect. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] live upgrade with lots of zfs filesystems
Well, so I'm getting ready to install the first set of patches on my x4500 since we deployed into production, and have run into an unexpected snag. I already knew that with about 5-6k file systems the reboot cycle was going to be over an hour (not happy about, but knew about and planned for). However, I went to create a new boot environment to install the patches into, and so far that's been running for about an hour and a half :(, which was not expected or planned for. First, it looks like the ludefine script spent about 20 minutes iterating through all of my zfs file systems, and then something named lupi_bebasic ran for over an hour, and then it looks like it mounted all of my zfs filesystems under /.alt.tmp.b-nAe.mnt, and now it looks like it is unmounting all of them. I hadn't noticed before, but when I went to check on my test system (with only a handful of filesystems), but evidently when I get to the point of using lumount to mount the boot environment for patching, it's going to again mount all of my zfs file systems under the alternative root, and then need to unmount them all again after I'm done patching, which is going to add probably another hour or two. I don't think I'm going to make my downtime window :(, and will probably need to reschedule the patching. I never considered I might have to start the patch process six hours before the window. I poked around a bit, but have not come across any way to exclude zfs filesystems not part of the boot os pool from the copy and mount process. I'm really hoping I'm just being stupid and missing something blindingly obvious. Given a boot pool named ospool, and a data pool named export, is there anyway to make live upgrade completely ignore the data pool? There is no need for my 6k user file systems to be mounted in the alternative environment during patching. I only want the file systems in the ospool copied, processed, and mounted. fingers crossed Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
Paul You need to exclude all the file system that are not the "OS" My S10 Virtual machine is not booted but you can put all the "excluded" file systems in a file and use -f from memory. You use to have to do this if there was a DVD in the drive otherwise /cdrom got copied to the new boot environment. I know this because I logged an RFE when Live Upgrade first appeared, and it was put into state Deferred as the workaround is to just exclude it. I think it did get fixed however in a later release. trevor Paul B. Henson wrote: Well, so I'm getting ready to install the first set of patches on my x4500 since we deployed into production, and have run into an unexpected snag. I already knew that with about 5-6k file systems the reboot cycle was going to be over an hour (not happy about, but knew about and planned for). However, I went to create a new boot environment to install the patches into, and so far that's been running for about an hour and a half :(, which was not expected or planned for. First, it looks like the ludefine script spent about 20 minutes iterating through all of my zfs file systems, and then something named lupi_bebasic ran for over an hour, and then it looks like it mounted all of my zfs filesystems under /.alt.tmp.b-nAe.mnt, and now it looks like it is unmounting all of them. I hadn't noticed before, but when I went to check on my test system (with only a handful of filesystems), but evidently when I get to the point of using lumount to mount the boot environment for patching, it's going to again mount all of my zfs file systems under the alternative root, and then need to unmount them all again after I'm done patching, which is going to add probably another hour or two. I don't think I'm going to make my downtime window :(, and will probably need to reschedule the patching. I never considered I might have to start the patch process six hours before the window. I poked around a bit, but have not come across any way to exclude zfs filesystems not part of the boot os pool from the copy and mount process. I'm really hoping I'm just being stupid and missing something blindingly obvious. Given a boot pool named ospool, and a data pool named export, is there anyway to make live upgrade completely ignore the data pool? There is no need for my 6k user file systems to be mounted in the alternative environment during patching. I only want the file systems in the ospool copied, processed, and mounted. fingers crossed Thanks... www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] live upgrade with lots of zfs filesystems
On Thu, 27 Aug 2009, Trevor Pretty wrote: My S10 Virtual machine is not booted but you can put all the excluded file systems in a file and use -f from memory. Unfortunately, I wasn't that stupid. I saw the -f option, but it's not applicable to ZFS root: -f exclude_list_file Use the contents of exclude_list_file to exclude specific files (including directories) from the newly created BE. exclude_list_file contains a list of files and directories, one per line. If a line item is a file, only that file is excluded; if a directory, that direc- tory and all files beneath that directory, including subdirectories, are excluded. This option is not supported when the source BE is on a ZFS file system. After it finished unmounting everything from the alternative root, it seems to have spawned *another* lupi_bebasic process which has eaten up 62 minutes of CPU time so far. Evidentally it's doing a lot of string comparisons (per truss): /1...@1: - libc:strcmp() = 0 /1...@1: - libc:strcmp(0x86fceec, 0xfefa1218) /1...@1: - libc:strcmp() = 0 /1...@1: - libc:strcmp(0x86fd534, 0xfefa1218) /1...@1: - libc:strcmp() = 0 /1...@1: - libc:strcmp(0x86fdccc, 0xfefa1218) /1...@1: - libc:strcmp() = 0 /1...@1: - libc:strcmp(0x86fdcfc, 0xfefa1218) /1...@1: - libc:strcmp() = 0 /1...@1: - libc:strcmp(0x86fec84, 0xfefa1218) /1...@1: - libc:strcmp() = 0 /1...@1: - libc:strcmp(0x86fecb4, 0xfefa1218) /1...@1: - libc:strcmp() = 0 The first one finished in a bit over an hour, hopefully this one's about done too and there's not any more stuff to do. Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss