Re: [zones-discuss] zones on shared storage proposal

Edward Pilatowicz Fri, 22 May 2009 20:05:45 -0700

comments inline below.
ed

On Fri, May 22, 2009 at 11:26:06AM -0500, Mike Gerdts wrote:
> On Fri, May 22, 2009 at 1:57 AM, Edward Pilatowicz
> <edward.pilatow...@sun.com> wrote:
> >
> > i've attached an updated version of the proposal (v1.1) which addresses
> > your feedback.  (i've also attached a second copy of the new proposal
> > that includes change bars, in case you want to review the updates.)
>
> As I was reading through it again, I fixed a few picky things (mostly
> spelling) that don't change the meaning.  I don't think that I "fixed"
> anything that was already right in British English.
>
> diff attached.
>


i've merged your fixes in.  thanks.

> > On Thu, May 21, 2009 at 11:59:22AM -0500, Mike Gerdts wrote:
> >> On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
> >> <edward.pilatow...@sun.com> wrote:
> >
> > nice catch.
> >
> > in early versions of my proposal, the nfs:// uri i was planning to
> > support allowed for the specification of mount options.  this required
> > allowing for per-zone nfs mounts with potentially different mount
> > options.  since then i've simplified things (realizing that most people
> > really don't need or want to specify mount options) and i've switched to
> > using the the nfs uri defined in rfc 2224.  this means we can do away
> > with the '<zonename>' path component as you suggest.
>
> That was actually something I thought about after the fact.  When I've
> been involved in performance problems in the past, being able to tune
> mount options (e.g. protocol versions, block sizes, caching behavior,
> etc.) has been important.
>

yeah.  so the idea is to keep is simple for the initial functionality.
i figure that i'll evaluate different options and provide the best
defaults possible.  if customer requests come in for supporting
different options, well, first they can easily work around the issue by
using autofs + path:/// (and if the autofs config is in nis/ldap, then
migration will still work).  then we can just come up with a new uri
spec that allows the user to specify mount options.  the non-obvious and
unfortunate part of having a uri that allows for the specification of
mount options is that this we'll probably have to require that the user
percent-encode certain chars in the uri.  :(  leaving this off for now
gives me a simpler nfs uri format.  (that should be good enough for most
people.)

> >> Perhaps if this is what I would like I would be better off adding a
> >> global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
> >> the path:/// uri instead.
> >>
> >> Thoughts?
> >>
> >
> > i'm not sure i understand how you would like to see this functionality
> > behave.
> >
> > wrt vfstab, i'd rather you not use that since that moves configuration
> > outside of zonecfg.  so later, if you want to migrate the zone, you'll
> > need to remember about that vfstab configuration and move it as well.
> > if at all possible i'd really like to keep all the configuration within
> > zonecfg(1m).
> >
> > perhaps you could explanin your issues with the currently planned
> > approach in a different way to help me understand it better?
> >
>
> The key thing here is that all of my zones are served from one or two
> NFS servers.  Let's pretend that I have a T5440 with 200 zones on it.
> The way the proposal is written, I would have 200 mounts in the global
> zone of the form:
>
>    $nfsserver:/vol/zones/zone$i
>         on /var/zones/nfsmount/nfsserver/vol/zones/zone$i
>
> When in reality, all I need is a single mount (subject to
> implementation-specific details, as discussed above with ro vs. rw
> shares):
>
>    $nfsserver:/vol/zones
>         on /var/myzones/nfs/$nfsserver/vol/zones
>
> If my standard global zone deployment mechanism adds a vfstab entry
> for $nfsserver:/vol/zones and configure each zone via path:/// I avoid
> a storm of NFS mount requests at zone boot time as the global zone
> boots.  The NFS mount requests are UDP-based RPC calls, which
> sometimes get lost on the wire.  The timeout/retransmit may be such
> that we add a bit of time to the overall zone startup process.  Not a
> huge deal in most cases, but a confusing problem to understand.
>
> In this case, I wouldn't consider the NFS mounts as being something
> specific to a particular zone.  Rather, it is a common configuration
> setting across all members of a particular "zone farm".
>

so if your nfs server is exporting a bunch of filesystems like:
        $nfsserver:/vol/zones/zone$i

then yes, you'll end up with mounts for each.  but if your nfs server
is exporting
        $nfsserver:/vol/zones

then you'll only end up with one.

that said, if your nfs server is exporting
        $nfsserver:/vol/zones
        $nfsserver:/vol/zones/zone$i

i really don't see any way to avoid having mounts for each zone.  afaik,
if the nfs server has a nested export, the exported subdirectory is only
accessible via a mount.  so you couldn't mount $nfsserver:/vol/zones and
then access $nfsserver:/vol/zones/zone5 without first mounting
$nfsserver:/vol/zones/zone5.  (i could always be wrong about this, but
this is my current understanding of how this works.)

> >> > ----------
> >> > C.1.viii Zoneadm(1m) clone
> >> >
> >> > Normally when cloning a zone which lives on a zfs filesystem the zones
> >> > framework will take a zfs(1m) snapshot of the source zone and then do a
> >> > zfs(1m) clone operation to create a filesystem for the new zone which is
> >> > being instantiated.  This works well when all the zones on a given
> >> > system live on local storage in a single zfs filesystem, but this model
> >> > doesn't work well for zones with encapsulated roots.  First, with
> >> > encapsulated roots each zone has it's own zpool, and zfs (1m) does not
> >> > support cloning across zpools.  Second, zfs(1m) snapshotting/cloning
> >> > within the source zpool and then mounting the resultant filesystem onto
> >> > the target zones zoneroot would introduce dependencies between zones,
> >> > complicating things like zone migration.
> >> >
> >> > Hence, for cloning operations, if the source zone has an encapsulated
> >> > root, zoneadm(1m) will not use zfs(1m) snapshot/clone.  Currently
> >> > zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
> >> > is unable to use zfs(1m) snapshot/clone.  We could just fall back to
> >> > this default behaviour for encapsulated root zones, but find+cpio are
> >> > not error free and can have problem with large files.  So we propose to
> >> > update zoneadm(1m) clone to detect when both the source and target zones
> >> > are using separate zfs filesystems, and in that case attempt to use zfs
> >> > send/recv before falling back to find+cpio.
> >>
> >> Can a provision be added for running an external command to produce the
> >> clone?  I envision this being used to make a call to a storage device to
> >> tell the storage device to create a clone of the storage.  (This implies
> >> that the super-secret tool to re-write the GUID would need to become
> >> available.)
> >>
> >> The alternative seems to be to have everyone invent their own mechanism
> >> with the same external commands and zoneadm attach.
> >>
> >
> > hm.  currently there are internal brand hooks which are run during a
> > clone operation, but i don't think it would be appropriate to expose
> > these.
> >
> > a "zoneadm clone" is basically a copy + sys-unconfig.  if you have a
> > storage device that can be used to do the copy for you, perhaps you
> > could simply do the copy on the storage device, and then do a "zoneadm
> > attach" of the new zone image?  if you want, i think it would be a
> > pretty trivial RFE to add a sys-unconfig option to "zoneadm attach".
> > that should let you get the same essential functionality as clone,
> > without having to add any new callbacks.  thoughts?
>
> Since cloning already requires the zone to be down, I don't think that
> too many people are probably cloning anything other than zones that
> are intended to be template zones that are never booted.  Such zones
> can be pre-sys-unconfig'd to work around this problem, and in my
> opinion is not worth a lot of effort.
>
> I further suspect that most places would prefer that zones were not
> sys-unconfig'd so that they could just tweak the few things that need
> to be tweaked rather than putting bogus information in /etc/sysidcfg
> then going back and fixing things afterwards.  For example, sysidcfg
> is unable to cope with the notion that you might use LDAP for
> accounts, DNS for hosts, and files for things like services.
> Patching, upgrades, etc. also tend to break things related to sysidcfg
> (e.g. disabling various SMF services required by name services).
> Hopefully sysidcfg goes away or gets fixed...
>

yeah.  i'm not sure what the future plans for sysidcfg are, but i'm
hoping that the AI install project will replace/blow-up all that old
sysid stuff...

> >> > ----------
> >> > C.2 Storage object uid/gid handling
> >> >
> >> > One issue faced by all VTs that support shared storage is dealing with
> >> > file access permissions of storage objects accessible via NFS.  This
> >> > issue doesn't affect device based shared storage, or local files and
> >> > vdisks, since these types of storage are always accessible, regardless
> >> > of the uid of the access process (as long as the accessing process has
> >> > the necessary privileges).  But when accessing files and vdisk via NFS,
> >> > the accessing process can not use privileges to circumvent restrictive
> >> > file access premissions.  This issue is also complicated by the fact
> >> > that by default most NFS servier will map all accesses by remote root
> >> > user to a different uid, usually "nobody".  (a process known as "root
> >> > squashing".)
> >> >
> >> > In order to avoid root squashing, or requiring users to setup special
> >> > configurations on their NFS servers, whenever the zone framework
> >> > attempts to create a storage object file or vdisk, it will temporarily
> >> > change it's uid and gid to the "xvm" user and group, and then create the
> >> > file with 0600 access permissions.
> >> >
> >> > Additionally, whenever the zones framework attempts to access an storage
> >> > object file or vdisk it will temporarily switch its uid and gid to match
> >> > the owner and group of the file/vdisk, ensure that the file is readable
> >> > and writeable by it's owner (updating the file/vdisk permissions if
> >> > necessary), and finally setup the file/vdisk for access via a zpool
> >> > import or lofiadm -a.  This should will allow the zones framework to
> >> > access storage object files/vdisks that we created by any user,
> >> > regardless of their ownership, simplifying file ownership and management
> >> > issues for administrators.
> >>
> >> This implies that the xvm user is getting some additional privileges.
> >> What are those privileges?
> >>
> >
> > hm.  afaik, the xvm user isn't defined as having any particular
> > privileges.  (/etc/user_attr doesn't have an xvm entry.)  i wasn't
> > planning on defining any privileg requirements for the xvm user.
> >
> > zoneadmd currently runs as root with all privs.  so zoneadmd will be
> > able to switch to the xvm user to create encapsulated zpool
> > files/vdisks.  similarly, zoneadmd will also be able to switch uid to
> > the owner of any other objects it may need to access.
>
> Gotcha.  It will be along the lines of:
>
>    seteuid(xvmuid);
>    system("/sbin/zpool ...");
>
> Rather than:
>
>    system("/usr/bin/su - xvm /sbin/zpool ...");
>
> Assuming you are using system(3C) and not libzfs.
>

yep.
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] zones on shared storage proposal

Reply via email to