On Fri, May 22, 2009 at 1:57 AM, Edward Pilatowicz
<edward.pilatow...@sun.com> wrote:
> hey mike,
> thanks for all the great feedback.
> my replies to your individual comments are inline below.

Thanks.  I've responded inline where needed.

> i've attached an updated version of the proposal (v1.1) which addresses
> your feedback.  (i've also attached a second copy of the new proposal
> that includes change bars, in case you want to review the updates.)

As I was reading through it again, I fixed a few picky things (mostly
spelling) that don't change the meaning.  I don't think that I "fixed"
anything that was already right in British English.

diff attached.

> thanks again,
> ed
> On Thu, May 21, 2009 at 11:59:22AM -0500, Mike Gerdts wrote:
>> On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
>> <edward.pilatow...@sun.com> wrote:
>> > hey all,
>> >
>> > i've created a proposal for my vision of how zones hosted on shared
>> > storage should work.  if anyone is interested in this functionality then
>> > please give my proposal a read and let me know what you think.  (fyi,
>> > i'm leaving on vacation next week so if i don't reply to comments right
>> > away please don't take offence, i'll get to it when i get back.  ;)
>> >
>> > ed
>> I'm very happy to see this.  Comments appear below.
>> > " please ensure that the vim modeline option is not disabled
>> > vim:textwidth=72
>> >
>> > -------------------------------------------------------------------------------
>> > Zones on shared storage (v1.0)
>> >
>> [snip]
>> > ----------
>> > C.1.i Zonecfg(1m)
>> >
>> > The zonecfg(1m) command will be enhanced with the following two new
>> > resources and associated properties:
>> >
>> >     rootzpool                               resource
>> >             src                             resource property
>> >             install-size                    resource property
>> >             zpool-preserve                  resource property
>> >             dataset                         resource property
>> >
>> >     zpool                                   resource
>> >             src                             resource property
>> >             install-size                    resource property
>> >             zpool-preserve                  resource property
>> >             name                            resource property
>> >
>> > The new resource and properties will be defined as follows:
>> >
>> > "rootzpool"
>> >     - Description: Identifies a shared storage object (and it's
>> >     associated parameters) which will be used to contain the root
>> >     zfs filesystem for a zone.
>> >
>> > "zpool"
>> >     - Description: Identifies a shared storage object (and it's
>> >     associated parameters) which will be made available to the
>> >     zone as a delegated zfs dataset.
>> That is to say "put your OS stuff in rootzpool, put everything else in
>> zpool" - right?
> yes.  as i see it, this proposal allows for multiple types of deployment
> configurations.
> - a zone with a single encapsulated "rootzpool" zpool.
>        the OS will reside in <zonename>_rpool/ROOT/zbeXXX
>        everything else will also reside in <zonename>_rpool/ROOT/zbeXXX
> - a zone with a single encapsulated "rootzpool" zpool.
>        the OS will reside in <zonename>_rpool/ROOT/zbeXXX
>        everything else will reside in <zonename>_rpool/dataset/<dataset>
> - a zone with multiple encapsulated zpools.
>        the OS will reside in <zonename>_rpool/ROOT/zbeXXX
>        everything else will reside in other encapsulated "zpool"s
> i've added some text to this section of the proposal to explain these
> different configuration scenarios.

Thanks, looks good.

>> > ----------
>> > C.1.ii Storage object uri (so-uri) format
>> >
>> > The storage object uri (so-uri) syntax[03] will conform to the standard
>> > uri format defined in RFC 3986 [04].  The nfs URI scheme is defined in
>> > RFC 2224 [05].  The so-uri syntax can be summarised as follows:
>> >
>> > File storage objects:
>> >
>> >     path:///<file-absolute>
>> >     nfs://<host>[:port]/<file-absolute>
>> >
>> > Vdisk storage objects:
>> >
>> >     vpath:///<file-absolute>
>> >     vnfs://<host>[:port]/<file-absolute>
>> >
>> > Device storage objects:
>> >
>> >     fc:///wwn[@<lun>]
>> >     iscsi:///alias=<alias>[@<lun>]
>> >     iscsi:///target=<target>[@<lun>]
>> >     iscsi://host[:port]/[tpgt=<tpgt>/]target=<target>[@<lun>]
>> >
>> > File storage objects point to plain files on a local, nfs, or cifs
>> > filesystems.  These files are used to contain zpools which store zone
>> > datasets.  These are the simplest types of storage objects.  Once
>> > created, they have a fixed size, can't be grown, and don't support
>> > advanced features like snapshotting, etc.  Some example file so-uri's
>> > are:
>> >
>> > path:///export/xvm/vm1.disk
>> >     - a local file
>> > path:///net/heaped.sfbay/export/xvm/1.disk
>> >     - a nfs file accessible via autofs
>> > nfs://heaped.sfbay/export/xvm/1.disk
>> >     - same file specified directly via a nfs so-uri
>> >
>> > Vdisk storage objects are similar to file storage objects in that they
>> > can live on local, nfs, or cifs filesystems, but they each have their
>> > own special data format and varying featuresets, with support for things
>> > like snapshotting, etc..  Some common vdisk formats are: VDI, VMDK and
>> > VHD.  Some example vdisk so-uri's are:
>> >
>> > vpath:///export/xvm/vm1.vmdk
>> >     - a local vdisk image
>> > vpath:///net/heaped.sfbay/export/xvm/1.vmdk
>> >     - a nfs vdisk image accessible via autofs
>> > vnfs://heaped.sfbay/export/xvm/1.vmdk
>> >     - same vdisk image specified directly via a nfs so-uri
>> >
>> > Device storage objects specify block storage devices in a host
>> > independant fashion.  When configuring FC or iscsi storage on different
>> > hosts, the storage configuration normally lives outsize of zonecfg, and
>> > the configured storage may have varying /dev/dsk/cXtXdX* names.  The
>> > so-uri syntax provides a way to specify storage in a host independent
>> > fashion, and during zone management operations, the zones framework can
>> > map this storage to a host specific device path.  Some example device
>> > so-uri's are:
>> >
>> > fc:///20000014c3474...@0
>> >     - lun 0 of a fc disk with the specified wwn
>> > iscsi:///alias=oracle zone r...@0
>> >     - lun 0 of an iscsi disk with the specified alias.
>> > iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740
>> >     - lun 0 of an iscsi disk with the specified target id.
>> What about if there is already the necessary layer of abstraction that
>> provides a consistent namespace?  For example,
>> /dev/vx/dsk/zone1dg/rootvol would refer to a block device named rootvol
>> in the disk group zone1dg.  That may reside on a single disk or span
>> many disks and will have the same name regardless of which host the disk
>> group is imported on.  Since this VxVM volume may span many disks, it
>> would be inappropriate to refer to a single LUN that makes up that disk
>> group.
>> Perhaps the following is appropriate for such situations.
>> dev:///dev/vx/dsk/zone1dg/rootvol
> good point.  but rather than adding another URI type i'd rather just re-use
> the "path:///" uri.
> i've updated the doc to describe this use case and i've added an
> example.

Oh yeah, UNIX presents devices as files.  Duh.  :)

>> > ----------
>> > C.1.iii Zoneadm(1m) install
>> >
>> > When a zone is installed via the zoneadm(1m) "install" subcommand, the
>> > zones subsystem will first verify that any required so-uris exist and
>> > are accessible.
>> >
>> > If an so-uri points to a plain file, nfs file, or vdisk, and the object
>> > does not exist, the object will be created with the install-size that
>> > was specified via zonecfg(1m).  If the so-uri does not exist and an
>> > install-size was not specified via zonecfg(1m) an error will be
>> > generated and the install will fail.
>> >
>> > If an so-uri points to an explicit nfs server, the zones framework will
>> > need to mount the nfs filesystem containing storage object.  The nfs
>> > server share containing the specified object will be auto-mounted at:
>> >
>> >     /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
>> Just for clarity, I think you mean:
>> - "will be mounted at".  I think "auto-mounted" conjures up the idea
>>   that there is integration with autofs.
>> - <host> is the NFS server
>> - <nfs-share-name> is the path on the NFS server.  Is this the exact
>>   same thing as <path-absolute> in the URI specification?  Is this the
>>   file that is mounted or the directory above the file?
>> My storage administrators give me grief if I create too many NFS mounts
>> (but I am not sure I've heard a convincing reason).  As I envision NFS
>> server layout, I think I would see something like:
>> vol
>>   zones
>>     zone1
>>       rootzpool
>>       zpool
>>     zone2
>>       rootzpool
>>       zpool
>>     zone3
>>       rootzpool
>>       zpool
>> It seems as though if these three zones are all running on the same box
>> the box will have at least the following mounts:
>> /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
>> /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
>> /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
> well, it all depends on what nfs shares are actually being exported.
> if the nfs server has the following share(s) exported:
>        nfsserver:/vol
> then you would have the following mount(s):
>        /var/zones/nfsmount/zone1/nfsserver/vol
>        /var/zones/nfsmount/zone2/nfsserver/vol
>        /var/zones/nfsmount/zone3/nfsserver/vol
> if the nfs server has the following share(s) exported:
>        nfsserver:/vol/zones
> then you would have the following mount(s):
>        /var/zones/nfsmount/zone1/nfsserver/vol/zones
>        /var/zones/nfsmount/zone2/nfsserver/vol/zones
>        /var/zones/nfsmount/zone3/nfsserver/vol/zones

In either of these cases I'll get nagged about having three mounts
when one would suffice.  I'm OK being nagged about that if it means
that I don't have something guessing how far up the tree they should
try to mount.

> if the nfs server has the following share(s) exported:
>        nfsserver:/vol/zones/zone1
>        nfsserver:/vol/zones/zone2
>        nfsserver:/vol/zones/zone3
> then you would have the following mount(s):
>        /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
>        /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
>        /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3

>> But maybe as many as:
>> /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/rootzpool
>> /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/zpool
>> /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/rootzpool
>> /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/zpool
>> /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/rootzpool
>> /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/zpool
> hm.  afaik, you can only share directories via nfs, and i'm assuming
> that "zpool" and "rootzpool" above are files (or volumes) which can
> actually store data.  in which case you would never mount them directly.

You can only share directories (I think) but you can mount files, much
like lofi allows you to mount files.  The only place I've seen this
done is by the Solaris installer when it mounts a flash archive file
directly rather than mounting the parent directory.  But... if zoneadm
needs to create files, it is hard to mount the file before it is
created.  Mounting the parent directory seems to be the right thing to

>> With a slightly different arrangment this could be reduced to one.
>> Change
>> >     /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
>> To:
>>       /var/zones/nfsmount/<host>/<nfs-share-name>/<zonename>/<file>
> nice catch.
> in early versions of my proposal, the nfs:// uri i was planning to
> support allowed for the specification of mount options.  this required
> allowing for per-zone nfs mounts with potentially different mount
> options.  since then i've simplified things (realizing that most people
> really don't need or want to specify mount options) and i've switched to
> using the the nfs uri defined in rfc 2224.  this means we can do away
> with the '<zonename>' path component as you suggest.

That was actually something I thought about after the fact.  When I've
been involved in performance problems in the past, being able to tune
mount options (e.g. protocol versions, block sizes, caching behavior,
etc.) has been important.

> i've updated the doc.
>> I can see that this would complicate things a bit because it would be
>> hard to figure out how far up the path is the right place for the mount.
> afaik, determining the mount point should be pretty strait forward.
> i was planning to get a list of all the shares exported by the specified
> nfs server, and then do a strncmp() of all the exported shares against
> the specified path.  the longest matching share name is the mount path.
> for example.  if we have:
>        nfs://jurassic/a/b/c/d/file
> and jurassic is exporting:
>        jurassic:/a
>        jurassic:/a/b
>        jurassic:/a/b/c
> then our mount path with be:
>        /var/zones/nfsmount/jurassic/a/b/c
> and our encapsulated zvol will be accessible at:
>        /var/zones/nfsmount/jurassic/a/b/c/d/file
> afaik, this is acutally the only way that this could be implemented.

So long as we don't try to do one mount that covers the needs of
multiple zones, it is quite simple.  It gets difficult if jurassic is

     jurassic:/a (ro)
     jurassic:/a/zones (ro)
     jurassic:/a/zones/zone1 (rw)
     jurassic:/a/zones/zone2 (rw)

Depending on the NFS (v3) server implementation (not this way with the
Solaris NFS implementation, but I think it is with NetApp) this is
problematic if the global zone mounts:

  jurassic:/a/zones on /var/zones/nfsmount/jurassic/a/zones

which makes /var/zones/nfsmount/jurassic/a/zones/zone{1,2} readable
but not writable.

If it doesn't try to be clever and simply mounts

  jurassic:/a/zones/zone1 on /var/zones/nfsmount/jurassic/a/zones/zone1
  jurassic:/a/zones/zone2 on /var/zones/nfsmount/jurassic/a/zones/zone2

Then all is well.

The optimization of a single mount is where things get ugly.  As such,
I'll let my storage people complain about having multiple mounts.

>> Perhaps if this is what I would like I would be better off adding a
>> global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
>> the path:/// uri instead.
>> Thoughts?
> i'm not sure i understand how you would like to see this functionality
> behave.
> wrt vfstab, i'd rather you not use that since that moves configuration
> outside of zonecfg.  so later, if you want to migrate the zone, you'll
> need to remember about that vfstab configuration and move it as well.
> if at all possible i'd really like to keep all the configuration within
> zonecfg(1m).
> perhaps you could explanin your issues with the currently planned
> approach in a different way to help me understand it better?

The key thing here is that all of my zones are served from one or two
NFS servers.  Let's pretend that I have a T5440 with 200 zones on it.
The way the proposal is written, I would have 200 mounts in the global
zone of the form:

        on /var/zones/nfsmount/nfsserver/vol/zones/zone$i

When in reality, all I need is a single mount (subject to
implementation-specific details, as discussed above with ro vs. rw

        on /var/myzones/nfs/$nfsserver/vol/zones

If my standard global zone deployment mechanism adds a vfstab entry
for $nfsserver:/vol/zones and configure each zone via path:/// I avoid
a storm of NFS mount requests at zone boot time as the global zone
boots.  The NFS mount requests are UDP-based RPC calls, which
sometimes get lost on the wire.  The timeout/retransmit may be such
that we add a bit of time to the overall zone startup process.  Not a
huge deal in most cases, but a confusing problem to understand.

In this case, I wouldn't consider the NFS mounts as being something
specific to a particular zone.  Rather, it is a common configuration
setting across all members of a particular "zone farm".

>> > If an so-uri points to a fibre channel lun, the zones subsystem will
>> > verify that the specified wwn corresponds to a global zone accessible
>> > fibre channel disk device.
>> >
>> > If an so-uri points to an iSCSI target or alias, the zones subsystem
>> > will verify that the iSCSI device is accessible on the local system.  If
>> > an so-uri points to a static iSCSI target and that target is not
>> > already accessible on the local host, then the zones subsystem will
>> > enable static discovery for the local iSCSI initiator and attempt to
>> > apply the specified static iSCSI configuration.  If the iSCSI target
>> > device is not accessible then the install will fail.
>> >
>> > Once a zones install has verified that any required so-uri exists and is
>> > accessible, the zones subsystem will need to initialise the so-uri.  In
>> > the case of a path or nfs path, this will involve creating a zpool
>> > within the specified file.  In the case of a vdisk, fibre channel lun,
>> > or iSCSI lun, this will involve creating a EFI/GPT partition on the
>> > device which uses the entire disk, then a zpool will be created within
>> > this partition.  For data protection purposes, if a storage object
>> > contains any pre-existing partitions, zpools, or ufs filesystems, the
>> > install will fail will fail with an appropriate error message.  To
>> s/will fail will fail/will fail/
> oops.  thanks.  ;)
>> > continue the installation and overwrite any pre-existing data, the user
>> > will be able to specify a new '-f' option to zoneadm(1m) install.  (This
>> > option mimics the '-f' option used by zpool(1m) create.)
>> >
>> > If zpool-preserve is set to true, then before initialising any target
>> > storage objects, the zones subsystem will attempt to import a
>> > pre-existing zpool from those objects.  This will allow users to
>> > pre-create a zpool with custom creation time options, for use with
>> > zones.  To successfully import a pre-created zpool for a zone install,
>> > that zpool must not be attached.  (Ie, any pre-created zpool must be
>> > exported from the system where it was created before a zone can be
>> > installed on it.)  Once the zpool is imported the install process will
>> > check for the existence of a /ROOT filesystem within the zpool.  If this
>> > filesystem exists the install will fail with an appropriate error
>> > message.  To continue the installation the user will need to specify the
>> > '-f' option to zoneadm(1m) install, which will cause the zones framework
>> > to delete the pre-existing /ROOT filesystem within the zpool.
>> Is this because the zone root will be installed <zonepath>/ROOT/<bename>
>> rather than <zonepath>/root?
> yes.
> the current zones zfs filesystem layout and management for
> opensolaris is documented here:
>        http://www.opensolaris.org/jive/thread.jspa?messageID=272726&#272726
> i've mentioned this and reffered the user the '[07]'.  (which references
> the link above.)
>> > The newly created or imported root zpool will be named after the zone to
>> > which it is associated, with the assigned name being "<zonename>_rpool".
>> > This zpool will then be mounted at the zones rootpath and then the
>> > install process will continue normally[07].
>> This seems odd... why not have the root zpool mounted at zonepath rather
>> than zoneroot?  This way (e.g.) SUNWdetached.xml would follow the zone
>> during migrations.
> oops.  that a mistake.  it will be mounted on the zonepath.  i've fixed
> this.
>> > XXX: use altroot at zpool creation or just manually mount zpool?
>> >
>> > If the user has specified a "zpool" resource, then the zones framework
>> > will configure, initialize, and/or import it in a similar manaer to a
>> > zpool specified by the "rootzpool" resource.  The key differences are
>> > that the name of the newly created or imported zpool will be
>> > "<zonename>_<name>".  The specified zpool will also have the zfs "zoned"
>> > property set to "on", hence it will not be mounted anywhere in the
>> > global zone.
>> >
>> > XXX: do we need "zpool import -O file-system-property=" to set the
>> >      zoned property upon import.
>> >
>> > Once a zone configured with a so-uri is in the installed state, the
>> > zones framework needs a mechanism to mark that storage as in use to
>> > prevent it from being accessed by multiple hosts simultaneously.  The
>> > most likely situation where this could happen is via a zoneadm(1m)
>> > attach on a remote host.  The easiest way to achieve this is to keep the
>> > zpools associated with the storage imported and mounted at all times,
>> > and leverage the existing zpool support for detecting and preventing
>> > multi-host access.
>> >
>> > So whenever a global zone boots and the zones smf service runs, it will
>> > attempt to configure and import any shared storage objects associated
>> > with installed zones.  It will then continue to behave as it does today
>> > and boot any installed zones that have the autoboot property set.  If
>> > any shared sorage objects fail to configure or import, then:
>> >
>> > - the zones associated with the failed storage will be transitioned
>> >   to the "uninstalled" state.
>> Is "uninstalled" a real state?  Perhaps "configured" is more
>> appropriate, as this allows a transition to "installed" via "zoneadm
>> attach".
> oops.  another bug.  fixed.
>> > - an error message will be emitted to the zones smf log file.
>> > - after booting any remaning installed zones that have autoboot set
>> >   to true, the zones smf service will enter the "maintainence" state,
>> >   there by prompting the administrator to look at the zones smf log
>> >   file.
>> >
>> > After fixing any problems with shared storage accessibility, the
>> > admin should be able to simply re-attach the zone to the system.
>> >
>> > Currently the zones smf service is dependant upon multi-user-server, so
>> > all networking services required for access to shared storage should be
>> > propertly configured well before we try to import any shared storage
>> > associated with zones.
>> May I propose a fix to the zones SMF service as part of this?  The
>> current integration with the global zone's SMF is rather weak in
>> reporting the real status of zones and allowing the use of SMF for
>> controlling the zones service.  In particular:
>> - If a zone fails to start, the state of svc:/system/zones:default does
>>   not reflect a maintenance or degraded state.
>> - If an admin wishes to start a zone the same way that the system would
>>   do it, "svcadm restart" and similar have the side effect of rebooting
>>   all zones on the system.
>> - There is no way to establish dependencies between zones or between a
>>   zone and something that needs to happen in the global zone.
>> - There isn't a good way to allow certain individuals within the global
>>   zone the ability to start/stop specific zones with RBAC or
>>   authorizations.
>> I propose that:
>> - zonecfg creates a new services instance svc:/system/zones:zonename
>>   when the zone is configured.  Its initial state is disabled.  If the
>>   service already exists sanity checking may be performed but it should
>>   not whack things like dependencies and authorizations.
>> - After zoneadm installs a zone, the general/enabled property of
>>   svc:/system/zones:zonename is set to match the zonecfg autoboot
>>   property.
>> - "zoneadm boot" is the equivalent of
>>   "svcadm enable -t svc:/system/zones:zonename"
>> - A new command "zoneadm shutdown" is the equivalent of
>>   "svcadm disable -t svc:/system/zones:zonename"
>> - "zoneadm halt" is the equivalent of "svcadm mark maintenance
>>   svc:/system/zones:zonename:" followed by the traditional ungraceful
>>   teardown of the zone.
>> - Modification of the autoboot property with zonecfg (so long as the
>>   zone has been installed/attached) triggers the corresponding
>>   general/enabled property change in SMF.  This should set the property
>>   general/enabled without causing an immediate state change.
>> - zoneadm uninstall and zoneadm detach set the service to not autostart.
>> - zonecfg delete also deletes the service.
>> - A new property be added to zonecfg to disable SMF integration of this
>>   particular zone.  This will be important for people that have already
>>   worked around this problem (including ISV's providing clustering
>>   products) that don't want SMF getting in the way of their already
>>   working solution.
> yeah.  the zones team is well aware that our current smf integration
> story is pretty poor.  :(  we really want to improve our smf integration
> by moving all our configuration into smf, adding per-zone smf services,
> etc.  so while this project proposes some minor changes to the behavior
> of our existing smf service, i think that an overhaul of our smf
> integration is really a project in and of itself, and out of scope for
> this proposal.  (this proposal already has plenty of scope that could
> take a while to deliver.  ;)

Very well...

>> > ----------
>> > C.1.viii Zoneadm(1m) clone
>> >
>> > Normally when cloning a zone which lives on a zfs filesystem the zones
>> > framework will take a zfs(1m) snapshot of the source zone and then do a
>> > zfs(1m) clone operation to create a filesystem for the new zone which is
>> > being instantiated.  This works well when all the zones on a given
>> > system live on local storage in a single zfs filesystem, but this model
>> > doesn't work well for zones with encapsulated roots.  First, with
>> > encapsulated roots each zone has it's own zpool, and zfs (1m) does not
>> > support cloning across zpools.  Second, zfs(1m) snapshotting/cloning
>> > within the source zpool and then mounting the resultant filesystem onto
>> > the target zones zoneroot would introduce dependencies between zones,
>> > complicating things like zone migration.
>> >
>> > Hence, for cloning operations, if the source zone has an encapsulated
>> > root, zoneadm(1m) will not use zfs(1m) snapshot/clone.  Currently
>> > zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
>> > is unable to use zfs(1m) snapshot/clone.  We could just fall back to
>> > this default behaviour for encapsulated root zones, but find+cpio are
>> > not error free and can have problem with large files.  So we propose to
>> > update zoneadm(1m) clone to detect when both the source and target zones
>> > are using separate zfs filesystems, and in that case attempt to use zfs
>> > send/recv before falling back to find+cpio.
>> Can a provision be added for running an external command to produce the
>> clone?  I envision this being used to make a call to a storage device to
>> tell the storage device to create a clone of the storage.  (This implies
>> that the super-secret tool to re-write the GUID would need to become
>> available.)
>> The alternative seems to be to have everyone invent their own mechanism
>> with the same external commands and zoneadm attach.
> hm.  currently there are internal brand hooks which are run during a
> clone operation, but i don't think it would be appropriate to expose
> these.
> a "zoneadm clone" is basically a copy + sys-unconfig.  if you have a
> storage device that can be used to do the copy for you, perhaps you
> could simply do the copy on the storage device, and then do a "zoneadm
> attach" of the new zone image?  if you want, i think it would be a
> pretty trivial RFE to add a sys-unconfig option to "zoneadm attach".
> that should let you get the same essential functionality as clone,
> without having to add any new callbacks.  thoughts?

Since cloning already requires the zone to be down, I don't think that
too many people are probably cloning anything other than zones that
are intended to be template zones that are never booted.  Such zones
can be pre-sys-unconfig'd to work around this problem, and in my
opinion is not worth a lot of effort.

I further suspect that most places would prefer that zones were not
sys-unconfig'd so that they could just tweak the few things that need
to be tweaked rather than putting bogus information in /etc/sysidcfg
then going back and fixing things afterwards.  For example, sysidcfg
is unable to cope with the notion that you might use LDAP for
accounts, DNS for hosts, and files for things like services.
Patching, upgrades, etc. also tend to break things related to sysidcfg
(e.g. disabling various SMF services required by name services).
Hopefully sysidcfg goes away or gets fixed...

>> > Today, the zoneadm(1m) clone operations ignores any additional storage
>> > (specified via the "fs", "device", or "dataset" resources) that may be
>> > associated with the zone.  Similarly, the clone operation will ignore
>> > additional storage associated with any "zpool" resources.
>> >
>> > Since zoneadm(1m) clone will be enhanced to support cloning between
>> > encapsulated root zones and un-encapsulated root zones, zoneadm(1m)
>> > clone will be documented as the recommended migration mechanism for
>> > users who which to migrate existing zones from one format to another.
>> >
>> >
>> > ----------
>> > C.2 Storage object uid/gid handling
>> >
>> > One issue faced by all VTs that support shared storage is dealing with
>> > file access permissions of storage objects accessible via NFS.  This
>> > issue doesn't affect device based shared storage, or local files and
>> > vdisks, since these types of storage are always accessible, regardless
>> > of the uid of the access process (as long as the accessing process has
>> > the necessary privileges).  But when accessing files and vdisk via NFS,
>> > the accessing process can not use privileges to circumvent restrictive
>> > file access premissions.  This issue is also complicated by the fact
>> > that by default most NFS servier will map all accesses by remote root
>> > user to a different uid, usually "nobody".  (a process known as "root
>> > squashing".)
>> >
>> > In order to avoid root squashing, or requiring users to setup special
>> > configurations on their NFS servers, whenever the zone framework
>> > attempts to create a storage object file or vdisk, it will temporarily
>> > change it's uid and gid to the "xvm" user and group, and then create the
>> > file with 0600 access permissions.
>> >
>> > Additionally, whenever the zones framework attempts to access an storage
>> > object file or vdisk it will temporarily switch its uid and gid to match
>> > the owner and group of the file/vdisk, ensure that the file is readable
>> > and writeable by it's owner (updating the file/vdisk permissions if
>> > necessary), and finally setup the file/vdisk for access via a zpool
>> > import or lofiadm -a.  This should will allow the zones framework to
>> > access storage object files/vdisks that we created by any user,
>> > regardless of their ownership, simplifying file ownership and management
>> > issues for administrators.
>> This implies that the xvm user is getting some additional privileges.
>> What are those privileges?
> hm.  afaik, the xvm user isn't defined as having any particular
> privileges.  (/etc/user_attr doesn't have an xvm entry.)  i wasn't
> planning on defining any privileg requirements for the xvm user.
> zoneadmd currently runs as root with all privs.  so zoneadmd will be
> able to switch to the xvm user to create encapsulated zpool
> files/vdisks.  similarly, zoneadmd will also be able to switch uid to
> the owner of any other objects it may need to access.

Gotcha.  It will be along the lines of:

   system("/sbin/zpool ...");

Rather than:

   system("/usr/bin/su - xvm /sbin/zpool ...");

Assuming you are using system(3C) and not libzfs.

>> > ----------
>> > C.3 Taskq enhancements
>> >
>> > The integration of Duckhorn[08] greatly simplifies the management of cpu
>> > resources assigned to zone.  This management is partially implemented
>> > through the use of dynamic resource pools, where zones and their
>> > associated cpu resources can both be bound to a pool.
>> >
>> > Internally, zfs has worker threads associated with each zpool.  These
>> > are kernel taskq threads which can run on any cpu which has not been
>> > explicitly allocated to a cpu set/partition/pool.
>> >
>> > So today, for any zones living on zfs filesystems, and running in a
>> > dedicated cpu pool, any zfs disk processing associated with that zone is
>> > not done by the cpu's bound to that zones pool.  Essentially all the
>> > zones zfs processing is done for "free" by the global zone.
>> >
>> > With the introduction of zpools encapsulated within storage objects,
>> > which are themselves associated with specific zones, it would be
>> > desirable to have the zpool worker threads bound to the cpus currently
>> > allocated to the zone.  Currently, zfs uses taskq threads for each
>> > zpool, so one way of doing this would be to introduce a mechanism that
>> > allows for the binding of taskqs to pools.
>> >
>> > Hence we propose the following new interfaces:
>> >     zfs_poolbind(char *, poolid_t);
>> >     taskq_poolbind(taskq_t, poolid_t);
>> >
>> > When a zone, which is bound to a pool, is booted, the zones framework
>> > will call zfs_poolbind() for each zpool associated with an encapsulated
>> > storage object bound to the zone being booted.
>> >
>> > Zfs will in turn use the new taskq pool binding interfaces to bind all
>> > it's taskqs to the specified pools.  This mapping is transient and zfs
>> > will not record or persist this binding in any way.
>> >
>> > The taskq implementation will be enhanced to allow for binding worker
>> > threads to a specific pool.  If taskqs threads are created for a taskq
>> > which is bound to a specific pool, those new thread will also inherit
>> > the same pool bindings.  The taskq to pool binding will remain in effect
>> > until the taskq is explicitly rebound or the pool to which it is bound
>> > is destroyed.
>> Any thoughts of dooing something similar for dedicated NICs?  From
>> dladm(1M):
>>      cpus
>>          Bind the processing of packets for a given data link  to
>>          a  processor  or a set of processors. The value can be a
>>          comma-separated list of one or more  processor  ids.  If
>>          the  list  consists of more than one processor, the pro-
>>          cessing will spread out to all the  processors.  Connec-
>>          tion  to  processor affinity and packet ordering for any
>>          individual connection will be maintained.
>> That is, the enhancement is already there, it's just a matter of making
>> use of it.
> i'm currently engaged with someone on the crossbow team who is working
> on a proposal to allow for binding datalinks to pools.  but once again,
> that's a seperate project.  ;)


>> > ----------
>> > C.4 Zfs enhancements
>> >
>> > In addition to the zfs_poolbind() interface proposed above.  The
>> > zpool(1m) "import" command will need to be enhanced.  Currently the
>> > zpool(1m) import by default scans all storage devices on the system
>> > looking for pools to import.  The caller can also use the '-d' option to
>> > specify a directory within which the zpool(1m) command will scan for
>> > zpools that may be imported.  This scanning involves sampling many
>> > objects.  When dealing with zpools encapsulated in storage objects, this
>> > scanning is unnecessary since we already know the path to the objects
>> > which contains the zpool.  Hence, the '-d' option will be enhanced to
>> > allow for the specification of a file or device.  The user will also be
>> > able to specify this option multiple times, in case the zpool spans
>> > multiple objects.
>> >
>> >
>> > ----------
>> > C.5 Lofi and lofiadm(1m) enhancements
>> >
>> > Currently, there is no way for a global zone to access the contents of a
>> > vdisk.  Vdisk support was first introduced in VirtualBox.  xVM then
>> > adopted the VirtualBox code for vdisk support.  With both technologies,
>> > the only way to access the contents of a vdisk is to export it to a VM.
>> >
>> > To allow zones to use vdisk devices we propose to leverage the code
>> > introduced by by xVM by incorporating it into lofi.  This will allow any
>> > solaris system to access the contents of vdisk devices.  The interface
>> > changes to lofi to allow for this are fairly straitforward.
>> >
>> > A new '-l' option will be added to the lofiadm(1m) "-a" device creation
>> > mode.  The '-l' option will indicate to lofi that the new device should
>> > have a label associated with it.  Normally lofi device are named
>> > /dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number.
>> > When a disk device has a label associated with it, it exports many
>> > device nodes with different names.  Therefore lofi will need to be
>> > enhanced to support these new device names, which multiple nodes
>> > per device.  These new names will be:
>> >
>> >     /dev/lofi/dsk<I>/p<j>           - block device partitions
>> >     /dev/lofi/dsk<I>/s<j>           - block device slices
>> >     /dev/rlofi/dsk<I>/p<j>          - char device partitions
>> >     /dev/rlofi/dsk<I>/s<j>          - char device slices
>> One of the big weaknesses with lofi is that you can't count on the
>> device name being the same between boots.  Could -l take an argument
>> to be used instead of "dsk<I>"?  That is:
>>    lofiadm -a -l coolgames /media/coolgames.iso
>> Creates:
>>    /dev/lofi/coolgames/p<j>
>>    /dev/lofi/coolgames/s<j>
>>    /dev/rlofi/coolgames/p<j>
>>    /dev/rlofi/coolgames/s<j>
>> For those cases where legacy behavior is desired, an optional %d can be
>> used to create the names you suggest above.
>>    lofiadm -a -l dsk%d /nfs/server/zone/stuff
> so there are a lot of improvements that could be done to lofi.  one
> improvement that i think we should do is to allow for persistent lofi
> devices that come back after reboots.  custom device naming is another.
> but once again, i think that is outside the scope of this project.
> (this project will facilitate these other changes because it is creating
> an smf service for lofi, where persistent configuration could be stored,
> but adding that functionality will have to be another project.)


>> > ----------
>> > C.6 Performance considerations
>> >
>> > As previously mentioned, this proposal primarily simplifies the process
>> > of configuring zones on shared storage.  In most cases these proposed
>> > configurations can be created today, but no one has actually verified
>> > that these configurations perform acceptably.  Hence, in conjunction
>> > with providing functionality to simplify the setup of these configs,
>> > we also need to be quantifying their performance to make sure that
>> > none of the configurations suffer from gross performance problems.
>> >
>> > The most straitforward configurations, with the least possibilities for
>> > poor performance, are ones using local devices, fibre channel luns, and
>> > iSCSI luns.  These configuration should perform identically to the
>> > configurations where the global zone uses these objects to host zfs
>> > filesystems without zones.  Additionally, the performance of these
>> > configurations will mostly be dependent upon the hardware associated
>> > with the storage devices.  Hence the performance of these configuration
>> > is for the most part uninteresting and performance analysis of these
>> > configuration can by skipped.
>> >
>> > Looking at the performance of storage objects which are local files or
>> > nfs files is more interesting.  In these cases the zpool that hosts the
>> > zone will be accessing it's storage via the zpool vdev_file vdev_ops_t
>> > interface.  Currently, this interface doesn't receive as much use and
>> > performance testing as some of the other zpool vdev_ops_t interfaces.
>> > Hence it will worthwhile to measure the performance of a zpool backed by
>> > a file within another zfs filesystem.  Likewise we will want to measure
>> > the performance of a zpool backed by a file on an NFS filesystem.
>> > Finally, we should compare these two performance points to a zone which
>> > is not encapsulated within a zpool, but is instead installed directly on
>> > a local zfs filesystem.  (These comparisons are not really that
>> > interesting when dealing with block device based storage objects.)
>> Reminder for when I am testing: is this a case where forcedirectio will
>> make a lot of sense?  That is, zfs is already buffering, don't make NFS
>> do it too.
> this is a great question, and i don't know the answer.  i'll have to
> ask some nfs folks and do some perf testing to determine what should
> be done here.  i've added a not about forcedirectio to the doc.

Sounds good

Mike Gerdts

Attachment: zones_on_shared_storage-1.1.diff
Description: Binary data

zones-discuss mailing list

Reply via email to