[ second reply, includes revised proposal ]

hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias.  i'll send some follow up emails which include the revised
proposal.

thanks again,
ed
" please ensure that the vim modeline option is not disabled
vim:textwidth=72

-------------------------------------------------------------------------------
Zones on shared storage (v1.1)


----------
A. INTRODUCTION

The most commonly requested zones feature is support for the NFS server
within a zone (4964859).  The next two most requested zones features are
support for hosting zones on shared storage, specifically NFS and iSCSI.
These last two requests are being tracked via the following bugs:

        6688400 Want zonepath on iscsi targets
        4963321 RFE: hosting root filesystems for Zones on NFS servers

This document proposed a plan to address the latter two issues.  These
proposed changes will also start to bring the zones storage
administration closer in line with the experience provided by other
virtualization technologies (VTs, examples of which are xVM, LDOMs, and
VirtualBox).


----------
B. BACKGROUND

Currently, it is possible to combine existing supported technologies
(FC, iSCSI, NFS, lofi, zfs, etc) to host zones on shared storage, and
there are a few blog entries out there describing how to create such
configurations[00].  While these configurations are usable and
technically supportable, they require extensive configuration of
multiple different technologies, making them complex, potentially
fragile (since they are not regularly tested), and out of reach for most
customers.

Creating these configurations today also causes problems with the use of
other existing zones features, the most obvious of which is zone
migration.  These configurations complicate zone migration because all
of the additional global zone configuration that was required to host
the zone on shared storage must be tracked and then migrated along with
the zone data itself.  VM Migration is critical component of all
existing VTs and anything that can be done to improve zone migration
support would be a great benifit to zones administrators.

All the other virtualization technologies currently available in Solaris
support the hosting of Virtual Machines (VMs) on shared storage.  They
all do this in the same fashion, which involves taking a "storage
object"[01] which is accessible from the "global zone"[02], and making
it visible within the VM as a local disk.  In this document we're refer
to this process as "encapsulation", since the "disk" which a VM thinks
it's accessing is actually contained (ie encapsulated) within some
storage object in the global zone.

This encapsulation has advantages and disadvantages.  Encapsulation
makes it easier to manage the storage associated with VMs.  These
storage objects may be accessible from multiple hosts simultaneously.
They can be backed up, restored, moved around, and copied as individual
files instead of filesystems.  One disadvantage of the encapsulation
used by these other VTs is that currently on Solaris, there is no way to
open up these encapsulated disks and access their contents from the
global zone.  Currently, the only way to access these encapsulated disks
is to import them into a running VM.

Another interesting development with these encapsulated disks used by
other VTs is the proliferation of storage object formats.  These custom
storage objects formats allows for features which are transparent to the
VM using the disks and independent of any features offered by the
underlying global zone.  While specific feature sets may vary, common
features that may be supported are things like compression, sparseness,
dedup, snapshotting, and rollback.  Management of all these storage
object features is usually done from the global zone.  Examples of some
of these different formats are:

        VDI - VirtualBox Virtual HDD
        VHD - Microsoft Virtual Hard Disk
        VMDK - VMWare Virtual Machine Disk

These formats are also commonly used for moving and sharing VM images,
and it's not uncommon for users to take VMs encapsulated in one format
and convert them into another.


----------
C. PROPOSAL / DESCRIPTION

The essence of this proposal is to enhance zonecfg(1m) to allow for the
specification of shared storage objects which can be used to encapsulate
zone filesystems, including the zones root filesystem.  Once shared
storage objects are specified in zonecfg(1m) the management of this
shared storage will be handled automatically by the zones framework.
This will allow administrators to host zones on shared storage with no
additional system configuration being required outside of zonecfg(1m).
It will also provide zones with a consistent global zone storage usage
and administration experience compared with other VTs.  All the existing
zoneadm(1m) operations that administrators can do today should continue
to work seamlessly with this new support for shared storage objects.

The proposal is broken down into the following subsections:
        C.0 Out of scope
        C.1 Zones enhancements
        C.2 Storage object uid/gid handling
        C.3 Taskq enhancements
        C.4 Zfs enhancements
        C.5 Lofi and lofiadm(1m) enhancements
        C.6 Performance considerations
        C.7 Phased delivery
        C.8 Future work


----------
C.0 Out of scope

Before discussing the details of the changes being proposed, here are
the things this proposal will not address:

- Hosting zones natively on NFS.  We will not be considering
  enhancements which would allow zone roots to live natively (ie,
  unencapsulated) on NFS.  Supporting this functionality is not prevented
  by anything proposed herein, but initially we are not pursuing this
  option because:

        - Encapsulation allows us to host zones on a myriad of storage
          in addition to NFS.
        - Encapsulation brings the zones storage feature set closer in
          line with other VTs.
        - Encapsulation will likely be easier to implement than
          unencapsulated hosting of a zone root filesystems on NFS.

- Advanced shared storage configuration.  Initially there will be no
  support for advanced shared storage configuration options.  If
  advanced configuration options need to be used, they will need to be
  setup in the global zone outside the zones framework.  That said,
  nothing in this proposal should not prevent us from adding support for
  more complex configurations in the future, should the become popular.
  Some examples of what would be considered advanced shared storage
  configuration options are (and this is by no means a complete list):

        - iSCSI target discovery management
        - iSCSI target parameter configuration (CHAP, etc)
        - custom vdisk creation options
        - encapsulated zpools spanning multiple storage objects
        - custom zpool/zfs creation options


----------
C.1 Zones enhancements

This is a list of proposed enhancements to specific zones subsystems.
It is broken down into the following categories:
        C.1.i    Zonecfg(1m)
        C.1.ii   Storage object uri (so-uri) format
        C.1.iii  Zoneadm(1m) install
        C.1.iv   Zoneadm(1m) attach
        C.1.v    Zoneadm(1m) boot
        C.1.vi   Zoneadm(1m) detach
        C.1.vii  Zoneadm(1m) uninstall
        C.1.viii Zoneadm(1m) clone

----------
C.1.i Zonecfg(1m)

The zonecfg(1m) command will be enhanced with the following two new
resources and associated properties:

        rootzpool                               resource
                src                             resource property
                install-size                    resource property
                zpool-preserve                  resource property
                dataset                         resource property

        zpool                                   resource
                src                             resource property
                install-size                    resource property
                zpool-preserve                  resource property
                name                            resource property

The new resource and properties will be defined as follows:

"rootzpool"
    - Description: Identifies a shared storage object (and it's
        associated parameters) which will be used to contain the root
        zfs filesystem for a zone.

"zpool"
    - Description: Identifies a shared storage object (and it's
        associated parameters) which will be made available to the
        zone as a delegated zfs dataset.

"src"
    - Status: Required.
    - Format: Storage object uri (so-uri).  (See definition below.)
    - Description: Identifies the storage object associated with this
        resource.

"install-size"
    - Status: Optional.
    - Format: Integer.  Defaults to bytes, but can be flagged as
        gigabytes, kilobytes, or megabytes, with a g, k, or m suffix,
        respectively.
    - Description: If the specified storage object doesn't exist at zone
        install time it will be created with this specific size.  This
        property has no effect for storage objects which already exist and
        have a pre-defined size.

"zpool-preserve"
    - Status: Optional.
    - Format: Boolean.  Defaults to false.
    - Description: When doing an install, if this property if this
        property is set to true and a zpool already exists on the
        specified storage object it will be used.  When doing a destroy,
        if this property is set to true, the root zpool will not be
        destroyed.

"dataset"
    - Status: Optional
    - Format: zfs filesystem name component (can't contain a '/')
    - Description: Name of a dataset within the root zpool to delegate
        to the zone.

"name"
    - Status: Required
    - Format: zfs filesystem name component (can't contain a '/')
    - Description: Used as part of the name for a zpool which will be
        delegated to the zone.

Zonecfg(1m) "verify" will verify the syntax of any "rootzpool" resource
group (and its properties), but it will NOT verify the accessibility of
any storage specified by by a so-uri.  (This is because accessing the
storage specified by an so-uri could require configuration changes to
other subsystems.)

These new resources should provide zone administrators with lots of
flexibility when it comes to deploying zones on shared storage.  Some of
the likely zone deployment models that are envisioned and enabled by
this proposal are:

- A zone with an encapsulated "rootzpool" zpool.  In this scenario the
  OS will be stored in the "rootzpool", and all non-OS software and data
  will also be stored in datasets which are descendent from the zone root
  dataset.  This means that system operations which snapshot and clone the
  OS will also snapshot and clone non-OS software and data.

- A zone with an encapsulated "rootzpool" zpool and a "dataset" defined
  within the "rootzpool".  In this scenario, the OS will be stored in in
  the "rootzpool", and all non-OS software and data will be stored within
  "dataset", which is also contained within the "rootzpool", but is not a
  direct descendent of any zone root dataset.  This means that system
  operations which snapshot and clone the OS will not affect non-OS
  software and data.

- A zone with an encapsulated "rootzpool" and one or more encapsulated
  "zpool"s.  In this scenario, the OS will be stored in in the
  "rootzpool", and all non-OS software and data will be stored within
  other "zpool"s.  This means that system operations which snapshot and
  clone the OS will not affect non-OS software and data.

More information about these new resources and how they will be managed
by the zones framework is available below.


----------
C.1.ii Storage object uri (so-uri) format

The storage object uri (so-uri) syntax[03] will conform to the standard
uri format defined in RFC 3986 [04].  The nfs URI scheme is defined in
RFC 2224 [05].  The so-uri syntax can be summarised as follows:

File storage objects:

    path:///<file-absolute>
    nfs://<host>[:port]/<file-absolute>

Vdisk storage objects:

    vpath:///<file-absolute>
    vnfs://<host>[:port]/<file-absolute>

Device storage objects:

    path:///dev/<file-absolute>
    fc:///wwn[@<lun>]
    iscsi:///alias=<alias>[@<lun>]
    iscsi:///target=<target>[@<lun>]
    iscsi://host[:port]/[tpgt=<tpgt>/]target=<target>[@<lun>]

File storage objects point to plain files on a local, nfs, or cifs
filesystems.  These files are used to contain zpools which store zone
datasets.  These are the simplest types of storage objects.  Once
created, they have a fixed size, can't be grown, and don't support
advanced features like snapshotting, etc.  Some example file so-uri's
are:

path:///export/xvm/vm1.disk
        - a local file
path:///net/heaped.sfbay/export/xvm/1.disk
        - a nfs file accessible via autofs
nfs://heaped.sfbay/export/xvm/1.disk
        - same file specified directly via a nfs so-uri

Vdisk storage objects are similar to file storage objects in that they
can live on local, nfs, or cifs filesystems, but they each have their
own special data format and varying featuresets, with support for things
like snapshotting, etc..  Some common vdisk formats are: VDI, VMDK and
VHD.  Some example vdisk so-uri's are:

vpath:///export/xvm/vm1.vmdk
        - a local vdisk image
vpath:///net/heaped.sfbay/export/xvm/1.vmdk
        - a nfs vdisk image accessible via autofs
vnfs://heaped.sfbay/export/xvm/1.vmdk
        - same vdisk image specified directly via a nfs so-uri

Device storage objects specify block storage devices in a host
independant fashion.  Some /dev device names may already be named in a
host independant fashion.  In this case the admin can simply specify the
/dev device path for this device as the so-uri.  When configuring FC or
iscsi storage on different hosts, the storage configuration normally
lives outsize of zonecfg, and the configured storage may have varying
/dev/dsk/cXtXdX* names.  In these cases, the so-uri syntax provides a
way to specify storage in a host independent fashion, and during zone
management operations, the zones framework can map this storage to a
host specific device path.  Some example device so-uri's are:

path:///dev/vx/dsk/zone1/rootvol
        - a Veritas volume that is accessible from multiple hosts using the
          same name
fc:///20000014c3474...@0
        - lun 0 of a fc disk with the specified wwn
iscsi:///alias=oracle zone r...@0
        - lun 0 of an iscsi disk with the specified alias.
iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740
        - lun 0 of an iscsi disk with the specified target id.


----------
C.1.iii Zoneadm(1m) install

When a zone is installed via the zoneadm(1m) "install" subcommand, the
zones subsystem will first verify that any required so-uris exist and
are accessible.

If an so-uri points to a plain file, nfs file, or vdisk, and the object
does not exist, the object will be created with the install-size that
was specified via zonecfg(1m).  If the so-uri does not exist and an
install-size was not specified via zonecfg(1m) an error will be
generated and the install will fail.

If an so-uri points to an explicit nfs server, the zones framework will
need to mount the nfs filesystem containing storage object.  The nfs
server share containing the specified object will be mounted at:

        /var/zones/nfsmount/<host>/<nfs-share-name>

If an so-uri points to a fibre channel lun, the zones subsystem will
verify that the specified wwn corresponds to a global zone accessible
fibre channel disk device.

If an so-uri points to an iSCSI target or alias, the zones subsystem
will verify that the iSCSI device is accessible on the local system.  If
an so-uri points to a static iSCSI target and that target is not
already accessible on the local host, then the zones subsystem will
enable static discovery for the local iSCSI initiator and attempt to
apply the specified static iSCSI configuration.  If the iSCSI target
device is not accessible then the install will fail.

Once a zones install has verified that any required so-uri exists and is
accessible, the zones subsystem will need to initialise the so-uri.  In
the case of a path or nfs path, this will involve creating a zpool
within the specified file.  In the case of a vdisk, fibre channel lun,
or iSCSI lun, this will involve creating a EFI/GPT partition on the
device which uses the entire disk, then a zpool will be created within
this partition.  For data protection purposes, if a storage object
contains any pre-existing partitions, zpools, or ufs filesystems, the
install will fail with an appropriate error message.  To continue the
installation and overwrite any pre-existing data, the user will be able
to specify a new '-f' option to zoneadm(1m) install.  (This option
mimics the '-f' option used by zpool(1m) create.)

If zpool-preserve is set to true, then before initialising any target
storage objects, the zones subsystem will attempt to import a
pre-existing zpool from those objects.  This will allow users to
pre-create a zpool with custom creation time options, for use with
zones.  To successfully import a pre-created zpool for a zone install,
that zpool must not be attached.  (Ie, any pre-created zpool must be
exported from the system where it was created before a zone can be
installed on it.)  Once the zpool is imported the install process will
check for the existence of a /ROOT filesystem within the zpool.  If this
filesystem exists the install will fail with an appropriate error
message.  To continue the installation the user will need to specify the
'-f' option to zoneadm(1m) install, which will cause the zones framework
to delete the pre-existing /ROOT filesystem within the zpool.  (This is
done because during install the zones root filesystem will be created
under /ROOT.  See [07] of more details.)

The newly created or imported root zpool will be named after the zone to
which it is associated, with the assigned name being "<zonename>_rpool".
This zpool will then be mounted on the zones zonepath and then the
install process will continue normally[07].

XXX: use altroot at zpool creation or just manually mount zpool?

If the user has specified a "zpool" resource, then the zones framework
will configure, initialize, and/or import it in a similar manaer to a
zpool specified by the "rootzpool" resource.  The key differences are
that the name of the newly created or imported zpool will be
"<zonename>_<name>".  The specified zpool will also have the zfs "zoned"
property set to "on", hence it will not be mounted anywhere in the
global zone.

XXX: do we need "zpool import -O file-system-property=" to set the
     zoned property upon import.

Once a zone configured with a so-uri is in the installed state, the
zones framework needs a mechanism to mark that storage as in use to
prevent it from being accessed by multiple hosts simultaneously.  The
most likely situation where this could happen is via a zoneadm(1m)
attach on a remote host.  The easiest way to achieve this is to keep the
zpools associated with the storage imported and mounted at all times,
and leverage the existing zpool support for detecting and preventing
multi-host access.

So whenever a global zone boots and the zones smf service runs, it will
attempt to configure and import any shared storage objects associated
with installed zones.  It will then continue to behave as it does today
and boot any installed zones that have the autoboot property set.  If
any shared sorage objects fail to configure or import, then:

- the zones associated with the failed storage will be transitioned
  to the "configured" state.
- an error message will be emitted to the zones smf log file.
- after booting any remaning installed zones that have autoboot set
  to true, the zones smf service will enter the "maintainence" state,
  there by prompting the administrator to look at the zones smf log
  file.

After fixing any problems with shared storage accessibility, the
admin should be able to simply re-attach the zone to the system.

Currently the zones smf service is dependant upon multi-user-server, so
all networking services required for access to shared storage should be
propertly configured well before we try to import any shared storage
associated with zones.

On system shutdown, the zones system will NOT export zpools contained
within storage object used by the zone.  Zpools contained within storage
objects assigned to installed zones will only be exported during zone
detach.  More details about the behaviour of zone detach is provided
below.


----------
C.1.iv Zoneadm(1m) attach

The zoneadm(1m) attach operation moves a zone from the configured to the
installed state.  This operation will behave similarly to to a
zoneadm(1m) install wrt so-uri management.  One key difference between
attach and install is that during an attach, any missing storage objects
will not be automatically created and instead the attach will fail.
Once all the storage objects associated with an attaching zone are
configured and accessible, the attach operation will attempt to import
and mount the zpools contained within those storage objects.  If zfs
detects that any zpools are in use by another host the attach will fail.
A new '-f' option will be added to zoneadm(1m) which will allow the user
to force import and mount any zpools which appear to still be in use.
(once again, this option mimics the '-f' option used by zpool(1m)
import.)

Currently, when a zone is attached to a system a software inventory
check is done to verify that the contents of the zone are in sync with
the global zone.  If they are then the zone is successfully attached.
If the contents are older than what is present in the global zone, then
the attach fails unless the user specifies the "-u" update option.  If
the contents of the zone are newer than what is in the global zone then
the attach will fail.  Encapsulated zones within zpools changes the
behaviour of attach slightly because if the zone was created on a host
with newer global zone bits, it's possible that the zpool/zfs filesystem
versions associated with that zpool may be newer than what the host
supports.  If this is the case we will be unable to import the zone to
check it's software contents.  In practice this is not a problem because
we know that if the zpool/zfs versions used by an encapsulated zone are
later than that supported by the global zone, the software contents of
that zone are also guaranteed to be later than what we have in our
global zone, so the attach would fail regardless.  In the case of older
zpool/zfs encapsulated zpool versions, these can be imported without a
problem and the version numbers will remain unchanged by the attach
operation.


----------
C.1.v Zoneadm(1m) boot

The "dataset" property within a "rootzpool" is designed to provide a
convenient mechanism for zones to store data within the root zpool, but
outside of the /ROOT filesystem (which is subject to snapshotting and
cloning during boot environment management operations).  To keep things
simple, the dataset property can only specify a single path component.
(ie, it can not contain any '/' characters.)  If the user has specified
the "dataset" property for a "rootzpool" resource, then when the zone is
booted the framework will check for the presence of a zfs filesystem
named /dataset/<dataset> within the root zpool.  If no such filesystem
exists it will be created.  If this zfs filesystem is created by the
zones framework, it will have the zfs mountpoint property set to none.
This filesystem will then be be delegated to the zone as if a
zonecfg(1m) "dataset" resource with a name property value set to
"<zonename>_rpool/dataset/<dataset>" existed.

For example, if a admin want to configure a zone contained within a
single storage object, but still have a seperate dataset where they can
manage snapshots and clones of their data independantly of their root
filesystem and software, then they could create a configuration as
follows:

---8<---
zonecfg -z ibis
...
add rootzpool
set src=iscsi:///alias=ibis z...@0
set dataset=oracle_data
end
---8<---

once this zone was booted, the zone would have access to the following
zfs dataset: ibis_rpool/dataset/oracle_data, and it could mount,
snapshot, and clone this dataset.  all independantly of any zone ROOT
dataset snapshots and clones created by system management tools,
image-updates, etc.

If a zone has any "zpool" resources specified, when the zone is booted
those zpools will be delegated to the zone.


----------
C.1.vi Zoneadm(1m) detach

The zoneadm(1m) detach operation will be enhanced so that when detaching
a zone from a system, any zpools residing in storage objects associated
with the detaching zone will be exported.  This will allow the
administrator to attach the zone on another host without having to
specify the force flag ('-f') to zoneadm(1m) attach.  Also, if any
static iSCSI storage objects are associated with the zone, the iSCSI
initiator configuration for those targets will be removed.  (Although
iSCSI initiator static discovery will remain enabled.)


----------
C.1.vii Zoneadm(1m) uninstall

If a user initiates a zoneadm(1m) uninstall operation, then for each
zpool specified via a "rootzpool" or "zpool" resource, if the
"zpool-preserve" property is set to false, the zpool contained within
the specified storage object will be destroyed.  If the "zpool-preserve"
property is set to true, then for a "rootzpool" resource only the /ROOT
filesystem (and any of it's descendents) within the root zpool will be
destroyed, and then all the zones zpools will be exported.

After destroying or exporting the zpool, any associated storage object
will be unconfigured as described in the zoneadm(1m) detach section
above.


----------
C.1.viii Zoneadm(1m) clone

Normally when cloning a zone which lives on a zfs filesystem the zones
framework will take a zfs(1m) snapshot of the source zone and then do a
zfs(1m) clone operation to create a filesystem for the new zone which is
being instantiated.  This works well when all the zones on a given
system live on local storage in a single zfs filesystem, but this model
doesn't work well for zones with encapsulated roots.  First, with
encapsulated roots each zone has it's own zpool, and zfs (1m) does not
support cloning across zpools.  Second, zfs(1m) snapshotting/cloning
within the source zpool and then mounting the resultant filesystem onto
the target zones zoneroot would introduce dependencies between zones,
complicating things like zone migration.

Hence, for cloning operations, if the source zone has an encapsulated
root, zoneadm(1m) will not use zfs(1m) snapshot/clone.  Currently
zoneadm(1m) will fall back to the use of find+cpio to clone zones if it
is unable to use zfs(1m) snapshot/clone.  We could just fall back to
this default behaviour for encapsulated root zones, but find+cpio are
not error free and can have problem with large files.  So we propose to
update zoneadm(1m) clone to detect when both the source and target zones
are using separate zfs filesystems, and in that case attempt to use zfs
send/recv before falling back to find+cpio.

Today, the zoneadm(1m) clone operations ignores any additional storage
(specified via the "fs", "device", or "dataset" resources) that may be
associated with the zone.  Similarly, the clone operation will ignore
additional storage associated with any "zpool" resources.

Since zoneadm(1m) clone will be enhanced to support cloning between
encapsulated root zones and un-encapsulated root zones, zoneadm(1m)
clone will be documented as the recommended migration mechanism for
users who which to migrate existing zones from one format to another.


----------
C.2 Storage object uid/gid handling

One issue faced by all VTs that support shared storage is dealing with
file access permissions of storage objects accessible via NFS.  This
issue doesn't affect device based shared storage, or local files and
vdisks, since these types of storage are always accessible, regardless
of the uid of the access process (as long as the accessing process has
the necessary privileges).  But when accessing files and vdisk via NFS,
the accessing process can not use privileges to circumvent restrictive
file access premissions.  This issue is also complicated by the fact
that by default most NFS servier will map all accesses by remote root
user to a different uid, usually "nobody".  (a process known as "root
squashing".)

In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.

Additionally, whenever the zones framework attempts to access an storage
object file or vdisk it will temporarily switch its uid and gid to match
the owner and group of the file/vdisk, ensure that the file is readable
and writeable by it's owner (updating the file/vdisk permissions if
necessary), and finally setup the file/vdisk for access via a zpool
import or lofiadm -a.  This should will allow the zones framework to
access storage object files/vdisks that we created by any user,
regardless of their ownership, simplifying file ownership and management
issues for administrators.


----------
C.3 Taskq enhancements

The integration of Duckhorn[08] greatly simplifies the management of cpu
resources assigned to zone.  This management is partially implemented
through the use of dynamic resource pools, where zones and their
associated cpu resources can both be bound to a pool.

Internally, zfs has worker threads associated with each zpool.  These
are kernel taskq threads which can run on any cpu which has not been
explicitly allocated to a cpu set/partition/pool.

So today, for any zones living on zfs filesystems, and running in a
dedicated cpu pool, any zfs disk processing associated with that zone is
not done by the cpu's bound to that zones pool.  Essentially all the
zones zfs processing is done for "free" by the global zone.

With the introduction of zpools encapsulated within storage objects,
which are themselves associated with specific zones, it would be
desirable to have the zpool worker threads bound to the cpus currently
allocated to the zone.  Currently, zfs uses taskq threads for each
zpool, so one way of doing this would be to introduce a mechanism that
allows for the binding of taskqs to pools.

Hence we propose the following new interfaces:
        zfs_poolbind(char *, poolid_t);
        taskq_poolbind(taskq_t, poolid_t);

When a zone, which is bound to a pool, is booted, the zones framework
will call zfs_poolbind() for each zpool associated with an encapsulated
storage object bound to the zone being booted.

Zfs will in turn use the new taskq pool binding interfaces to bind all
it's taskqs to the specified pools.  This mapping is transient and zfs
will not record or persist this binding in any way.

The taskq implementation will be enhanced to allow for binding worker
threads to a specific pool.  If taskqs threads are created for a taskq
which is bound to a specific pool, those new thread will also inherit
the same pool bindings.  The taskq to pool binding will remain in effect
until the taskq is explicitly rebound or the pool to which it is bound
is destroyed.


----------
C.4 Zfs enhancements

In addition to the zfs_poolbind() interface proposed above.  The
zpool(1m) "import" command will need to be enhanced.  Currently the
zpool(1m) import by default scans all storage devices on the system
looking for pools to import.  The caller can also use the '-d' option to
specify a directory within which the zpool(1m) command will scan for
zpools that may be imported.  This scanning involves sampling many
objects.  When dealing with zpools encapsulated in storage objects, this
scanning is unnecessary since we already know the path to the objects
which contains the zpool.  Hence, the '-d' option will be enhanced to
allow for the specification of a file or device.  The user will also be
able to specify this option multiple times, in case the zpool spans
multiple objects.


----------
C.5 Lofi and lofiadm(1m) enhancements

Currently, there is no way for a global zone to access the contents of a
vdisk.  Vdisk support was first introduced in VirtualBox.  xVM then
adopted the VirtualBox code for vdisk support.  With both technologies,
the only way to access the contents of a vdisk is to export it to a VM.

To allow zones to use vdisk devices we propose to leverage the code
introduced by by xVM by incorporating it into lofi.  This will allow any
solaris system to access the contents of vdisk devices.  The interface
changes to lofi to allow for this are fairly straitforward.

A new '-l' option will be added to the lofiadm(1m) "-a" device creation
mode.  The '-l' option will indicate to lofi that the new device should
have a label associated with it.  Normally lofi device are named
/dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number.
When a disk device has a label associated with it, it exports many
device nodes with different names.  Therefore lofi will need to be
enhanced to support these new device names, which multiple nodes
per device.  These new names will be:

        /dev/lofi/dsk<I>/p<j>           - block device partitions
        /dev/lofi/dsk<I>/s<j>           - block device slices
        /dev/rlofi/dsk<I>/p<j>          - char device partitions
        /dev/rlofi/dsk<I>/s<j>          - char device slices

A new '-v <vdisk-format>' option will be added to the lofiadm(1m) "-a"
device creation mode.  This will indicate to lofi that the new device
which is being created will be stored within a vdisk instead of a normal
file.  Vdisk formats may provide their own management features such as
snapshotting, compression, encryption, etc.  As such, the lofi vdisk
support exists purely to access the contents of vdisks.  Hence, vdisk
based lofi devices will not support other lofi options such as
encryption ('-c') and compression ('-C' / '-U').  Also, all vdisks
contain actually disks, so they all contain a partition/label data.
hence when attaching a vdisk the '-l' flag is always implied (and should
not be specified.)  The vdisk formats that will be supported by lofi are
whatever vdisk formats happen to be supported by xVM at the time of
integration.  Since the implementation between lofi and xVM will be
shared, as new vdisk format support is added to xVM, it should be
immediately supportable via lofi as well.

The current xVM implementation for accessing vdisks involves two drivers
and a userland utility.  A "frontend" driver runs inside a VM and it
exports normal solaris disk interface.  It takes IO requests to these
disks and transmits them, via a ring buffer, to the "backend" driver
running in the global zone.  The backend driver then maps these ring
requests into a dedicated vdisk process (there is one such process for
every vdisk), and this process translates these ring requests into
access to a vdisk of the requested format.  Given all this existing xVM
functionality, the most straitforward way to support vdisk from within
lofi would be to leverage the xVM implementation.  This will involve
re-factoring the existing xVM code, thereby allowing lofi to utilise
the "frontend" code which translates strategy io requests into ring
buffer requests, and also the "backend" code which exports the ring
buffer to userland.  The unchanged xVM userland vdisk utility can then
be used to map ring buffer requests to the actual vdisk storage.
Currently this utility is only available on x86, but since lofi is a
cross-platform utility, this proposal will require the delivery of this
utility on both sparc and x86.  This utility is currently delivered in
an xVM private directory, /usr/lib/xen/bin/vdisk.  Given that lofi is a
more general and cross platform utility as compared to xVM, and also
given that that we don't expect users to access the vdisk management
utilities directly, we propose to move the vdisk application to
/usr/lib/lofi/bin/vdisk.

For RAS purposes, we will need to ensure that this vdisk utility is
always running.  Hence we will introduce a new lofi smf service
svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid
daemon, which will manage the starting, stopping, monitoring, and
possible re-start of the vdisk utility.  Re-starts of vdisk utility
should be transparent (aside from a short performance hiccup) to any
zones accessing those vdisks.  By default this service will be disabled.
If a lofi vdisk device is created, this service will be temporarily
enabled.  When the last vdisk based lofi device is destroyed, this
service will disable itself.

XXX: what to do about disk geometry assignment?  sigh.

Here's some examples of how this lofi functionality could be used
(outside of the zone framework).  If there are no lofi devices on
the system, and an admin runs the following command:
        lofiadm -a -l /export/xvm/vm1.disk

they would end up with the following device:
        /dev/lofi/dsk0/p#               - for # == 0 - 4
        /dev/lofi/dsk0/s#               - for # == 0 - 15
        /dev/rlofi/dsk0/p#              - for # == 0 - 4
        /dev/rlofi/dsk0/s#              - for # == 0 - 15

If there are no lofi devices on the system, and an admin runs the
following command:
        lofiadm -a -v /export/xvm/vm1.vmdk

they would end up with the following device:
        /dev/lofi/dsk0/p#               - for # == 0 - 4
        /dev/lofi/dsk0/s#               - for # == 0 - 15
        /dev/rlofi/dsk0/p#              - for # == 0 - 4
        /dev/rlofi/dsk0/s#              - for # == 0 - 15

By default, format(1m) will not list these devices in it's output.  But
users will be able to treat these devices like regular disks and pass
their names to utilities like fdisk(1m), format(1m), prtvtoc(1m),
fmthard(1m), zpool(1m), etc.

----------
C.6 Performance considerations

As previously mentioned, this proposal primarily simplifies the process
of configuring zones on shared storage.  In most cases these proposed
configurations can be created today, but no one has actually verified
that these configurations perform acceptably.  Hence, in conjunction
with providing functionality to simplify the setup of these configs,
we also need to be quantifying their performance to make sure that
none of the configurations suffer from gross performance problems.

The most straitforward configurations, with the least possibilities for
poor performance, are ones using local devices, fibre channel luns, and
iSCSI luns.  These configuration should perform identically to the
configurations where the global zone uses these objects to host zfs
filesystems without zones.  Additionally, the performance of these
configurations will mostly be dependent upon the hardware associated
with the storage devices.  Hence the performance of these configuration
is for the most part uninteresting and performance analysis of these
configuration can by skipped.

Looking at the performance of storage objects which are local files or
nfs files is more interesting.  In these cases the zpool that hosts the
zone will be accessing it's storage via the zpool vdev_file vdev_ops_t
interface.  Currently, this interface doesn't receive as much use and
performance testing as some of the other zpool vdev_ops_t interfaces.
Hence it will worthwhile to measure the performance of a zpool backed by
a file within another zfs filesystem.  Likewise we will want to measure
the performance of a zpool backed by a file on an NFS filesystem.
Finally, we should compare these two performance points to a zone which
is not encapsulated within a zpool, but is instead installed directly on
a local zfs filesystem.  (These comparisons are not really that
interesting when dealing with block device based storage objects.)  We
will also want to determine if there are any specific NFS mount options
that should be used which could affect performance, for example, should
"forcedirectio" be enabled?

Currently, while it is very common to deploy large numbers of zfs
filesystems, systems with large numbers of zpools are not very common.
The solution proposed in this project will likely result in an increase
of zpools on systems hosting zones.  Hence, we should evaluate the
impact of an increasing number of zpools on performance scalability.
This could be done by comparing the io performance drop-off of an
increasing number of zones hosted multiple zfs filesystems in a single
zpool vs zones hosted in seperate zpools.

Finally, it will be important to do performance measurements for vdisk
configurations.  These configurations are similar to the local file or
nfs configurations, but they will be utilising the vdev_disk backend and
they will have an additional layer of indirection through lofi.

XXX: impact of multiple zpools on arc and l2 arc?  talk to mark maybee.


----------
C.7 Phased delivery

Customers have been asking for a simple mechanisms to allow hosting of
zones on NFS since the introduction of zones.  Hence we'd like to get
this functionality into the hands of customers as quickly as possible.
Also, the approach taken by this proposal to supporting zones on shared
storage is different from what was originally anticipated, hence we'd
like to get practical experience with this approach at customer sites
asap to determine if there are situations where this approach may not
meet their requires.  To accelerate the delivery of the previously
proposed features, we plan to deliver them in three phases:


I - Basic Zone encapsulation support

This will involve the introduction of the new "rootzpool" and "zpool"
resources and support the "file" and "nfs" type storage objects.  It
will require implementation of all the proposed zone, zfs, and taskq
changes.  This phase will not require any lofi changes.


II - Zone encapsulation on fc and iSCSI storage.

This will provide enhanced so-uri support for fibre channel and iSCSI
type storage objects.


III - Zone encapsulation on vdisk storage.

This will involve implementation of the proposed lofi enhancements and
enhancement support for vdisk type storage objects.


----------
C.8 Future work

In addition to the proposed work above, there are future enhancements
that could be made which would extend the functionality proposed above.
These are included here because the features proposed above have been
designed with this extensibility in mind.

One simple enhancement would be the addition a new "rootzpool/zpool"
resource property called "zpool-auto-upgrade".  If set to true, whenever
a zpool is imported by the zones framework (at install, attach, or boot)
the zpool and zfs filesystems would be upgraded to the latest version
supported by the global zone.  Aside from helping to ensure that zones
are using the latest and greatest zpool/zfs features, this feature would
help ensure that all the encapsulated zones on a system are running on
the same zpool/zfs versions.  This is important because zfs send and
receive are not guaranteed to be compatible across zpool/zfs versions.
So by ensure that all zones are running the latest zpool/zfs versions,
we increase the chances of being able to use zfs send/recv for zone
cloning.

Another possible enhancement to the "rootzpool/zpool" resources would be
to allow the specification of multiple so-uris to support more complex
root zpool configurations.

With the introduction of vdisk support, we will have the ability to
access VMs created by other VTs.  Using this functionality it would be
straitforward to enhance the existing zoneadm(1m) attach p2v
functionality to be able to do v2v, where the source system image for
installing a zone could be a vdisk created by another VT.

Finally, a new "storage" zonecfg(1m) resource could be added to allow
for the addition of arbitrary storage objects to zones.  This "storage"
resource would have the similar properties as a "zpool" resource, along
with a few additional properties:

        storage                                 resource
                src                             resource property
                install-size                    resource property
                name                            resource property
                locking-required                resource property

The zones framework would ensure that the specified storage objects were
accessible from the global zone, and then it would grant the zone access
to the raw disk devices.  But global zone disk device names can be
different on different host, so if global zone device names were used
within the non-global zone, this would negatively impact zone migration
since the software within the zone would have to be updated to deal with
potentially new device names.  To avoid this problems and factilitate
zone migration, the disk devices would be mapped into the non-global
zone with different names which would be global zone independant.
Regadless of the global zone disk device name, from within the
non-global zone, the devices would be named:

        /dev/storage/<name>/p<j>
        /dev/storage/<name>/s<j>

By basing the names of the device as seen from within the zone on the
"name" specified in the zonecfg(1m) (instead of the raw device names
from the global zone), we can guarantee that these device names will not
change when migrating the zones between hosts.

One difficulty with allowing the specification of storage objects which
don't by default contain a zpools is that we can no longer use zfs to
ensure that multiple entities are not accessing these objects at the
same time.  Hence other mechanisms will need to be used.  In the case of
fibre channel and iSCSI, we should be able to use SCSI reservations.  In
the case of file paths we should be able to use standard file locking.
In the case of vdisks, some vdisks support meta-data which can be
utilised to prevent concurrent access, and if a vdisk format does not
allow for this then we will need to fall back to file locking.  If for
some reason we are unable to do in use detection for a specific storage
object, then we could alert the user and refuse to use that storage
object unless the user set the optional "locking-required" property to
false.


----------
D. INTERFACES

Zonecfg(1m):
        rootzpool                               committed, resource
                src                             committed, resource property
                install-size                    committed, resource property
                zpool-preserve                  committed, resource property
                dataset                         committed, resource property

        zpool                                   resource
                src                             resource property
                install-size                    resource property
                zpool-preserve                  resource property
                name                            resource property

Zoneadm(1m):
        install -f                              committed, optional flag
        attach -f                               committed, optional flag

Zones misc:
        /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
                                                project private, nfs mount point

Taskq pool binding:
        taskq_poolbind(taskq_t, poolid_t)       consolidation private

Zfs pool binding:
        zfs_poolbind(char *, poolid_t)          consolidation private

Lofiadm(1m):
        -a -l                                   committed, optional flag
        -a -v <vdisk-format>                    committed, optional flag

Lofi misc:
        svc:/system/lofi:default                project private
        /lib/lofi/lofid                         project private
        /lib/lofi/vdisk                         consolidation private /w
                                                        contract for xVM


----------
E. ABBREVIATIONS

VDI  - VirtualBox Virtual HDD
VHD  - Microsoft Virtual Hard Disk
VM   - Virtual Machine
VMDK - VMWare Virtual Machine Disk
VT   - Virtualization Technology


----------
F. FOOTNOTES / REFERENCES

--
00 - Blogs and emails describing custom configurations used to host
    zones on shared storage:

    June 2005
    Containers on NFS?
        http://blogs.sun.com/jph/entry/containers_on_nfs

    February 2008
    The quick & dirty guide to zones on iSCSI LUNs
        http://opensolaris.org/jive/thread.jspa?messageID=226951

    April 2008
    ZoiT: Solaris Zones on iSCSI Targets (aka NAC: Network-Attached Containers)
        http://blogs.sun.com/JeffV/entry/zoit_solaris_zones_on_iscsi

--
01 - For the purposes of this document, the term storage object refers
    to any device or file which can be used to store data.  These "objects"
    can take the form of files on a local filesystem, files on a remote
    filesystem accessible via NFS, CIFS, etc, local disk devices, FC disk
    devices, iSCSI target devices,

--
02 - Each VT has a different name for the zones concept of the "global
    zone".  With LDOMs we have the "control domain", with xVM we have
    "dom0", with VirtualBox we have the "host domain".  For the purposes of
    this document the "global zone" refers to the OS entity which allocates
    resources to and has control over the state of VMs.  For simplicity,
    this document will always just refer to the "global zone" in place of
    all these other terms.

--
03 - Complete so-uri ABNF syntax definition:

    so-uri              = path-uri
                        / nfs-uri
                        / vpath-uri
                        / vnfs-uri
                        / fc-uri
                        / iscsi-uri

    path-uri            = "path://" file-absolute
    file-absolute       = "/" *( segment "/" ) segment-nz
                        ; Since file-absolute is always the last component
                        ; in a uri reserved characters do not need to be
                        ; percent encoding.  To simplify path management
                        ; we will also not permit '/./' or '/../' segments.

    nfs-uri             = "nfs://" hostport "/" file-absolute
                        ; Defined in RFC 2224 [05]

    hostport            = host [ ":" port ]

    vpath-uri           = "vpath://" file-absolute
    vnfs-uri            = "vnfs://" host [ ":" port ] "/" file-absolute
                        ; same syntax as nfs-uri

    fc-uri              = "fc:/// wwn [ "@" lun ]
    wwn                 = 1000 12HEXDIG
                        / 2 15HEXDIG
                        / 50 14HEXDIG
                        / 60 14HEXDIG
                        ; see http://en.wikipedia.org/wiki/World_Wide_Name

    iscsi-uri           = "iscsi:///" iscsi-alias
                        / "iscsi:///" iscsi-target
                        / "iscsi://" iscsi-static

    iscsi-alias         = "alias=" iscsi-alias-name [ "@" lun ]
    iscsi-alias-name    = 1*255pchar
                        ; Defined in RFC 3720 [06]
                        ; any "@" chars must be percent encoded

    iscsi-target        = "target=" iscsi-target-iqn [ "@" lun ]
                        / "target=" iscsi-target-eui [ "@" lun ]
    iscsi-target-iqn    = "iqn." iqn-date "." reg-name [ ":" 1*pchar ]
                        ; Defined in RFC 3720 [06]
                        ; any "@" chars must be percent encoded
    iqn-date            = 4DIGIT "-" "0" %x31-39        ; "XXXX-01" - "XXXX-09"
                        / 4DIGIT "-" "1" %x30-32        ; "XXXX-10" - "XXXX-12"
    iscsi-target-eui    = XXX
                        ; Defined in RFC 3720 [06]
                        ; any "@" chars must be percent encoded

    iscsi-static        = hostport "/" [ "tpgt=" iscsi-tpgt "/" ] iscsi-target

    iscsi-tpgt          = DIGIT                         ; 0-9
                        / %x31-39 1*3DIGIT              ; 10-9999
                        / %x31-35 4DIGIT                ; 10000-59999
                        / "6"    %x30-34 3DIGIT         ; 60000-64999
                        / "65"   %x30-34 2DIGIT         ; 65000-65499
                        / "655"  %x30-32 1DIGIT         ; 65500-65529
                        / "6553" %x30-35                ; 65530-65535

    lun                 = DIGIT                         ; 0-9
                        / %x31-39 1*3DIGIT              ; 10-9999
                        / "1"    %x30-35 3DIGIT         ; 10000-15999
                        / "16"   %x30-32 2DIGIT         ; 16000-16299
                        / "163"  %x30-37 1DIGIT         ; 16300-16379
                        / "1638" %x30-33                ; 16380-16383

    DIGIT               = Defined in RFC 3986 [04]
    HEXDIG              = Defined in RFC 3986 [04]
    segment-nz          = Defined in RFC 3986 [04]
    segment             = Defined in RFC 3986 [04]
    host                = Defined in RFC 3986 [04]
    port                = Defined in RFC 3986 [04]

--
04 - RFC 3986: Uniform Resource Identifier (URI): Generic Syntax
    http://www.ietf.org/rfc/rfc3986.txt
    http://www.websitedev.de/temp/rfc3986-check.html.gz

--
05 - RFC 2224 NFS URL Scheme
    http://www.ietf.org/rfc/rfc2224.txt

--
06 - RFC 3720: Internet Small Computer Systems Interface (iSCSI)

--
07 - Zones/SNAP Design - 8/25/2008
    http://www.opensolaris.org/jive/thread.jspa?messageID=272726&#272726

--
08 - PSARC/2006/496 Improved Zones/RM Integration
    http://arc.opensolaris.org/caselog/PSARC/2006/496


-------------------------------------------------------------------------------
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Reply via email to