[ second reply, includes revised proposal ] hey mike,
thanks for all the great feedback. my replies to your individual comments are inline below. i've updated my proposal to include your feedback, but i'm unable to attach it to this reply because of mail size restrictions imposed by this alias. i'll send some follow up emails which include the revised proposal. thanks again, ed
" please ensure that the vim modeline option is not disabled vim:textwidth=72 ------------------------------------------------------------------------------- Zones on shared storage (v1.1) ---------- A. INTRODUCTION The most commonly requested zones feature is support for the NFS server within a zone (4964859). The next two most requested zones features are support for hosting zones on shared storage, specifically NFS and iSCSI. These last two requests are being tracked via the following bugs: 6688400 Want zonepath on iscsi targets 4963321 RFE: hosting root filesystems for Zones on NFS servers This document proposed a plan to address the latter two issues. These proposed changes will also start to bring the zones storage administration closer in line with the experience provided by other virtualization technologies (VTs, examples of which are xVM, LDOMs, and VirtualBox). ---------- B. BACKGROUND Currently, it is possible to combine existing supported technologies (FC, iSCSI, NFS, lofi, zfs, etc) to host zones on shared storage, and there are a few blog entries out there describing how to create such configurations[00]. While these configurations are usable and technically supportable, they require extensive configuration of multiple different technologies, making them complex, potentially fragile (since they are not regularly tested), and out of reach for most customers. Creating these configurations today also causes problems with the use of other existing zones features, the most obvious of which is zone migration. These configurations complicate zone migration because all of the additional global zone configuration that was required to host the zone on shared storage must be tracked and then migrated along with the zone data itself. VM Migration is critical component of all existing VTs and anything that can be done to improve zone migration support would be a great benifit to zones administrators. All the other virtualization technologies currently available in Solaris support the hosting of Virtual Machines (VMs) on shared storage. They all do this in the same fashion, which involves taking a "storage object"[01] which is accessible from the "global zone"[02], and making it visible within the VM as a local disk. In this document we're refer to this process as "encapsulation", since the "disk" which a VM thinks it's accessing is actually contained (ie encapsulated) within some storage object in the global zone. This encapsulation has advantages and disadvantages. Encapsulation makes it easier to manage the storage associated with VMs. These storage objects may be accessible from multiple hosts simultaneously. They can be backed up, restored, moved around, and copied as individual files instead of filesystems. One disadvantage of the encapsulation used by these other VTs is that currently on Solaris, there is no way to open up these encapsulated disks and access their contents from the global zone. Currently, the only way to access these encapsulated disks is to import them into a running VM. Another interesting development with these encapsulated disks used by other VTs is the proliferation of storage object formats. These custom storage objects formats allows for features which are transparent to the VM using the disks and independent of any features offered by the underlying global zone. While specific feature sets may vary, common features that may be supported are things like compression, sparseness, dedup, snapshotting, and rollback. Management of all these storage object features is usually done from the global zone. Examples of some of these different formats are: VDI - VirtualBox Virtual HDD VHD - Microsoft Virtual Hard Disk VMDK - VMWare Virtual Machine Disk These formats are also commonly used for moving and sharing VM images, and it's not uncommon for users to take VMs encapsulated in one format and convert them into another. ---------- C. PROPOSAL / DESCRIPTION The essence of this proposal is to enhance zonecfg(1m) to allow for the specification of shared storage objects which can be used to encapsulate zone filesystems, including the zones root filesystem. Once shared storage objects are specified in zonecfg(1m) the management of this shared storage will be handled automatically by the zones framework. This will allow administrators to host zones on shared storage with no additional system configuration being required outside of zonecfg(1m). It will also provide zones with a consistent global zone storage usage and administration experience compared with other VTs. All the existing zoneadm(1m) operations that administrators can do today should continue to work seamlessly with this new support for shared storage objects. The proposal is broken down into the following subsections: C.0 Out of scope C.1 Zones enhancements C.2 Storage object uid/gid handling C.3 Taskq enhancements C.4 Zfs enhancements C.5 Lofi and lofiadm(1m) enhancements C.6 Performance considerations C.7 Phased delivery C.8 Future work ---------- C.0 Out of scope Before discussing the details of the changes being proposed, here are the things this proposal will not address: - Hosting zones natively on NFS. We will not be considering enhancements which would allow zone roots to live natively (ie, unencapsulated) on NFS. Supporting this functionality is not prevented by anything proposed herein, but initially we are not pursuing this option because: - Encapsulation allows us to host zones on a myriad of storage in addition to NFS. - Encapsulation brings the zones storage feature set closer in line with other VTs. - Encapsulation will likely be easier to implement than unencapsulated hosting of a zone root filesystems on NFS. - Advanced shared storage configuration. Initially there will be no support for advanced shared storage configuration options. If advanced configuration options need to be used, they will need to be setup in the global zone outside the zones framework. That said, nothing in this proposal should not prevent us from adding support for more complex configurations in the future, should the become popular. Some examples of what would be considered advanced shared storage configuration options are (and this is by no means a complete list): - iSCSI target discovery management - iSCSI target parameter configuration (CHAP, etc) - custom vdisk creation options - encapsulated zpools spanning multiple storage objects - custom zpool/zfs creation options ---------- C.1 Zones enhancements This is a list of proposed enhancements to specific zones subsystems. It is broken down into the following categories: C.1.i Zonecfg(1m) C.1.ii Storage object uri (so-uri) format C.1.iii Zoneadm(1m) install C.1.iv Zoneadm(1m) attach C.1.v Zoneadm(1m) boot C.1.vi Zoneadm(1m) detach C.1.vii Zoneadm(1m) uninstall C.1.viii Zoneadm(1m) clone ---------- C.1.i Zonecfg(1m) The zonecfg(1m) command will be enhanced with the following two new resources and associated properties: rootzpool resource src resource property install-size resource property zpool-preserve resource property dataset resource property zpool resource src resource property install-size resource property zpool-preserve resource property name resource property The new resource and properties will be defined as follows: "rootzpool" - Description: Identifies a shared storage object (and it's associated parameters) which will be used to contain the root zfs filesystem for a zone. "zpool" - Description: Identifies a shared storage object (and it's associated parameters) which will be made available to the zone as a delegated zfs dataset. "src" - Status: Required. - Format: Storage object uri (so-uri). (See definition below.) - Description: Identifies the storage object associated with this resource. "install-size" - Status: Optional. - Format: Integer. Defaults to bytes, but can be flagged as gigabytes, kilobytes, or megabytes, with a g, k, or m suffix, respectively. - Description: If the specified storage object doesn't exist at zone install time it will be created with this specific size. This property has no effect for storage objects which already exist and have a pre-defined size. "zpool-preserve" - Status: Optional. - Format: Boolean. Defaults to false. - Description: When doing an install, if this property if this property is set to true and a zpool already exists on the specified storage object it will be used. When doing a destroy, if this property is set to true, the root zpool will not be destroyed. "dataset" - Status: Optional - Format: zfs filesystem name component (can't contain a '/') - Description: Name of a dataset within the root zpool to delegate to the zone. "name" - Status: Required - Format: zfs filesystem name component (can't contain a '/') - Description: Used as part of the name for a zpool which will be delegated to the zone. Zonecfg(1m) "verify" will verify the syntax of any "rootzpool" resource group (and its properties), but it will NOT verify the accessibility of any storage specified by by a so-uri. (This is because accessing the storage specified by an so-uri could require configuration changes to other subsystems.) These new resources should provide zone administrators with lots of flexibility when it comes to deploying zones on shared storage. Some of the likely zone deployment models that are envisioned and enabled by this proposal are: - A zone with an encapsulated "rootzpool" zpool. In this scenario the OS will be stored in the "rootzpool", and all non-OS software and data will also be stored in datasets which are descendent from the zone root dataset. This means that system operations which snapshot and clone the OS will also snapshot and clone non-OS software and data. - A zone with an encapsulated "rootzpool" zpool and a "dataset" defined within the "rootzpool". In this scenario, the OS will be stored in in the "rootzpool", and all non-OS software and data will be stored within "dataset", which is also contained within the "rootzpool", but is not a direct descendent of any zone root dataset. This means that system operations which snapshot and clone the OS will not affect non-OS software and data. - A zone with an encapsulated "rootzpool" and one or more encapsulated "zpool"s. In this scenario, the OS will be stored in in the "rootzpool", and all non-OS software and data will be stored within other "zpool"s. This means that system operations which snapshot and clone the OS will not affect non-OS software and data. More information about these new resources and how they will be managed by the zones framework is available below. ---------- C.1.ii Storage object uri (so-uri) format The storage object uri (so-uri) syntax[03] will conform to the standard uri format defined in RFC 3986 [04]. The nfs URI scheme is defined in RFC 2224 [05]. The so-uri syntax can be summarised as follows: File storage objects: path:///<file-absolute> nfs://<host>[:port]/<file-absolute> Vdisk storage objects: vpath:///<file-absolute> vnfs://<host>[:port]/<file-absolute> Device storage objects: path:///dev/<file-absolute> fc:///wwn[@<lun>] iscsi:///alias=<alias>[@<lun>] iscsi:///target=<target>[@<lun>] iscsi://host[:port]/[tpgt=<tpgt>/]target=<target>[@<lun>] File storage objects point to plain files on a local, nfs, or cifs filesystems. These files are used to contain zpools which store zone datasets. These are the simplest types of storage objects. Once created, they have a fixed size, can't be grown, and don't support advanced features like snapshotting, etc. Some example file so-uri's are: path:///export/xvm/vm1.disk - a local file path:///net/heaped.sfbay/export/xvm/1.disk - a nfs file accessible via autofs nfs://heaped.sfbay/export/xvm/1.disk - same file specified directly via a nfs so-uri Vdisk storage objects are similar to file storage objects in that they can live on local, nfs, or cifs filesystems, but they each have their own special data format and varying featuresets, with support for things like snapshotting, etc.. Some common vdisk formats are: VDI, VMDK and VHD. Some example vdisk so-uri's are: vpath:///export/xvm/vm1.vmdk - a local vdisk image vpath:///net/heaped.sfbay/export/xvm/1.vmdk - a nfs vdisk image accessible via autofs vnfs://heaped.sfbay/export/xvm/1.vmdk - same vdisk image specified directly via a nfs so-uri Device storage objects specify block storage devices in a host independant fashion. Some /dev device names may already be named in a host independant fashion. In this case the admin can simply specify the /dev device path for this device as the so-uri. When configuring FC or iscsi storage on different hosts, the storage configuration normally lives outsize of zonecfg, and the configured storage may have varying /dev/dsk/cXtXdX* names. In these cases, the so-uri syntax provides a way to specify storage in a host independent fashion, and during zone management operations, the zones framework can map this storage to a host specific device path. Some example device so-uri's are: path:///dev/vx/dsk/zone1/rootvol - a Veritas volume that is accessible from multiple hosts using the same name fc:///20000014c3474...@0 - lun 0 of a fc disk with the specified wwn iscsi:///alias=oracle zone r...@0 - lun 0 of an iscsi disk with the specified alias. iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740 - lun 0 of an iscsi disk with the specified target id. ---------- C.1.iii Zoneadm(1m) install When a zone is installed via the zoneadm(1m) "install" subcommand, the zones subsystem will first verify that any required so-uris exist and are accessible. If an so-uri points to a plain file, nfs file, or vdisk, and the object does not exist, the object will be created with the install-size that was specified via zonecfg(1m). If the so-uri does not exist and an install-size was not specified via zonecfg(1m) an error will be generated and the install will fail. If an so-uri points to an explicit nfs server, the zones framework will need to mount the nfs filesystem containing storage object. The nfs server share containing the specified object will be mounted at: /var/zones/nfsmount/<host>/<nfs-share-name> If an so-uri points to a fibre channel lun, the zones subsystem will verify that the specified wwn corresponds to a global zone accessible fibre channel disk device. If an so-uri points to an iSCSI target or alias, the zones subsystem will verify that the iSCSI device is accessible on the local system. If an so-uri points to a static iSCSI target and that target is not already accessible on the local host, then the zones subsystem will enable static discovery for the local iSCSI initiator and attempt to apply the specified static iSCSI configuration. If the iSCSI target device is not accessible then the install will fail. Once a zones install has verified that any required so-uri exists and is accessible, the zones subsystem will need to initialise the so-uri. In the case of a path or nfs path, this will involve creating a zpool within the specified file. In the case of a vdisk, fibre channel lun, or iSCSI lun, this will involve creating a EFI/GPT partition on the device which uses the entire disk, then a zpool will be created within this partition. For data protection purposes, if a storage object contains any pre-existing partitions, zpools, or ufs filesystems, the install will fail with an appropriate error message. To continue the installation and overwrite any pre-existing data, the user will be able to specify a new '-f' option to zoneadm(1m) install. (This option mimics the '-f' option used by zpool(1m) create.) If zpool-preserve is set to true, then before initialising any target storage objects, the zones subsystem will attempt to import a pre-existing zpool from those objects. This will allow users to pre-create a zpool with custom creation time options, for use with zones. To successfully import a pre-created zpool for a zone install, that zpool must not be attached. (Ie, any pre-created zpool must be exported from the system where it was created before a zone can be installed on it.) Once the zpool is imported the install process will check for the existence of a /ROOT filesystem within the zpool. If this filesystem exists the install will fail with an appropriate error message. To continue the installation the user will need to specify the '-f' option to zoneadm(1m) install, which will cause the zones framework to delete the pre-existing /ROOT filesystem within the zpool. (This is done because during install the zones root filesystem will be created under /ROOT. See [07] of more details.) The newly created or imported root zpool will be named after the zone to which it is associated, with the assigned name being "<zonename>_rpool". This zpool will then be mounted on the zones zonepath and then the install process will continue normally[07]. XXX: use altroot at zpool creation or just manually mount zpool? If the user has specified a "zpool" resource, then the zones framework will configure, initialize, and/or import it in a similar manaer to a zpool specified by the "rootzpool" resource. The key differences are that the name of the newly created or imported zpool will be "<zonename>_<name>". The specified zpool will also have the zfs "zoned" property set to "on", hence it will not be mounted anywhere in the global zone. XXX: do we need "zpool import -O file-system-property=" to set the zoned property upon import. Once a zone configured with a so-uri is in the installed state, the zones framework needs a mechanism to mark that storage as in use to prevent it from being accessed by multiple hosts simultaneously. The most likely situation where this could happen is via a zoneadm(1m) attach on a remote host. The easiest way to achieve this is to keep the zpools associated with the storage imported and mounted at all times, and leverage the existing zpool support for detecting and preventing multi-host access. So whenever a global zone boots and the zones smf service runs, it will attempt to configure and import any shared storage objects associated with installed zones. It will then continue to behave as it does today and boot any installed zones that have the autoboot property set. If any shared sorage objects fail to configure or import, then: - the zones associated with the failed storage will be transitioned to the "configured" state. - an error message will be emitted to the zones smf log file. - after booting any remaning installed zones that have autoboot set to true, the zones smf service will enter the "maintainence" state, there by prompting the administrator to look at the zones smf log file. After fixing any problems with shared storage accessibility, the admin should be able to simply re-attach the zone to the system. Currently the zones smf service is dependant upon multi-user-server, so all networking services required for access to shared storage should be propertly configured well before we try to import any shared storage associated with zones. On system shutdown, the zones system will NOT export zpools contained within storage object used by the zone. Zpools contained within storage objects assigned to installed zones will only be exported during zone detach. More details about the behaviour of zone detach is provided below. ---------- C.1.iv Zoneadm(1m) attach The zoneadm(1m) attach operation moves a zone from the configured to the installed state. This operation will behave similarly to to a zoneadm(1m) install wrt so-uri management. One key difference between attach and install is that during an attach, any missing storage objects will not be automatically created and instead the attach will fail. Once all the storage objects associated with an attaching zone are configured and accessible, the attach operation will attempt to import and mount the zpools contained within those storage objects. If zfs detects that any zpools are in use by another host the attach will fail. A new '-f' option will be added to zoneadm(1m) which will allow the user to force import and mount any zpools which appear to still be in use. (once again, this option mimics the '-f' option used by zpool(1m) import.) Currently, when a zone is attached to a system a software inventory check is done to verify that the contents of the zone are in sync with the global zone. If they are then the zone is successfully attached. If the contents are older than what is present in the global zone, then the attach fails unless the user specifies the "-u" update option. If the contents of the zone are newer than what is in the global zone then the attach will fail. Encapsulated zones within zpools changes the behaviour of attach slightly because if the zone was created on a host with newer global zone bits, it's possible that the zpool/zfs filesystem versions associated with that zpool may be newer than what the host supports. If this is the case we will be unable to import the zone to check it's software contents. In practice this is not a problem because we know that if the zpool/zfs versions used by an encapsulated zone are later than that supported by the global zone, the software contents of that zone are also guaranteed to be later than what we have in our global zone, so the attach would fail regardless. In the case of older zpool/zfs encapsulated zpool versions, these can be imported without a problem and the version numbers will remain unchanged by the attach operation. ---------- C.1.v Zoneadm(1m) boot The "dataset" property within a "rootzpool" is designed to provide a convenient mechanism for zones to store data within the root zpool, but outside of the /ROOT filesystem (which is subject to snapshotting and cloning during boot environment management operations). To keep things simple, the dataset property can only specify a single path component. (ie, it can not contain any '/' characters.) If the user has specified the "dataset" property for a "rootzpool" resource, then when the zone is booted the framework will check for the presence of a zfs filesystem named /dataset/<dataset> within the root zpool. If no such filesystem exists it will be created. If this zfs filesystem is created by the zones framework, it will have the zfs mountpoint property set to none. This filesystem will then be be delegated to the zone as if a zonecfg(1m) "dataset" resource with a name property value set to "<zonename>_rpool/dataset/<dataset>" existed. For example, if a admin want to configure a zone contained within a single storage object, but still have a seperate dataset where they can manage snapshots and clones of their data independantly of their root filesystem and software, then they could create a configuration as follows: ---8<--- zonecfg -z ibis ... add rootzpool set src=iscsi:///alias=ibis z...@0 set dataset=oracle_data end ---8<--- once this zone was booted, the zone would have access to the following zfs dataset: ibis_rpool/dataset/oracle_data, and it could mount, snapshot, and clone this dataset. all independantly of any zone ROOT dataset snapshots and clones created by system management tools, image-updates, etc. If a zone has any "zpool" resources specified, when the zone is booted those zpools will be delegated to the zone. ---------- C.1.vi Zoneadm(1m) detach The zoneadm(1m) detach operation will be enhanced so that when detaching a zone from a system, any zpools residing in storage objects associated with the detaching zone will be exported. This will allow the administrator to attach the zone on another host without having to specify the force flag ('-f') to zoneadm(1m) attach. Also, if any static iSCSI storage objects are associated with the zone, the iSCSI initiator configuration for those targets will be removed. (Although iSCSI initiator static discovery will remain enabled.) ---------- C.1.vii Zoneadm(1m) uninstall If a user initiates a zoneadm(1m) uninstall operation, then for each zpool specified via a "rootzpool" or "zpool" resource, if the "zpool-preserve" property is set to false, the zpool contained within the specified storage object will be destroyed. If the "zpool-preserve" property is set to true, then for a "rootzpool" resource only the /ROOT filesystem (and any of it's descendents) within the root zpool will be destroyed, and then all the zones zpools will be exported. After destroying or exporting the zpool, any associated storage object will be unconfigured as described in the zoneadm(1m) detach section above. ---------- C.1.viii Zoneadm(1m) clone Normally when cloning a zone which lives on a zfs filesystem the zones framework will take a zfs(1m) snapshot of the source zone and then do a zfs(1m) clone operation to create a filesystem for the new zone which is being instantiated. This works well when all the zones on a given system live on local storage in a single zfs filesystem, but this model doesn't work well for zones with encapsulated roots. First, with encapsulated roots each zone has it's own zpool, and zfs (1m) does not support cloning across zpools. Second, zfs(1m) snapshotting/cloning within the source zpool and then mounting the resultant filesystem onto the target zones zoneroot would introduce dependencies between zones, complicating things like zone migration. Hence, for cloning operations, if the source zone has an encapsulated root, zoneadm(1m) will not use zfs(1m) snapshot/clone. Currently zoneadm(1m) will fall back to the use of find+cpio to clone zones if it is unable to use zfs(1m) snapshot/clone. We could just fall back to this default behaviour for encapsulated root zones, but find+cpio are not error free and can have problem with large files. So we propose to update zoneadm(1m) clone to detect when both the source and target zones are using separate zfs filesystems, and in that case attempt to use zfs send/recv before falling back to find+cpio. Today, the zoneadm(1m) clone operations ignores any additional storage (specified via the "fs", "device", or "dataset" resources) that may be associated with the zone. Similarly, the clone operation will ignore additional storage associated with any "zpool" resources. Since zoneadm(1m) clone will be enhanced to support cloning between encapsulated root zones and un-encapsulated root zones, zoneadm(1m) clone will be documented as the recommended migration mechanism for users who which to migrate existing zones from one format to another. ---------- C.2 Storage object uid/gid handling One issue faced by all VTs that support shared storage is dealing with file access permissions of storage objects accessible via NFS. This issue doesn't affect device based shared storage, or local files and vdisks, since these types of storage are always accessible, regardless of the uid of the access process (as long as the accessing process has the necessary privileges). But when accessing files and vdisk via NFS, the accessing process can not use privileges to circumvent restrictive file access premissions. This issue is also complicated by the fact that by default most NFS servier will map all accesses by remote root user to a different uid, usually "nobody". (a process known as "root squashing".) In order to avoid root squashing, or requiring users to setup special configurations on their NFS servers, whenever the zone framework attempts to create a storage object file or vdisk, it will temporarily change it's uid and gid to the "xvm" user and group, and then create the file with 0600 access permissions. Additionally, whenever the zones framework attempts to access an storage object file or vdisk it will temporarily switch its uid and gid to match the owner and group of the file/vdisk, ensure that the file is readable and writeable by it's owner (updating the file/vdisk permissions if necessary), and finally setup the file/vdisk for access via a zpool import or lofiadm -a. This should will allow the zones framework to access storage object files/vdisks that we created by any user, regardless of their ownership, simplifying file ownership and management issues for administrators. ---------- C.3 Taskq enhancements The integration of Duckhorn[08] greatly simplifies the management of cpu resources assigned to zone. This management is partially implemented through the use of dynamic resource pools, where zones and their associated cpu resources can both be bound to a pool. Internally, zfs has worker threads associated with each zpool. These are kernel taskq threads which can run on any cpu which has not been explicitly allocated to a cpu set/partition/pool. So today, for any zones living on zfs filesystems, and running in a dedicated cpu pool, any zfs disk processing associated with that zone is not done by the cpu's bound to that zones pool. Essentially all the zones zfs processing is done for "free" by the global zone. With the introduction of zpools encapsulated within storage objects, which are themselves associated with specific zones, it would be desirable to have the zpool worker threads bound to the cpus currently allocated to the zone. Currently, zfs uses taskq threads for each zpool, so one way of doing this would be to introduce a mechanism that allows for the binding of taskqs to pools. Hence we propose the following new interfaces: zfs_poolbind(char *, poolid_t); taskq_poolbind(taskq_t, poolid_t); When a zone, which is bound to a pool, is booted, the zones framework will call zfs_poolbind() for each zpool associated with an encapsulated storage object bound to the zone being booted. Zfs will in turn use the new taskq pool binding interfaces to bind all it's taskqs to the specified pools. This mapping is transient and zfs will not record or persist this binding in any way. The taskq implementation will be enhanced to allow for binding worker threads to a specific pool. If taskqs threads are created for a taskq which is bound to a specific pool, those new thread will also inherit the same pool bindings. The taskq to pool binding will remain in effect until the taskq is explicitly rebound or the pool to which it is bound is destroyed. ---------- C.4 Zfs enhancements In addition to the zfs_poolbind() interface proposed above. The zpool(1m) "import" command will need to be enhanced. Currently the zpool(1m) import by default scans all storage devices on the system looking for pools to import. The caller can also use the '-d' option to specify a directory within which the zpool(1m) command will scan for zpools that may be imported. This scanning involves sampling many objects. When dealing with zpools encapsulated in storage objects, this scanning is unnecessary since we already know the path to the objects which contains the zpool. Hence, the '-d' option will be enhanced to allow for the specification of a file or device. The user will also be able to specify this option multiple times, in case the zpool spans multiple objects. ---------- C.5 Lofi and lofiadm(1m) enhancements Currently, there is no way for a global zone to access the contents of a vdisk. Vdisk support was first introduced in VirtualBox. xVM then adopted the VirtualBox code for vdisk support. With both technologies, the only way to access the contents of a vdisk is to export it to a VM. To allow zones to use vdisk devices we propose to leverage the code introduced by by xVM by incorporating it into lofi. This will allow any solaris system to access the contents of vdisk devices. The interface changes to lofi to allow for this are fairly straitforward. A new '-l' option will be added to the lofiadm(1m) "-a" device creation mode. The '-l' option will indicate to lofi that the new device should have a label associated with it. Normally lofi device are named /dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number. When a disk device has a label associated with it, it exports many device nodes with different names. Therefore lofi will need to be enhanced to support these new device names, which multiple nodes per device. These new names will be: /dev/lofi/dsk<I>/p<j> - block device partitions /dev/lofi/dsk<I>/s<j> - block device slices /dev/rlofi/dsk<I>/p<j> - char device partitions /dev/rlofi/dsk<I>/s<j> - char device slices A new '-v <vdisk-format>' option will be added to the lofiadm(1m) "-a" device creation mode. This will indicate to lofi that the new device which is being created will be stored within a vdisk instead of a normal file. Vdisk formats may provide their own management features such as snapshotting, compression, encryption, etc. As such, the lofi vdisk support exists purely to access the contents of vdisks. Hence, vdisk based lofi devices will not support other lofi options such as encryption ('-c') and compression ('-C' / '-U'). Also, all vdisks contain actually disks, so they all contain a partition/label data. hence when attaching a vdisk the '-l' flag is always implied (and should not be specified.) The vdisk formats that will be supported by lofi are whatever vdisk formats happen to be supported by xVM at the time of integration. Since the implementation between lofi and xVM will be shared, as new vdisk format support is added to xVM, it should be immediately supportable via lofi as well. The current xVM implementation for accessing vdisks involves two drivers and a userland utility. A "frontend" driver runs inside a VM and it exports normal solaris disk interface. It takes IO requests to these disks and transmits them, via a ring buffer, to the "backend" driver running in the global zone. The backend driver then maps these ring requests into a dedicated vdisk process (there is one such process for every vdisk), and this process translates these ring requests into access to a vdisk of the requested format. Given all this existing xVM functionality, the most straitforward way to support vdisk from within lofi would be to leverage the xVM implementation. This will involve re-factoring the existing xVM code, thereby allowing lofi to utilise the "frontend" code which translates strategy io requests into ring buffer requests, and also the "backend" code which exports the ring buffer to userland. The unchanged xVM userland vdisk utility can then be used to map ring buffer requests to the actual vdisk storage. Currently this utility is only available on x86, but since lofi is a cross-platform utility, this proposal will require the delivery of this utility on both sparc and x86. This utility is currently delivered in an xVM private directory, /usr/lib/xen/bin/vdisk. Given that lofi is a more general and cross platform utility as compared to xVM, and also given that that we don't expect users to access the vdisk management utilities directly, we propose to move the vdisk application to /usr/lib/lofi/bin/vdisk. For RAS purposes, we will need to ensure that this vdisk utility is always running. Hence we will introduce a new lofi smf service svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid daemon, which will manage the starting, stopping, monitoring, and possible re-start of the vdisk utility. Re-starts of vdisk utility should be transparent (aside from a short performance hiccup) to any zones accessing those vdisks. By default this service will be disabled. If a lofi vdisk device is created, this service will be temporarily enabled. When the last vdisk based lofi device is destroyed, this service will disable itself. XXX: what to do about disk geometry assignment? sigh. Here's some examples of how this lofi functionality could be used (outside of the zone framework). If there are no lofi devices on the system, and an admin runs the following command: lofiadm -a -l /export/xvm/vm1.disk they would end up with the following device: /dev/lofi/dsk0/p# - for # == 0 - 4 /dev/lofi/dsk0/s# - for # == 0 - 15 /dev/rlofi/dsk0/p# - for # == 0 - 4 /dev/rlofi/dsk0/s# - for # == 0 - 15 If there are no lofi devices on the system, and an admin runs the following command: lofiadm -a -v /export/xvm/vm1.vmdk they would end up with the following device: /dev/lofi/dsk0/p# - for # == 0 - 4 /dev/lofi/dsk0/s# - for # == 0 - 15 /dev/rlofi/dsk0/p# - for # == 0 - 4 /dev/rlofi/dsk0/s# - for # == 0 - 15 By default, format(1m) will not list these devices in it's output. But users will be able to treat these devices like regular disks and pass their names to utilities like fdisk(1m), format(1m), prtvtoc(1m), fmthard(1m), zpool(1m), etc. ---------- C.6 Performance considerations As previously mentioned, this proposal primarily simplifies the process of configuring zones on shared storage. In most cases these proposed configurations can be created today, but no one has actually verified that these configurations perform acceptably. Hence, in conjunction with providing functionality to simplify the setup of these configs, we also need to be quantifying their performance to make sure that none of the configurations suffer from gross performance problems. The most straitforward configurations, with the least possibilities for poor performance, are ones using local devices, fibre channel luns, and iSCSI luns. These configuration should perform identically to the configurations where the global zone uses these objects to host zfs filesystems without zones. Additionally, the performance of these configurations will mostly be dependent upon the hardware associated with the storage devices. Hence the performance of these configuration is for the most part uninteresting and performance analysis of these configuration can by skipped. Looking at the performance of storage objects which are local files or nfs files is more interesting. In these cases the zpool that hosts the zone will be accessing it's storage via the zpool vdev_file vdev_ops_t interface. Currently, this interface doesn't receive as much use and performance testing as some of the other zpool vdev_ops_t interfaces. Hence it will worthwhile to measure the performance of a zpool backed by a file within another zfs filesystem. Likewise we will want to measure the performance of a zpool backed by a file on an NFS filesystem. Finally, we should compare these two performance points to a zone which is not encapsulated within a zpool, but is instead installed directly on a local zfs filesystem. (These comparisons are not really that interesting when dealing with block device based storage objects.) We will also want to determine if there are any specific NFS mount options that should be used which could affect performance, for example, should "forcedirectio" be enabled? Currently, while it is very common to deploy large numbers of zfs filesystems, systems with large numbers of zpools are not very common. The solution proposed in this project will likely result in an increase of zpools on systems hosting zones. Hence, we should evaluate the impact of an increasing number of zpools on performance scalability. This could be done by comparing the io performance drop-off of an increasing number of zones hosted multiple zfs filesystems in a single zpool vs zones hosted in seperate zpools. Finally, it will be important to do performance measurements for vdisk configurations. These configurations are similar to the local file or nfs configurations, but they will be utilising the vdev_disk backend and they will have an additional layer of indirection through lofi. XXX: impact of multiple zpools on arc and l2 arc? talk to mark maybee. ---------- C.7 Phased delivery Customers have been asking for a simple mechanisms to allow hosting of zones on NFS since the introduction of zones. Hence we'd like to get this functionality into the hands of customers as quickly as possible. Also, the approach taken by this proposal to supporting zones on shared storage is different from what was originally anticipated, hence we'd like to get practical experience with this approach at customer sites asap to determine if there are situations where this approach may not meet their requires. To accelerate the delivery of the previously proposed features, we plan to deliver them in three phases: I - Basic Zone encapsulation support This will involve the introduction of the new "rootzpool" and "zpool" resources and support the "file" and "nfs" type storage objects. It will require implementation of all the proposed zone, zfs, and taskq changes. This phase will not require any lofi changes. II - Zone encapsulation on fc and iSCSI storage. This will provide enhanced so-uri support for fibre channel and iSCSI type storage objects. III - Zone encapsulation on vdisk storage. This will involve implementation of the proposed lofi enhancements and enhancement support for vdisk type storage objects. ---------- C.8 Future work In addition to the proposed work above, there are future enhancements that could be made which would extend the functionality proposed above. These are included here because the features proposed above have been designed with this extensibility in mind. One simple enhancement would be the addition a new "rootzpool/zpool" resource property called "zpool-auto-upgrade". If set to true, whenever a zpool is imported by the zones framework (at install, attach, or boot) the zpool and zfs filesystems would be upgraded to the latest version supported by the global zone. Aside from helping to ensure that zones are using the latest and greatest zpool/zfs features, this feature would help ensure that all the encapsulated zones on a system are running on the same zpool/zfs versions. This is important because zfs send and receive are not guaranteed to be compatible across zpool/zfs versions. So by ensure that all zones are running the latest zpool/zfs versions, we increase the chances of being able to use zfs send/recv for zone cloning. Another possible enhancement to the "rootzpool/zpool" resources would be to allow the specification of multiple so-uris to support more complex root zpool configurations. With the introduction of vdisk support, we will have the ability to access VMs created by other VTs. Using this functionality it would be straitforward to enhance the existing zoneadm(1m) attach p2v functionality to be able to do v2v, where the source system image for installing a zone could be a vdisk created by another VT. Finally, a new "storage" zonecfg(1m) resource could be added to allow for the addition of arbitrary storage objects to zones. This "storage" resource would have the similar properties as a "zpool" resource, along with a few additional properties: storage resource src resource property install-size resource property name resource property locking-required resource property The zones framework would ensure that the specified storage objects were accessible from the global zone, and then it would grant the zone access to the raw disk devices. But global zone disk device names can be different on different host, so if global zone device names were used within the non-global zone, this would negatively impact zone migration since the software within the zone would have to be updated to deal with potentially new device names. To avoid this problems and factilitate zone migration, the disk devices would be mapped into the non-global zone with different names which would be global zone independant. Regadless of the global zone disk device name, from within the non-global zone, the devices would be named: /dev/storage/<name>/p<j> /dev/storage/<name>/s<j> By basing the names of the device as seen from within the zone on the "name" specified in the zonecfg(1m) (instead of the raw device names from the global zone), we can guarantee that these device names will not change when migrating the zones between hosts. One difficulty with allowing the specification of storage objects which don't by default contain a zpools is that we can no longer use zfs to ensure that multiple entities are not accessing these objects at the same time. Hence other mechanisms will need to be used. In the case of fibre channel and iSCSI, we should be able to use SCSI reservations. In the case of file paths we should be able to use standard file locking. In the case of vdisks, some vdisks support meta-data which can be utilised to prevent concurrent access, and if a vdisk format does not allow for this then we will need to fall back to file locking. If for some reason we are unable to do in use detection for a specific storage object, then we could alert the user and refuse to use that storage object unless the user set the optional "locking-required" property to false. ---------- D. INTERFACES Zonecfg(1m): rootzpool committed, resource src committed, resource property install-size committed, resource property zpool-preserve committed, resource property dataset committed, resource property zpool resource src resource property install-size resource property zpool-preserve resource property name resource property Zoneadm(1m): install -f committed, optional flag attach -f committed, optional flag Zones misc: /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name> project private, nfs mount point Taskq pool binding: taskq_poolbind(taskq_t, poolid_t) consolidation private Zfs pool binding: zfs_poolbind(char *, poolid_t) consolidation private Lofiadm(1m): -a -l committed, optional flag -a -v <vdisk-format> committed, optional flag Lofi misc: svc:/system/lofi:default project private /lib/lofi/lofid project private /lib/lofi/vdisk consolidation private /w contract for xVM ---------- E. ABBREVIATIONS VDI - VirtualBox Virtual HDD VHD - Microsoft Virtual Hard Disk VM - Virtual Machine VMDK - VMWare Virtual Machine Disk VT - Virtualization Technology ---------- F. FOOTNOTES / REFERENCES -- 00 - Blogs and emails describing custom configurations used to host zones on shared storage: June 2005 Containers on NFS? http://blogs.sun.com/jph/entry/containers_on_nfs February 2008 The quick & dirty guide to zones on iSCSI LUNs http://opensolaris.org/jive/thread.jspa?messageID=226951 April 2008 ZoiT: Solaris Zones on iSCSI Targets (aka NAC: Network-Attached Containers) http://blogs.sun.com/JeffV/entry/zoit_solaris_zones_on_iscsi -- 01 - For the purposes of this document, the term storage object refers to any device or file which can be used to store data. These "objects" can take the form of files on a local filesystem, files on a remote filesystem accessible via NFS, CIFS, etc, local disk devices, FC disk devices, iSCSI target devices, -- 02 - Each VT has a different name for the zones concept of the "global zone". With LDOMs we have the "control domain", with xVM we have "dom0", with VirtualBox we have the "host domain". For the purposes of this document the "global zone" refers to the OS entity which allocates resources to and has control over the state of VMs. For simplicity, this document will always just refer to the "global zone" in place of all these other terms. -- 03 - Complete so-uri ABNF syntax definition: so-uri = path-uri / nfs-uri / vpath-uri / vnfs-uri / fc-uri / iscsi-uri path-uri = "path://" file-absolute file-absolute = "/" *( segment "/" ) segment-nz ; Since file-absolute is always the last component ; in a uri reserved characters do not need to be ; percent encoding. To simplify path management ; we will also not permit '/./' or '/../' segments. nfs-uri = "nfs://" hostport "/" file-absolute ; Defined in RFC 2224 [05] hostport = host [ ":" port ] vpath-uri = "vpath://" file-absolute vnfs-uri = "vnfs://" host [ ":" port ] "/" file-absolute ; same syntax as nfs-uri fc-uri = "fc:/// wwn [ "@" lun ] wwn = 1000 12HEXDIG / 2 15HEXDIG / 50 14HEXDIG / 60 14HEXDIG ; see http://en.wikipedia.org/wiki/World_Wide_Name iscsi-uri = "iscsi:///" iscsi-alias / "iscsi:///" iscsi-target / "iscsi://" iscsi-static iscsi-alias = "alias=" iscsi-alias-name [ "@" lun ] iscsi-alias-name = 1*255pchar ; Defined in RFC 3720 [06] ; any "@" chars must be percent encoded iscsi-target = "target=" iscsi-target-iqn [ "@" lun ] / "target=" iscsi-target-eui [ "@" lun ] iscsi-target-iqn = "iqn." iqn-date "." reg-name [ ":" 1*pchar ] ; Defined in RFC 3720 [06] ; any "@" chars must be percent encoded iqn-date = 4DIGIT "-" "0" %x31-39 ; "XXXX-01" - "XXXX-09" / 4DIGIT "-" "1" %x30-32 ; "XXXX-10" - "XXXX-12" iscsi-target-eui = XXX ; Defined in RFC 3720 [06] ; any "@" chars must be percent encoded iscsi-static = hostport "/" [ "tpgt=" iscsi-tpgt "/" ] iscsi-target iscsi-tpgt = DIGIT ; 0-9 / %x31-39 1*3DIGIT ; 10-9999 / %x31-35 4DIGIT ; 10000-59999 / "6" %x30-34 3DIGIT ; 60000-64999 / "65" %x30-34 2DIGIT ; 65000-65499 / "655" %x30-32 1DIGIT ; 65500-65529 / "6553" %x30-35 ; 65530-65535 lun = DIGIT ; 0-9 / %x31-39 1*3DIGIT ; 10-9999 / "1" %x30-35 3DIGIT ; 10000-15999 / "16" %x30-32 2DIGIT ; 16000-16299 / "163" %x30-37 1DIGIT ; 16300-16379 / "1638" %x30-33 ; 16380-16383 DIGIT = Defined in RFC 3986 [04] HEXDIG = Defined in RFC 3986 [04] segment-nz = Defined in RFC 3986 [04] segment = Defined in RFC 3986 [04] host = Defined in RFC 3986 [04] port = Defined in RFC 3986 [04] -- 04 - RFC 3986: Uniform Resource Identifier (URI): Generic Syntax http://www.ietf.org/rfc/rfc3986.txt http://www.websitedev.de/temp/rfc3986-check.html.gz -- 05 - RFC 2224 NFS URL Scheme http://www.ietf.org/rfc/rfc2224.txt -- 06 - RFC 3720: Internet Small Computer Systems Interface (iSCSI) -- 07 - Zones/SNAP Design - 8/25/2008 http://www.opensolaris.org/jive/thread.jspa?messageID=272726񂥖 -- 08 - PSARC/2006/496 Improved Zones/RM Integration http://arc.opensolaris.org/caselog/PSARC/2006/496 -------------------------------------------------------------------------------
_______________________________________________ zones-discuss mailing list zones-discuss@opensolaris.org