hey illya, thanks for reviewing this and sorry for the delay in replying. my comments are inline below.
i've also attached a document that contains some of the design doc sections i've revised based of your (and john's) feedback. since the document is large, i've only included sections that i've modified, and i've included changebars in the right most column. ed On Mon, Sep 07, 2009 at 11:07:41AM +0300, Illya Kysil wrote: > Hi Edward, > > See comments and questions below inline: > > 1. Section C.0 > > ... That said, nothing in this proposal should not prevent us from adding > > support for... > That "not" before "prevent" is superfluous. > done. > 2. Section C.1.i > How many instances of "rootzpool" and "zpool" resources is permitted? > IMO "zero or one" for "rootzpool" and "zero or more" for "zpool" is enough. > correct. i've updated the proposal to mention this. > 3. Section C.1.iii > > The newly created or imported root zpool will be named after the zone to > > which it is associated, with the assigned name being "<zonename>_rpool". > What if zone name is changed later? Will the name of the zpool change as well? > It is not clear how the association with the zpool will be maintained > if its name will not change. > the pool name must be kept in sync with the zone name. unfortunatly this presents a problem because currently zone renaming is done from zonecfg and it can be done when a zone is installed, but since zone zpools are imported at install time, we need to disallow re-naming of installed zones. hence, i've added the following new section: C.1.x Zonecfg(1m) misc changes > > This zpool will then be mounted on the zones zonepath and then the > > install process will continue normally. > > > > XXX: use altroot at zpool creation or just manually mount zpool? > What will happen on zoneadm move? IMO the zones framework have to > remount the zpool in the new location. > that's correct. i've added another new section: C.1.ix Zoneadm(1m) move > > If the user has specified a "zpool" resource, then the zones framework > > will configure, initialize, and/or import it in a similar manner to a > > zpool specified by the "rootzpool" resource. The key differences are > > that the name of the newly created or imported zpool will be > > "<zonename>_<name>". > What if zone name or zpool resource name is changed later? Will the > name of the zpool change as well? > It is not clear how the association with the zpool will be maintained > if its name will not change. > > 4. Section C.1.viii once again, the zpool name will need to be kept in sync with the zonename. > > Since zoneadm(1m) clone will be enhanced to support cloning between > > encapsulated root zones and un-encapsulated root zones, zoneadm(1m) > > clone will be documented as the recommended migration mechanism for > > users who which to migrate existing zones from one format to another. > "users who which" -> "users who wish" > > 5. Section C.5 > > For RAS purposes,... > What is RAS? > a TLA. :) it stand for Reliability, Availability, and Serviceability. > > Here's some examples of how this lofi functionality could be used > > (outside of the zone framework). If there are no lofi devices on > > the system, and an admin runs the following command: > > lofiadm -a -l /export/xvm/vm1.disk > > > > they would end up with the following device: > > /dev/lofi/dsk0/p# - for # == 0 - 4 > > /dev/lofi/dsk0/s# - for # == 0 - 15 > > /dev/rlofi/dsk0/p# - for # == 0 - 4 > > /dev/rlofi/dsk0/s# - for # == 0 - 15 > > > > If there are no lofi devices on the system, and an admin runs the > > following command: > > lofiadm -a -v /export/xvm/vm1.vmdk > > > > they would end up with the following device: > > /dev/lofi/dsk0/p# - for # == 0 - 4 > > /dev/lofi/dsk0/s# - for # == 0 - 15 > > /dev/rlofi/dsk0/p# - for # == 0 - 4 > > /dev/rlofi/dsk0/s# - for # == 0 - 15 > The list of devices is the same in both examples. What's the difference? > the difference is in the invocation, not in the resulte. i've re-worded this section to make things more clear. > 6. Section D > > D. INTERFACES > > > > Zonecfg(1m): > > rootzpool committed, resource > > src committed, resource property > > install-size committed, resource property > What is the meaning of "committed" here? > this is arc terminology. basically, i'm presenting a new user interface here and i'm saying that it isn't going to change incompatibaly in the future. > >Zones misc: > > /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name> > > project private, nfs mount point > The mount point is different from what is described in section C.1.iii > (see additional comment above): oops. fixed. > > If an so-uri points to an explicit nfs server, the zones framework will > > need to mount the nfs filesystem containing storage object. The nfs > > server share containing the specified object will be mounted at: > > /var/zones/nfsmount/<host>/<nfs-share-name> > > 7. What will happen to storage objects on "zonecfg delete"? > nothing. a zonecfg delete just deletes a zones configuration. no storage objects assocaited with the zone will be touched.
<... snip ...> ---------- C.1.i Zonecfg(1m) The zonecfg(1m) command will be enhanced with the following two new resources and associated properties: rootzpool resource src resource property install-size resource property zpool-preserve resource property dataset resource property user resource property + zpool resource src resource property install-size resource property zpool-preserve resource property name resource property user resource property + The new resource and properties will be defined as follows: "rootzpool" - Status: Optional. + - Description: Identifies a shared storage object (and it's associated parameters) which will be used to contain the root zfs | filesystem for a zone. Only one "rootzpool" may be defined | per zone. + "zpool" - Status: Optional. + - Description: Identifies a shared storage object (and it's associated parameters) which will be made available to the zone as a delegated zfs dataset. Multiple "zpool" resources | may be defined per zone. + <... snip ...> "user" + - Status: Optional + - Format: User name string + - Description: User name to use when accessing a path:// or nfs:// based + storage object. + + <... snip ...> ---------- C.1.ii Storage object uri (so-uri) format The storage object uri (so-uri) syntax will conform to the standard uri format defined in RFC 3986 . The nfs URI scheme is defined in RFC 2224 . The so-uri syntax can be summarised as follows: File and vdisk storage objects: | path:///<file-absolute> nfs://<host>[:port]/<file-absolute> <... snip ...> Vdisk storage objects are similar to file storage objects in that they can live on local, nfs, or cifs filesystems, but they each have their own special data format and varying feature sets, with support for things | like snapshotting, etc.. Some common vdisk formats are: VDI, VMDK and VHD. Some example vdisk so-uris are: | path:///export/xvm/vm1.vmdk | - a local vdisk image path:///net/heaped.sfbay/export/xvm/1.vmdk | - a nfs vdisk image accessible via autofs nfs://heaped.sfbay/export/xvm/1.vmdk | - same vdisk image specified directly via a nfs so-uri <... snip ...> ---------- C.1.ix Zoneadm(1m) move + + Since an installed zone with a "rootzpool" resource will have zfs + datasets mounted on it's zonepath, the zoneadm(1m) move subcommand will + have to handle unmounting and remounting the "rootzpool" in the process + of doing the move operation. + + + ---------- + C.1.x Zonecfg(1m) misc changes + + Currently renaming of a zone is done via zonecfg(1m) by setting the + zonename property to the new desired zone name. This operation can be + done while the zone is in the "configured" or "installed" state. This + presents a problem for installed zones which have a "rootzpool" or + "zpool" resource, since these zones zones have associated zpools and the + names of these zpools are dependent upon the zone's name. Hence, + zonecfg(1m) will need to be modified to prevent the renaming of + installed installed zones which have a "rootzpool" or "zpool" resource. + zonecfg(1m). To rename these zones, the administrator will need to + detach the zone via zoneadm(1m), rename it via zonecfg(1m), and + subsequently re-attach the zone via zoneadm(1m). + + Another non-obvious impact upon zonecfg(1m) operation is that since + "rootzpool" or "zpool" resource are only configured during zone install + and attach, it will not be possible to add, remove, or modify + "rootzpool" or "zpool" resource for an installed zone. Once again, to + change these resources the admin will need to detach and re-attach the + zone. + + ---------- C.2 Storage object uid/gid handling One issue faced by all VTs that support shared storage is dealing with file access permissions of storage objects accessible via NFS. This issue doesn't affect device based shared storage, or local files and vdisks, since these types of storage are always accessible, regardless of the uid of the access process (as long as the accessing process has the necessary privileges). But when accessing files and vdisk via NFS, the accessing process can not use privileges to circumvent restrictive file access permissions. This issue is also complicated by the fact that by default most NFS server will map all accesses by remote root user to a different uid, usually "nobody". (a process known as "root squashing".) In order to avoid root squashing, or requiring users to setup special configurations on their NFS servers, whenever the zone framework attempts to create a storage object file or vdisk, it will temporarily change it's uid and gid to the "xvm" user and group, and then create the file with 0600 access permissions. If the optional storage object "user" configuration parameter is not | specified, then whenever the zones framework attempts to access an | storage object file or vdisk it will temporarily switch its uid and gid | to match the owner and group of the file/vdisk, ensure that the file is | readable and writable by it's owner (updating the file/vdisk | permissions if necessary), and finally setup the file/vdisk for access | via a zpool import or lofiadm -a. This should will allow the zones | framework to access storage object files/vdisks that we created by any | user, regardless of their ownership, simplifying file ownership and | management issues for administrators. + If the optional storage object "user" configuration parameter has been + specified, then instead of temporarily assuming the uid matching the + owner of the file/vdisk, the zones framework will instead assume the uid + associated with "user". All the other subsequent operations listed + above regarding the group id, permissions, etc will remain the same. + <... snip ...> ---------- C.5 Lofi and lofiadm(1m) enhancements Currently, there is no way for a global zone to access the contents of a vdisk. Vdisk support was first introduced in VirtualBox. xVM then adopted the VirtualBox code for vdisk support. With both technologies, the only way to access the contents of a vdisk is to export it to a VM. To allow zones to use vdisk devices we propose to leverage the code introduced by by xVM by incorporating it into lofi. This will allow any solaris system to access the contents of vdisk devices. The interface changes to lofi to allow for this are fairly straightforward. A new '-l' option will be added to the lofiadm(1m) "-a" device creation mode. The '-l' option will indicate to lofi that the new device should have a label associated with it. Normally lofi device are named /dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number. When a disk device has a label associated with it, it exports many device nodes with different names. Therefore lofi will need to be enhanced to support these new device names, which multiple nodes per device. These new names will be: /dev/lofi/dsk<I>/p<j> - block device partitions /dev/lofi/dsk<I>/s<j> - block device slices /dev/rlofi/dsk<I>/p<j> - char device partitions /dev/rlofi/dsk<I>/s<j> - char device slices Lofi will also be enhanced to support accessing vdisks in addition to | normal files. Accessing a vdisk via lofi will be no different from | accessing a disk encapsulated within a file as described above with the | -l option. The lofi '-a' option will be enhanced to allow for the | automatic detection of vdisk targets on the lofi command line. Since | all vdisks represent actual disks they all contain a partition/label | data, hence when if a vdisk is detected on the command line, the '-l' | flag is automatically implied. The vdisk formats that will be supported | by lofi are whatever vdisk formats happen to be supported by xVM at the | time of integration. Since the implementation between lofi and xVM will | be shared, as new vdisk format support is added to xVM, it should be | immediately supportable via lofi as well. In many cases, vdisk formats | may provide their own management features such as snapshotting, | compression, encryption, etc. As such, the lofi vdisk support exists | purely to access the contents of vdisks. Hence, vdisk based lofi | devices will not support other lofi options such as encryption ('-c') + and compression ('-C' / '-U'). + The current xVM implementation for accessing vdisks involves two drivers and a userland utility. A "frontend" driver runs inside a VM and it exports normal solaris disk interface. It takes IO requests to these disks and transmits them, via a ring buffer, to the "backend" driver running in the global zone. The backend driver then maps these ring requests into a dedicated vdisk process (there is one such process for every vdisk), and this process translates these ring requests into access to a vdisk of the requested format. Given all this existing xVM functionality, the most straightforward way to support vdisk from within lofi would be to leverage the xVM implementation. This will involve re-factoring the existing xVM code, thereby allowing lofi to utilise the "frontend" code which translates strategy io requests into ring buffer requests, and also the "backend" code which exports the ring buffer to userland. The unchanged xVM userland vdisk utility can then be used to map ring buffer requests to the actual vdisk storage. Currently this utility is only available on x86, but since lofi is a cross-platform utility, this proposal will require the delivery of this utility on both sparc and x86. This utility is currently delivered in an xVM private directory, /usr/lib/xen/bin/vdisk. Given that lofi is a more general and cross platform utility as compared to xVM, and also given that that we don't expect users to access the vdisk management utilities directly, we propose to move the vdisk application to /usr/lib/lofi/bin/vdisk. For RAS purposes, we will need to ensure that this vdisk utility is always running. Hence we will introduce a new lofi smf service svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid daemon, which will manage the starting, stopping, monitoring, and possible re-start of the vdisk utility. Re-starts of vdisk utility should be transparent (aside from a short performance hiccup) to any zones accessing those vdisks. By default this service will be disabled. If a lofi vdisk device is created, this service will be temporarily enabled. When the last vdisk based lofi device is destroyed, this service will disable itself. XXX: what to do about disk geometry assignment? sigh. Here's some examples of how this lofi functionality could be used (outside of the zone framework). | | If an admin wanted to access a disk embedded within a plain file then + they could run the following command: + lofiadm -a -l /export/xvm/vm1.disk If an admin wanted to access a vdisk (which always implicitly has a disk | embedded within it) they could run the following command: | lofiadm -a /export/xvm/vm1.vmdk | In both the examples above, if there were no previously configured | lofi devices on the system, the following new lofi devices would | be created: | /dev/lofi/dsk0/p# - for # == 0 - 4 /dev/lofi/dsk0/s# - for # == 0 - 15 /dev/rlofi/dsk0/p# - for # == 0 - 4 /dev/rlofi/dsk0/s# - for # == 0 - 15 By default, format(1m) will not list these devices in it's output. But users will be able to treat these devices like regular disks and pass their names to utilities like fdisk(1m), format(1m), prtvtoc(1m), fmthard(1m), zpool(1m), etc. <... snip ...>
_______________________________________________ zones-discuss mailing list firstname.lastname@example.org