Re: [zones-discuss] zones on shared storage proposal

Edward Pilatowicz Wed, 16 Sep 2009 17:56:34 -0700

hey illya,

thanks for reviewing this and sorry for the delay in replying.
my comments are inline below.


i've also attached a document that contains some of the design doc
sections i've revised based of your (and john's) feedback.  since the
document is large, i've only included sections that i've modified, and
i've included changebars in the right most column.

ed

On Mon, Sep 07, 2009 at 11:07:41AM +0300, Illya Kysil wrote:
> Hi Edward,
>
> See comments and questions below inline:
>
> 1. Section C.0
> >  ... That said, nothing in this proposal should not prevent us from adding 
> > support for...
> That "not" before "prevent" is superfluous.
>

done.

> 2. Section C.1.i
> How many instances of "rootzpool" and "zpool" resources is permitted?
> IMO "zero or one" for "rootzpool" and "zero or more" for "zpool" is enough.
>

correct.  i've updated the proposal to mention this.

> 3. Section C.1.iii
> > The newly created or imported root zpool will be named after the zone to
> > which it is associated, with the assigned name being "<zonename>_rpool".
> What if zone name is changed later? Will the name of the zpool change as well?
> It is not clear how the association with the zpool will be maintained
> if its name will not change.
>

the pool name must be kept in sync with the zone name.

unfortunatly this presents a problem because currently zone renaming is
done from zonecfg and it can be done when a zone is installed, but since
zone zpools are imported at install time, we need to disallow re-naming
of installed zones.  hence, i've added the following new section:

        C.1.x Zonecfg(1m) misc changes


> > This zpool will then be mounted on the zones zonepath and then the
> > install process will continue normally[07].
> >
> > XXX: use altroot at zpool creation or just manually mount zpool?
> What will happen on zoneadm move? IMO the zones framework have to
> remount the zpool in the new location.
>

that's correct.  i've added another new section:
        C.1.ix Zoneadm(1m) move


> > If the user has specified a "zpool" resource, then the zones framework
> > will configure, initialize, and/or import it in a similar manner to a
> > zpool specified by the "rootzpool" resource.  The key differences are
> > that the name of the newly created or imported zpool will be
> > "<zonename>_<name>".
> What if zone name or zpool resource name is changed later? Will the
> name of the zpool change as well?
> It is not clear how the association with the zpool will be maintained
> if its name will not change.
>
> 4. Section C.1.viii

once again, the zpool name will need to be kept in sync with the
zonename.

> > Since zoneadm(1m) clone will be enhanced to support cloning between
> > encapsulated root zones and un-encapsulated root zones, zoneadm(1m)
> > clone will be documented as the recommended migration mechanism for
> > users who which to migrate existing zones from one format to another.
> "users who which" -> "users who wish"
>
> 5. Section C.5
> > For RAS purposes,...
> What is RAS?
>

a TLA.  :)

it stand for Reliability, Availability, and Serviceability.

> > Here's some examples of how this lofi functionality could be used
> > (outside of the zone framework).  If there are no lofi devices on
> > the system, and an admin runs the following command:
> >     lofiadm -a -l /export/xvm/vm1.disk
> >
> > they would end up with the following device:
> >     /dev/lofi/dsk0/p#               - for # == 0 - 4
> >     /dev/lofi/dsk0/s#               - for # == 0 - 15
> >     /dev/rlofi/dsk0/p#              - for # == 0 - 4
> >     /dev/rlofi/dsk0/s#              - for # == 0 - 15
> >
> > If there are no lofi devices on the system, and an admin runs the
> > following command:
> >     lofiadm -a -v /export/xvm/vm1.vmdk
> >
> > they would end up with the following device:
> >     /dev/lofi/dsk0/p#               - for # == 0 - 4
> >     /dev/lofi/dsk0/s#               - for # == 0 - 15
> >     /dev/rlofi/dsk0/p#              - for # == 0 - 4
> >     /dev/rlofi/dsk0/s#              - for # == 0 - 15
> The list of devices is the same in both examples. What's the difference?
>

the difference is in the invocation, not in the resulte.
i've re-worded this section to make things more clear.

> 6. Section D
> > D. INTERFACES
> >
> > Zonecfg(1m):
> >     rootzpool                               committed, resource
> >             src                             committed, resource property
> >             install-size                    committed, resource property
> What is the meaning of "committed" here?
>

this is arc terminology.  basically, i'm presenting a new user interface
here and i'm saying that it isn't going to change incompatibaly in the
future.

> >Zones misc:
> >     /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
> >                                             project private, nfs mount point
> The mount point is different from what is described in section C.1.iii
> (see additional comment above):

oops.  fixed.

> > If an so-uri points to an explicit nfs server, the zones framework will
> > need to mount the nfs filesystem containing storage object.  The nfs
> > server share containing the specified object will be mounted at:
> >     /var/zones/nfsmount/<host>/<nfs-share-name>
>
> 7. What will happen to storage objects on "zonecfg delete"?
>

nothing.  a zonecfg delete just deletes a zones configuration.
no storage objects assocaited with the zone will be touched.

<... snip ...>
----------
C.1.i Zonecfg(1m)

The zonecfg(1m) command will be enhanced with the following two new
resources and associated properties:

        rootzpool                               resource
                src                             resource property
                install-size                    resource property
                zpool-preserve                  resource property
                dataset                         resource property
                user                            resource property               
+

        zpool                                   resource
                src                             resource property
                install-size                    resource property
                zpool-preserve                  resource property
                name                            resource property
                user                            resource property               
+

The new resource and properties will be defined as follows:

"rootzpool"
    - Status: Optional.                                                         
+
    - Description: Identifies a shared storage object (and it's
        associated parameters) which will be used to contain the root zfs       
|
        filesystem for a zone.  Only one "rootzpool" may be defined             
|
        per zone.                                                               
+

"zpool"
    - Status: Optional.                                                         
+
    - Description: Identifies a shared storage object (and it's
        associated parameters) which will be made available to the
        zone as a delegated zfs dataset.  Multiple "zpool" resources            
|
        may be defined per zone.                                                
+

<... snip ...>

"user"                                                                          
+
    - Status: Optional                                                          
+
    - Format: User name string                                                  
+
    - Description: User name to use when accessing a path:// or nfs:// based    
+
        storage object.                                                         
+
                                                                                
+
<... snip ...>


----------
C.1.ii Storage object uri (so-uri) format

The storage object uri (so-uri) syntax[03] will conform to the standard
uri format defined in RFC 3986 [04].  The nfs URI scheme is defined in
RFC 2224 [05].  The so-uri syntax can be summarised as follows:

File and vdisk storage objects:                                                 
|

    path:///<file-absolute>
    nfs://<host>[:port]/<file-absolute>

<... snip ...>

Vdisk storage objects are similar to file storage objects in that they
can live on local, nfs, or cifs filesystems, but they each have their
own special data format and varying feature sets, with support for things       
|
like snapshotting, etc..  Some common vdisk formats are: VDI, VMDK and
VHD.  Some example vdisk so-uris are:                                           
|

path:///export/xvm/vm1.vmdk                                                     
|
        - a local vdisk image
path:///net/heaped.sfbay/export/xvm/1.vmdk                                      
|
        - a nfs vdisk image accessible via autofs
nfs://heaped.sfbay/export/xvm/1.vmdk                                            
|
        - same vdisk image specified directly via a nfs so-uri

<... snip ...>


----------
C.1.ix Zoneadm(1m) move                                                         
+
                                                                                
+
Since an installed zone with a "rootzpool" resource will have zfs               
+
datasets mounted on it's zonepath, the zoneadm(1m) move subcommand will         
+
have to handle unmounting and remounting the "rootzpool" in the process         
+
of doing the move operation.                                                    
+
                                                                                
+
                                                                                
+
----------                                                                      
+
C.1.x Zonecfg(1m) misc changes                                                  
+
                                                                                
+
Currently renaming of a zone is done via zonecfg(1m) by setting the             
+
zonename property to the new desired zone name.  This operation can be          
+
done while the zone is in the "configured" or "installed" state.  This          
+
presents a problem for installed zones which have a "rootzpool" or              
+
"zpool" resource, since these zones zones have associated zpools and the        
+
names of these zpools are dependent upon the zone's name.  Hence,               
+
zonecfg(1m) will need to be modified to prevent the renaming of                 
+
installed installed zones which have a "rootzpool" or "zpool" resource.         
+
zonecfg(1m).  To rename these zones, the administrator will need to             
+
detach the zone via zoneadm(1m), rename it via zonecfg(1m), and                 
+
subsequently re-attach the zone via zoneadm(1m).                                
+
                                                                                
+
Another non-obvious impact upon zonecfg(1m) operation is that since             
+
"rootzpool" or "zpool" resource are only configured during zone install         
+
and attach, it will not be possible to add, remove, or modify                   
+
"rootzpool" or "zpool" resource for an installed zone.  Once again, to          
+
change these resources the admin will need to detach and re-attach the          
+
zone.                                                                           
+
                                                                                
+

----------
C.2 Storage object uid/gid handling

One issue faced by all VTs that support shared storage is dealing with
file access permissions of storage objects accessible via NFS.  This
issue doesn't affect device based shared storage, or local files and
vdisks, since these types of storage are always accessible, regardless
of the uid of the access process (as long as the accessing process has
the necessary privileges).  But when accessing files and vdisk via NFS,
the accessing process can not use privileges to circumvent restrictive
file access permissions.  This issue is also complicated by the fact
that by default most NFS server will map all accesses by remote root
user to a different uid, usually "nobody".  (a process known as "root
squashing".)

In order to avoid root squashing, or requiring users to setup special
configurations on their NFS servers, whenever the zone framework
attempts to create a storage object file or vdisk, it will temporarily
change it's uid and gid to the "xvm" user and group, and then create the
file with 0600 access permissions.

If the optional storage object "user" configuration parameter is not            
|
specified, then whenever the zones framework attempts to access an              
|
storage object file or vdisk it will temporarily switch its uid and gid         
|
to match the owner and group of the file/vdisk, ensure that the file is         
|
readable and writable by it's owner (updating the file/vdisk                    
|
permissions if necessary), and finally setup the file/vdisk for access          
|
via a zpool import or lofiadm -a.  This should will allow the zones             
|
framework to access storage object files/vdisks that we created by any          
|
user, regardless of their ownership, simplifying file ownership and             
|
management issues for administrators.                                           
+

If the optional storage object "user" configuration parameter has been          
+
specified, then instead of temporarily assuming the uid matching the            
+
owner of the file/vdisk, the zones framework will instead assume the uid        
+
associated with "user".  All the other subsequent operations listed             
+
above regarding the group id, permissions, etc will remain the same.            
+

<... snip ...>


----------
C.5 Lofi and lofiadm(1m) enhancements

Currently, there is no way for a global zone to access the contents of a
vdisk.  Vdisk support was first introduced in VirtualBox.  xVM then
adopted the VirtualBox code for vdisk support.  With both technologies,
the only way to access the contents of a vdisk is to export it to a VM.

To allow zones to use vdisk devices we propose to leverage the code
introduced by by xVM by incorporating it into lofi.  This will allow any
solaris system to access the contents of vdisk devices.  The interface
changes to lofi to allow for this are fairly straightforward.

A new '-l' option will be added to the lofiadm(1m) "-a" device creation
mode.  The '-l' option will indicate to lofi that the new device should
have a label associated with it.  Normally lofi device are named
/dev/lofi/<I> and /dev/rlofi/<I>, where <I> is the lofi device number.
When a disk device has a label associated with it, it exports many
device nodes with different names.  Therefore lofi will need to be
enhanced to support these new device names, which multiple nodes
per device.  These new names will be:

        /dev/lofi/dsk<I>/p<j>           - block device partitions
        /dev/lofi/dsk<I>/s<j>           - block device slices
        /dev/rlofi/dsk<I>/p<j>          - char device partitions
        /dev/rlofi/dsk<I>/s<j>          - char device slices

Lofi will also be enhanced to support accessing vdisks in addition to           
|
normal files.  Accessing a vdisk via lofi will be no different from             
|
accessing a disk encapsulated within a file as described above with the         
|
-l option.  The lofi '-a' option will be enhanced to allow for the              
|
automatic detection of vdisk targets on the lofi command line.  Since           
|
all vdisks represent actual disks they all contain a partition/label            
|
data, hence when if a vdisk is detected on the command line, the '-l'           
|
flag is automatically implied.  The vdisk formats that will be supported        
|
by lofi are whatever vdisk formats happen to be supported by xVM at the         
|
time of integration.  Since the implementation between lofi and xVM will        
|
be shared, as new vdisk format support is added to xVM, it should be            
|
immediately supportable via lofi as well.  In many cases, vdisk formats         
|
may provide their own management features such as snapshotting,                 
|
compression, encryption, etc.  As such, the lofi vdisk support exists           
|
purely to access the contents of vdisks.  Hence, vdisk based lofi               
|
devices will not support other lofi options such as encryption ('-c')           
+
and compression ('-C' / '-U').                                                  
+

The current xVM implementation for accessing vdisks involves two drivers
and a userland utility.  A "frontend" driver runs inside a VM and it
exports normal solaris disk interface.  It takes IO requests to these
disks and transmits them, via a ring buffer, to the "backend" driver
running in the global zone.  The backend driver then maps these ring
requests into a dedicated vdisk process (there is one such process for
every vdisk), and this process translates these ring requests into
access to a vdisk of the requested format.  Given all this existing xVM
functionality, the most straightforward way to support vdisk from within
lofi would be to leverage the xVM implementation.  This will involve
re-factoring the existing xVM code, thereby allowing lofi to utilise
the "frontend" code which translates strategy io requests into ring
buffer requests, and also the "backend" code which exports the ring
buffer to userland.  The unchanged xVM userland vdisk utility can then
be used to map ring buffer requests to the actual vdisk storage.
Currently this utility is only available on x86, but since lofi is a
cross-platform utility, this proposal will require the delivery of this
utility on both sparc and x86.  This utility is currently delivered in
an xVM private directory, /usr/lib/xen/bin/vdisk.  Given that lofi is a
more general and cross platform utility as compared to xVM, and also
given that that we don't expect users to access the vdisk management
utilities directly, we propose to move the vdisk application to
/usr/lib/lofi/bin/vdisk.

For RAS purposes, we will need to ensure that this vdisk utility is
always running.  Hence we will introduce a new lofi smf service
svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid
daemon, which will manage the starting, stopping, monitoring, and
possible re-start of the vdisk utility.  Re-starts of vdisk utility
should be transparent (aside from a short performance hiccup) to any
zones accessing those vdisks.  By default this service will be disabled.
If a lofi vdisk device is created, this service will be temporarily
enabled.  When the last vdisk based lofi device is destroyed, this
service will disable itself.

XXX: what to do about disk geometry assignment?  sigh.

Here's some examples of how this lofi functionality could be used
(outside of the zone framework).                                                
|
                                                                                
|
If an admin wanted to access a disk embedded within a plain file then           
+
they could run the following command:                                           
+
        lofiadm -a -l /export/xvm/vm1.disk

If an admin wanted to access a vdisk (which always implicitly has a disk        
|
embedded within it) they could run the following command:                       
|
        lofiadm -a /export/xvm/vm1.vmdk                                         
|

In both the examples above, if there were no previously configured              
|
lofi devices on the system, the following new lofi devices would                
|
be created:                                                                     
|
        /dev/lofi/dsk0/p#               - for # == 0 - 4
        /dev/lofi/dsk0/s#               - for # == 0 - 15
        /dev/rlofi/dsk0/p#              - for # == 0 - 4
        /dev/rlofi/dsk0/s#              - for # == 0 - 15

By default, format(1m) will not list these devices in it's output.  But
users will be able to treat these devices like regular disks and pass
their names to utilities like fdisk(1m), format(1m), prtvtoc(1m),
fmthard(1m), zpool(1m), etc.

<... snip ...>

_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Re: [zones-discuss] zones on shared storage proposal

Reply via email to