Enclosed is a first draft of a spec. for the S10
brand which we plan to submit for a PSARC
inception review.  Please send us any comments
or questions.

Thanks,
Jerry

---

           S10C: A Solaris 10 Branded Zone for Solaris.Next

                     Gerald Jelinek, Jordan Vaughan
                   Solaris Virtualization Technologies


[A note on terminology: This document uses the terms "Solaris 10" and
 "Solaris.Next" very frequently.  As such, the abbreviations "S10" and
  "S.next" respectively are used interchangeably with the longer forms.
  The term "virtualization" is abbreviated as V12N.]


Part 1: Introduction
____________________________

Each new minor release of Solaris brings with it the well known problems
of slow user adoption, slow ISV support and concerns about compatibility.
The compatibility concerns will be more pronounced with the release of
S.next since it's anticipated that there will be greater than normal
user-visible changes (e.g. the packaging system, etc.).

Fortunately, since the last minor release of Solaris (Solaris 10), V12N
techniques have become widespread and V12N can be used as a solution to
ease the transition to the new version of Solaris.  Zones[1] combined
with a brand[2] are particularly well suited for this task since the host
system is actually running S.next, whereas this is not necessarily the
case with other V12N solutions.  In addition, zones are usable on any
system which runs S.next, which is also not the case with other V12N
alternatives.

We already have a proven track record delivering this sort of
zones/brand based solution to enable running earlier versions of Solaris
on S10 [3, 4], so in one sense this case breaks little new ground.
However, the earlier 'solaris8' and 'solaris9' brands were used to host
releases that are very static as compared to hosting a zone running S10.
In addition, S.next can be expected to continue to change rapidly for
the forseeable future.  Given this, a 'solaris10' brand for S.next poses
additional challenges for projects on both the S10 and S.next sides of
the system.  Many of these challenges are outside of the scope of an
architectural review and include developer education, testing and
procedural changes.  However, the existence of this brand could
potentially impact future projects in various ways and at a minimum will
require ARC consideration for future reviews.  The existence of this
brand can be seen as a potential "tax" on all projects which work on both
sides of the user/kernel boundary for both S10 and S.next.

The benefits of the brand are as follows:

  For customers:
    - Provides a solution to cope with compatibility differences between
      S10 and S.next
    - Protects investment in S10 infrastructure, training, and internal
      support
    - Minimize the cost of consolidating Solaris 10 systems
    - Enables deployment of new technologies in S.next (e.g., crossbow)
      while still running applications on S10, thereby limiting risk to
      production environment
    - Avoids or delays required application recertification

  For Sun:
    - S.next is adopted sooner
    - Provide a Solaris compatibility environment for S.next
    - Sun is a solution provider easing the burden of getting to S.next
    - Provide cross-platform virtualization solution for S.next across
      all hardware (it is the only V12N solution on M-Series)

This has been identified as a required feature for S.next.

=== Project Overview ===

As with the earlier 'solaris8' and 'solaris9' brands, this project
delivers the following:

   - A Branded Container which emulates Solaris 10's user environment,
     based on the BrandZ infrastructure provided with zones.
     This brand is called 'solaris10'.  Only Solaris 10u8 and
     beyond will be supported and tested in the zone.

   - A mechanism for archiving existing Solaris 10 systems and for
     redeploying those archives into the branded zone. This
     process is referred to as p2v and uses the same techniques
     as the 'solaris8' and 'solaris9' brands.

In addition, the following additional capabilities will be provided
as compared to the 'solaris8' and 'solaris9' brands.

   - This brand will be supported on all hardware architectures
     that run S.next (sun4v, sun4u and x86).  The specific platforms,
     particularly sun4u, will be the same as are certified for S.next.

   - A "virtual to virtual" or v2v mechanism for archiving existing
     Solaris 10 native zones and for redeploying those archives into
     the branded zone on S.next will be provided.  The process will be
     very similar to the existing zone migration [5] feature except that
     the zone's brand will be changed as part of the process.  In
     addition, if the zone is sparse on S10 it must be converted to
     a whole-root zone during the migration.

Part 2: solaris10 Brand
____________________________

The solaris10 brand is conceptually similar to the existing solaris8
and solaris9 brands and builds directly on the BrandZ infrastructure
that was created to support the lx brand.  Familiarity with BrandZ
and the solaris8 and solaris9 brands is assumed.

At this time the design and development of the brand is only
supporting the shared stack [6] networking model in which the zone's
network is managed by the global zone.  The exclusive stack model
is anticipated to require more complex solutions or emulation due
to the introduction of Crossbow [7] into S.next.  The exclusive
stack issues will be resolved before commitment review.

The ZFS ioctls have been audited and no issues have been seen.  Because
so much of ZFS has been backported to S10 updates earlier than the first
S10 version being supported in the brand (S10u8), ZFS delegated datasets
appear to work fine.  Further testing needs to be done and future
ZFS enhancements might require work at some point.

=== System Call Emulation ===

This section details the system call emulation provided by the current
solaris10 brand module.

The following system calls are currently being emulated.

        SYS_exec                11
        SYS_ioctl               54
        SYS_exeve               59
        SYS_acctctl             71
        SYS_getpagesizes        73
        SYS_issetugid           75
        SYS_uname               135
        SYS_pwrite              174
        SYS_sigqueue            190
        SYS_pwrite64            223
        SYS_zone                227

    SYS_exec
    SYS_exeve
        The emulator interposes on these system calls to provide a
        convenient mechanism for branded processes to be able to spawn
        native processes.

    SYS_ioctl
        Emulate process contract ioctls for init(1M) because the
        ioctl parameter structure changed between S10 and Nevada.

    SYS_acctctl
        The mode shift, mode mask and option mask for acctctl changed for
        crossbow.

    SYS_getpagesizes
        New first arg "legacy" must be set to 1.

    SYS_issetugid
        S10's issetugid() syscall is now a subcode to privsys().

    SYS_uname
        The emulator simply passes this through, then modifies the result
        upon return, so that the system call returns 5.10 for the 'release'
        field and 'Generic_Virtual' for the 'version' field.

    SYS_pwrite
    SYS_pwrite64
        pwrite's behavior differs between S10 and Nevada when applied to
        files opened with O_APPEND.  The offset argument is ignored and the
        buffer is appended to the target file in S10, whereas the current
        file position is ignored in Nevada (i.e., pwrite() acts as though
        the target file wasn't opened with O_APPEND).  This is a result of
        the fix for:
           6655660 pwrite() must ignore the O_APPEND/FAPPEND flag.
        Emulate the old S10 pwrite() behavior by checking whether the target
        file was opened with O_APPEND.  If it was, then invoke the write()
        system call instead of pwrite(); otherwise, invoke the pwrite()
        system call as usual.

    SYS_sigqueue
        New last arg "block" flag should be zero.  The block flag is used
        by the Opensolaris AIO implementation, which is now part of libc.

    SYS_zone
        See discussion below.

=== zone(2) support ===

Zones have been part of S10 since its FCS, so in general S10 is
already zone-aware and does the right thing in most cases.  Commands
that are zone-aware will continue to work as they do today in
S10 native zones.  One set of commands which does require emulation
are the S10 SVr4 packaging and patch commands.  Those commands are
zone-aware and in some cases will check if they are running in the
global zone and refuse to function if not.  If running in the global
zone they will also attempt to look for other zones to operate on.

The brand emulation interposes on the zone syscall and selectively
provides emulation when the running command is one of the SVr4
package or patch commands.  In these cases the emulation indicates
that it is the global zone (zoneid 0) and various zone attributes,
such as the zone brand itself, are emulated.  In all other cases
the syscall is passed through so that the other S10 commands continue
to behave as they do currently.  Because the solaris10 branded zones
are whole-root zones, all packaging and patch operations will
be successful, although the kernel components of the package or
patch are not used.  This is exactly the same behavior as on the
solaris8 and solaris9 branded zones.

One further considerations for zones is related to the p2v process.
During p2v there may be zones on the original physical system.
Since zones do not nest, p2v-ing these systems means that the zones
themselves are not usable inside the branded zone.  This is detected
when the zone is installed and a warning is issued indicating that
any nested zones will not be usable and that the disk space could be
recovered.  Those zones can be migrated ahead of time using the v2v
feature described below.  In addition, a future project is planned
which will assess a system prior to p2v and report any possible issues
that may arise.  Detecting zones would be part of that report.

=== solaris10 Brand: What's Not Emulated ===

This project does not make any changes to existing native zones
limitations.  One point to note is that TX will continue to
be incompatible with branded zones.  Customers using TX on S10
systems will need to transition to a certified, native S.next TX
solution.  Discussions with the TX team indicate that this is
the normal behavior for users of TX, since the base OS itself must
be certified for TX.

=== Versioning ===

Because of the potential issues with compatibility of various releases
of S10 hosted on differing releases of S.next, a basic versioning
system is incorporated into the brand.  This versioning system works
both ways.  That is, the brand emulation can check which version of
S10 is being hosted in the zone and adjust the emulation accordingly.
Likewise, future S10 updates which require specific emulation can
indicate that a specific version of the emulation is required.  If
necessary, they can also check if they are running in a branded zone
and, if so, determine what version of emulation is available.  The
initial release of the software won't need this versioning mechanism,
but it is being included to cope with possible future enhancements to
either S10 or S.next.

If a change is made to S10 which requires an enhancement to the brand
emulation library, it is expected that this change would be delivered
in a S10 KU patch which provides components on both sides of the
user/kernel boundary.  When the branded zone boots, the brand boot hook
determines the minimal version of the KU that is installed in the zone
to verify that the zone's release is supported (i.e. currently the
minimal KU will be the one from S10u8).  It then makes the associated
version (i.e. version 0 of the emulation) available as an attribute on
the zone.  The brand library can then use this information to provide
conditional emulation if needed.  Future projects that enhance the
emulation for new features in S10 can add a check for a different KU
version number which would then provide associated versions (e.g. 1, 2,
etc.) to the brand library.

If the KU version is not sufficient, future S10 projects may need to
design some other version check for the brand to enable it to properly
detect the S10 changes.  The ability to detect the KU version is
already covered by the contract on the zone "update on attach"
feature [8].

The situation is more complicated for future changes within the S10
code base which will require associated enhancements to the brand
emulation.  There are two mechanisms being proposed.

The first mechanism is that the future version of S10 can specify
that it requires a minimal version of the brand emulation.  It does
this by delivering a version number into the
'/usr/lib/brand/solaris10/version' file on S10.  When this future
version of S10 is p2v-ed into a solaris10 branded zone, the
solaris10 brand will check for the presence of this file and if it
exists, the brand will verify that the brand's version is greater than
or equal to the version specified in the S10 file.  If not, then an
error will be emitted and the zone p2v will fail, leaving the zone in
the configured state.

If the '/usr/lib/brand/solaris10/version' file is missing on S10, that
indicates that the version of S10 is still compatible with the initial
release of the solaris10 brand emulation.  The first time a project is
backported to S10 which requires an enhancement to the emulation, this
file must be created and the version number in the file will be bumped.

This first mechanism is useful if a future S10 update is fundamentally
incompatible with an older version of the S.next brand emulation.

The second mechanism allows projects that have been backported to S10
to actually be brand aware.  A new zone attribute will be available
indicating which version of the brand emulation is currently installed
on the system.  For these future S10 updates, if they deliver a new
feature which requires changes to the brand library, that S10 feature
can also determine if it is running in a branded zone and if so, if the
necessary emulation is available.  If the newer S10 update is running
in a zone on an older version of S.next which does not provide the
required emulation, the S10 feature can adjust its behavior in the
appropriate manner.

The existing getzoneid() and zone_getattr(ZONE_ATTR_BRAND) functions can
be used by S10 code to determine if it is running in a non-global zone
and if that zone is a 'solaris10' branded zone.  A new solaris10
brand-specific zone attribute, S10_EMUL_VERSION_NUM, is defined.  The
S10 feature can use the zone_getattr(S10_EMUL_VERSION_NUM) function to
determine if the brand emulation supports the feature.  The getzoneid()
and zone_getattr() functions are already used throughout the ON
consolidation for code that is zone-aware.  These functions will continue
to be consolidation private.

Engineers backporting features to a future S10 update will need to
first determine if that feature requires enhancements to the solaris10
brand library.  If so, they will then have to enhance the emulation
in S.next and bump the emulation version number.  They can then either
bump the minimal emulation version number in the
/usr/lib/brand/solaris10/version file on S10 during the S10 backport or
they can add the appropriate checks to the backported S10 code so that
it can determine if the support is available in the brand library and
change behavior accordingly.

This obviously adds a great deal of complexity to projects backporting
features to future S10 updates if those features require emulation to
function correctly in the branded zone.  Ideally, projects requiring
such enhancements to the brand emulation will not be backported.
Perhaps the presence of the S10 brand on S.next may discourage projects
from backporting since the brand provides S10 compatibility on S.next.
Future projects which cross the user/kernel boundary and which request
patch binding should be reviewed by the ARCs to determine if those
projects must take the solaris10 brand into account.

In addition to the above, any changes integrating into S.next which might
impact the solaris10 brand will need to test the supported versions
of S10 in the branded zone and make any needed changes to the solaris10
emulation.


Part 3: Archiving, Installation, p2v & v2v
____________________________

The p2v process for the solaris10 brand is the same as for
the solari8, solaris9 and native [9] brands.  A contract will
be included with this case for the flar command to explicitly
call out the use of flash archives for migrating system images
into zones.

The v2v process for migrating S10 native zones to solaris10 branded
zones will support the same archive formats as p2v.  This process will
use the 'zoneamd attach' subcommand since thats the existing
interface for migrating [3] zones from one system to another.  The
solaris10 brand attach subcommand will be extended to accept the
following options which correspond to the same options in the install
subcommand.

    -a {path} - specifies a path to an archive to unpack into the zone
    -d {path} - specifies a path to a tree of files as the source for the
                installation.

One issue with v2v of a S10 zone is that those zones can be sparse
but the solaris10 branded zone must be whole root.  The current plan
is that the zone must be readied on the source system.  This will
mount any inherited-pkg-dirs and an archive can then be made of the
readied zone.

The p2v conversion during the installation of the zone will again
be similar to the native p2v process [9].

=== Interface Table ===

  The solaris10 brand seeks minor release binding.

    Exported Interfaces                     Stability
  ----------------------------------------------------------------------
    "solaris10" brand name                  Committed
    "SUNWsolaris10" brand template name     Committed

    For the solaris10 brand
        brand-specific install and
        attach subcommand options           Committed
        documented in this case

    /usr/lib/brand/solaris10 directory      Committed

    SUNWs10brandr, SUNWs10brandu packages   Committed

    /usr/lib/brand/solaris10/version        Committed

    getzoneid(), zone_getattr(),
    ZONE_ATTR_BRAND and
    S10_EMUL_VERSION_NUM,attibutes          Consolidation Private


    Imported Interfaces                     Stability
  ----------------------------------------------------------------------
    brandz[2]                               Project Private

    Nevada syscall traps documented above   Consolidation Private

    flar(1m)                                Evolving
                                            Contract included with this case


REFERENCES

1. PSARC 2002/174 Virtualization and Namespace Isolation in Solaris
2. PSARC 2005/471 BrandZ: Support for non-native zones
3. PSARC/2007/350 Etude: Migration Technology
4. PSARC/2008/125 Etude Part Deux
5. PSARC/2006/030 Zone migration
6. PSARC/2006/366 Stack instances: Exclusive IP stack per zone
7. PSARC/2006/357 Crossbow - Network Virtualization and Resource Management
8. PSARC/2007/621 zone update on attach
9. PSARC/2008/766 native zones p2v
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Reply via email to