Attached is a description of a project we have been refining for
a while now.  The idea is to improve the integration of zones
with some of the existing resource management features in Solaris.
I would appreciate hearing any suggestions or questions.  I'd
like to submit this proposal to our internal architectural review
process by mid-July.  I have also posted a few slides that give an 
overview of the project.  Those are available on the zones files
page (http://www.opensolaris.org/os/community/zones/files/).

Thanks,
Jerry
 
 
This message posted from opensolaris.org
SUMMARY: 

        This project enhances Solaris zones[1], pools[2-4] and resource
        caps[5,6] to improve the integration of zones with resource
        management (RM).  It addresses existing RFEs[7-10] in this area and
        lays the groundwork for simplified, coherent management of the various
        RM features exposed through zones.

        We will integrate some basic pool configuration with zones, implement
        the concept of "temporary pools" that are dynamically created/destroyed
        when a zone boots/halts and we will simplify the setting of resource
        controls within zonecfg.  We will enhance rcapd so that it can cap
        a zone's memory while rcapd is running in the global zone.  We will
        also make a few other changes to provide a better overall experience
        when using zones with RM.

        Patch binding is requested for these new interfaces and the stability
        of most of these interfaces is "evolving" (see interface table for
        complete list).

PROBLEM:

        Although zones are fairly easy to configure and install, it appears
        that many customers have difficulty setting up a good RM configuration
        to accompany their zone configuration.  Understanding RM involves many
        new terms and concepts along with lots of documentation to understand.
        This leads to the problem that many customers either do not configure
        RM with their zones, or configure it incorrectly, leading them to be
        disappointed when zones, by themselves, do not provide all of the
        containment that they expect.

        This problem will just get worse in the near future with the
        additional RM features that are coming, such as cpu-caps[11], memory
        sets[12] and swap sets[13].

PROPOSAL:

        There are 7 different enhancements outlined below.

1) "Hard" vs. "Soft" RM configuration within zonecfg

        We will enhance zonecfg(1M) so that the user can configure basic RM
        capabilities in a structured way.

        The various existing and upcoming RM features can be broken down
        into "hard" vs. "soft" partitioning of the system's resources.
        With "hard" partitioning, resources are dedicated to the zone using
        processor sets (psets) and memory sets (msets).  With "soft"
        partitioning, resources are shared, but capped, with an upper limit
        on their use by the zone.

                         Hard    |    Soft
               ---------------------------------
               cpu    |  psets   |  cpu-caps
               memory |  msets   |  rcapd

        There are also some existing rctls (zone.cpu-shares, zone.max-lwps)
        which will be integrated into this overall concept.

        Within zonecfg we will organize the various RM features into four
        basic zonecfg resources so that it is simple for a user to understand
        and configure the RM features that are to be used with their zone.
        Note that zonecfg "resources" are not the same as "resource
        management".  Within zonecfg, a "resource" is the name of a top-level
        property of the zone (see zonecfg(1M) for more information).

        The four new zonecfg resources are:
                dedicated-cpu
                capped-cpu (future, once cpu-caps are integrated)
                dedicated-memory (future, once memory sets are integrated)
                capped-memory

        Each of these zonecfg resources will have properties that are
        appropriate to the RM capabilities associated with that resource.
        Zonecfg will only allow one instance of each these resource to be
        configured and it will not allow conflicting resources to be added
        (e.g. dedicated-cpu and capped-cpu are mutually exclusive).

        The mapping of these new zonecfg resources to the primary underlying RM
        feature is:
                dedicated-cpu -> temporary pset
                dedicated-memory -> temporary mset
                capped-cpu -> cpu-cap rctl [11]
                capped-memory -> rcapd running in GZ

        Temporary psets and msets are described below, in section 2.
        Rcapd enhancements for running in the global zone are described below,
        in section 4.

        The valid properties for each of these new zonecfg resources will be:

                dedicated-cpu
                        ncpus (a positive integer or range, default value 1)
                        importance (a positive integer, default value 1)
                        max-lwps (an integer >= 100)
                capped-cpu
                        cpu-cap (a positive integer, default value 100 which
                                 represents 100% of one cpu)
                        max-lwps (an integer >= 100)
                        cpu-shares (a positive integer)
                dedicated-memory
                        TBD - once msets [12] are completed
                capped-memory
                        cap (a positive decimal number with optional k, m, g,
                             or t as a modifier, no modifier defaults to units
                             of megabytes(m), must be at least 1m)

        Some of these properties actually correspond to rctls.  See section 3
        below for a description of how this will work.

        Zonecfg will also be enhanced to check for invalid combinations.
        This means it will disallow a dedicated-cpu resource and the
        zone.cpu-shares rctl being defined at the same time.  It also means
        that explicitly specifying a pool name via the 'pool' resource, along
        with either a 'dedicated-cpu' or 'dedicated-memory' resource is an
        invalid combination.

        These new zonecfg resource names (dedicated-cpu, capped-cpu,
        dedicated-memory & capped-memory) are chosen so as to be reasonably
        clear what the objective is, even though they do not exactly align
        with our existing underlying (and inconsistent) RM naming schemes.

2) Temporary Pools.

        We will implement the concept of "temporary pools" within the pools
        framework.

        To improve the integration of zones and pools we are allowing the
        configuration of some basic pool attributes within zonecfg, as
        described above in section 1.  However, we do not want to extend
        zonecfg to completely and directly manage standard pool configurations.
        That would lead to confusion and inconsistency regarding which tool to
        use and where configuration data is stored.  Temporary pools sidesteps
        this problem and allows zones to dynamically create a simple pool/pset
        configuration for the basic case where a sysadmin just wants a
        specified number of processors dedicated to the zone (and eventually a
        dedicated amount of memory).

        We believe that the ability to simply specify a fixed number of cpus
        (and eventually a mset size) meets the needs of a large percentage of
        zones users who need "hard" partitioning (e.g. to meet licensing
        restrictions).

        If a dedicated-cpu (or eventually a dedicated-memory) resource is
        configured for the zone, then when the zone boots zoneadmd will create
        a temporary pool dedicated for the zones use.  Zoneadmd will
        dynamically create a pool & pset (or eventually a mset) and assign the
        number of cpus specified in zonecfg to that pset.  The temporary pool
        & pset will be named 'SUNWzone{zoneid}'.

        Zoneadmd will set the 'pset.min' and 'pset.max' pset properties, as
        well as the 'pool.importance' pool property, based on the values
        specified for dedicated-cpu's 'ncpus' and 'importance' properties
        in zonecfg.

        If the cpu (or memory) resources needed to create the temporary pool
        are unavailable, zoneadmd will issue an error and the zone won't boot.

        When the zone is halted, the temporary pool & pset will be destroyed.

        We will add a new boolean property ('temporary') that can exist on
        pools and any resource set.  The 'temporary' property indicates that
        the pool or resource set should never be committed to a static
        configuration (e.g. pooladm -s) and that it should never be destroyed
        when updating the dynamic configuration from a static configuration
        (e.g. pooladm -c).  These temporary pools/resources can only be managed
        in the dynamic configuration.  These changes will be implemented within
        libpool(3LIB).

        It is our expectation that most users will never need to manage
        temporary pools through the existing poolcfg(1M) commands.  For users
        who need more sophisticated pool configuration and management, the
        existing 'pool' resource within zonecfg should be used and users
        should manually create a permanent pool using the existing mechanisms.

3) Resource controls in zonecfg will be simplified.

        Within zonecfg the existing rctls (zone.cpu-shares and zone.max-lwps)
        take a 3-tuple value where only a single component usually has any
        meaning (the 'limit').  The other two components of the value (the
        'priv' and 'action') are not normally changed but users can be confused
        if they don't understand what the other components mean or what values
        can be specified.

        Here is a zonecfg example:
                > add rctl
                rctl> set name=zone.cpu-shares
                rctl> add value (priv=privileged,limit=5,action=none)
                rctl> end

        Within zonecfg we will introduce the idea of rctl aliases.  The alias
        is a simplified name and template for the existing rctls.  Behind the
        scenes we continue to store the data using the existing rctl entries
        in the XML file.  Thus, the alias always refers to the same underlying
        piece of data as the full rctl.

        The purpose of the rctl alias is to provide a simplified name and
        mechanism to set the rctl 'limit'.  For each rctl/alias pair we will
        "know" the expected values for the 'priv' and 'action' components of
        the rctl value.  If an rctl is already defined that does not match this
        "knowledge" (e.g. it has a non-standard 'action' or there are multiple
        values defined for the rctl), then the user will not be allowed to use
        an alias for that rctl.

        Here are the aliases we will define for the rctls:
                alias           rctl
                -----           ----
                max-lwps        zone.max-lwps
                cpu-shares      zone.cpu-shares
                cpu-cap         zone.cpu-cap (future, once cpu-caps integrate)

        Here is an example of the max-lwps alias used as a property within the
        new 'dedicated-cpu' resource:

                > add dedicated-cpu
                dedicated-cpu> set ncpus=2-4
                dedicated-cpu> set max-lwps=500
                dedicated-cpu> end
                > info
                ...
                dedicated-cpu:
                        ncpus: 2-4
                        max-lwps: 500
                rctl:
                        name: zone.max-lwps
                        value: (priv=privileged,limit=500,action=deny)

        In the example, you can see the use of the alias when adding the
        'dedicated-cpu' resource and you can also see the full rctl output
        within the 'info' command.  If the 'max-lwps' property had not been set
        within the 'dedicated-cpu' resource, then the corresponding rctl would
        not be defined.

        If you update the rctl value through the 'rctl' resource within
        zonecfg, then the corresponding value within the 'dedicated-cpu'
        resource would also be updated since both the rctl and its alias refer
        to the same piece of data.

        If an rctl was already defined that did not match the expected value
        (e.g. it had 'action=none' or multiple values), then the 'max-lwps'
        alias will be disabled.  An attempt to set 'max-lwps' within
        'dedicated-cpu' would print the following error:
                "One or more incompatible rctls already exist for this
                 property"

        This rctl alias enhancement is fully backward compatible with the
        existing rctl syntax.  That is, zonecfg output will continue to display
        rctl settings in the current format (in addition to the new aliased
        format) and zonecfg will continue to accept the existing input syntax
        for setting rctls.  This ensures full backward compatibility for any
        existing tools/scripts that parse zonecfg output or configure zones.

4) Enable rcapd to limit zone memory while running in the global zone

        Currently, to use rcapd(1M) to limit zone memory consumption, the
        rcapd process must be run within the zone.  This exposes a loophole
        since the zone administrator, who might be untrusted, can change the
        rcapd limit.

        We will enhance rcapd so that it can limit zone's memory consumption
        while it is running in the global zone.  This closes the rcapd
        loophole and allows the global zone administrator to set memory
        caps that can be enforced by a single, trusted process.

        The rcapd limit for a zone will be configured using the new
        'capped-memory' resource and 'cap' property within zonecfg.
        When a zone with 'capped-memory' boots, zoneadmd will automatically
        start rcapd in the global zone, if necessary.  The interfaces to
        communicate memory cap information between zoneadmd and rcapd
        are project private.

        As part of this overall project, we will be enhancing the internal
        rcapd rss accounting so that rcapd will have a more accurate
        measurement of the overall rss for each zone.

5) Use FSS when zone.cpu-shares is set 

        Although the zone.cpu-shares rctl can be set on a zone, the Fair Share
        Scheduler (FSS) is not the default scheduling class so this rctl
        frequently has no effect, unless the user also sets FSS as the
        default scheduler or changes the zones processes to use FSS with the
        priocntl(1M) command.  This means that users can easily think
        they have configured their zone for a behavior that they are not
        actually getting.

        We will enhance zoneadmd so that if the zone.cpu-shares rctl is set
        and FSS is not already the default scheduling class, zoneadmd will set
        the scheduling class to be FSS for processes in the zone.

6) Add RM templates for zone creation

        Zonecfg already supports templates on the 'create' subcommand using
        the '-t' option.  We will update the documentation which currently
        states that a template must be the name of an existing zone.  We
        already deliver two existing templates (SUNWblank and SUNWdefault).

        We will deliver at least four new templates that configure
        reasonable default properties for the four new resources in zonecfg: 
                fully dedicated:                dedicated-cpu & dedicated-memory
                cpu dedicated, memory capped:   dedicated-cpu & capped-memory
                cpu capped, memory dedicated:   capped-cpu & dedicated-memory
                cpu & memory capped:            capped-cpu & capped-memory

        We may also deliver other templates that only pre-configure one of
        the new resources (e.g. only configures dedicated-cpu and leaves
        memory with the default handling).

        We will enhance the 'create' help command to briefly describe the
        templates and why you would use one vs. another.

        The names of all new templates will begin with SUNW.  This namespace
        was already reserved by [1].

        This zonecfg change will primarily impact the documentation.

7) Pools system objective defaults to weighted-load (wt-load)[4]

        Currently pools are delivered with no objective set.  This means that
        if you enable the poold(1M) service, nothing will actually happen on
        your system.

        As part of this project, we will set weighted load
        (system.poold.objectives=wt-load) to be the default objective.
        Delivering this objective as the default does not impact systems out
        of the box since poold is disabled by default.

EXPORTED INTERFACES

        New zonecfg resource names
                dedicated-cpu           Evolving
                capped-cpu              Evolving
                dedicated-memory        Evolving
                capped-memory           Evolving

                The capped-cpu and dedicated-memory resource names are
                being reserved now in anticipation of the future integration
                of the cpu-caps and memory sets projects.  However, we do
                not want to make this project dependent on [11] & [12].

        New zonecfg property names
                ncpus                   Evolving
                importance              Evolving
                cap                     Evolving

        New zonecfg rctl alias names
                max-lwps                Evolving
                cpu-shares              Evolving
                cpu-cap                 Evolving

                The cpu-cap rctl alias is being reserved now in anticipation
                of the future integration of the cpu-caps projects.  However,
                we do not want to make this project dependent on [11].

        Temporary pool & resource names
                SUNWzone{id}            Stable

        New temporary pool & resource boolean properties
                'pool.temporary'        Evolving
                'pset.temporary'        Evolving
                '*.temporary'           Evolving
                (for future resources such as mset)

        rcapd/zoneadmd interface
                zone_getattr            Project Private

        wt-load as default              Evolving

IMPORTED INTERFACES

        libpool(3LIB)                   unstable

REFERENCES

1. PSARC 2002/174 Virtualization and Namespace Isolation in Solaris
2. PSARC 2000/136 Administrative support for processor sets and extensions
3. PSARC 1999/119 Tasks, Sessions, Projects and Accounting
4. PSARC 2002/287 Dynamic Resource Pools
5. PSARC 2002/519 rcapd(1MSRM): resource capping daemon
6. PSARC 2003/155 rcapd(1M) sedimentation
7. 6421202 RFE: simplify and improve zones/pool integration
        http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6421202
8. 6222025 RFE: simplify rctl syntax and improve cpu-shares/FSS interaction
        http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6222025
9. 5026227 RFE: ability to rcap zones from global zone
        http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5026227
10. 6409152 RFE: template support for better RM integration
        http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6409152
11. PSARC 2004/402 CPU Caps
12. PSARC 2000/350 Physical Memory Control 
13. PSARC 2002/181 Swap Sets
_______________________________________________
zones-discuss mailing list
zones-discuss@opensolaris.org

Reply via email to