Attached is a description of a project we have been refining for
a while now. The idea is to improve the integration of zones
with some of the existing resource management features in Solaris.
I would appreciate hearing any suggestions or questions. I'd
like to submit this proposal to our internal architectural review
process by mid-July. I have also posted a few slides that give an
overview of the project. Those are available on the zones files
page (http://www.opensolaris.org/os/community/zones/files/).
Thanks,
Jerry
This message posted from opensolaris.org
------------------------------------------------------------------------
SUMMARY:
This project enhances Solaris zones[1], pools[2-4] and resource
caps[5,6] to improve the integration of zones with resource
management (RM). It addresses existing RFEs[7-10] in this area and
lays the groundwork for simplified, coherent management of the various
RM features exposed through zones.
We will integrate some basic pool configuration with zones, implement
the concept of "temporary pools" that are dynamically created/destroyed
when a zone boots/halts and we will simplify the setting of resource
controls within zonecfg. We will enhance rcapd so that it can cap
a zone's memory while rcapd is running in the global zone. We will
also make a few other changes to provide a better overall experience
when using zones with RM.
Patch binding is requested for these new interfaces and the stability
of most of these interfaces is "evolving" (see interface table for
complete list).
PROBLEM:
Although zones are fairly easy to configure and install, it appears
that many customers have difficulty setting up a good RM configuration
to accompany their zone configuration. Understanding RM involves many
new terms and concepts along with lots of documentation to understand.
This leads to the problem that many customers either do not configure
RM with their zones, or configure it incorrectly, leading them to be
disappointed when zones, by themselves, do not provide all of the
containment that they expect.
This problem will just get worse in the near future with the
additional RM features that are coming, such as cpu-caps[11], memory
sets[12] and swap sets[13].
PROPOSAL:
There are 7 different enhancements outlined below.
1) "Hard" vs. "Soft" RM configuration within zonecfg
We will enhance zonecfg(1M) so that the user can configure basic RM
capabilities in a structured way.
The various existing and upcoming RM features can be broken down
into "hard" vs. "soft" partitioning of the system's resources.
With "hard" partitioning, resources are dedicated to the zone using
processor sets (psets) and memory sets (msets). With "soft"
partitioning, resources are shared, but capped, with an upper limit
on their use by the zone.
Hard | Soft
---------------------------------
cpu | psets | cpu-caps
memory | msets | rcapd
There are also some existing rctls (zone.cpu-shares, zone.max-lwps)
which will be integrated into this overall concept.
Within zonecfg we will organize the various RM features into four
basic zonecfg resources so that it is simple for a user to understand
and configure the RM features that are to be used with their zone.
Note that zonecfg "resources" are not the same as "resource
management". Within zonecfg, a "resource" is the name of a top-level
property of the zone (see zonecfg(1M) for more information).
The four new zonecfg resources are:
dedicated-cpu
capped-cpu (future, once cpu-caps are integrated)
dedicated-memory (future, once memory sets are integrated)
capped-memory
Each of these zonecfg resources will have properties that are
appropriate to the RM capabilities associated with that resource.
Zonecfg will only allow one instance of each these resource to be
configured and it will not allow conflicting resources to be added
(e.g. dedicated-cpu and capped-cpu are mutually exclusive).
The mapping of these new zonecfg resources to the primary underlying RM
feature is:
dedicated-cpu -> temporary pset
dedicated-memory -> temporary mset
capped-cpu -> cpu-cap rctl [11]
capped-memory -> rcapd running in GZ
Temporary psets and msets are described below, in section 2.
Rcapd enhancements for running in the global zone are described below,
in section 4.
The valid properties for each of these new zonecfg resources will be:
dedicated-cpu
ncpus (a positive integer or range, default value 1)
importance (a positive integer, default value 1)
max-lwps (an integer >= 100)
capped-cpu
cpu-cap (a positive integer, default value 100 which
represents 100% of one cpu)
max-lwps (an integer >= 100)
cpu-shares (a positive integer)
dedicated-memory
TBD - once msets [12] are completed
capped-memory
cap (a positive decimal number with optional k, m, g,
or t as a modifier, no modifier defaults to units
of megabytes(m), must be at least 1m)
Some of these properties actually correspond to rctls. See section 3
below for a description of how this will work.
Zonecfg will also be enhanced to check for invalid combinations.
This means it will disallow a dedicated-cpu resource and the
zone.cpu-shares rctl being defined at the same time. It also means
that explicitly specifying a pool name via the 'pool' resource, along
with either a 'dedicated-cpu' or 'dedicated-memory' resource is an
invalid combination.
These new zonecfg resource names (dedicated-cpu, capped-cpu,
dedicated-memory & capped-memory) are chosen so as to be reasonably
clear what the objective is, even though they do not exactly align
with our existing underlying (and inconsistent) RM naming schemes.
2) Temporary Pools.
We will implement the concept of "temporary pools" within the pools
framework.
To improve the integration of zones and pools we are allowing the
configuration of some basic pool attributes within zonecfg, as
described above in section 1. However, we do not want to extend
zonecfg to completely and directly manage standard pool configurations.
That would lead to confusion and inconsistency regarding which tool to
use and where configuration data is stored. Temporary pools sidesteps
this problem and allows zones to dynamically create a simple pool/pset
configuration for the basic case where a sysadmin just wants a
specified number of processors dedicated to the zone (and eventually a
dedicated amount of memory).
We believe that the ability to simply specify a fixed number of cpus
(and eventually a mset size) meets the needs of a large percentage of
zones users who need "hard" partitioning (e.g. to meet licensing
restrictions).
If a dedicated-cpu (or eventually a dedicated-memory) resource is
configured for the zone, then when the zone boots zoneadmd will create
a temporary pool dedicated for the zones use. Zoneadmd will
dynamically create a pool & pset (or eventually a mset) and assign the
number of cpus specified in zonecfg to that pset. The temporary pool
& pset will be named 'SUNWzone{zoneid}'.
Zoneadmd will set the 'pset.min' and 'pset.max' pset properties, as
well as the 'pool.importance' pool property, based on the values
specified for dedicated-cpu's 'ncpus' and 'importance' properties
in zonecfg.
If the cpu (or memory) resources needed to create the temporary pool
are unavailable, zoneadmd will issue an error and the zone won't boot.
When the zone is halted, the temporary pool & pset will be destroyed.
We will add a new boolean property ('temporary') that can exist on
pools and any resource set. The 'temporary' property indicates that
the pool or resource set should never be committed to a static
configuration (e.g. pooladm -s) and that it should never be destroyed
when updating the dynamic configuration from a static configuration
(e.g. pooladm -c). These temporary pools/resources can only be managed
in the dynamic configuration. These changes will be implemented within
libpool(3LIB).
It is our expectation that most users will never need to manage
temporary pools through the existing poolcfg(1M) commands. For users
who need more sophisticated pool configuration and management, the
existing 'pool' resource within zonecfg should be used and users
should manually create a permanent pool using the existing mechanisms.
3) Resource controls in zonecfg will be simplified.
Within zonecfg the existing rctls (zone.cpu-shares and zone.max-lwps)
take a 3-tuple value where only a single component usually has any
meaning (the 'limit'). The other two components of the value (the
'priv' and 'action') are not normally changed but users can be confused
if they don't understand what the other components mean or what values
can be specified.
Here is a zonecfg example:
> add rctl
rctl> set name=zone.cpu-shares
rctl> add value (priv=privileged,limit=5,action=none)
rctl> end
Within zonecfg we will introduce the idea of rctl aliases. The alias
is a simplified name and template for the existing rctls. Behind the
scenes we continue to store the data using the existing rctl entries
in the XML file. Thus, the alias always refers to the same underlying
piece of data as the full rctl.
The purpose of the rctl alias is to provide a simplified name and
mechanism to set the rctl 'limit'. For each rctl/alias pair we will
"know" the expected values for the 'priv' and 'action' components of
the rctl value. If an rctl is already defined that does not match this
"knowledge" (e.g. it has a non-standard 'action' or there are multiple
values defined for the rctl), then the user will not be allowed to use
an alias for that rctl.
Here are the aliases we will define for the rctls:
alias rctl
----- ----
max-lwps zone.max-lwps
cpu-shares zone.cpu-shares
cpu-cap zone.cpu-cap (future, once cpu-caps integrate)
Here is an example of the max-lwps alias used as a property within the
new 'dedicated-cpu' resource:
> add dedicated-cpu
dedicated-cpu> set ncpus=2-4
dedicated-cpu> set max-lwps=500
dedicated-cpu> end
> info
...
dedicated-cpu:
ncpus: 2-4
max-lwps: 500
rctl:
name: zone.max-lwps
value: (priv=privileged,limit=500,action=deny)
In the example, you can see the use of the alias when adding the
'dedicated-cpu' resource and you can also see the full rctl output
within the 'info' command. If the 'max-lwps' property had not been set
within the 'dedicated-cpu' resource, then the corresponding rctl would
not be defined.
If you update the rctl value through the 'rctl' resource within
zonecfg, then the corresponding value within the 'dedicated-cpu'
resource would also be updated since both the rctl and its alias refer
to the same piece of data.
If an rctl was already defined that did not match the expected value
(e.g. it had 'action=none' or multiple values), then the 'max-lwps'
alias will be disabled. An attempt to set 'max-lwps' within
'dedicated-cpu' would print the following error:
"One or more incompatible rctls already exist for this
property"
This rctl alias enhancement is fully backward compatible with the
existing rctl syntax. That is, zonecfg output will continue to display
rctl settings in the current format (in addition to the new aliased
format) and zonecfg will continue to accept the existing input syntax
for setting rctls. This ensures full backward compatibility for any
existing tools/scripts that parse zonecfg output or configure zones.
4) Enable rcapd to limit zone memory while running in the global zone
Currently, to use rcapd(1M) to limit zone memory consumption, the
rcapd process must be run within the zone. This exposes a loophole
since the zone administrator, who might be untrusted, can change the
rcapd limit.
We will enhance rcapd so that it can limit zone's memory consumption
while it is running in the global zone. This closes the rcapd
loophole and allows the global zone administrator to set memory
caps that can be enforced by a single, trusted process.
The rcapd limit for a zone will be configured using the new
'capped-memory' resource and 'cap' property within zonecfg.
When a zone with 'capped-memory' boots, zoneadmd will automatically
start rcapd in the global zone, if necessary. The interfaces to
communicate memory cap information between zoneadmd and rcapd
are project private.
As part of this overall project, we will be enhancing the internal
rcapd rss accounting so that rcapd will have a more accurate
measurement of the overall rss for each zone.
5) Use FSS when zone.cpu-shares is set
Although the zone.cpu-shares rctl can be set on a zone, the Fair Share
Scheduler (FSS) is not the default scheduling class so this rctl
frequently has no effect, unless the user also sets FSS as the
default scheduler or changes the zones processes to use FSS with the
priocntl(1M) command. This means that users can easily think
they have configured their zone for a behavior that they are not
actually getting.
We will enhance zoneadmd so that if the zone.cpu-shares rctl is set
and FSS is not already the default scheduling class, zoneadmd will set
the scheduling class to be FSS for processes in the zone.
6) Add RM templates for zone creation
Zonecfg already supports templates on the 'create' subcommand using
the '-t' option. We will update the documentation which currently
states that a template must be the name of an existing zone. We
already deliver two existing templates (SUNWblank and SUNWdefault).
We will deliver at least four new templates that configure
reasonable default properties for the four new resources in zonecfg:
fully dedicated: dedicated-cpu & dedicated-memory
cpu dedicated, memory capped: dedicated-cpu & capped-memory
cpu capped, memory dedicated: capped-cpu & dedicated-memory
cpu & memory capped: capped-cpu & capped-memory
We may also deliver other templates that only pre-configure one of
the new resources (e.g. only configures dedicated-cpu and leaves
memory with the default handling).
We will enhance the 'create' help command to briefly describe the
templates and why you would use one vs. another.
The names of all new templates will begin with SUNW. This namespace
was already reserved by [1].
This zonecfg change will primarily impact the documentation.
7) Pools system objective defaults to weighted-load (wt-load)[4]
Currently pools are delivered with no objective set. This means that
if you enable the poold(1M) service, nothing will actually happen on
your system.
As part of this project, we will set weighted load
(system.poold.objectives=wt-load) to be the default objective.
Delivering this objective as the default does not impact systems out
of the box since poold is disabled by default.
EXPORTED INTERFACES
New zonecfg resource names
dedicated-cpu Evolving
capped-cpu Evolving
dedicated-memory Evolving
capped-memory Evolving
The capped-cpu and dedicated-memory resource names are
being reserved now in anticipation of the future integration
of the cpu-caps and memory sets projects. However, we do
not want to make this project dependent on [11] & [12].
New zonecfg property names
ncpus Evolving
importance Evolving
cap Evolving
New zonecfg rctl alias names
max-lwps Evolving
cpu-shares Evolving
cpu-cap Evolving
The cpu-cap rctl alias is being reserved now in anticipation
of the future integration of the cpu-caps projects. However,
we do not want to make this project dependent on [11].
Temporary pool & resource names
SUNWzone{id} Stable
New temporary pool & resource boolean properties
'pool.temporary' Evolving
'pset.temporary' Evolving
'*.temporary' Evolving
(for future resources such as mset)
rcapd/zoneadmd interface
zone_getattr Project Private
wt-load as default Evolving
IMPORTED INTERFACES
libpool(3LIB) unstable
REFERENCES
1. PSARC 2002/174 Virtualization and Namespace Isolation in Solaris
2. PSARC 2000/136 Administrative support for processor sets and extensions
3. PSARC 1999/119 Tasks, Sessions, Projects and Accounting
4. PSARC 2002/287 Dynamic Resource Pools
5. PSARC 2002/519 rcapd(1MSRM): resource capping daemon
6. PSARC 2003/155 rcapd(1M) sedimentation
7. 6421202 RFE: simplify and improve zones/pool integration
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6421202
8. 6222025 RFE: simplify rctl syntax and improve cpu-shares/FSS interaction
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6222025
9. 5026227 RFE: ability to rcap zones from global zone
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5026227
10. 6409152 RFE: template support for better RM integration
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6409152
11. PSARC 2004/402 CPU Caps
12. PSARC 2000/350 Physical Memory Control
13. PSARC 2002/181 Swap Sets