Enclosed is a first draft of a spec. for the S10 brand which we plan to submit for a PSARC inception review. Please send us any comments or questions.
Thanks, Jerry --- S10C: A Solaris 10 Branded Zone for Solaris.Next Gerald Jelinek, Jordan Vaughan Solaris Virtualization Technologies [A note on terminology: This document uses the terms "Solaris 10" and "Solaris.Next" very frequently. As such, the abbreviations "S10" and "S.next" respectively are used interchangeably with the longer forms. The term "virtualization" is abbreviated as V12N.] Part 1: Introduction ____________________________ Each new minor release of Solaris brings with it the well known problems of slow user adoption, slow ISV support and concerns about compatibility. The compatibility concerns will be more pronounced with the release of S.next since it's anticipated that there will be greater than normal user-visible changes (e.g. the packaging system, etc.). Fortunately, since the last minor release of Solaris (Solaris 10), V12N techniques have become widespread and V12N can be used as a solution to ease the transition to the new version of Solaris. Zones[1] combined with a brand[2] are particularly well suited for this task since the host system is actually running S.next, whereas this is not necessarily the case with other V12N solutions. In addition, zones are usable on any system which runs S.next, which is also not the case with other V12N alternatives. We already have a proven track record delivering this sort of zones/brand based solution to enable running earlier versions of Solaris on S10 [3, 4], so in one sense this case breaks little new ground. However, the earlier 'solaris8' and 'solaris9' brands were used to host releases that are very static as compared to hosting a zone running S10. In addition, S.next can be expected to continue to change rapidly for the forseeable future. Given this, a 'solaris10' brand for S.next poses additional challenges for projects on both the S10 and S.next sides of the system. Many of these challenges are outside of the scope of an architectural review and include developer education, testing and procedural changes. However, the existence of this brand could potentially impact future projects in various ways and at a minimum will require ARC consideration for future reviews. The existence of this brand can be seen as a potential "tax" on all projects which work on both sides of the user/kernel boundary for both S10 and S.next. The benefits of the brand are as follows: For customers: - Provides a solution to cope with compatibility differences between S10 and S.next - Protects investment in S10 infrastructure, training, and internal support - Minimize the cost of consolidating Solaris 10 systems - Enables deployment of new technologies in S.next (e.g., crossbow) while still running applications on S10, thereby limiting risk to production environment - Avoids or delays required application recertification For Sun: - S.next is adopted sooner - Provide a Solaris compatibility environment for S.next - Sun is a solution provider easing the burden of getting to S.next - Provide cross-platform virtualization solution for S.next across all hardware (it is the only V12N solution on M-Series) This has been identified as a required feature for S.next. === Project Overview === As with the earlier 'solaris8' and 'solaris9' brands, this project delivers the following: - A Branded Container which emulates Solaris 10's user environment, based on the BrandZ infrastructure provided with zones. This brand is called 'solaris10'. Only Solaris 10u8 and beyond will be supported and tested in the zone. - A mechanism for archiving existing Solaris 10 systems and for redeploying those archives into the branded zone. This process is referred to as p2v and uses the same techniques as the 'solaris8' and 'solaris9' brands. In addition, the following additional capabilities will be provided as compared to the 'solaris8' and 'solaris9' brands. - This brand will be supported on all hardware architectures that run S.next (sun4v, sun4u and x86). The specific platforms, particularly sun4u, will be the same as are certified for S.next. - A "virtual to virtual" or v2v mechanism for archiving existing Solaris 10 native zones and for redeploying those archives into the branded zone on S.next will be provided. The process will be very similar to the existing zone migration [5] feature except that the zone's brand will be changed as part of the process. In addition, if the zone is sparse on S10 it must be converted to a whole-root zone during the migration. Part 2: solaris10 Brand ____________________________ The solaris10 brand is conceptually similar to the existing solaris8 and solaris9 brands and builds directly on the BrandZ infrastructure that was created to support the lx brand. Familiarity with BrandZ and the solaris8 and solaris9 brands is assumed. At this time the design and development of the brand is only supporting the shared stack [6] networking model in which the zone's network is managed by the global zone. The exclusive stack model is anticipated to require more complex solutions or emulation due to the introduction of Crossbow [7] into S.next. The exclusive stack issues will be resolved before commitment review. The ZFS ioctls have been audited and no issues have been seen. Because so much of ZFS has been backported to S10 updates earlier than the first S10 version being supported in the brand (S10u8), ZFS delegated datasets appear to work fine. Further testing needs to be done and future ZFS enhancements might require work at some point. === System Call Emulation === This section details the system call emulation provided by the current solaris10 brand module. The following system calls are currently being emulated. SYS_exec 11 SYS_ioctl 54 SYS_exeve 59 SYS_acctctl 71 SYS_getpagesizes 73 SYS_issetugid 75 SYS_uname 135 SYS_pwrite 174 SYS_sigqueue 190 SYS_pwrite64 223 SYS_zone 227 SYS_exec SYS_exeve The emulator interposes on these system calls to provide a convenient mechanism for branded processes to be able to spawn native processes. SYS_ioctl Emulate process contract ioctls for init(1M) because the ioctl parameter structure changed between S10 and Nevada. SYS_acctctl The mode shift, mode mask and option mask for acctctl changed for crossbow. SYS_getpagesizes New first arg "legacy" must be set to 1. SYS_issetugid S10's issetugid() syscall is now a subcode to privsys(). SYS_uname The emulator simply passes this through, then modifies the result upon return, so that the system call returns 5.10 for the 'release' field and 'Generic_Virtual' for the 'version' field. SYS_pwrite SYS_pwrite64 pwrite's behavior differs between S10 and Nevada when applied to files opened with O_APPEND. The offset argument is ignored and the buffer is appended to the target file in S10, whereas the current file position is ignored in Nevada (i.e., pwrite() acts as though the target file wasn't opened with O_APPEND). This is a result of the fix for: 6655660 pwrite() must ignore the O_APPEND/FAPPEND flag. Emulate the old S10 pwrite() behavior by checking whether the target file was opened with O_APPEND. If it was, then invoke the write() system call instead of pwrite(); otherwise, invoke the pwrite() system call as usual. SYS_sigqueue New last arg "block" flag should be zero. The block flag is used by the Opensolaris AIO implementation, which is now part of libc. SYS_zone See discussion below. === zone(2) support === Zones have been part of S10 since its FCS, so in general S10 is already zone-aware and does the right thing in most cases. Commands that are zone-aware will continue to work as they do today in S10 native zones. One set of commands which does require emulation are the S10 SVr4 packaging and patch commands. Those commands are zone-aware and in some cases will check if they are running in the global zone and refuse to function if not. If running in the global zone they will also attempt to look for other zones to operate on. The brand emulation interposes on the zone syscall and selectively provides emulation when the running command is one of the SVr4 package or patch commands. In these cases the emulation indicates that it is the global zone (zoneid 0) and various zone attributes, such as the zone brand itself, are emulated. In all other cases the syscall is passed through so that the other S10 commands continue to behave as they do currently. Because the solaris10 branded zones are whole-root zones, all packaging and patch operations will be successful, although the kernel components of the package or patch are not used. This is exactly the same behavior as on the solaris8 and solaris9 branded zones. One further considerations for zones is related to the p2v process. During p2v there may be zones on the original physical system. Since zones do not nest, p2v-ing these systems means that the zones themselves are not usable inside the branded zone. This is detected when the zone is installed and a warning is issued indicating that any nested zones will not be usable and that the disk space could be recovered. Those zones can be migrated ahead of time using the v2v feature described below. In addition, a future project is planned which will assess a system prior to p2v and report any possible issues that may arise. Detecting zones would be part of that report. === solaris10 Brand: What's Not Emulated === This project does not make any changes to existing native zones limitations. One point to note is that TX will continue to be incompatible with branded zones. Customers using TX on S10 systems will need to transition to a certified, native S.next TX solution. Discussions with the TX team indicate that this is the normal behavior for users of TX, since the base OS itself must be certified for TX. === Versioning === Because of the potential issues with compatibility of various releases of S10 hosted on differing releases of S.next, a basic versioning system is incorporated into the brand. This versioning system works both ways. That is, the brand emulation can check which version of S10 is being hosted in the zone and adjust the emulation accordingly. Likewise, future S10 updates which require specific emulation can indicate that a specific version of the emulation is required. If necessary, they can also check if they are running in a branded zone and, if so, determine what version of emulation is available. The initial release of the software won't need this versioning mechanism, but it is being included to cope with possible future enhancements to either S10 or S.next. If a change is made to S10 which requires an enhancement to the brand emulation library, it is expected that this change would be delivered in a S10 KU patch which provides components on both sides of the user/kernel boundary. When the branded zone boots, the brand boot hook determines the minimal version of the KU that is installed in the zone to verify that the zone's release is supported (i.e. currently the minimal KU will be the one from S10u8). It then makes the associated version (i.e. version 0 of the emulation) available as an attribute on the zone. The brand library can then use this information to provide conditional emulation if needed. Future projects that enhance the emulation for new features in S10 can add a check for a different KU version number which would then provide associated versions (e.g. 1, 2, etc.) to the brand library. If the KU version is not sufficient, future S10 projects may need to design some other version check for the brand to enable it to properly detect the S10 changes. The ability to detect the KU version is already covered by the contract on the zone "update on attach" feature [8]. The situation is more complicated for future changes within the S10 code base which will require associated enhancements to the brand emulation. There are two mechanisms being proposed. The first mechanism is that the future version of S10 can specify that it requires a minimal version of the brand emulation. It does this by delivering a version number into the '/usr/lib/brand/solaris10/version' file on S10. When this future version of S10 is p2v-ed into a solaris10 branded zone, the solaris10 brand will check for the presence of this file and if it exists, the brand will verify that the brand's version is greater than or equal to the version specified in the S10 file. If not, then an error will be emitted and the zone p2v will fail, leaving the zone in the configured state. If the '/usr/lib/brand/solaris10/version' file is missing on S10, that indicates that the version of S10 is still compatible with the initial release of the solaris10 brand emulation. The first time a project is backported to S10 which requires an enhancement to the emulation, this file must be created and the version number in the file will be bumped. This first mechanism is useful if a future S10 update is fundamentally incompatible with an older version of the S.next brand emulation. The second mechanism allows projects that have been backported to S10 to actually be brand aware. A new zone attribute will be available indicating which version of the brand emulation is currently installed on the system. For these future S10 updates, if they deliver a new feature which requires changes to the brand library, that S10 feature can also determine if it is running in a branded zone and if so, if the necessary emulation is available. If the newer S10 update is running in a zone on an older version of S.next which does not provide the required emulation, the S10 feature can adjust its behavior in the appropriate manner. The existing getzoneid() and zone_getattr(ZONE_ATTR_BRAND) functions can be used by S10 code to determine if it is running in a non-global zone and if that zone is a 'solaris10' branded zone. A new solaris10 brand-specific zone attribute, S10_EMUL_VERSION_NUM, is defined. The S10 feature can use the zone_getattr(S10_EMUL_VERSION_NUM) function to determine if the brand emulation supports the feature. The getzoneid() and zone_getattr() functions are already used throughout the ON consolidation for code that is zone-aware. These functions will continue to be consolidation private. Engineers backporting features to a future S10 update will need to first determine if that feature requires enhancements to the solaris10 brand library. If so, they will then have to enhance the emulation in S.next and bump the emulation version number. They can then either bump the minimal emulation version number in the /usr/lib/brand/solaris10/version file on S10 during the S10 backport or they can add the appropriate checks to the backported S10 code so that it can determine if the support is available in the brand library and change behavior accordingly. This obviously adds a great deal of complexity to projects backporting features to future S10 updates if those features require emulation to function correctly in the branded zone. Ideally, projects requiring such enhancements to the brand emulation will not be backported. Perhaps the presence of the S10 brand on S.next may discourage projects from backporting since the brand provides S10 compatibility on S.next. Future projects which cross the user/kernel boundary and which request patch binding should be reviewed by the ARCs to determine if those projects must take the solaris10 brand into account. In addition to the above, any changes integrating into S.next which might impact the solaris10 brand will need to test the supported versions of S10 in the branded zone and make any needed changes to the solaris10 emulation. Part 3: Archiving, Installation, p2v & v2v ____________________________ The p2v process for the solaris10 brand is the same as for the solari8, solaris9 and native [9] brands. A contract will be included with this case for the flar command to explicitly call out the use of flash archives for migrating system images into zones. The v2v process for migrating S10 native zones to solaris10 branded zones will support the same archive formats as p2v. This process will use the 'zoneamd attach' subcommand since thats the existing interface for migrating [3] zones from one system to another. The solaris10 brand attach subcommand will be extended to accept the following options which correspond to the same options in the install subcommand. -a {path} - specifies a path to an archive to unpack into the zone -d {path} - specifies a path to a tree of files as the source for the installation. One issue with v2v of a S10 zone is that those zones can be sparse but the solaris10 branded zone must be whole root. The current plan is that the zone must be readied on the source system. This will mount any inherited-pkg-dirs and an archive can then be made of the readied zone. The p2v conversion during the installation of the zone will again be similar to the native p2v process [9]. === Interface Table === The solaris10 brand seeks minor release binding. Exported Interfaces Stability ---------------------------------------------------------------------- "solaris10" brand name Committed "SUNWsolaris10" brand template name Committed For the solaris10 brand brand-specific install and attach subcommand options Committed documented in this case /usr/lib/brand/solaris10 directory Committed SUNWs10brandr, SUNWs10brandu packages Committed /usr/lib/brand/solaris10/version Committed getzoneid(), zone_getattr(), ZONE_ATTR_BRAND and S10_EMUL_VERSION_NUM,attibutes Consolidation Private Imported Interfaces Stability ---------------------------------------------------------------------- brandz[2] Project Private Nevada syscall traps documented above Consolidation Private flar(1m) Evolving Contract included with this case REFERENCES 1. PSARC 2002/174 Virtualization and Namespace Isolation in Solaris 2. PSARC 2005/471 BrandZ: Support for non-native zones 3. PSARC/2007/350 Etude: Migration Technology 4. PSARC/2008/125 Etude Part Deux 5. PSARC/2006/030 Zone migration 6. PSARC/2006/366 Stack instances: Exclusive IP stack per zone 7. PSARC/2006/357 Crossbow - Network Virtualization and Resource Management 8. PSARC/2007/621 zone update on attach 9. PSARC/2008/766 native zones p2v _______________________________________________ zones-discuss mailing list zones-discuss@opensolaris.org