James, great write up. The automated configuration management systems that are available to day are quite incapable of handling upstart, so I think there's a huge need for some simpler automation.
My reply is inline, and, I'm afraid, a bit rambling as well, as it was written over a couple of days... Excerpts from James Hunt's message of Fri Jun 17 12:42:17 -0700 2011: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi All, > > = Caveat = > > This is very much a brain dump and doesn't have all the answers - please > comment and fill in the blanks when you spot them! :-) > > = Introduction = > > We are looking to provide the ability to fully disable a job. > > = Rationale = > > Lots of users are familiar with the old SysV way of handling jobs and > are looking for a chkconfig-like tool to ease the transition to Upstart. > > The "manual" stanza coupled with the Override facility does already > provide this facility, but have the following shortcomings. > > == Shortcomings of Override Files == > > * There is no programmatic Upstart interface: it requires a tool/user to > manually create a ".override" file contaning the "manual" stanza (or > simply appending "manual" to the ".conf" file). > > * It is too generic a facility / not "fail-safe" > > Any Admin/tool/pkg can manipulate ".override" files. If an Admin > disables a job using a ".override" file, they might find that it has > later been changed by another tool that rewrote the override. This is > undesirable since the job may no longer be disabled. Conflicting changes to a single configuration, whether in one file or a group of files, will always be a problem. In SysV, you can have one tool that disables a service, and another that just moves it from starting at position 20 to position 60. Then insserv comes along and reorders it all for dependencies that the tool didn't account for. Whats important is that the system provides a gracefully degrading mechanism that abstracts the disabling/restricting/limiting behavior at the points that make sense. With boot order, there's really only one set of concerns to address.. the root user and how they want the system to boot. So I'm not sure where the override file fails to support this. The tools that modify the override file *must* be able to introspect the situation, or be reasonably sure that they can take blind action that will succeed. Right now, echo manual >> /etc/init/job.override gives one the blind action assurance. I think we identified the need for initctl to be able to tell us whether or not a job is manual or not, which would give a standard way to achieve introspection. > > * Not obvious how to determine if a job *is* enabled or disabled. > > It is possible though. See: > > http://upstart.ubuntu.com/cookbook/#determine-if-a-job-is-disabled I think there's a need for a stronger keyword, 'disabled', which totally disables the job even if manually started with 'start foo'. Given that, the status tool could show if a job is manual and/or disabled. If the desire is to disable only some of its start/stop conditions, well that can also be done by adding a new start on to the end of the override file. Basically there is a difference between wanting something to not run ever, and wanting to change the default way something runs without modifying the job file. > > = Requirement = > > A "chkconfig"-like tool [1] to allow: > > * Jobs to be disabled in particular runlevels. I think this is far less useful on Debian/Ubuntu. runlevels just aren't as important as they are in RH where they mean a lot more. I do see some times where people want to be able to affect one start condition, but not the other. The precision of the upstart start on / stop on conditions make it hard to translate the imprecise (but simple) methods possible w/ runlevels. > > * The ability to determine if a job is disabled for a particular > runlevel. > I think the visualization tool can be used to tell if a job might be started or stopped given a particular event. Am I over-stating its capabilities? > * The ability to determine if a job *will* run for a particular > runlevel (note: this is *NOT* the same as the bullet above! > See below...) > > = Ideal = > > The ideal tool would provide the following details: > > * Job name. > * Instance name. > * Which runlevels a job is enabled and disabled in. > This breaks down into: > * Job is enabled for specified runlevel. > * Job is explicitly disabled for specified runlevel. > * Job is *implicitly* disabled for specified runlevel. > * Whether the job ran last time? > (would require an event+job log. Can never be 100% reliable of course > since config may have changed between boots.) If I replace 'runlevel' with 'boot phase', and think of it more like what Scott said ChromeOS does, this makes more sense. I think most of the time an admin who wants to change the 'runlevel' something is enabled or disabled in really just wants to move it from starting before everything else, or after everything else. Otherwise they want to express a non-obvious non-generic dependency. Either way, they're both handled by following jobs, whether its a boot phase job or an explicit job. Then the tool just needs to be good at manipulating start/stop on conditions. > > = Preliminaries = > > == Thoughts and Observations == > > * It is actually rather difficult to map the Upstart event model onto such > a tool since SysV init doesn't behave like Upstart (further details below). > > * If a job is explicitly disabled completely, jobs which start on that > job will be implicitly disabled. This information needs to be > conveyed somehow. > Again, the manual and (potentially if its agreed up on and comes into existence) the disabled keywords need to be carried into the visualization tool. Then whenever you change one of these things, you can very easily print the diff between the two and ask the user if they're ok with that. Something like startup -starting mountall -starting lxc-mountall -started lxc-mountall -> local-filesystems + started dbus -starting network-manager -starting dbus -started dbus +local-filesystems -starting network-manager Where a user or program can detect the diff of jobs that show up on the criteria and report it. This would be useful in automated integration testing for Ubuntu as well, since we could very easily install all co-installable packages and run this, and then raise warnings when a job wasn't going to start. > * If a job has a start on condition as below, what action should we > take if the user requests the job be disabled in runlevel 2?: > > start on foo or runlevel 2 > > Since it is (currently) not possible to know upfront whether "foo" or > "runlevel 2" will be satisfied at boot time, it may be reasonable to > (by default) disable such a job in runlevel 2 since the "start on" > has specified it *might* "start on runlevel 2". We could provide an > option to control this subtle behaviour. We'd have our tool mask out runlevel 2 and do the equivilent of sudo sh -c 'echo -e "# added by our tool `date`\nstart on foo\n" >> /etc/init/foo.override' This would disable the start on condition. Asking upstart what its effective start on condition is would present us with 'start on foo'. > > * If we provide the ability to disable any job, the system could become > unbootable very quickly. > > > == Constraints == > > * Upstart currently has no knowledge of SystemV runlevels: they are > supported through events and external applications such as telinit. > > This premise should not need to be contravened - the internals of > Upstart should not need to be imbued with runlevel knowledge. This > implies that: > > 1. The facility should work for *any* event (not just runlevels). > > 2. The facility should be driven by an external tool of some kind (in > other words either a program or script which calls initctl as > appropriate). > > * Runlevels are implemented with the "runlevel" event which has a > primary environment variable "RUNLEVEL" taking a value from 0 to 6. > It needs to be possible to disable a job: > > * entirely (where it has any "start on" condition). > > * in all runlevels ("[0123456]"). > > * in some runlevels (for example "[345]"). > > * Upstart allows jobs to be started based on arbitrarily complex > conditions. Any facility to disable a job should consider these > conditions. > +1 for a tool that helps admins use upstarts event based model rather than hiding it from them. > > == Categories of Jobs == > > There are a number of job categories that we need to consider: > > 1. Jobs that specify a start on which does *NOT* include > runlevel. > > They may start before or after the runlevel event is emitted. > > 2. Jobs that start on the initial event. > > A small handful of jobs "start on startup". This is a specialisation > of (1). > > 3. Jobs that "start on runlevel" (a single event). > > Such jobs may restrict the start on further by specifying > environment variables (RUNLEVEL and PREVLEVEL). > > 4. Jobs that specify a "complex" start on (one using "and" / "or") > which includes "runlevel". > > = Terminology = > > * "limit" > > Since we want to be able to disable Upstart jobs based on some > condition, "disable" is rather a crude term. The word "limit" is > better since it connotes the more fine-grained approach being > proposed. Its antonym being "delimit" (I'd initially thought of > "restrict" and "derestrict" but (,de)limit is shorter :-) > I like this term, and I like idea of it being able to take an optional set of start on / stop on keywords that simply mask the given criteria out of the start on or stop on. Given the above example: limit start on runlevel 2 Achieves the desired effect. > > = Scope = > > Ideally, it would be possible to disable a job *instance*. But that is > probably going to be an "iteration 2" feature. > > Of the four categories of Jobs outlined above, only category (3) and (4) > can reasonably be dealt with by this design. Category (1) breaks down > into jobs that run before the runlevel event is emitted (about 20 on an > Ubuntu oneiric system currently) and jobs that run after. The former > have to be excluded but the latter may be able to be considered. It is > possible that many of those would end up being implicitly disabled if a > job in category (3) or (4) were disabled anyway [2]. > > It isn't reasonable to stop category (2) jobs from running since that > will almost certainly break your system anyway: mountall won't run for > starters! > > > = High-Level Plan = > > My thoughts at this stage are that we provide 3 new commands (note these > are not *necessarily* initctl commands): > > * limit <job> [<expr>] > > Restrict conditions on which job <job> is started. <expr> is assumed > to be a subset of the "start on" condition of <job>, however if it > is not, this is not an error (but a warning should probably be > issued since the command would have no effect at that point in time. > > QUESTION: If job <job> has already been limited, what do we do: > > 1. Throw an error. > 2. Replace the existing limit with the new one. If it would be a noop, do nothing, exit 0. If it would change the start/stop criteria, just do it. This way it only happens once. --verbose shows the "not doing anything" or "setting start/stop on to xxx" message for those who are confused why it did nothing. > > QUESTION: How would we handle this scenario?: > > $ restrict cron runlevel [35] > $ restrict cron runlevel RUNLEVEL=4 Assuming cron's original start on was start on runlevel [2345] And assuming it knows that 'runlevel RUNLEVEL=4' is equivilent to 'runlevel [4]'.. The first call would mask 35 from any current arguments to runlevel and add # Added by 'limit cron runlevel [35]' Sun Jun 19 19:15:03 -0700 start on runlevel [24] To the override file. The second one would calculate that 4 must be removed and add # Added by 'limit cron runlevel [35]' Sun Jun 19 19:17:08 -0700 start on runlevel [2] The key is being able to ask upstart what the current effective start on is now so it can only act on that. > > Possible outcomes: > > 1. Cron is restricted in runlevels 3+5. > 1. Cron is restricted in runlevel 4. > 1. Cron is restricted in runlevel 3, 4 and 5. > > * delimit <job> > > Returns any current limit expression and undoes the effect of > "limit". Simplest implementation has this adding # Added by 'delimit cron' Sun Jun 19 19:18:23 -0700 start on runlevel [2345] to the override file. A more complicated one might remove all limits from the override file, but I think the former is more elegant, and the parsing of 3 or 4 start on's is pretty close to computationally free so I don't see any downsides to keeping it simple. If you want to just remove one limit, forget about it applying only to the limits you've explicitly added. It can delimit *anything*. So delimit cron start on started mysql Just adds started mysql as an "OR" condition for cron. This becomes an elegant and simple to understand tool for any automation system to add safe event conditions. Since AND requires thought before doing, lets leave that to manual intervention. > > * show-limit [<job> [<expr>]] > > Show limits for all jobs or specified job. > > Command should emit a warning if any limit is found that is not a > subset of the "start on" for the job in question (since the limit > will have no effect). > > If no expression is supplied, show "raw" limit. If an expression > *is* specified, determine if job would run given that expression. > > Example: Assume a job specifies "start on runlevel [345]". If a > limit of "runlevel RUNLEVEL=4" has been set, we want a higher-level > tool to be able to query directly if the job would run in runlevel 4 > so returning "runlevel [345]" isn't that helpful. What we really > want to say is: > > $ show-limit foo runlevel 4 This falls back on upstart/initctl I think. A command that says "show me the possible event chain(s) that leads to job X starting" would be highly useful even without limits. If it can put a * next to every condition that is overriden, that might be helpful. > > And have the tool display whether for "runlevel 4" job foo would run > based on the limit of "runlevel [345]". This could be displayed in > parseable format and also maybe returned via the return code. > > Thought: maybe we could add a "query-limit" command specifically for > this and have "show-limit" just return the "raw" limit details? > > > = Implementation Details = > > == Limit Condition == > > To satisfy the chkconfig requirement, we could just allow a single event > and optional environment to be specified. However, the better solution > is to allow an arbitrary condition (like "start on" and "stop on"). The > condition could almost be viewed as a "restrict on" stanza. Only one > such limit condition may be specified. > > XXX: Note that the condition itself -- for the example of runlevels -- > cover all the runlevels where that job must not run. This is an > important point: the condition only specifies a single runlevel if that > job should only be disabled in a single runlevel. The "norm" is > probablly more likely to be where the condition covers *more than one* > runlevel. This is perfectly acceptable since "show-limit" allows an > *actual* runlevel to be specified so a higher-level tool can establish > if a job would be disabled for a particular runlevel. > > == Matching Limits to Events == > > If a job condition becomes "true" such that Upstart would normally > attempt to start the job and if that job has a limit condition which > "matches" part of the EventOperator tree, Upstart will not run the job. If we just have limit as a tool that manipulates the override file, this is no longer part of the implementation is it? > > === Examples === > > start on : runlevel [2345] > runlevel : 2 > limit : runlevel 2 > outcome : match - job will be disabled in runlevel 2. > > > start on : runlevel [2345] > runlevel : 2 > limit : runlevel > outcome : match - job will be disabled in runlevel 2. > > > start on : runlevel > runlevel : 2 > limit : runlevel [2345] > outcome : match - job will be disabled in runlevel 2. > > > start on : runlevel 2 > runlevel : 2 > limit : runlevel [2345] > outcome : match - job will be disabled. > > > start on : runlevel RUNLEVEL=2 > runlevel : 2 > limit : runlevel [2345] > outcome : match - job will be disabled. > > start on : runlevel [2345] > runlevel : 2 > limit : runlevel RUNLEVEL=2 > outcome : match - job will be disabled. > > > start on : runlevel RUNLEVEL=2 > runlevel : 2 > limit : runlevel [2345] > outcome : match - job will be disabled. > > > start on : runlevel RUNLEVEL=2 PREVLEVEL=S > runlevel : 2 > limit : runlevel [2345] > outcome : match - job will be disabled. > > > start on : runlevel RUNLEVEL=2 > runlevel : 2 > limit : runlevel [2345] S > outcome : no match - job will run. > > start on : runlevel 2 > runlevel : 2 > limit : runlevel [345] > outcome : no match - job will run. warning will be generated since > limit cannot match the start on condition. > > > start on : foo or runlevel 2 > runlevel : 2 (foo has not been emitted). > limit : runlevel [2345] > outcome : match? I think yes. > > > start on : foo and runlevel 2 > runlevel : 2 (and foo has been emitted). > limit : runlevel [2345] > outcome : match - job will not run. > Right all of these are handled gracefully if limit just masks out the conditions passed to it. > > == Storage of Limit Conditions == > > The two main ideas here are: > > * Create a single file to store all limit information. > > A good location might be "/etc/init.limit". This file would store > job restriction details in a simple format such as: > > <job> [<condition>] > > So, if job "cron" was disabled entirely, it would contain: > > cron > > Whereas if the job was disabled in runlevels 3-5 it would contain: > > cron runlevel [345] > > If the file exists on startup, Upstart would read the job > limit details. > > Pros: > > * Single file outside of /etc/init/ so might be "safer" in the case > where an admin ran "cd /etc/init; rm * .override" say by mistake. > > * It would be a "single point of definition" and thus easier to > backup and apply to other systems maybe? > > Cons: > > * File would nominally need to be rewritten each time a change was > made. Might not be too bad since changing limits is perceived as > being an irregular activity (but tell me if you have other views on > this! :) > > * Possible locking issues if multiple requests came in to change a > limit at the same time. > > * Create per job files > > In a similar fashion to the existing ".conf" and ".override" files, > we could introduce "/etc/init/<job>.limit". If this file existed > and was empty, the job would be fully disabled (never automatically > started). However, if it contains "<condition>", that would be applied. > > Pros: > * Analog to ".conf" and ".override" so familiar to users. > > Cons: > > * Easy to inadvertently delete a ".limit" file maybe? > > * We're starting to create a lot of files now. Theoretically there > could now be 3 files / job (".conf", ".override" and ".limit"). > We're not likely to reach the inotify limit (4096 watches?) yet, > but it is something to be aware of, moreso in the server or maybe > development server environment. > > However the Limit Condition file(s) is/are created, care needs to be > taken to ensure that it is not possible to lose data should > the system fail / be rebooted in mid-write. Option 3, just store them as overrides. Pros: * Singe point for admins to go to look for overriden settings for a job. * Implementation would simply be a script that is able to parse and understand upstart's even conditions. * No features necessary to add upstart itself. May be useful to expose the job parsing as a library but not *essential*. Cons: * May conflict with other tools that manipulate override. * May confuse admins who are using override without expecting a system level tool to override their .. overrides. -- upstart-devel mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/upstart-devel
