Quoth Tony Nguyen on Thu, Oct 23, 2008 at 04:43:04PM -0700: > David Bustos wrote: > > Quoth Tony Nguyen on Sun, Oct 19, 2008 at 02:49:40AM -0700: > >> This is for 6761070 PSARC 2008/580 Solaris host-based firewall > >> <http://monaco.sfbay/detail.jsf?cr=6761070>* > >> > >> *The webrev and the new event message are respectively at: > >> > >> http://cr.opensolaris.org/~tonyn/firewall/ > >> > >> http://cr.opensolaris.org/~tonyn/firewall/8000-R4 ... > I agree the restarter protocol, restarters or librestart, would probably > be a more appropriate places for these changes. However, these changes > imply: > > - third-party restarters, in addition to services need modification to > use firewall framework > > - potential synchronization issues between services under different > restarters trying to manipulate IPfilter (simultaneous enable/disable of > IPfilter). > > - and in the case of new dedicate restarter, services managed under > another restarter wouldn't be able to use the firewall framework. > > I do, however, agree that we should revisit the long-term plan for > observing service changes.
That's true. There is one entity which can observe and exert control over all services regardless of restarter, though: svc.startd's graph engine. Much like dependencies are evaluated without knowledge of the service or its restarter, IP Filter rules could be installed. I'm not sure the difference in functionality, susceptibility to bugs, or amenability to future changes is small enough that it's ok to proceed with the current architecture. What's your judgement? Another thing that concerns me is the tying of this functionality to the state of network/ipfilter:default. Won't we have more flexibility if this behavior can be turned on or off though a separate command? > > cmd/cmd-inet/usr.lib/inetd/inetd.c > > 291: Where is this handle destroyed? > > The caller is responsible for destroying this handle, except when > function returns non-zero in which case it should destroy the handle. I > added commented and the handle destroy call. Ah, yes, you destroy it in inst_reset_aux_fmri(). I missed that. ... > > cmd/cmd/inet/usr.lib/inetd/repval.c ... > > scf_inst_delete_prop(): Since this function isn't in libscf, please > > use a different prefix. > > How about instance_delete_prop? That's fine. ... > > cmd/ipf/svc/ipfd.c ... > As a general response to most of your concerns about the error handlings > in this file. This daemon is started by network/ipfilter which currently > is a transient service. I don't speak for the Networking team but my > guess is that the team also wants users to be able to manipulate > IPfilter outside of SMF (ipf -E|D). Consequently, this daemon, svc.ipfd > is started by a transient service and won't get restarted by svc.startd. Can you elaborate on the connection between those two? Does ipf -E|D kill your daemon? > My goal is to have svc.ipfd run as long as network/ipfilter is online > and not interrupt it for reasons outside of its control, svc.configd or > libscf interactions. This way, svc.ipfd's availability is consistent > with ipmon, the other process started by network/ipfilter. Thus, the scf > related errors are not handled as well as they should be. I'll address > your concerns as much as I can and would appreciate your suggestions on > how we can best achieve such goal. It's ok to ignore errors. Specifically when they're expected, or they have no impact on functionality expected by the user. When it's not clear to me that either of these is the case, I've asked you to explain. And usually when that's required, I think you should also add a comment so people reading or modifying this code in the future will also understand. ... > > 102: Why is it ok to return 0 if scf_handle_bind() fails for reasons > > other than SCF_ERROR_NO_RESOURCES? > > The return value is meaningless here. Essentially, we try our best and > exit the function. It should probably be a void function. It seems that the return value is used at line 559. The other uses assume that the handle is bound. If scf_handle_bind() fails, we know it isn't bound, and shouldn't return. We should either retry, exit, or return a value and make the callers handle the error appropriately. And perhaps emit a message to the user explaining the impact of the error. ... > > 163: Please add > > > > assert(max_scf_value_size >= scf_limit(SCF_LIMIT_MAX_NAME_LEN) + 1); > > > > somewhere before this. > > I changed the initialization in main() so we wouldn't need to keep checking. > > max_scf_fmri_size = scf_limit(SCF_LIMIT_MAX_FMRI_LENGTH) + 1; > max_scf_value_size = scf_limit(SCF_LIMIT_MAX_VALUE_LENGTH) + 1; > assert(max_scf_fmri_size > 0); > assert(max_scf_value_size > 0); Actually my point was that you're fetching a name into a buffer sized for a value. This is currently safe because values can be much longer than names. But I don't think that's inherent in the abstractions, so I think you should assert() that it's the case, so if it changes, we'll know before a bug about bizarre behavior is filed. > > 164: What should a user do if he sees this log message? > > The problem is most likely in SMF and this daemon is another victim. I > don't see what user can do. What do you suggest we can do here? Well what is the impact of this error? If there is none, then you should probably not log a message. ... > > 310: I presume you considered launching ipfilter in a separate process > > contract? > > That doesn't buy us much since network/ipfilter is a transient service. > Do you see any benefit? I don't recall whether when one child dies in a transient service, all processes are killed. If so, then this would be even more important because SMF won't restart you. But it should be pretty rare and can be deferred if you want. ... > > 312: What should a user do if he sees this log message? > > Again, I'm not sure how a user can correct such systematic problem? Well if it's because the system is low on memory, then maybe he can add memory. If it's because the software is defective, then he can report a bug. If it's because the hardware is introducing undetected memory errors, then he should investigate his hardware. And perhaps more importantly, if his firewall is now inconsistent, then he should probably take steps to correct that. To the extent that you can, I think your error message should convey information which helps the user decide. ... > > 359: I didn't see this behavior of preferring inetd/isrpc over > > firewall_context/isrpc in the ARC case. Did I miss it? Or does it > > not matter? > > An RPC service has either inetd or firewall_context pg but not both. > There isn't a preference as order does not matter. Oh, I missed that in the ARC case. Where is it explained? And will this be enforced, or is it up to the service author? ... > > 407: Please add a comment explaining why it's ok to continue of > > repository_rebind() failed. I presume you accepted this comment. > > 414: It's not clear to me that it's ok to retry in such a tight loop > > for all errors of _scf_notify_add_pgtype() caught by the default > > case. It would help for you to catch the errors explicitly and > > abort() on unknown errors. > > But I'd like to avoid aborting :^( What other options do we have? If > there aren't any, I agree it should abort. What do you think? You should determine the other errors _scf_notify_add_pgtype() can fail with. For errors such as out-of-memory, you should pause before retrying. For unknown errors, I would abort() because it means that the interface is documented incorrectly, or the system is suffering memory errors, and the core should help us understand which. If you don't think a core would be useful, then you could emit an error message instead. And if you would rather make it easier for future libscf coders to introduce new errors, then you can ignore the unknown errors. ... > > 422: Please add a comment explaining why it's ok to ignore errors from > > repository_rebind(). > > repository_rebind doesn't return any useful error at the moment. Why not? Do you regard returning with h unbound a success? > > 426-535: Is there a reason this code can't be in a separate function? > > No, it's simple enough that I thought it'd be ok in the same function. > If you feel strongly about it, I can have it in a separate function. I think it's too long and has too much indentation to be considered simple. Please move it to a separate function. ... > > 438: I'm not sure it's appropriate to log this message without > > explaining what its consequences are. > > Yes, "Lost repository event" may not be a clear consequence. Do you have > other suggestions? Well what are the consequences of the error? > > 443: I believe SCF_ERROR_NOT_SET signifies a bug in libscf or > > svc.configd. Shouldn't you abort() if it occurs? > > If it's a bug in configd or libscf, this daemon should continue to run > so that it can process future events when configd or libscf issues are > sorted out. How can the configd or libscf issues be sorted out? > > 446: What should a user do if he sees this log message? > > The problem is most likely in SMF and this daemon is another victim. I > don't see what user can do. What do you suggest we can do here? Well what errors can lead to this? What are their impacts? Do you think a core would not be useful in diagnosing a problem here? ... > > 458: Please add a comment explaining why it's ok to ignore the other > > errors. > > I added a generic comment explaining that we have minimal scf error > handling since most scf errors are outside the control of the daemon, > essentially we serve when we can. Thus, rebind if connection is broken > and simply continue and wait for future events and hope that SMF can > sort itself out. I don't think hope should be part of this algorithm, especially since it's supposed to enforce security. Unless this error results in no impact on functionality, at the least an error message should be issued. > > 462: This buffer is one character too small. In theory, you should > > allocate a buffer of size scf_limit(SCF_LIMIT_MAX_FMRI_LENGTH + 1). > > Hmm...Would it make more sense if we change the definition to have the > right value? An rfe? It does have the right value, as currently defined. I presume you mean that you don't think the current definition is as useful as a new definition. (The size of the buffer required, rather than the maximum length of the string.) You can certainly file an RFE for that. ... > > 503: What should a user do if he sees this log message? > > If update fails, the problem is probably outside of user's control, > error in allocating memory, forking, or getting service's state. What do > you suggest we can do here? I think sometimes errors allocating memory or getting a service's state are in the user's control. I suggest you explain the impact of the error here, and possibly explain likely causes and their remedies. ... > > lib/librestart/common/librestart.c ... > > instance_validate_aux_fmri(): Is there a reason this function is in > > this part of the file? > > What's the more appropriate location? If there is no reason for it to be there, then move it to the end. If there are related functions, then you should move it near them. ... > > lib/librestart/common/librestart.h > > 240 > > - It seems that all of the other symbols in this file begin with > > "restarter". Is there a reason instance_validate_aux_fmri() > > doesn't? > > how about restarter_inst_validate_aux_fmri? Sure. David