[smf-discuss] Code review request - Solaris host-based firewall

Tony Nguyen Mon, 27 Oct 2008 16:53:32 -0700

David Bustos wrote:
> Quoth Tony Nguyen on Thu, Oct 23, 2008 at 04:43:04PM -0700:
>> David Bustos wrote:
>>> Quoth Tony Nguyen on Sun, Oct 19, 2008 at 02:49:40AM -0700:
>>>> This is for 6761070 PSARC 2008/580 Solaris host-based firewall 
>>>> <http://monaco.sfbay/detail.jsf?cr=6761070>*
>>>>
>>>> *The webrev and the new event message are respectively at:
>>>>
>>>> http://cr.opensolaris.org/~tonyn/firewall/
>>>>
>>>> http://cr.opensolaris.org/~tonyn/firewall/8000-R4
> ...
>> I agree the restarter protocol, restarters or librestart, would probably 
>> be a more appropriate places for these changes. However, these changes 
>> imply:
>>
>> - third-party restarters, in addition to services need modification to 
>> use firewall framework
>>
>> - potential synchronization issues between services under different 
>> restarters trying to manipulate IPfilter (simultaneous enable/disable of 
>> IPfilter).
>>
>> - and in the case of new dedicate restarter, services managed under 
>> another restarter wouldn't be able to use the firewall framework.
>>
>> I do, however, agree that we should revisit the long-term plan for 
>> observing service changes.
> 
> That's true.  There is one entity which can observe and exert control
> over all services regardless of restarter, though: svc.startd's graph
> engine.  Much like dependencies are evaluated without knowledge of the
> service or its restarter, IP Filter rules could be installed.


Yes, the graph engine can observe and request restarters to take
appropriate actions. The restarters, currently, are typically
responsible for invoking services' methods/actions. My point is changes
in the graph engine would most likely require cooperating changes from
all restarters. Moreover, we should probably figure out a more generic
mechanism to observe services changes and invoke corresponding actions,
not just for this host-based firewall daemon.

> 
> I'm not sure the difference in functionality, susceptibility to bugs, or
> amenability to future changes is small enough that it's ok to proceed
> with the current architecture.  What's your judgement?

The architecture is essentially made up of IPfilter rule generation and
service observability to update the rules by invoking the rule
generation engine. If a better observability mechanism exist, it can
replace the current daemon and directly invoke the rule generation engine.

I don't see any functional difference. However, susceptibility to bugs
is certainly a valid concern as we're essentially intercept events to
restarter and forming our own conclusions on how to affect firewall
configuration. But then, even if we choose to make changes at the graph
engine level specifically for firewall configuration, the logic to
determine whether certain events affect firewall configuration is still
similar to what we currently have in svc.ipfd.

> 
> Another thing that concerns me is the tying of this functionality to the
> state of network/ipfilter:default.  Won't we have more flexibility if
> this behavior can be turned on or off though a separate command?

That's because the host-based firewall feature is entirely built on
network/ipfilter functionality. In other words, if networking/ipfilter
isn't running, we can't load/unload ipf rules. Moreover, the
network/ipfilter contains the default system wide firewall policies.

However, I see your point. Host-based firewall abstracts
network/ipfilter functionality but users are required to enable
network/ipfilter to activate host-based firewall. Thus, firewall has a 
hard dependency on network/ipfilter, but the abstraction isn't quite 
correct.

There's going to be a follow up firewall panel via Visual Panels project 
that will provide the host-based firewall enabling/disabling 
abstraction. In the mean time, we can undo the network/ipfilter:default 
commitment in the ARC case to have flexibility.

> ...
>>> cmd/ipf/svc/ipfd.c
> ...
>> As a general response to most of your concerns about the error handlings
>> in this file. This daemon is started by network/ipfilter which currently 
>> is a transient service. I don't speak for the Networking team but my 
>> guess is that the team also wants users to be able to manipulate 
>> IPfilter outside of SMF (ipf -E|D). Consequently, this daemon, svc.ipfd 
>> is started by a transient service and won't get restarted by svc.startd.
> 
> Can you elaborate on the connection between those two?

I don't have the definitive answer for why network/ipfilter is a
transient service. It was background information on why I wanted
svc.ipfd to stay up as much as possible since svc.startd isn't going to
restart processes in orphan contracts.

Does ipf -E|D
> kill your daemon?

ipf -E|D make ioctl calls to /dev/ipf, it has no effect on the service's
daemons (ipmon and svc.ipfd).

> 
>> My goal is to have svc.ipfd run as long as network/ipfilter is online
>> and not interrupt it for reasons outside of its control, svc.configd or
>> libscf interactions. This way, svc.ipfd's availability is consistent
>> with ipmon, the other process started by network/ipfilter. Thus, the scf
>> related errors are not handled as well as they should be. I'll address
>> your concerns as much as I can and would appreciate your suggestions on
>> how we can best achieve such goal.
> 
> It's ok to ignore errors.  Specifically when they're expected, or they
> have no impact on functionality expected by the user.  When it's not
> clear to me that either of these is the case, I've asked you to explain.
> And usually when that's required, I think you should also add a comment
> so people reading or modifying this code in the future will also
> understand.
> 

Agree.

> ...
>>>   102: Why is it ok to return 0 if scf_handle_bind() fails for reasons
>>>     other than SCF_ERROR_NO_RESOURCES?
>> The return value is meaningless here. Essentially, we try our best and
>> exit the function. It should probably be a void function.
> 
> It seems that the return value is used at line 559.  The other uses
> assume that the handle is bound.  If scf_handle_bind() fails, we know it
> isn't bound, and shouldn't return.  We should either retry, exit, or
> return a value and make the callers handle the error appropriately.  And
> perhaps emit a message to the user explaining the impact of the error.
> 

Ah...yes, we can't continue with an unbound handle. This should be a 
void function and exits if it fails to bind the handle and emits an 
error to explain impact.

> ...
>>>   163: Please add
>>>
>>>     assert(max_scf_value_size >= scf_limit(SCF_LIMIT_MAX_NAME_LEN) + 1);
>>>
>>>     somewhere before this.
>> I changed the initialization in main() so we wouldn't need to keep checking.
>>
>> max_scf_fmri_size = scf_limit(SCF_LIMIT_MAX_FMRI_LENGTH) + 1;
>> max_scf_value_size = scf_limit(SCF_LIMIT_MAX_VALUE_LENGTH) + 1;
>> assert(max_scf_fmri_size > 0);
>> assert(max_scf_value_size > 0);
> 
> Actually my point was that you're fetching a name into a buffer sized
> for a value.  This is currently safe because values can be much longer
> than names.  But I don't think that's inherent in the abstractions, so
> I think you should assert() that it's the case, so if it changes, we'll
> know before a bug about bizarre behavior is filed.
> 

ah, I missed the name part. Thanks.

>>>   164: What should a user do if he sees this log message?
>> The problem is most likely in SMF and this daemon is another victim. I
>> don't see what user can do. What do you suggest we can do here?
> 
> Well what is the impact of this error?  If there is none, then you
> should probably not log a message.
> 

There's no user impact. It can probably be a debug rather than an error 
message.

>>>   310: I presume you considered launching ipfilter in a separate process
>>>     contract? 
>> That doesn't buy us much since network/ipfilter is a transient service. 
>> Do you see any benefit?
> 
> I don't recall whether when one child dies in a transient service, all
> processes are killed.  If so, then this would be even more important
> because SMF won't restart you.  But it should be pretty rare and can be
> deferred if you want.
> 

I believe events in orphan contracts are not processed since there's 
registered listener. Thus, died children should affect processes in an 
orphan contract.

> ...
>>>   312: What should a user do if he sees this log message?
>> Again, I'm not sure how a user can correct such systematic problem?
> 
> Well if it's because the system is low on memory, then maybe he can add
> memory.  If it's because the software is defective, then he can report
> a bug.  If it's because the hardware is introducing undetected memory
> errors, then he should investigate his hardware.  And perhaps more
> importantly, if his firewall is now inconsistent, then he should
> probably take steps to correct that.  To the extent that you can,
> I think your error message should convey information which helps the
> user decide.
> 

Understand. It emits the error and have the caller explains the impact 
which is the convention I'll try to follow with the rest of the changes.

>>>   359: I didn't see this behavior of preferring inetd/isrpc over
>>>     firewall_context/isrpc in the ARC case.  Did I miss it?  Or does it
>>>     not matter?
>> An RPC service has either inetd or firewall_context pg but not both. 
>> There isn't a preference as order does not matter.
> 
> Oh, I missed that in the ARC case.  Where is it explained?  And will
> this be enforced, or is it up to the service author?
> 

See Firewall Static Configuration, Developer Information, and Exported 
Interfaces.

This property currently exists for inetd services to signal RPC 
services. We're duplicating the behavior for non-inetd services. If the 
property is missing, the service is considered to be a non-RPC service.

>>>   407: Please add a comment explaining why it's ok to continue of
>>>     repository_rebind() failed.
> 
> I presume you accepted this comment.

repository_rebind() now exits if it fails to bind the handle so we no 
longer continue here.

>>>   414: It's not clear to me that it's ok to retry in such a tight loop
>>>     for all errors of _scf_notify_add_pgtype() caught by the default
>>>     case.  It would help for you to catch the errors explicitly and
>>>     abort() on unknown errors.
>> But I'd like to avoid aborting :^( What other options do we have? If 
>> there aren't any, I agree it should abort. What do you think?
> 
> You should determine the other errors _scf_notify_add_pgtype() can fail
> with.  For errors such as out-of-memory, you should pause before
> retrying.  For unknown errors, I would abort() because it means that the
> interface is documented incorrectly, or the system is suffering memory
> errors, and the core should help us understand which.  If you don't
> think a core would be useful, then you could emit an error message
> instead.  And if you would rather make it easier for future libscf
> coders to introduce new errors, then you can ignore the unknown errors.
> 

The other meaningful error from _scf_notify_add_pgtype() is 
SCF_ERROR_NO_RESOURCES which we can have a 1 second sleep before 
retrying. For unknown errors, we can emit an error and abort.

What do you think?

>>>   422: Please add a comment explaining why it's ok to ignore errors from
>>>     repository_rebind().
>> repository_rebind doesn't return any useful error at the moment.
> 
> Why not?  Do you regard returning with h unbound a success?
> 

repository_rebind() exits if it failed to bind the handle.

>>>   426-535: Is there a reason this code can't be in a separate function?
>> No, it's simple enough that I thought it'd be ok in the same function.
>> If you feel strongly about it, I can have it in a separate function.
> 
> I think it's too long and has too much indentation to be considered
> simple.  Please move it to a separate function.
> 
> ...

done

>>>   438: I'm not sure it's appropriate to log this message without
>>>     explaining what its consequences are.
>> Yes, "Lost repository event" may not be a clear consequence. Do you have
>> other suggestions?
> 
> Well what are the consequences of the error?
> 

I moved this section into a new function which processes repository 
events and emits mostly debug messages. However, if the function failed, 
the caller will emit an error telling user that a service may have 
incorrect IPfilter configuration.

>>>   443: I believe SCF_ERROR_NOT_SET signifies a bug in libscf or
>>>     svc.configd.  Shouldn't you abort() if it occurs?
>> If it's a bug in configd or libscf, this daemon should continue to run
>> so that it can process future events when configd or libscf issues are
>> sorted out.
> 
> How can the configd or libscf issues be sorted out?
> 

My point was that bugs in configd or libscf, possibly temporary issues, 
is outside of this daemon's control and shouldn't terminate this daemon.

However, I agree that a core here and in 446 would help diagnosing 
problems. It'll abort.

>>>   446: What should a user do if he sees this log message?
>> The problem is most likely in SMF and this daemon is another victim. I
>> don't see what user can do. What do you suggest we can do here?
> 
> Well what errors can lead to this?  What are their impacts?  Do you
> think a core would not be useful in diagnosing a problem here?
> 

It'll abort.

>>>   458: Please add a comment explaining why it's ok to ignore the other
>>>     errors.
>> I added a generic comment explaining that we have minimal scf error
>> handling since most scf errors are outside the control of the daemon,
>> essentially we serve when we can. Thus, rebind if connection is broken
>> and simply continue and wait for future events and hope that SMF can
>> sort itself out.
> 
> I don't think hope should be part of this algorithm, especially since
> it's supposed to enforce security.  Unless this error results in no
> impact on functionality, at the least an error message should be issued.
> 

An error is issued.

>>>   462: This buffer is one character too small.  In theory, you should
>>>     allocate a buffer of size scf_limit(SCF_LIMIT_MAX_FMRI_LENGTH + 1).
>> Hmm...Would it make more sense if we change the definition to have the
>> right value? An rfe?
> 
> It does have the right value, as currently defined.  I presume you mean
> that you don't think the current definition is as useful as a new
> definition.  (The size of the buffer required, rather than the maximum
> length of the string.)  You can certainly file an RFE for that.
> 

correct, it'd be nice to have more useful definitions.

>>>   503: What should a user do if he sees this log message?
>> If update fails, the problem is probably outside of user's control,
>> error in allocating memory, forking, or getting service's state. What do
>> you suggest we can do here?
> 
> I think sometimes errors allocating memory or getting a service's state
> are in the user's control.  I suggest you explain the impact of the
> error here, and possibly explain likely causes and their remedies.
> 

ipfilter_update() is modified to emit errors and its caller will issue 
the impact, firewall updated failed for fmri.

>>> lib/librestart/common/librestart.c
> ...
>>>   instance_validate_aux_fmri(): Is there a reason this function is in
>>>     this part of the file?
>> What's the more appropriate location?
> 
> If there is no reason for it to be there, then move it to the end.  If
> there are related functions, then you should move it near them.
> 

done

There's quite a few changes from this round of review. I'll do some 
testing and generate another webrev once we conclude this conversation.

Thanks,
-tony

[smf-discuss] Code review request - Solaris host-based firewall

Reply via email to