[smf-discuss] SMF v. NSS: deadlock central

David Bustos Tue, 20 Nov 2007 09:36:15 -0800

Quoth Keith M Wesolowski on Fri, Nov 16, 2007 at 03:29:49PM -0800:
> Thread 4 here is servicing the getpwuid_r() from configd.  configd's
> thread 5 is, in turn, servicing (or not) the datael_get_child_locked
> call from nscd.  What's happening here?  Pretty simple, really - the
> rc_node_delete() call has set RC_NODE_DYING_FLAGS in preparation for
> deleting it near rc_node.c:3379, then later calls perm_granted() near
> 3409.  nscd is attempting to fill in the children of
> svc:/network/nis/client:default and is waiting for
> RC_NODE_CHILDREN_CHANGING to be cleared.  Clearly, that's never going
> to happen.


As you note below, I think this is a bug in svc.configd -- I don't see
an inherent reason why looking up property groups in an instance should
block on an authorization check.

> More generally, now.  I can certainly file a bug on the above (it's
> really another variation on 6598922) but there are other possible
> failure modes here, and it's very hard to be sure we've gotten all of
> them.  For example, suppose an NSS database backend (or NSS itself)
> were to store pieces of its configuration in the SMF repository.  When
> it goes to look up those values to service a lookup, it would be
> vulnerable to the same type of deadlock.
> 
> Duckwater folks: Are you aware of this?  How do you intend to solve
> (or avoid) it?
> 
> SMF folks: is there hope of establishing (and verifying compliance
> with) a set of locking rules that can definitely prevent these
> conditions?  Is it worth doing, or is some alternate set of
> restrictions more appropriate (perhaps banning the name service
> components from calling into configd)?

I think we can solve this by committing to the conditions under which
certain libscf calls will not block on name service lookups, and
reciprocally for the name services team.  I don't know whether an ARC
contract is appropriate for this, but I suspect a formal document of
that ilk is in order.

The problem, of course, is enforcement.  In svc.configd, I think that as
we begin operations which we have committed to be name service-free, we
could set some thread-specific data.  Then we could replace those lookup
functions with wrappers that assert that the no-nss flag is clear.  That
would work for direct invocations, but not mutexes.  I suspect that when
a thread with the no-nss flag set locks a mutex, we could annotate it
(the mutex) as no-nss as well, and make any other thread inherit the
flag while it holds the lock.  Then with a sufficiently thorough test
suite, we should be able to catch violations during development.

You may wonder about condition variables, and particularly the RC_NODE_
flags.  I'm not sure that this method can be extended to condition
variables generally, in which case code reviews would have to suffice.
But it seems to me that the flags function as an array of mutexes
anyway.  In your example, the fact that looking up a property group
takes the children-changing flag could mark it as no-nss, and then the
deletion operation could blow an assertion when it calls perm_granted()
with that flag held.

Thorough test suites aren't easy to create, though, and while the above
adapts well to new code, there are windows where mutexes which will be
marked no-nss aren't.  So I suspect we could trade off some code
maintainability for test suite size by marking -- at initialization time
-- mutexes which we know can be taken by no-nss operations.  I suspect
this would lead to false positives for operations that are allowed to
use the name services in some contexts but not others, but it's not
clear to me that configd's operations are sufficiently coordinated.

I did look into warlock to see if it could help for this, but I didn't
see any shouldn't-block-on-this-function annotations.


David

[smf-discuss] SMF v. NSS: deadlock central

Reply via email to