Quoth Keith M Wesolowski on Fri, Nov 16, 2007 at 03:29:49PM -0800: > Thread 4 here is servicing the getpwuid_r() from configd. configd's > thread 5 is, in turn, servicing (or not) the datael_get_child_locked > call from nscd. What's happening here? Pretty simple, really - the > rc_node_delete() call has set RC_NODE_DYING_FLAGS in preparation for > deleting it near rc_node.c:3379, then later calls perm_granted() near > 3409. nscd is attempting to fill in the children of > svc:/network/nis/client:default and is waiting for > RC_NODE_CHILDREN_CHANGING to be cleared. Clearly, that's never going > to happen.
As you note below, I think this is a bug in svc.configd -- I don't see an inherent reason why looking up property groups in an instance should block on an authorization check. > More generally, now. I can certainly file a bug on the above (it's > really another variation on 6598922) but there are other possible > failure modes here, and it's very hard to be sure we've gotten all of > them. For example, suppose an NSS database backend (or NSS itself) > were to store pieces of its configuration in the SMF repository. When > it goes to look up those values to service a lookup, it would be > vulnerable to the same type of deadlock. > > Duckwater folks: Are you aware of this? How do you intend to solve > (or avoid) it? > > SMF folks: is there hope of establishing (and verifying compliance > with) a set of locking rules that can definitely prevent these > conditions? Is it worth doing, or is some alternate set of > restrictions more appropriate (perhaps banning the name service > components from calling into configd)? I think we can solve this by committing to the conditions under which certain libscf calls will not block on name service lookups, and reciprocally for the name services team. I don't know whether an ARC contract is appropriate for this, but I suspect a formal document of that ilk is in order. The problem, of course, is enforcement. In svc.configd, I think that as we begin operations which we have committed to be name service-free, we could set some thread-specific data. Then we could replace those lookup functions with wrappers that assert that the no-nss flag is clear. That would work for direct invocations, but not mutexes. I suspect that when a thread with the no-nss flag set locks a mutex, we could annotate it (the mutex) as no-nss as well, and make any other thread inherit the flag while it holds the lock. Then with a sufficiently thorough test suite, we should be able to catch violations during development. You may wonder about condition variables, and particularly the RC_NODE_ flags. I'm not sure that this method can be extended to condition variables generally, in which case code reviews would have to suffice. But it seems to me that the flags function as an array of mutexes anyway. In your example, the fact that looking up a property group takes the children-changing flag could mark it as no-nss, and then the deletion operation could blow an assertion when it calls perm_granted() with that flag held. Thorough test suites aren't easy to create, though, and while the above adapts well to new code, there are windows where mutexes which will be marked no-nss aren't. So I suspect we could trade off some code maintainability for test suite size by marking -- at initialization time -- mutexes which we know can be taken by no-nss operations. I suspect this would lead to false positives for operations that are allowed to use the name services in some contexts but not others, but it's not clear to me that configd's operations are sufficiently coordinated. I did look into warlock to see if it could help for this, but I didn't see any shouldn't-block-on-this-function annotations. David