Keith M Wesolowski writes: > Recently, we've seen > > 6628289 svc.configd hangs in deadly embrace with nscd > 6598922 rc_node_modify_permission_check() calls perm_granted() with rn_lock > > to name but two instances of the same general class of problem. > Specifically, configd can call into NSS (either directly or through > nscd), which can in turn call back into configd. If the call into the > name service was made with any locks held (6598922) or flags set (see > below), deadlock can result. > > Here's another instance of the problem. To reproduce: > > 1. # svcadm disable name-service-cache > 2. # svcadm disable nis/client:default > 3. # svccfg -s nis/client:default addpg foo application > 4. # svcadm enable name-service-cache > 5. # svcadm enable nis/client:default > 6. $ svccfg -s nis/client:default delpg foo > > This deadlocks unless you're lucky. The key is that the unprivileged > user in step (6) should be a user with a NIS account. A similar > problem can occur with LDAP, etc. > > The deadlock is seen as follows: > > stack pointer for thread 4: fedfb578 > [ fedfb578 libc.so.1`door_call+8() ] > fedfb5f0 libc.so.1`_nsc_trydoorcall_ext+0x1d0() > fedfb6e8 libc.so.1`_nsc_search+0xbc() > fedfb758 libc.so.1`nss_search+0x28() > fedfb7d0 libc.so.1`getpwuid_r+0x50() > fedfb868 libsecdb.so.1`getuseruid+0x10() > fedfbcf0 perm_granted+0xd4() > fedfbd58 rc_node_delete+0x2b8() > fedfbdc0 entity_delete+0x48() > fedfbe20 simple_handler+0x24() > fedfbe80 client_switcher+0x24c() > fedfbf10 libc.so.1`__door_return+0x60() > stack pointer for thread 5: fecfba38 > [ fecfba38 libc.so.1`__lwp_park+0x10() ] > fecfba98 libc.so.1`cond_wait_queue+0x28() > fecfbaf8 libc.so.1`cond_wait+0x10() > fecfbb58 libc.so.1`pthread_cond_wait+8() > fecfbbb8 rc_node_hold_flag+0x50() > fecfbc18 rc_node_fill_children+0x3c() > fecfbc78 rc_node_find_named_child+0x2c() > fecfbcd8 rc_node_get_child+0x58() > fecfbd40 entity_get_child+0x40() > fecfbda8 simple_handler+0x24() > fecfbe08 client_switcher+0x24c() > fecfbe98 libc.so.1`__door_return+0x60() > > Thread 4 is servicing svccfg. Thread 5 we'll come back to in a > moment. Here's what nscd is doing: > > stack pointer for thread 4: febf7360 > [ febf7360 libc.so.1`door_call+8() ] > febf73d8 libscf.so.1`datael_get_child_locked+0xc4() > febf74c0 libscf.so.1`datael_get_child+0x178() > febf7520 libscf.so.1`scf_handle_decode_fmri+0x4c0() > febf7868 libscf.so.1`scf_simple_prop_get+0xc8() > febf78d0 libscf.so.1`smf_get_state+0x24() > febf7930 query_smf_state+0x14() > febf7990 nss_search+0x538() > febf7ae0 nss_psearch+0xc8() > febf7ba8 lookup_int+0x7dc() > febf7d28 nsc_lookup+0xc() > febf7d88 nscd`lookup+0x178() > febfbe00 nscd`switcher+0x118() > febfbe70 libc.so.1`__door_return+0x60() > > Thread 4 here is servicing the getpwuid_r() from configd. configd's > thread 5 is, in turn, servicing (or not) the datael_get_child_locked > call from nscd. What's happening here? Pretty simple, really - the > rc_node_delete() call has set RC_NODE_DYING_FLAGS in preparation for > deleting it near rc_node.c:3379, then later calls perm_granted() near > 3409. nscd is attempting to fill in the children of > svc:/network/nis/client:default and is waiting for > RC_NODE_CHILDREN_CHANGING to be cleared. Clearly, that's never going > to happen. > > More generally, now. I can certainly file a bug on the above (it's > really another variation on 6598922) but there are other possible > failure modes here, and it's very hard to be sure we've gotten all of > them. For example, suppose an NSS database backend (or NSS itself) > were to store pieces of its configuration in the SMF repository. When > it goes to look up those values to service a lookup, it would be > vulnerable to the same type of deadlock. > > Duckwater folks: Are you aware of this? How do you intend to solve > (or avoid) it? > > SMF folks: is there hope of establishing (and verifying compliance > with) a set of locking rules that can definitely prevent these > conditions? Is it worth doing, or is some alternate set of > restrictions more appropriate (perhaps banning the name service > components from calling into configd)? > > -- > Keith M Wesolowski "Sir, we're surrounded!" > FishWorks "Excellent; we can attack in any direction!" > _______________________________________________ > smf-discuss mailing list > smf-discuss at opensolaris.org
I think a review of the configd locking rules is definitely in order. As you point out, it is not just a matter of holding the locks. We also need to understand when the rn_flags are being set and how they can create a deadlock. perm_granted() is only called in 6 places, so I think that we should be able to come up with a solution that we are confident in. Nico's suggestion is intriguing. I'd like to include it in the study. Meanwhile, please do file a bug for the case that you found. I would be happy to work with the Duckwater folks on this issue. In a separate mail you stated: What I'm getting out of this is that people see that this is a problem but don't actually have a specific plan for fixing it. Is that a fair assessment? I think that is a fair assessment. tom