Keith M Wesolowski writes:
> Recently, we've seen
> 
> 6628289 svc.configd hangs in deadly embrace with nscd
> 6598922 rc_node_modify_permission_check() calls perm_granted() with rn_lock
> 
> to name but two instances of the same general class of problem.
> Specifically, configd can call into NSS (either directly or through
> nscd), which can in turn call back into configd.  If the call into the
> name service was made with any locks held (6598922) or flags set (see
> below), deadlock can result.
> 
> Here's another instance of the problem.  To reproduce:
> 
> 1. # svcadm disable name-service-cache
> 2. # svcadm disable nis/client:default
> 3. # svccfg -s nis/client:default addpg foo application
> 4. # svcadm enable name-service-cache
> 5. # svcadm enable nis/client:default
> 6. $ svccfg -s nis/client:default delpg foo
> 
> This deadlocks unless you're lucky.  The key is that the unprivileged
> user in step (6) should be a user with a NIS account.  A similar
> problem can occur with LDAP, etc.
> 
> The deadlock is seen as follows:
> 
> stack pointer for thread 4: fedfb578
> [ fedfb578 libc.so.1`door_call+8() ]
>   fedfb5f0 libc.so.1`_nsc_trydoorcall_ext+0x1d0()
>   fedfb6e8 libc.so.1`_nsc_search+0xbc()
>   fedfb758 libc.so.1`nss_search+0x28()
>   fedfb7d0 libc.so.1`getpwuid_r+0x50()
>   fedfb868 libsecdb.so.1`getuseruid+0x10()
>   fedfbcf0 perm_granted+0xd4()
>   fedfbd58 rc_node_delete+0x2b8()
>   fedfbdc0 entity_delete+0x48()
>   fedfbe20 simple_handler+0x24()
>   fedfbe80 client_switcher+0x24c()
>   fedfbf10 libc.so.1`__door_return+0x60()
> stack pointer for thread 5: fecfba38
> [ fecfba38 libc.so.1`__lwp_park+0x10() ]
>   fecfba98 libc.so.1`cond_wait_queue+0x28()
>   fecfbaf8 libc.so.1`cond_wait+0x10()
>   fecfbb58 libc.so.1`pthread_cond_wait+8()
>   fecfbbb8 rc_node_hold_flag+0x50()
>   fecfbc18 rc_node_fill_children+0x3c()
>   fecfbc78 rc_node_find_named_child+0x2c()
>   fecfbcd8 rc_node_get_child+0x58()
>   fecfbd40 entity_get_child+0x40()
>   fecfbda8 simple_handler+0x24()
>   fecfbe08 client_switcher+0x24c()
>   fecfbe98 libc.so.1`__door_return+0x60()
> 
> Thread 4 is servicing svccfg.  Thread 5 we'll come back to in a
> moment.  Here's what nscd is doing:
> 
> stack pointer for thread 4: febf7360
> [ febf7360 libc.so.1`door_call+8() ]
>   febf73d8 libscf.so.1`datael_get_child_locked+0xc4()
>   febf74c0 libscf.so.1`datael_get_child+0x178()
>   febf7520 libscf.so.1`scf_handle_decode_fmri+0x4c0()
>   febf7868 libscf.so.1`scf_simple_prop_get+0xc8()
>   febf78d0 libscf.so.1`smf_get_state+0x24()
>   febf7930 query_smf_state+0x14()
>   febf7990 nss_search+0x538()
>   febf7ae0 nss_psearch+0xc8()
>   febf7ba8 lookup_int+0x7dc()
>   febf7d28 nsc_lookup+0xc()
>   febf7d88 nscd`lookup+0x178()
>   febfbe00 nscd`switcher+0x118()
>   febfbe70 libc.so.1`__door_return+0x60()
> 
> Thread 4 here is servicing the getpwuid_r() from configd.  configd's
> thread 5 is, in turn, servicing (or not) the datael_get_child_locked
> call from nscd.  What's happening here?  Pretty simple, really - the
> rc_node_delete() call has set RC_NODE_DYING_FLAGS in preparation for
> deleting it near rc_node.c:3379, then later calls perm_granted() near
> 3409.  nscd is attempting to fill in the children of
> svc:/network/nis/client:default and is waiting for
> RC_NODE_CHILDREN_CHANGING to be cleared.  Clearly, that's never going
> to happen.
> 
> More generally, now.  I can certainly file a bug on the above (it's
> really another variation on 6598922) but there are other possible
> failure modes here, and it's very hard to be sure we've gotten all of
> them.  For example, suppose an NSS database backend (or NSS itself)
> were to store pieces of its configuration in the SMF repository.  When
> it goes to look up those values to service a lookup, it would be
> vulnerable to the same type of deadlock.
> 
> Duckwater folks: Are you aware of this?  How do you intend to solve
> (or avoid) it?
> 
> SMF folks: is there hope of establishing (and verifying compliance
> with) a set of locking rules that can definitely prevent these
> conditions?  Is it worth doing, or is some alternate set of
> restrictions more appropriate (perhaps banning the name service
> components from calling into configd)?
> 
> -- 
> Keith M Wesolowski            "Sir, we're surrounded!" 
> FishWorks                     "Excellent; we can attack in any direction!" 
> _______________________________________________
> smf-discuss mailing list
> smf-discuss at opensolaris.org

I think a review of the configd locking rules is definitely in order.  As
you point out, it is not just a matter of holding the locks.  We also need
to understand when the rn_flags are being set and how they can create a
deadlock.  perm_granted() is only called in 6 places, so I think that we
should be able to come up with a solution that we are confident in.

Nico's suggestion is intriguing.  I'd like to include it in the study.

Meanwhile, please do file a bug for the case that you found.

I would be happy to work with the Duckwater folks on this issue.

In a separate mail you stated:
    What I'm getting out of this is that people see that this is a problem
    but don't actually have a specific plan for fixing it.  Is that a fair
    assessment?
I think that is a fair assessment.

tom

Reply via email to