tentitive ucred mutex cleanup patch, MATT's version (was Re: ucred holding patch, BDE version)
Please review this patch. I haven't looked at BDE's patch but this was so simple I decided to just go do it. There are many situations where the allocation of a system structure can be safely done with Giant, but deallocation occurs dangerously deep in the call stack. For these situations I believe it makes sense to have a recycle queue (maybe even a global recycle queue) to handle freeing the refcount-0 structures at a later time (like in the allocation code where we might already need Giant). At least until MALLOC/FREE are really made MP safe. In anycase, here is the patch. It is VERY simple and it also cuts the size of the ucred structure down by a huge amount. As a bonus, here are the gettimeofday() loop test results with this code and the gettimeofday() giant removal: CURRENT, gettimeofday() SMP patch, ucred patch: 1 TG running: 396365 calls/sec 2 TGs running: 322000 calls/sec each. Now with my gdb -k on /dev/mem I note that the contention is occuring in userret(), where Giant is being obtained for the signal processing. If we could optimize that I'll bet the call rate would go above 400K calls per second with two running. -Matt Index: sys/ucred.h === RCS file: /home/ncvs/src/sys/sys/ucred.h,v retrieving revision 1.26 diff -u -r1.26 ucred.h --- sys/ucred.h 11 Oct 2001 23:38:17 - 1.26 +++ sys/ucred.h 16 Feb 2002 21:36:58 - @@ -60,8 +60,9 @@ struct uidinfo *cr_uidinfo;/* per euid resource consumption */ struct uidinfo *cr_ruidinfo; /* per ruid resource consumption */ struct prison *cr_prison; /* jail(4) */ -#definecr_endcopy cr_mtx - struct mtx cr_mtx; /* protect refcount */ +#definecr_endcopy cr_mtxp + struct mtx *cr_mtxp; /* protect refcount */ + struct ucred *cr_freenext; /* next on free list */ }; #define cr_gid cr_groups[0] #define NOCRED ((struct ucred *)0) /* no credential available */ Index: kern/kern_prot.c === RCS file: /home/ncvs/src/sys/kern/kern_prot.c,v retrieving revision 1.133 diff -u -r1.133 kern_prot.c --- kern/kern_prot.c16 Jan 2002 06:55:30 - 1.133 +++ kern/kern_prot.c16 Feb 2002 21:42:17 - @@ -72,6 +72,9 @@ int dummy; }; #endif + +static struct ucred *ucred_free_list; + /* * MPSAFE */ @@ -1582,11 +1585,40 @@ struct ucred * crget() { - register struct ucred *cr; + struct ucred *cr; + + GIANT_REQUIRED; + + mtx_pool_lock(ucred_free_list); + if (ucred_free_list) { + cr = ucred_free_list; + ucred_free_list = cr-cr_freenext; + cr-cr_freenext = NULL; + mtx_pool_unlock(ucred_free_list); - MALLOC(cr, struct ucred *, sizeof(*cr), M_CRED, M_WAITOK | M_ZERO); + /* +* Cleanup previous user of this recycled ucred +* +* Some callers of crget(), such as nfs_statfs(), +* allocate a temporary credential, but don't +* allocate a uidinfo structure. +*/ + if (cr-cr_uidinfo != NULL) + uifree(cr-cr_uidinfo); + if (cr-cr_ruidinfo != NULL) + uifree(cr-cr_ruidinfo); + /* +* Free a prison, if any. +*/ + if (jailed(cr)) + prison_free(cr-cr_prison); + } else { + mtx_pool_unlock(ucred_free_list); + MALLOC(cr, struct ucred *, sizeof(*cr), M_CRED, M_WAITOK); + } + bzero(cr, sizeof(struct ucred)); cr-cr_ref = 1; - mtx_init(cr-cr_mtx, ucred, MTX_DEF); + cr-cr_mtxp = mtx_pool_find(cr); return (cr); } @@ -1598,42 +1630,36 @@ struct ucred *cr; { - mtx_lock(cr-cr_mtx); + mtx_lock(cr-cr_mtxp); cr-cr_ref++; - mtx_unlock(cr-cr_mtx); + mtx_unlock(cr-cr_mtxp); return (cr); } /* * Free a cred structure. * Throws away space when ref count gets to 0. + * + * XXX we should not have to obtain the mutex lock if cr_ref is 1, since + * we are the last user. */ void crfree(cr) struct ucred *cr; { + struct mtx *mtxp = cr-cr_mtxp; - mtx_lock(cr-cr_mtx); + mtx_lock(mtxp); KASSERT(cr-cr_ref 0, (bad ucred refcount: %d, cr-cr_ref)); if (--cr-cr_ref == 0) { - mtx_destroy(cr-cr_mtx); - /* -* Some callers of crget(), such as nfs_statfs(), -* allocate a temporary credential, but don't -* allocate a uidinfo structure. -*/ - if (cr-cr_uidinfo != NULL) -
Re: ucred holding patch, BDE version
Julian Elischer wrote: This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). multiple threads can do it.. The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. The multiple threads argument is bogus, since the calls to [gs]et[ug]id() are on a per process, not a per thread basis. An argument which *may* not be bogus (I am unconvinced) is that creds are immutable once instanced, and that the calls to [gs]et[ug]id() instance a new cred and replace, rather than changing an existing cred (this logically follows from credential inheritance, or the first set call would change the cred used by init and all other processes). Personally, I still do not understand the need to have a cred reference per thread, the only thing that makes any sense about that is to optimize the degenerate case of a daemon that makes calls as another ID, on behalf of a lot of users (or, sequentially, at least, different users). One example of such a program would be SAMBA (but *not* NFS, due to access semantics on objects based on path component access exclusion by credential not being an effective mechanism for NFS file handles). I think that you would need to have [gs]et[ug]id() be on a per thread basis for this to be an efficiency, and I think trying to do this pessimizes everything else. My gut tells me that creds should be per process, and that the references to them should be taken sparingly, and then only if a need can be justified, rather than just in case some day. Kirk at one time called vnodes the structure that ate the kernel; he was wrong: it was creds. Perhaps this dicsussion is enough impetus to justify revisiting the atomic_t type definitions, which would be useful as reference counted hold/release mechanisms that would obviate the need for locks here? This would at least let you defer getting rid of the per thread cred instances until later. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
RE: ucred holding patch, BDE version
On 11-Feb-02 Julian Elischer wrote: here is the BDE version ready to commit. Extended to other architectures. Bruce, John, comments? As I was adding a prototype to ucred.h I stripped the __Ps of the others in that section (in the spirit of change it when editing it anyhow Hmm, acquire_ucred (don't really like that name, maybe thread_updatecred(td) which can use td_proc to get the proc) probably should be declared in sys/proc.h. Well, maybe not, sys/ucred.h is probably fine. But it's implementation should then be in kern_prot.c along with all the other ucred related functions. :) Also, please make the comment above the function into a complete sentence and capitalize appropriately, etc. as per style(9) just to be pedantic. I guess removing __P() as you go is ok if that spirit is what the -arch thread is desired. Personally I thought it should be the other way around just like we don't mix whitespace commits with code commits to avoid obfuscating function changes with style changes. IMO, just commit to ucred.h blowing away __P() first, then commit your functional changes with the rest. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 11-Feb-02 Alfred Perlstein wrote: * Julian Elischer [EMAIL PROTECTED] [020211 15:00] wrote: In the current world, when the thread enters userland, it does: lock giant crfree() (which includes mutexes) unlock giant This isn't needed afaik. Yes, calling free() without Giant is about as good as calling fdrop() without Giant Alfred. :) if there are ASTs it does this once again for each AST waiting as well. And on the way into the system it does: lock process crhold() (which includes mutex ops) unlock process This isn't needed, at least afaik. Not strictly for the comparison as Julian and Terry pointed out since the race can occur anyway (i.e., you don't need the lock to see if p_ucred changed), however, if you are actually doing a crhold(), you want to make sure p_ucred isn't stale, so you need the proc lock. so if there is a single AST (not uncommon) it does on a system call, 4 to 6 locks and 4 to 6 unlocks just to get a reference on the ucred it already had a reference on. By not freeing it when going to userland, and checking if it is the right one when returning to the kernel, we replace that with a pointer comparison (well maybe 2) 99.999% of the time. John still wants to free it if INVARIANTS is on so he canh trap on inapropriate access. I'm not sure it's needed but am willing to do so.. This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). Umm, do you understand the entire thread ucred reference at all? What if you have two threads on two different CPU's from teh same process and one changes your ucred out from under you while you are handlign a syscall? The idea here is to ensure that a thread has a consistent ucred for an entire user trap or syscall to avoid races involving credentials. Therefore the additional hold on entry is completely useless no matter what and with that the release on exit is also useless. No, you just haven't been keeping up. I added td_ucred at least a month or so ago and it was discussed on -arch before it went in. I have a big honking patch to test when I get back from BSDCon that changes the kernel to use td_ucred almost everywhere instead of p_ucred so that we actualyl use the per-thread ucred (which will be needed before we can stop being a 1:1:1:1 system) and also means credential ops such as suser(), and p_can* don't need locks for the current thread. (p_can* still need locks for the target process). -Alfred -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 12-Feb-02 Julian Elischer wrote: The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. _No_! The proc lock is protecting p_ucred, it can't go away! What _can_ go away is the per-ucred mutex to protect the refcount if we ever revive the refcount API. When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. Yes. Actually, calling free() can still be rather expensive even when Giant is gone. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 12-Feb-02 Terry Lambert wrote: Julian Elischer wrote: This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). multiple threads can do it.. The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. The multiple threads argument is bogus, since the calls to [gs]et[ug]id() are on a per process, not a per thread basis. Thread 1 makes a syscall that blocks. Say it blocks in one of 3 VOP's all of which need a credential. Thread 2 changes the credentials of the process. Thread 1 now resumes assuming that say a VADMIN or VACCESS suceeded with the old cred that may not be valid any longer and performs VOP's with the now newer credential (which it may even read a stale value of w/o a lock thus using some random memory for the creds) to do its other VOP's which may now fail weirdly. The idea of per-thread ucred references is so that a thread will have the same credentials for the lifetime of a syscall so that the entire syscall is self consistent. It also means that except for when we update the ucred reference, we don't need locks to access thread credentials since the thread references are to read-only credentials. We discussed this on -arch _months_ ago and everyone agreed with it then. Perhaps this dicsussion is enough impetus to justify revisiting the atomic_t type definitions, which would be useful as reference counted hold/release mechanisms that would obviate the need for locks here? This would at least let you defer getting rid of the per thread cred instances until later. All having the refcount_t and other refcount_* functions would do is let us get rid of the per-ucred mutex (actually a pool mutex, so the overhead per ucred is just a pointer right now). It wouln't change the fact that we need a lock to make sure p_ucred is up to date before we read a value we need to depend on or actually use. -- Terry -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
RE: ucred holding patch, BDE version
On Mon, 11 Feb 2002, John Baldwin wrote: On 11-Feb-02 Julian Elischer wrote: here is the BDE version ready to commit. Extended to other architectures. Bruce, John, comments? As I was adding a prototype to ucred.h I stripped the __Ps of the others in that section (in the spirit of change it when editing it anyhow Hmm, acquire_ucred (don't really like that name, maybe thread_updatecred(td) which can use td_proc to get the proc) probably should be declared in sys/proc.h. Well, maybe not, sys/ucred.h is probably fine. But it's implementation should then be in kern_prot.c along with all the other ucred related functions. :) I guess so. The name requires changing anyhow as it was pointed out to me that Bruce mis-spelled acquire and I didn't notice. Also, please make the comment above the function into a complete sentence and capitalize appropriately, etc. as per style(9) just to be pedantic. I guess removing __P() as you go is ok if that spirit is what the -arch thread is desired. Personally I thought it should be the other way around just like we don't mix whitespace commits with code commits to avoid obfuscating function changes with style changes. IMO, just commit to ucred.h blowing away __P() first, then commit your functional changes with the rest. hmmm I am completely confused as to which way we ended up deciding then.. :-) -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On Mon, 11 Feb 2002, Terry Lambert wrote: Julian Elischer wrote: This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). multiple threads can do it.. The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. The multiple threads argument is bogus, since the calls to [gs]et[ug]id() are on a per process, not a per thread basis. there is no such thing as a per process syscall. Two threads can always do the same syscall at the same time. there needs to be a proc-lock to stop it from becoming chaotic in there. In actual fact, since you cannot alter a cred but only replace that which the process points to it's not quite that bad, but you need to either lock it or have atomic reference-counting that can handle the possibility that the cred could have bee decremented to 0 by another thread just before you checked it. An argument which *may* not be bogus (I am unconvinced) is that creds are immutable once instanced, and that the calls to [gs]et[ug]id() instance a new cred and replace, rather than changing an existing cred (this logically follows from credential inheritance, or the first set call would change the cred used by init and all other processes). Personally, I still do not understand the need to have a cred reference per thread, the only thing that makes any sense about that is to optimize the degenerate case of a daemon that makes calls as another ID, on behalf of a lot of users (or, sequentially, at least, different users). One example of such a program would be SAMBA (but *not* NFS, due to access semantics on objects based on path component access exclusion by credential not being an effective mechanism for NFS file handles). the cred that is in force at the time that the syscall STARTS is used for the full syscall otherwise you could have one cred used for the first part of a syscall and a completely differnet one used for the secnd part of a syscall. I think that you would need to have [gs]et[ug]id() be on a per thread basis for this to be an efficiency, and I think trying to do this pessimizes everything else. My gut tells me that creds should be per process, and that the references to them should be taken sparingly, and then only if a need can be justified, rather than just in case some day. creads can only be changed per process but the threads only pick up the change on next syscall startup. Kirk at one time called vnodes the structure that ate the kernel; he was wrong: it was creds. I believe it was Mike Karels. Perhaps this dicsussion is enough impetus to justify revisiting the atomic_t type definitions, which would be useful as reference counted hold/release mechanisms that would obviate the need for locks here? This would at least let you defer getting rid of the per thread cred instances until later. I've made that point before and I believe that jhb has said he would like such primatives. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
RE: ucred holding patch, BDE version
On 11-Feb-02 Julian Elischer wrote: here is the BDE version ready to commit. Extended to other architectures. Bruce, John, comments? As I was adding a prototype to ucred.h I stripped the __Ps of the others in that section (in the spirit of change it when editing it anyhow Hmm, acquire_ucred (don't really like that name, maybe thread_updatecred(td) which can use td_proc to get the proc) probably should be declared in sys/proc.h. Well, maybe not, sys/ucred.h is probably fine. But it's implementation should then be in kern_prot.c along with all the other ucred related functions. :) Also, please make the comment above the function into a complete sentence and capitalize appropriately, etc. as per style(9) just to be pedantic. I guess removing __P() as you go is ok if that spirit is what the -arch thread is desired. Personally I thought it should be the other way around just like we don't mix whitespace commits with code commits to avoid obfuscating function changes with style changes. IMO, just commit to ucred.h blowing away __P() first, then commit your functional changes with the rest. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 12-Feb-02 Julian Elischer wrote: The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. _No_! The proc lock is protecting p_ucred, it can't go away! What _can_ go away is the per-ucred mutex to protect the refcount if we ever revive the refcount API. When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. Yes. Actually, calling free() can still be rather expensive even when Giant is gone. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 11-Feb-02 Alfred Perlstein wrote: * Julian Elischer [EMAIL PROTECTED] [020211 15:00] wrote: In the current world, when the thread enters userland, it does: lock giant crfree() (which includes mutexes) unlock giant This isn't needed afaik. Yes, calling free() without Giant is about as good as calling fdrop() without Giant Alfred. :) if there are ASTs it does this once again for each AST waiting as well. And on the way into the system it does: lock process crhold() (which includes mutex ops) unlock process This isn't needed, at least afaik. Not strictly for the comparison as Julian and Terry pointed out since the race can occur anyway (i.e., you don't need the lock to see if p_ucred changed), however, if you are actually doing a crhold(), you want to make sure p_ucred isn't stale, so you need the proc lock. so if there is a single AST (not uncommon) it does on a system call, 4 to 6 locks and 4 to 6 unlocks just to get a reference on the ucred it already had a reference on. By not freeing it when going to userland, and checking if it is the right one when returning to the kernel, we replace that with a pointer comparison (well maybe 2) 99.999% of the time. John still wants to free it if INVARIANTS is on so he canh trap on inapropriate access. I'm not sure it's needed but am willing to do so.. This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). Umm, do you understand the entire thread ucred reference at all? What if you have two threads on two different CPU's from teh same process and one changes your ucred out from under you while you are handlign a syscall? The idea here is to ensure that a thread has a consistent ucred for an entire user trap or syscall to avoid races involving credentials. Therefore the additional hold on entry is completely useless no matter what and with that the release on exit is also useless. No, you just haven't been keeping up. I added td_ucred at least a month or so ago and it was discussed on -arch before it went in. I have a big honking patch to test when I get back from BSDCon that changes the kernel to use td_ucred almost everywhere instead of p_ucred so that we actualyl use the per-thread ucred (which will be needed before we can stop being a 1:1:1:1 system) and also means credential ops such as suser(), and p_can* don't need locks for the current thread. (p_can* still need locks for the target process). -Alfred -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
* Julian Elischer [EMAIL PROTECTED] [020211 15:00] wrote: In the current world, when the thread enters userland, it does: lock giant crfree() (which includes mutexes) unlock giant This isn't needed afaik. if there are ASTs it does this once again for each AST waiting as well. And on the way into the system it does: lock process crhold() (which includes mutex ops) unlock process This isn't needed, at least afaik. so if there is a single AST (not uncommon) it does on a system call, 4 to 6 locks and 4 to 6 unlocks just to get a reference on the ucred it already had a reference on. By not freeing it when going to userland, and checking if it is the right one when returning to the kernel, we replace that with a pointer comparison (well maybe 2) 99.999% of the time. John still wants to free it if INVARIANTS is on so he canh trap on inapropriate access. I'm not sure it's needed but am willing to do so.. This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). Therefore the additional hold on entry is completely useless no matter what and with that the release on exit is also useless. -Alfred To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On Mon, 11 Feb 2002, Alfred Perlstein wrote: * Julian Elischer [EMAIL PROTECTED] [020211 15:00] wrote: In the current world, when the thread enters userland, it does: lock giant crfree() (which includes mutexes) unlock giant This isn't needed afaik. Argue with jhb. if there are ASTs it does this once again for each AST waiting as well. And on the way into the system it does: lock process crhold() (which includes mutex ops) unlock process This isn't needed, at least afaik. see below. so if there is a single AST (not uncommon) it does on a system call, 4 to 6 locks and 4 to 6 unlocks just to get a reference on the ucred it already had a reference on. By not freeing it when going to userland, and checking if it is the right one when returning to the kernel, we replace that with a pointer comparison (well maybe 2) 99.999% of the time. John still wants to free it if INVARIANTS is on so he canh trap on inapropriate access. I'm not sure it's needed but am willing to do so.. This makes little sense to me. Maybe I'm missing something, but by virtue of ownership we don't have to worry about the ucred's refcount on entry into the kernel because it is the owner and no one else is allowed to change our privledges besideds ourselves via set[ug]id(). multiple threads can do it.. The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. Therefore the additional hold on entry is completely useless no matter what and with that the release on exit is also useless. -Alfred To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On Mon, 11 Feb 2002, John Baldwin wrote: On 12-Feb-02 Julian Elischer wrote: The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. _No_! The proc lock is protecting p_ucred, it can't go away! What _can_ go away is the per-ucred mutex to protect the refcount if we ever revive the refcount API. hmm ok Alfred, here's the way I see it.. You are never permitted to CHANGE a cred. You ALWAYS allocate another and switch them. When you switch them you decrement the refcount of the old one. If someone else takes a reference on it at the same moment that you drop it, then the order is important down to the bus-cycle as to whether it gets freed or not. We already know that dereferencing it from the proc structure is not important, because a stale ucred pointer is only preventable from the userland. All that is important is that the pointer is a REAL pointer to a cred and that it is not allowed to go to 0 references in its way to 1 reference. An atomic reference count increment that checks against 0 would be part of it but might not be enough. I think we also need to have a lock on something because it might get freed and used for something else that happens to place a 1 in that location inbetween the time that p-p_ucred is read and the refcount is decremented. As an asside: I think that in NT they may have refcounts in the same location in all structures as I think they derived all their structures from a bas class that has ref counts. In that case it WOULD have a 1 in that location if it were re-used. (It's been a long time since I saw NT code so I may be wrong) When Giant goes away it won't be so bad but it will STILL be quicker to not drop it across userland. Yes. Actually, calling free() can still be rather expensive even when Giant is gone. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
John Baldwin wrote: Yes, calling free() without Giant is about as good as calling fdrop() without Giant Alfred. :) Alfred would be right, for per processor memory pools. 8-). And on the way into the system it does: lock process crhold() (which includes mutex ops) unlock process This isn't needed, at least afaik. Not strictly for the comparison as Julian and Terry pointed out since the race can occur anyway (i.e., you don't need the lock to see if p_ucred changed), however, if you are actually doing a crhold(), you want to make sure p_ucred isn't stale, so you need the proc lock. No. If you _depend_ on the frequency of change being low, you can do this with only atomic reference counts. See the pseudo code in my other posting, in direct response to you. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
Julian Elischer wrote: The multiple threads argument is bogus, since the calls to [gs]et[ug]id() are on a per process, not a per thread basis. there is no such thing as a per process syscall. Two threads can always do the same syscall at the same time. there needs to be a proc-lock to stop it from becoming chaotic in there. In actual fact, since you cannot alter a cred but only replace that which the process points to it's not quite that bad, but you need to either lock it or have atomic reference-counting that can handle the possibility that the cred could have bee decremented to 0 by another thread just before you checked it. I would argue that: 1) There are a small number of system calls where this is true. 2) It's worth doing the synchornization in-kernel for the process alone, where this is the case. The problem I have with the crd locking is that each process does not do a gratuitous clone of the active cred before doing its own thing (if you will remember, I suggested this in the context of the cred reference count overflow bug, back when I found the problem last year). The upshot of this is that the lock will stall anyone using that cred. This argues that creds should be, minimally, per process, if not per CPU, instead of shared references smeared all over the place, and locked on each reference, even though it's not possible for a cred to be changed at all out from under you -- only replaced wholesale, since once instanced, a cred may not be changed. Ask John and Robert about the permissability of changing cred contents, and the NFS changes that resulted (fairly) recently. Personally, I think that cred changes are rare enough, even in the degenerate case, that it's worth taking the synchronization event as an intraprocess global IPI, rather than using a lock. Personally, I still do not understand the need to have a cred reference per thread, the only thing that makes any sense about that is to optimize the degenerate case of a daemon that makes calls as another ID, on behalf of a lot of users (or, sequentially, at least, different users). One example of such a program would be SAMBA (but *not* NFS, due to access semantics on objects based on path component access exclusion by credential not being an effective mechanism for NFS file handles). the cred that is in force at the time that the syscall STARTS is used for the full syscall otherwise you could have one cred used for the first part of a syscall and a completely differnet one used for the secnd part of a syscall. Again: the model permits dropping, but not gaining, of priviledges. THe worst case failure, then, is: 1) A call starts on a thread without proper program level stall synchornization between threads. 2) A subsequent thread drops priviledges, and this drop is unprotected by the stall. 3) The call completes, but fails during completion for lack of privledges present when the call was started, because the drop was not stalled. Frankly, I don't see this being a problem: badly written code with a race condition will occasionally lose the race, rather than being implicitly protected (an assumption that is not portable, if it is depended upon), and the code will end up being fixed by its author. I think that you would need to have [gs]et[ug]id() be on a per thread basis for this to be an efficiency, and I think trying to do this pessimizes everything else. My gut tells me that creds should be per process, and that the references to them should be taken sparingly, and then only if a need can be justified, rather than just in case some day. creads can only be changed per process but the threads only pick up the change on next syscall startup. I think this is the fundamental misunderstanding: creds never change. The can only be replaced, and then only with a cred of equal or lesser priviledge. Kirk at one time called vnodes the structure that ate the kernel; he was wrong: it was creds. I believe it was Mike Karels. Whatever. It's creds now. Perhaps this dicsussion is enough impetus to justify revisiting the atomic_t type definitions, which would be useful as reference counted hold/release mechanisms that would obviate the need for locks here? This would at least let you defer getting rid of the per thread cred instances until later. I've made that point before and I believe that jhb has said he would like such primatives. I'm still not convinced that it's necessary to lock what you are trying to lock, but I certainly agree on atomic_t. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
John Baldwin wrote: The multiple threads argument is bogus, since the calls to [gs]et[ug]id() are on a per process, not a per thread basis. Thread 1 makes a syscall that blocks. Say it blocks in one of 3 VOP's all of which need a credential. Thread 2 changes the credentials of the process. Thread 1 now resumes assuming that say a VADMIN or VACCESS suceeded with the old cred that may not be valid any longer and performs VOP's with the now newer credential (which it may even read a stale value of w/o a lock thus using some random memory for the creds) to do its other VOP's which may now fail weirdly. Then the combination of calls fails. There is no case where it will succeed when previously it would not. I would argue that the failure is a lost race condition that exists in the code anyway, and that this approach merely makes the race failure more reliably reproducible. 8-). I think you could fix this case, in any event, by saving a reference to the credential at the start (or, better, as a result of the sleep, when you know it's going to happen), any time you are going to split a system call across multiple potential blocking operations. That certainly is *NOT* as common to all system calls as this discussion implies. The idea of per-thread ucred references is so that a thread will have the same credentials for the lifetime of a syscall so that the entire syscall is self consistent. It also means that except for when we update the ucred reference, we don't need locks to access thread credentials since the thread references are to read-only credentials. We discussed this on -arch _months_ ago and everyone agreed with it then. As long as the pointer updates are atomic, you don't need to hold a lock to reference it anyway. You don't care over the update, since it's a drop of priviledge. Perhaps this dicsussion is enough impetus to justify revisiting the atomic_t type definitions, which would be useful as reference counted hold/release mechanisms that would obviate the need for locks here? This would at least let you defer getting rid of the per thread cred instances until later. All having the refcount_t and other refcount_* functions would do is let us get rid of the per-ucred mutex (actually a pool mutex, so the overhead per ucred is just a pointer right now). It wouln't change the fact that we need a lock to make sure p_ucred is up to date before we read a value we need to depend on or actually use. Not true. You can take a reference to it, and the reference is guaranteed to not change out from under you, so long as it is counted. Use of a stack variable to store the value over the count increment allows a compare after the increment, after which a retry is possible (if there was a race during the reference taking). Thus there is no need for a lock, e.g.: cred * addreftocred(cred **cpp) { cred *rcp; /* stack variable */ again: rcp = *cp; atomic_plusplus( rcp-count); /* magic atomic crap */ if( rcp != *cp) { atomic_mimusminus( rcp-count); goto again; } return( rcp); } This depends on the rarity of credential changes; in the degenerate case, where there is a credential change between each normal system call that is not a credential change, the overhead is immense. But people who write code like that can be safely punished without fear. 8-). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 12-Feb-02 Julian Elischer wrote: On Mon, 11 Feb 2002, John Baldwin wrote: On 12-Feb-02 Julian Elischer wrote: The proclock is needed to get the reference, guarding against other threads, and giant is needed fo rnot to free it because if it reaches a refcount of 0 it needs to call free(). (which john assures me needs Giant at this time). We could avoid the proclock with judicious use of an atomic refcount incrementing method. _No_! The proc lock is protecting p_ucred, it can't go away! What _can_ go away is the per-ucred mutex to protect the refcount if we ever revive the refcount API. hmm ok Alfred, here's the way I see it.. This is jhb, not alfred. You are never permitted to CHANGE a cred. You ALWAYS allocate another and switch them. When you switch them you decrement the refcount of the old one. If someone else takes a reference on it at the same moment that you drop it, then the order is important down to the bus-cycle as to whether it gets freed or not. We already know that dereferencing it from the proc structure is not important, because a stale ucred pointer is only preventable from the userland. NO, no! You only know it is stale in that case cause you are comparing a non-stale known-good pointer to p_ucred. If p_ucred is stale, then it is equal to td_ucred which still points to a valid ucred so it is ok in that particular case. If p_ucred is stale but points to a new ucred, getting the proc lock around crhold() ensures that you will gain a reference to the really correct value that is guaranteed to point to a correct value. Don't think in terms of bus cycles. On non-i386 architectures, a write can sit in a store buffer waiting until it is pushed out by a memory barrier or for the CPU to get around to posting it. You can't assume that a write will be visible to other CPU's soon. All that is important is that the pointer is a REAL pointer to a cred and that it is not allowed to go to 0 references in its way to 1 reference. But if it's a stale pointer, it's not a pointer to a ucred. That memory could have been free'd and now be a mbuf header. Now you go try to increment the refcount of a mbuf header. Bad juju here. :) Trust me, my jhb_proc kernels were locking up cause I missed one instance of setting p_args to NULL during exec so taht during wait we would trash random data memory when we tried to indirect the stale p_args pointer to dec the refcount. The problem is that you actually have 2 locks involved for 2 different things here during your crhold()'s: - proc lock which protects p_ucred in two ways: - 1) it ensures that the ucred has at least one reference that doesn't go away so it's ok for us to deref the poitner and increment the count - 2) it ensures that the value of the pointer we read is up to date - ucred mutex which protects the ucred refcount Only the second mutex can be replaced by pure atomic operations in theory. The first one cannot. In your comparison test, you aren't dereferencing p_ucred, so if it is stale, that is ok due to other issues. However, when you do crhold(p-p_ucred), you dereference the pointer, so it better darn well be pointing at a ucred and not a mbuf. An atomic reference count increment that checks against 0 would be part of it but might not be enough. I think we also need to have a lock on something because it might get freed and used for something else that happens to place a 1 in that location inbetween the time that p-p_ucred is read and the refcount is decremented. That is what the proc lock is doing. :-P As an asside: I think that in NT they may have refcounts in the same location in all structures as I think they derived all their structures from a bas class that has ref counts. In that case it WOULD have a 1 in that location if it were re-used. (It's been a long time since I saw NT code so I may be wrong) That would involve great evil. Let's not do that. How do you know the refcount hasn't been adjusted on some other structure anyways that you have a stale pointer to? Maybe that current structure is just now being freed? -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 12-Feb-02 Terry Lambert wrote: Julian Elischer wrote: The multiple threads argument is bogus, since the calls to [gs]et[ug]id() are on a per process, not a per thread basis. there is no such thing as a per process syscall. Two threads can always do the same syscall at the same time. there needs to be a proc-lock to stop it from becoming chaotic in there. In actual fact, since you cannot alter a cred but only replace that which the process points to it's not quite that bad, but you need to either lock it or have atomic reference-counting that can handle the possibility that the cred could have bee decremented to 0 by another thread just before you checked it. I would argue that: 1)There are a small number of system calls where this is true. 2)It's worth doing the synchornization in-kernel for the process alone, where this is the case. The problem I have with the crd locking is that each process does not do a gratuitous clone of the active cred before doing its own thing (if you will remember, I suggested this in the context of the cred reference count overflow bug, back when I found the problem last year). ... which has nothing to do with the problem at hand as far as I can tell. Well, if we did, say, embed a cred in each process rather than sharing them, then we wouldn't need to protect p_ucred with a lock, but now we would need to protect all the contents of the cred with a lock, or go with the IPI scheme (yuck, IPI's are very uncheap). The upshot of this is that the lock will stall anyone using that cred. This argues that creds should be, minimally, per process, if not per CPU, instead of shared references smeared all over the place, and locked on each reference, even though it's not possible for a cred to be changed at all out from under you -- only replaced wholesale, since once instanced, a cred may not be changed. What in the world are you talking about Terry. Have you read the code? The lock _only_ protects the reference count, nothing else! Once I commit the code to make sure that we use p_ucred properly when updating credentials then I can commit the code to chagne all of suser(), p_can*() to use the per-thread ucred. Since teh per-thread ucred is immutable, and since td_ucred is only ever touchced by curthread, thsi involves NO LOCKS for ANY calls to suser() or p_can*() EXCEPT in the few syscalls that update credentials. We went over this _months_ ago. I direct you to the archives of -arch please. Personally, I think that cred changes are rare enough, even in the degenerate case, that it's worth taking the synchronization event as an intraprocess global IPI, rather than using a lock. Egads. That still doesn't fix the problem of some syscall changing its creds to get weird behavior halfway through a syscall. You still need locks if two threads call setuid() at the same time. Sure it's a program bug but we can't have the kernel panic over it. creads can only be changed per process but the threads only pick up the change on next syscall startup. I think this is the fundamental misunderstanding: creds never change. The can only be replaced, and then only with a cred of equal or lesser priviledge. Well, Robert wasn't very comfortable with that change, plus it either requires a horribly expensive and ugly IPI, or it requires lots of lock acquires to read p_ucred that we wouldn't need otherwise. -- John Baldwin [EMAIL PROTECTED]http://www.FreeBSD.org/~jhb/ Power Users Use the Power to Serve! - http://www.FreeBSD.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: ucred holding patch, BDE version
On 12-Feb-02 Terry Lambert wrote: Then the combination of calls fails. There is no case where it will succeed when previously it would not. I would argue that the failure is a lost race condition that exists in the code anyway, and that this approach merely makes the race failure more reliably reproducible. 8-). I think you could fix this case, in any event, by saving a reference to the credential at the start (or, better, as a result of the sleep, when you know it's going to happen), any time you are going to split a system call across multiple potential blocking operations. That certainly is *NOT* as common to all system calls as this discussion implies. The idea of per-thread ucred references is so that a thread will have the same credentials for the lifetime of a syscall so that the entire syscall is self consistent. It also means that except for when we update the ucred reference, we don't need locks to access thread credentials since the thread references are to read-only credentials. We discussed this on -arch _months_ ago and everyone agreed with it then. As long as the pointer updates are atomic, you don't need to hold a lock to reference it anyway. You don't care over the update, since it's a drop of priviledge. Ok, I'll use really small words and diagrams: thread's 1 and 2 belong to process p p-p_ucred starts pointing at memory X CPU 1 is executing thread 1 CPU 2 is executing thread 2 CPU 3 is executing some other random code in the kernel At start both thread 1 and thread 2 have td_ucred == X CPU 1 CPU 2 CPU 3 seteuid(), p_ucred in userland now points to Y ... seteuid(), p_ucred now points to Z ... some syscall, Y's only remaining ref is free'd ... allocates some other kernel structure at Y executes any syscall because on non-i386 arch's, the value of p_ucred may still be sitting in the store buffer and we aren't using a lock, we may get any of the values X, Y, or Z for p_ucred when we test to see if it needs updating. If we get X, we keep using the old reference (this is the acceptable race from before). If we get Z, then we gain a reference to the new ucred. If we get Y, then we increment the refcount of the ucred formerly at Y, in fact corrupting the memory allocated by CPU 3! *BOOM* Is that simple enough? Perhaps this dicsussion is enough impetus to justify revisiting the atomic_t type definitions, which would be useful as reference counted hold/release mechanisms that would obviate the need for locks here? This would at least let you defer getting rid of the per thread cred instances until later. All having the refcount_t and other refcount_* functions would do is let us get rid of the per-ucred mutex (actually a pool mutex, so the overhead per ucred is just a pointer right now). It wouln't change the fact that we need a lock to make sure p_ucred is up to date before we read a value we need to depend on or actually use. Not true. You can take a reference to it, and the reference is guaranteed to not change out from under you, so long as it is counted. Use of a stack variable to store the value over the count increment allows a compare after the increment, after which a retry is possible (if there was a race during the reference taking). Thus there is no need for a lock, e.g.: cred * addreftocred(cred **cpp) { cred *rcp; /* stack variable */ again: rcp = *cp; atomic_plusplus( rcp-count); /* magic atomic crap */ if( rcp != *cp) { atomic_mimusminus( rcp-count); goto again; } return( rcp); } This depends on the rarity of credential changes; in the degenerate case, where there is a credential change between each normal system call that is not a credential change, the overhead is immense. But people who write code like that can be safely punished without fear. 8-). This code is still broken. How do you know that the value you are reading from cp (well, cpp, looks like you have a typo :-P) isn't stale and that your increment isn't corrupting something?