Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
One last followup on that bizarreness about shutdown's checkpoint failing on Alpha platforms --- After changing the checkpoint code to loop, rather than assuming TAS() must succeed the first time, I noticed that it always looped exactly once. This didn't make sense to me at the time, but after querying some Alpha experts at DEC^H^H^HCompaq, it does now. If a new process's first write to a shared memory page is a stq_c, that stq_c is guaranteed to fail (at least on Tru64 Unix), because it will page fault. The shared memory page is inherited read-only and is converted to read-write on first fault. This doesn't seem really necessary, but I suppose it's done to share code with the copy-on-write case for non-shared pages that are inherited via fork(). It makes sense that the checkpoint process's first write to shared memory would be stq_c, because after all it shouldn't be scribbling on shared memory until it's got the spinlock, n'est ce pas? So a failure the first time through the TAS loop is entirely expected for Alpha. I wouldn't be surprised to see similar behavior on other architectures, now that I see the first-write-from-a-process connection. Bottom line is the same: always call TAS() in a retry loop. regards, tom lane
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
One last followup on that bizarreness about shutdown's checkpoint failing on Alpha platforms --- After changing the checkpoint code to loop, rather than assuming TAS() must succeed the first time, I noticed that it always looped exactly once. This didn't make sense to me at the time, but after querying some Alpha experts at DEC^H^H^HCompaq, it does now. If a new process's first write to a shared memory page is a stq_c, that stq_c is guaranteed to fail (at least on Tru64 Unix), because it will page fault. The shared memory page is inherited read-only and is converted to read-write on first fault. This doesn't seem really necessary, but I suppose it's done to share code with the copy-on-write case for non-shared pages that are inherited via fork(). This seems quite bizarre. Why would the process fail on the write, and not just pause and wait for the fault to bring in the page? Doesn't the CPU halt the instruction to fetch in the page and restart the instruction? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
Bruce Momjian [EMAIL PROTECTED] writes: After changing the checkpoint code to loop, rather than assuming TAS() must succeed the first time, I noticed that it always looped exactly once. This didn't make sense to me at the time, but after querying some Alpha experts at DEC^H^H^HCompaq, it does now. If a new process's first write to a shared memory page is a stq_c, that stq_c is guaranteed to fail (at least on Tru64 Unix), because it will page fault. The shared memory page is inherited read-only and is converted to read-write on first fault. This doesn't seem really necessary, but I suppose it's done to share code with the copy-on-write case for non-shared pages that are inherited via fork(). This seems quite bizarre. Why would the process fail on the write, and not just pause and wait for the fault to bring in the page? An ordinary write would be re-executed and would succeed after the page fault. stq_c is different, because it's only supposed to succeed if the processor has managed to hold an access lock on the target address continuously since the ldq_l. It would be very bad form to try to hold the lock during a page fault. (stq_c will also fail if the processor is interrupted between ldq_l and stq_c, so occasional failures are to be expected. What was surprising me was the consistency of the failure pattern.) See the Alpha Architecture Manual if you really want to discuss this. regards, tom lane
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
Oh, thanks. That makes sense. Bruce Momjian [EMAIL PROTECTED] writes: After changing the checkpoint code to loop, rather than assuming TAS() must succeed the first time, I noticed that it always looped exactly once. This didn't make sense to me at the time, but after querying some Alpha experts at DEC^H^H^HCompaq, it does now. If a new process's first write to a shared memory page is a stq_c, that stq_c is guaranteed to fail (at least on Tru64 Unix), because it will page fault. The shared memory page is inherited read-only and is converted to read-write on first fault. This doesn't seem really necessary, but I suppose it's done to share code with the copy-on-write case for non-shared pages that are inherited via fork(). This seems quite bizarre. Why would the process fail on the write, and not just pause and wait for the fault to bring in the page? An ordinary write would be re-executed and would succeed after the page fault. stq_c is different, because it's only supposed to succeed if the processor has managed to hold an access lock on the target address continuously since the ldq_l. It would be very bad form to try to hold the lock during a page fault. (stq_c will also fail if the processor is interrupted between ldq_l and stq_c, so occasional failures are to be expected. What was surprising me was the consistency of the failure pattern.) See the Alpha Architecture Manual if you really want to discuss this. regards, tom lane -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
Alfred Perlstein [EMAIL PROTECTED] writes: One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. The author of the current s_lock code seems to have thought that select() with a zero delay would do the equivalent of sched_yield(). I'm not sure if that's true on very many kernels, if indeed any... I doubt we could buy much by depending on sched_yield(); if you want to assume POSIX facilities, ISTM you might as well go for user-space semaphores and forget the whole TAS mechanism. Another issue is that sched_yield brings in the pthreads library/hooks on some OS's, which we certainly want to avoid. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
* Bruce Momjian [EMAIL PROTECTED] [010101 23:59] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. The author of the current s_lock code seems to have thought that select() with a zero delay would do the equivalent of sched_yield(). I'm not sure if that's true on very many kernels, if indeed any... I doubt we could buy much by depending on sched_yield(); if you want to assume POSIX facilities, ISTM you might as well go for user-space semaphores and forget the whole TAS mechanism. Another issue is that sched_yield brings in the pthreads library/hooks on some OS's, which we certainly want to avoid. I know it's a major undertaking, but since the work is sort of done, have you guys considered the port to solaris threads and seeing about making a pthreads port of that? I know it would probably get you considerable gains under Windows at the expense of dropping some really really legacy system. Or you could do what apache (is rumored) does and have it do either threads or processes or both... -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
RE: [HACKERS] Assuming that TAS() will succeed the first time is verboten
The code is based on some odd assumptions. A select() with 0 delay returns immediately unless there is an interrupt during its (very short!) time in kernel space. Yeah, I've wondered whether the 0 entries in s_spincycle[] shouldn't be 1. The code author evidently expected select() to at least yield the processor even with delay 0, but the select() man pages I have handy say that it will "return immediately" when delay is 0. I've run some tests with 5 instead of 0 and afair performance was worse, so we should carefully test !0 values. Actually, one slocks are held longer than anothers - probably we should use different delays... Vadim
[HACKERS] Assuming that TAS() will succeed the first time is verboten
I have been digging into the observed failure FATAL: Checkpoint lock is busy while data base is shutting down on some Alpha machines. It apparently doesn't happen on all Alphas, but it's quite reproducible on some of them. The bottom line turns out to be that on the Alpha hardware, it is possible for TAS() to fail even when the lock is initially zero, because that hardware's locking protocol will fail to acquire the lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be called in a retry loop on Alphas. Thus, the coding presently in xlog.c, while (TAS((XLogCtl-chkp_lck))) { struct timeval delay = {2, 0}; if (shutdown) elog(STOP, "Checkpoint lock is busy while data base is shutting down"); (void) select(0, NULL, NULL, NULL, delay); } is no good because it does not allow for multiple retries. Offhand I see no good reason why the above-quoted code isn't just S_LOCK((XLogCtl-chkp_lck)); and propose to fix this problem by reducing it to that. If the lock is held when it shouldn't be, we'll fail with a stuck-spinlock error. It also bothers me that xlog.c contains several places where there is a potentially infinite wait for a lock. It seems to me that these should time out with stuck-spinlock messages. Do you object to such a change? regards, tom lane
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Nathan Myers) writes: I wonder about the advisability of using spinlocks in user-level code which might be swapped out any time. The reason we use spinlocks is that we expect the lock to succeed (not block) the majority of the time, and we want the code to fall through as quickly as possible in that case. In particular we do *not* want to expend a kernel call when we are able to acquire the lock immediately. Most implementations of mutex and semaphore do no system call if they get the lock; if they fail to get the lock, they block in the kernel without any need for complicated "back-off" loops. It's not a true "spin" lock because we don't sit in a tight loop when we do have to wait for the lock --- we use select() to delay for a small interval before trying again. See src/backend/storage/buffer/s_lock.c. The design is reasonable, even if a little bit offbeat. I suspect "a little bit offbeat" qualifies as an extreme understatement. In particular, it's not really a spinlock, in the standard sense. The code is based on some odd assumptions. A select() with 0 delay returns immediately unless there is an interrupt during its (very short!) time in kernel space. On a single processor this is extremely unlikely to result in a change to the lock. I suspect #define S_NSPINCYCLE2 #define S_MAX_BUSY 1000 * S_NSPINCYCLE int s_spincycle[S_NSPINCYCLE] = {1,1}; would give better results, assuming we want to keep the existing mechanism. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
[EMAIL PROTECTED] (Nathan Myers) writes: On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote: The reason we use spinlocks is that we expect the lock to succeed (not block) the majority of the time, and we want the code to fall through as quickly as possible in that case. In particular we do *not* want to expend a kernel call when we are able to acquire the lock immediately. Most implementations of mutex and semaphore do no system call if they get the lock; if they fail to get the lock, they block in the kernel without any need for complicated "back-off" loops. There's been some talk of reimplementing S_LOCK() and friends to use Posix user-space semaphores, on platforms that provide such. But it'll be a *long* time before we can expect such facilities to be available everywhere. The code is based on some odd assumptions. A select() with 0 delay returns immediately unless there is an interrupt during its (very short!) time in kernel space. Yeah, I've wondered whether the 0 entries in s_spincycle[] shouldn't be 1. The code author evidently expected select() to at least yield the processor even with delay 0, but the select() man pages I have handy say that it will "return immediately" when delay is 0. regards, tom lane
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
* Tom Lane [EMAIL PROTECTED] [001228 14:25] wrote: [EMAIL PROTECTED] (Nathan Myers) writes: I wonder about the advisability of using spinlocks in user-level code which might be swapped out any time. The reason we use spinlocks is that we expect the lock to succeed (not block) the majority of the time, and we want the code to fall through as quickly as possible in that case. In particular we do *not* want to expend a kernel call when we are able to acquire the lock immediately. It's not a true "spin" lock because we don't sit in a tight loop when we do have to wait for the lock --- we use select() to delay for a small interval before trying again. See src/backend/storage/buffer/s_lock.c. The design is reasonable, even if a little bit offbeat. It sounds pretty bad, if you have a contested lock you'll trap into the kernel each time you miss, crossing the protection boundry and then waiting. It's a tough call to make, because on UP systems you loose bigtime by spinning for your quantum, however on SMP systems there's a chance that the lock is owned by a process on another CPU and spinning might be benificial. One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. http://www.freebsd.org/cgi/man.cgi?query=sched_yieldapropos=0sektion=0manpath=FreeBSD+4.2-RELEASEformat=html -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
Alfred Perlstein [EMAIL PROTECTED] writes: One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. The author of the current s_lock code seems to have thought that select() with a zero delay would do the equivalent of sched_yield(). I'm not sure if that's true on very many kernels, if indeed any... I doubt we could buy much by depending on sched_yield(); if you want to assume POSIX facilities, ISTM you might as well go for user-space semaphores and forget the whole TAS mechanism. regards, tom lane