Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2001-01-08 Thread Tom Lane

One last followup on that bizarreness about shutdown's checkpoint
failing on Alpha platforms ---

After changing the checkpoint code to loop, rather than assuming TAS()
must succeed the first time, I noticed that it always looped exactly
once.  This didn't make sense to me at the time, but after querying some
Alpha experts at DEC^H^H^HCompaq, it does now.  If a new process's first
write to a shared memory page is a stq_c, that stq_c is guaranteed to
fail (at least on Tru64 Unix), because it will page fault.  The shared
memory page is inherited read-only and is converted to read-write on
first fault.  This doesn't seem really necessary, but I suppose it's
done to share code with the copy-on-write case for non-shared pages
that are inherited via fork().

It makes sense that the checkpoint process's first write to shared
memory would be stq_c, because after all it shouldn't be scribbling
on shared memory until it's got the spinlock, n'est ce pas?

So a failure the first time through the TAS loop is entirely expected
for Alpha.  I wouldn't be surprised to see similar behavior on other
architectures, now that I see the first-write-from-a-process connection.

Bottom line is the same: always call TAS() in a retry loop.

regards, tom lane



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2001-01-08 Thread Bruce Momjian

 One last followup on that bizarreness about shutdown's checkpoint
 failing on Alpha platforms ---
 
 After changing the checkpoint code to loop, rather than assuming TAS()
 must succeed the first time, I noticed that it always looped exactly
 once.  This didn't make sense to me at the time, but after querying some
 Alpha experts at DEC^H^H^HCompaq, it does now.  If a new process's first
 write to a shared memory page is a stq_c, that stq_c is guaranteed to
 fail (at least on Tru64 Unix), because it will page fault.  The shared
 memory page is inherited read-only and is converted to read-write on
 first fault.  This doesn't seem really necessary, but I suppose it's
 done to share code with the copy-on-write case for non-shared pages
 that are inherited via fork().

This seems quite bizarre.  Why would the process fail on the write, and
not just pause and wait for the fault to bring in the page?  Doesn't the
CPU halt the instruction to fetch in the page and restart the
instruction?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2001-01-08 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 After changing the checkpoint code to loop, rather than assuming TAS()
 must succeed the first time, I noticed that it always looped exactly
 once.  This didn't make sense to me at the time, but after querying some
 Alpha experts at DEC^H^H^HCompaq, it does now.  If a new process's first
 write to a shared memory page is a stq_c, that stq_c is guaranteed to
 fail (at least on Tru64 Unix), because it will page fault.  The shared
 memory page is inherited read-only and is converted to read-write on
 first fault.  This doesn't seem really necessary, but I suppose it's
 done to share code with the copy-on-write case for non-shared pages
 that are inherited via fork().

 This seems quite bizarre.  Why would the process fail on the write, and
 not just pause and wait for the fault to bring in the page?

An ordinary write would be re-executed and would succeed after the
page fault.  stq_c is different, because it's only supposed to succeed
if the processor has managed to hold an access lock on the target
address continuously since the ldq_l.  It would be very bad form to try
to hold the lock during a page fault.  (stq_c will also fail if the
processor is interrupted between ldq_l and stq_c, so occasional failures
are to be expected.  What was surprising me was the consistency of the
failure pattern.)

See the Alpha Architecture Manual if you really want to discuss this.

regards, tom lane



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2001-01-08 Thread Bruce Momjian


Oh, thanks.  That makes sense.

 Bruce Momjian [EMAIL PROTECTED] writes:
  After changing the checkpoint code to loop, rather than assuming TAS()
  must succeed the first time, I noticed that it always looped exactly
  once.  This didn't make sense to me at the time, but after querying some
  Alpha experts at DEC^H^H^HCompaq, it does now.  If a new process's first
  write to a shared memory page is a stq_c, that stq_c is guaranteed to
  fail (at least on Tru64 Unix), because it will page fault.  The shared
  memory page is inherited read-only and is converted to read-write on
  first fault.  This doesn't seem really necessary, but I suppose it's
  done to share code with the copy-on-write case for non-shared pages
  that are inherited via fork().
 
  This seems quite bizarre.  Why would the process fail on the write, and
  not just pause and wait for the fault to bring in the page?
 
 An ordinary write would be re-executed and would succeed after the
 page fault.  stq_c is different, because it's only supposed to succeed
 if the processor has managed to hold an access lock on the target
 address continuously since the ldq_l.  It would be very bad form to try
 to hold the lock during a page fault.  (stq_c will also fail if the
 processor is interrupted between ldq_l and stq_c, so occasional failures
 are to be expected.  What was surprising me was the consistency of the
 failure pattern.)
 
 See the Alpha Architecture Manual if you really want to discuss this.
 
   regards, tom lane
 


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2001-01-01 Thread Bruce Momjian

 Alfred Perlstein [EMAIL PROTECTED] writes:
  One trick that may help is calling sched_yield(2) on a lock miss,
  it's a POSIX call and quite new so you'd need a 'configure' test
  for it.
 
 The author of the current s_lock code seems to have thought that
 select() with a zero delay would do the equivalent of sched_yield().
 I'm not sure if that's true on very many kernels, if indeed any...
 
 I doubt we could buy much by depending on sched_yield(); if you want
 to assume POSIX facilities, ISTM you might as well go for user-space
 semaphores and forget the whole TAS mechanism.


Another issue is that sched_yield brings in the pthreads library/hooks
on some OS's, which we certainly want to avoid.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2001-01-01 Thread Alfred Perlstein

* Bruce Momjian [EMAIL PROTECTED] [010101 23:59] wrote:
  Alfred Perlstein [EMAIL PROTECTED] writes:
   One trick that may help is calling sched_yield(2) on a lock miss,
   it's a POSIX call and quite new so you'd need a 'configure' test
   for it.
  
  The author of the current s_lock code seems to have thought that
  select() with a zero delay would do the equivalent of sched_yield().
  I'm not sure if that's true on very many kernels, if indeed any...
  
  I doubt we could buy much by depending on sched_yield(); if you want
  to assume POSIX facilities, ISTM you might as well go for user-space
  semaphores and forget the whole TAS mechanism.
 
 
 Another issue is that sched_yield brings in the pthreads library/hooks
 on some OS's, which we certainly want to avoid.

I know it's a major undertaking, but since the work is sort of done,
have you guys considered the port to solaris threads and seeing about
making a pthreads port of that?

I know it would probably get you considerable gains under Windows
at the expense of dropping some really really legacy system.

Or you could do what apache (is rumored) does and have it do either
threads or processes or both...

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."



RE: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2000-12-29 Thread Mikheev, Vadim

  The code is based on some odd assumptions.  A select() with 0 delay 
  returns immediately unless there is an interrupt during its 
  (very short!) time in kernel space.
 
 Yeah, I've wondered whether the 0 entries in s_spincycle[] 
 shouldn't be 1.  The code author evidently expected select()
 to at least yield the processor even with delay 0, but the select()
 man pages I have handy say that it will "return immediately" when delay
 is 0.

I've run some tests with 5 instead of 0 and afair performance was worse,
so we should carefully test !0 values. Actually, one slocks are held
longer than anothers - probably we should use different delays...

Vadim



[HACKERS] Assuming that TAS() will succeed the first time is verboten

2000-12-28 Thread Tom Lane

I have been digging into the observed failure

FATAL: Checkpoint lock is busy while data base is shutting down

on some Alpha machines.  It apparently doesn't happen on all Alphas,
but it's quite reproducible on some of them.

The bottom line turns out to be that on the Alpha hardware, it is
possible for TAS() to fail even when the lock is initially zero,
because that hardware's locking protocol will fail to acquire the
lock if the ldq_l/stq_c sequence is interrupted.  TAS() *must* be
called in a retry loop on Alphas.  Thus, the coding presently in
xlog.c,

while (TAS((XLogCtl-chkp_lck)))
{
struct timeval delay = {2, 0};

if (shutdown)
elog(STOP, "Checkpoint lock is busy while data base is shutting down");
(void) select(0, NULL, NULL, NULL, delay);
}

is no good because it does not allow for multiple retries.

Offhand I see no good reason why the above-quoted code isn't just

S_LOCK((XLogCtl-chkp_lck));

and propose to fix this problem by reducing it to that.  If the lock
is held when it shouldn't be, we'll fail with a stuck-spinlock error.

It also bothers me that xlog.c contains several places where there is a
potentially infinite wait for a lock.  It seems to me that these should
time out with stuck-spinlock messages.  Do you object to such a change?

regards, tom lane



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2000-12-28 Thread Nathan Myers

On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote:
 [EMAIL PROTECTED] (Nathan Myers) writes:
  I wonder about the advisability of using spinlocks in user-level code 
  which might be swapped out any time.
 
 The reason we use spinlocks is that we expect the lock to succeed (not
 block) the majority of the time, and we want the code to fall through
 as quickly as possible in that case.  In particular we do *not* want to
 expend a kernel call when we are able to acquire the lock immediately.

Most implementations of mutex and semaphore do no system call if they 
get the lock; if they fail to get the lock, they block in the kernel 
without any need for complicated "back-off" loops.  

 It's not a true "spin" lock because we don't sit in a tight loop when
 we do have to wait for the lock --- we use select() to delay for a small
 interval before trying again.  See src/backend/storage/buffer/s_lock.c.
 
 The design is reasonable, even if a little bit offbeat.

I suspect "a little bit offbeat" qualifies as an extreme understatement.  
In particular, it's not really a spinlock, in the standard sense.

The code is based on some odd assumptions.  A select() with 0 delay 
returns immediately unless there is an interrupt during its (very short!) 
time in kernel space.  On a single processor this is extremely unlikely 
to result in a change to the lock.  I suspect

  #define S_NSPINCYCLE2
  #define S_MAX_BUSY  1000 * S_NSPINCYCLE

  int s_spincycle[S_NSPINCYCLE] = {1,1};

would give better results, assuming we want to keep the existing
mechanism.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2000-12-28 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 On Thu, Dec 28, 2000 at 05:12:22PM -0500, Tom Lane wrote:
 The reason we use spinlocks is that we expect the lock to succeed (not
 block) the majority of the time, and we want the code to fall through
 as quickly as possible in that case.  In particular we do *not* want to
 expend a kernel call when we are able to acquire the lock immediately.

 Most implementations of mutex and semaphore do no system call if they 
 get the lock; if they fail to get the lock, they block in the kernel 
 without any need for complicated "back-off" loops.  

There's been some talk of reimplementing S_LOCK() and friends to use
Posix user-space semaphores, on platforms that provide such.  But it'll
be a *long* time before we can expect such facilities to be available
everywhere.

 The code is based on some odd assumptions.  A select() with 0 delay 
 returns immediately unless there is an interrupt during its (very short!) 
 time in kernel space.

Yeah, I've wondered whether the 0 entries in s_spincycle[] shouldn't be
1.  The code author evidently expected select() to at least yield the
processor even with delay 0, but the select() man pages I have handy
say that it will "return immediately" when delay is 0.

regards, tom lane



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2000-12-28 Thread Alfred Perlstein

* Tom Lane [EMAIL PROTECTED] [001228 14:25] wrote:
 [EMAIL PROTECTED] (Nathan Myers) writes:
  I wonder about the advisability of using spinlocks in user-level code 
  which might be swapped out any time.
 
 The reason we use spinlocks is that we expect the lock to succeed (not
 block) the majority of the time, and we want the code to fall through
 as quickly as possible in that case.  In particular we do *not* want to
 expend a kernel call when we are able to acquire the lock immediately.
 It's not a true "spin" lock because we don't sit in a tight loop when
 we do have to wait for the lock --- we use select() to delay for a small
 interval before trying again.  See src/backend/storage/buffer/s_lock.c.
 
 The design is reasonable, even if a little bit offbeat.

It sounds pretty bad, if you have a contested lock you'll trap into
the kernel each time you miss, crossing the protection boundry and
then waiting.  It's a tough call to make, because on UP systems
you loose bigtime by spinning for your quantum, however on SMP
systems there's a chance that the lock is owned by a process on
another CPU and spinning might be benificial.

One trick that may help is calling sched_yield(2) on a lock miss,
it's a POSIX call and quite new so you'd need a 'configure' test
for it.

http://www.freebsd.org/cgi/man.cgi?query=sched_yieldapropos=0sektion=0manpath=FreeBSD+4.2-RELEASEformat=html

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."



Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten

2000-12-28 Thread Tom Lane

Alfred Perlstein [EMAIL PROTECTED] writes:
 One trick that may help is calling sched_yield(2) on a lock miss,
 it's a POSIX call and quite new so you'd need a 'configure' test
 for it.

The author of the current s_lock code seems to have thought that
select() with a zero delay would do the equivalent of sched_yield().
I'm not sure if that's true on very many kernels, if indeed any...

I doubt we could buy much by depending on sched_yield(); if you want
to assume POSIX facilities, ISTM you might as well go for user-space
semaphores and forget the whole TAS mechanism.

regards, tom lane