Re: kernel lock contention and scalability

2001-03-10 Thread Anton Blanchard

 
Hi,

> In the slow path of a spinlock_acquire they busy wait for a few
> cycles, and then call schedule with a zero timeout assuming that
> it'll basically do the same as a sched_yield() but more portably.

The obvious problem with this is that we bounce in and out of schedule()
a few times before moving on to the next task. I see this also with
sched_yield().

I had this patch lying around which I think came about when I was playing
with pthreads (which for spinlocks does sched_yield() for a while before
sleeping)

--- linux/kernel/sched.cFri Mar  9 10:26:56 2001
+++ linux_intel/kernel/sched.c  Fri Mar  9 08:42:39 2001
@@ -505,6 +505,9 @@
goto out_unlock;
}
 #else
+   if (prev->policy & SCHED_YIELD)
+   prev->counter = (prev->counter >> 4);
+
prev->policy &= ~SCHED_YIELD;
 #endif /* CONFIG_SMP */
 }

Anton


/* test sched_yield */

#include 
#include 
#include 
#include 
#include 

#undef USE_SELECT

void waste_time()
{
int i;
for(i = 0; i < 1; i++)
;
}

void do_stuff(int i)
{
#ifdef USE_SELECT
struct timeval tv;
#endif

while(1) {
fprintf(stderr, "%d\n", i);
waste_time();
#ifdef USE_SELECT
tv.tv_sec = 0;
tv.tv_usec = 0;
select(0, NULL, NULL, NULL, );
#else
sched_yield();
#endif
}
}

int main()
{
int i, pid;

for(i = 0; i < 10; i++) {
pid = fork();

if (!pid)
do_stuff(i);
}

do_stuff(i+1);

return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-10 Thread Anton Blanchard


Hi,
 
> Thanks for looking into postgresql/pgbench related locking.  Yes, 
> apparently postgresql uses a synchronization scheme that uses select()
> to effect delays for backing off while attempting to acquire a lock.
> However, it seems to me that runqueue lock contention was not entirely due 
> to postgresql code, since it was largely alleviated by the multiqueue 
> scheduler patch.

Im not saying that the multiqueue scheduler patch isn't needed, just that
this test case is caused by a bug in postgres. We shouldn't run around
fixing symptoms - dropping the contention in the runqueue lock might not
change the overall performance of the benchmark, on the other hand
fixing the spinlocks in postgres probably will.

On the other hand, if postgres still pounds on the runqueue lock after
the bug has been fixed then we need to look at the multiqueue patch.

Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-10 Thread Anton Blanchard


Hi,
 
 Thanks for looking into postgresql/pgbench related locking.  Yes, 
 apparently postgresql uses a synchronization scheme that uses select()
 to effect delays for backing off while attempting to acquire a lock.
 However, it seems to me that runqueue lock contention was not entirely due 
 to postgresql code, since it was largely alleviated by the multiqueue 
 scheduler patch.

Im not saying that the multiqueue scheduler patch isn't needed, just that
this test case is caused by a bug in postgres. We shouldn't run around
fixing symptoms - dropping the contention in the runqueue lock might not
change the overall performance of the benchmark, on the other hand
fixing the spinlocks in postgres probably will.

On the other hand, if postgres still pounds on the runqueue lock after
the bug has been fixed then we need to look at the multiqueue patch.

Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-08 Thread Jeff Dike

[EMAIL PROTECTED] said:
> On a uniprocessor system, a simple fallback is to just use a semaphore
> instead of a spinlock, since you can guarantee that there's no point
> in scheduling the current task until the holder of the "lock" releases
> it. 

Yeah, that works.  But I'm not all that interested in compiling UML 
differently for UP and SMP hosts.

> Otherwise, the spin calling sched_yield() each iteration isn't too
> horrible. 

This looks a lot better.  For UML, if there's a thread spinning on a lock, 
there has to be a runnable thread holding it, and that thread will get a 
timeslice before the spinning one (assuming that the thread holding the lock 
hasn't called a blocking system call, which is something that I intend to make 
sure can't happen).

> > That sounds like a pretty fundamental (and abusable) mechanism.
> 
> It would be if it were generally available. The implementation on
> DYNIX/ptx requires a privilege (PRIV_SCHED IIRC), to be able to use
> it.

OK, that makes sense.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-08 Thread Jeff Dike

[EMAIL PROTECTED] said:
 On a uniprocessor system, a simple fallback is to just use a semaphore
 instead of a spinlock, since you can guarantee that there's no point
 in scheduling the current task until the holder of the "lock" releases
 it. 

Yeah, that works.  But I'm not all that interested in compiling UML 
differently for UP and SMP hosts.

 Otherwise, the spin calling sched_yield() each iteration isn't too
 horrible. 

This looks a lot better.  For UML, if there's a thread spinning on a lock, 
there has to be a runnable thread holding it, and that thread will get a 
timeslice before the spinning one (assuming that the thread holding the lock 
hasn't called a blocking system call, which is something that I intend to make 
sure can't happen).

  That sounds like a pretty fundamental (and abusable) mechanism.
 
 It would be if it were generally available. The implementation on
 DYNIX/ptx requires a privilege (PRIV_SCHED IIRC), to be able to use
 it.

OK, that makes sense.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-07 Thread Tim Wright

On Tue, Mar 06, 2001 at 10:12:17PM -0500, Jeff Dike wrote:
> [EMAIL PROTECTED] said:
> > If you're a UP system, it never makes sense to spin in userland, since
> > you'll just burn up a timeslice and prevent the lock holder from
> > running. I haven't looked, but assume that their code only uses
> > spinlocks on SMP. If you're an SMP system, then you shouldn't be using
> > a spinlock unless the critical section is "short", in which case the
> > waiters should simply spin in userland rather than making system calls
> > which is simply overhead.
> 
> This is a problem that UML is going to have when I turn SMP back on.  
> Emulating a multiprocessor box on a UP host with the existing locking 
> primitives is going to result in exactly this problem.
> 

Yes. On a uniprocessor system, a simple fallback is to just use a semaphore
instead of a spinlock, since you can guarantee that there's no point in
scheduling the current task until the holder of the "lock" releases it.
Otherwise, the spin calling sched_yield() each iteration isn't too horrible.

> > Actually, what's really needed here is an efficient form of
> > dynamically marking a process as non-preemptible so that when
> > acquiring a spinlock the process can ensure that it exits the critical
> > section as fast as possible, when it would relinquish its
> > non-preemptible privilege.
> 
> That sounds like a pretty fundamental (and abusable) mechanism.
> 

It would be if it were generally available. The implementation on DYNIX/ptx
requires a privilege (PRIV_SCHED IIRC), to be able to use it. It was added
for a database to prevent preemption during critical sections.

> I had a suggestion from an IBM guy at ALS last year to make UML "spin"-locks 
> actually sleep in the host (this doesn't make them sleep locks in userspace 
> because they don't call schedule), which sounds reasonable.  This gives the 
> lock-holder an opportunity to run immediately.  It's unclear to me what the 
> wake-up mechanism would be, though.
> 

Hmmm.. depends what you mean by sleep i.e sleep(3) vs. making a system call
that sleeps. I would have thought the latter, and use semaphores again.

> Another thought I had was to raise the priority of a thread holding a 
> spinlock.  This would reduce the chance that it would be preempted by a thread 
> that will waste a timeslice spinning on that lock.  I don't know whether this 
> is a good idea either.
> 

That's basically a weaker version of the no-preempt. Not a bad idea, but
less than optimal :-)

Regards,

Tim

-- 
Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-07 Thread Tim Wright

On Tue, Mar 06, 2001 at 10:12:17PM -0500, Jeff Dike wrote:
 [EMAIL PROTECTED] said:
  If you're a UP system, it never makes sense to spin in userland, since
  you'll just burn up a timeslice and prevent the lock holder from
  running. I haven't looked, but assume that their code only uses
  spinlocks on SMP. If you're an SMP system, then you shouldn't be using
  a spinlock unless the critical section is "short", in which case the
  waiters should simply spin in userland rather than making system calls
  which is simply overhead.
 
 This is a problem that UML is going to have when I turn SMP back on.  
 Emulating a multiprocessor box on a UP host with the existing locking 
 primitives is going to result in exactly this problem.
 

Yes. On a uniprocessor system, a simple fallback is to just use a semaphore
instead of a spinlock, since you can guarantee that there's no point in
scheduling the current task until the holder of the "lock" releases it.
Otherwise, the spin calling sched_yield() each iteration isn't too horrible.

  Actually, what's really needed here is an efficient form of
  dynamically marking a process as non-preemptible so that when
  acquiring a spinlock the process can ensure that it exits the critical
  section as fast as possible, when it would relinquish its
  non-preemptible privilege.
 
 That sounds like a pretty fundamental (and abusable) mechanism.
 

It would be if it were generally available. The implementation on DYNIX/ptx
requires a privilege (PRIV_SCHED IIRC), to be able to use it. It was added
for a database to prevent preemption during critical sections.

 I had a suggestion from an IBM guy at ALS last year to make UML "spin"-locks 
 actually sleep in the host (this doesn't make them sleep locks in userspace 
 because they don't call schedule), which sounds reasonable.  This gives the 
 lock-holder an opportunity to run immediately.  It's unclear to me what the 
 wake-up mechanism would be, though.
 

Hmmm.. depends what you mean by sleep i.e sleep(3) vs. making a system call
that sleeps. I would have thought the latter, and use semaphores again.

 Another thought I had was to raise the priority of a thread holding a 
 spinlock.  This would reduce the chance that it would be preempted by a thread 
 that will waste a timeslice spinning on that lock.  I don't know whether this 
 is a good idea either.
 

That's basically a weaker version of the no-preempt. Not a bad idea, but
less than optimal :-)

Regards,

Tim

-- 
Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Jeff Dike

[EMAIL PROTECTED] said:
> Here it is:
>   http://oss.sgi.com/projects/postwait/
> Check out the download section for a 2.4.0 patch. 

After having thought about this a bit more, I don't see why pw_post and 
pw_wait can't be implemented in userspace as:

int pw_post(uid_t uid)
{
return(kill(uid, SIGHUP)) /* Or signal of the waiter's choice */
}

int pw_wait(struct timespec *t)
{
return(nanosleep(t, t));
}

In the case of UML, there would be a uid field in its lock structure and the 
spin code would look like:

lock->uid = getpid();
pw_wait(NULL);

and the lock release code would be:

pw_post(lock->uid);

Obviously, sending signals to processes from the outside could massively 
confuse matters, but I don't see that being a big problem, since I think you 
can do that now, and no one is complaining about it.

Is there anything that I'm missing?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Rajagopal Ananthanarayanan

Jeff Dike wrote:
[ ... ]
> 
> > Another synchronization method popular with database peeps is "post/
> > wait" for which SGI have a patch available for Linux. I understand
> > that this is relatively "light weight" and might be a better choice
> > for PG.
> 
> URL?
> 
> Jeff


Here it is:

http://oss.sgi.com/projects/postwait/

Check out the download section for a 2.4.0 patch.

cheers,

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Jeff Dike

[EMAIL PROTECTED] said:
> If you're a UP system, it never makes sense to spin in userland, since
> you'll just burn up a timeslice and prevent the lock holder from
> running. I haven't looked, but assume that their code only uses
> spinlocks on SMP. If you're an SMP system, then you shouldn't be using
> a spinlock unless the critical section is "short", in which case the
> waiters should simply spin in userland rather than making system calls
> which is simply overhead.

This is a problem that UML is going to have when I turn SMP back on.  
Emulating a multiprocessor box on a UP host with the existing locking 
primitives is going to result in exactly this problem.

> Actually, what's really needed here is an efficient form of
> dynamically marking a process as non-preemptible so that when
> acquiring a spinlock the process can ensure that it exits the critical
> section as fast as possible, when it would relinquish its
> non-preemptible privilege.

That sounds like a pretty fundamental (and abusable) mechanism.

I had a suggestion from an IBM guy at ALS last year to make UML "spin"-locks 
actually sleep in the host (this doesn't make them sleep locks in userspace 
because they don't call schedule), which sounds reasonable.  This gives the 
lock-holder an opportunity to run immediately.  It's unclear to me what the 
wake-up mechanism would be, though.

Another thought I had was to raise the priority of a thread holding a 
spinlock.  This would reduce the chance that it would be preempted by a thread 
that will waste a timeslice spinning on that lock.  I don't know whether this 
is a good idea either.

> Another synchronization method popular with database peeps is "post/
> wait" for which SGI have a patch available for Linux. I understand
> that this is relatively "light weight" and might be a better choice
> for PG. 

URL?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Tim Wright

On Tue, Mar 06, 2001 at 11:39:17PM +, Matthew Kirkwood wrote:
> On Tue, 6 Mar 2001, Jonathan Lahr wrote:
> 
> [ sorry to reply over another reply, but I don't have
>   the original of this ]
> 
> > > Tridge and I tried out the postgresql benchmark you used here and this
> > > contention is due to a bug in postgres. From a quick strace, we found
> > > the threads do a load of select(0, NULL, NULL, NULL, {0,0}).
> 
> I can shed some light on this (though I'm far from a PG hacker).
> 
> Postgres can use either of two locking methods -- SysV semaphores
> (which it tries to avoid, asusming that they'll be too heavy) or
> userspace spinlocks (via inline assembler on platforms which support
> it).
> 
> In the slow path of a spinlock_acquire they busy wait for a few
> cycles, and then call schedule with a zero timeout assuming that
> it'll basically do the same as a sched_yield() but more portably.
> 

Ugh !
I had a nasty feeling that might be what they were up to. The reason for
the "ugh" is as follows. If you're a UP system, it never makes sense to
spin in userland, since you'll just burn up a timeslice and prevent the
lock holder from running. I haven't looked, but assume that their code only
uses spinlocks on SMP. If you're an SMP system, then you shouldn't be
using a spinlock unless the critical section is "short", in which case the 
waiters should simply spin in userland rather than making system calls which
is simply overhead. If the argument is that the "spinners" take too much
useful time away from other processes, then it sounds like the contention is
too high, or that the critical section is sufficiently long that semaphores
would be a better choice.

Actually, what's really needed here is an efficient form of dynamically
marking a process as non-preemptible so that when acquiring a spinlock the
process can ensure that it exits the critical section as fast as possible,
when it would relinquish its non-preemptible privilege.

Another synchronization method popular with database peeps is "post/wait"
for which SGI have a patch available for Linux. I understand that this is
relatively "light weight" and might be a better choice for PG.

Tim

-- 
Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Matthew Kirkwood

On Tue, 6 Mar 2001, Jonathan Lahr wrote:

[ sorry to reply over another reply, but I don't have
  the original of this ]

> > Tridge and I tried out the postgresql benchmark you used here and this
> > contention is due to a bug in postgres. From a quick strace, we found
> > the threads do a load of select(0, NULL, NULL, NULL, {0,0}).

I can shed some light on this (though I'm far from a PG hacker).

Postgres can use either of two locking methods -- SysV semaphores
(which it tries to avoid, asusming that they'll be too heavy) or
userspace spinlocks (via inline assembler on platforms which support
it).

In the slow path of a spinlock_acquire they busy wait for a few
cycles, and then call schedule with a zero timeout assuming that
it'll basically do the same as a sched_yield() but more portably.

Matthew.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Jonathan Lahr


> Tridge and I tried out the postgresql benchmark you used here and this
> contention is due to a bug in postgres. From a quick strace, we found
> the threads do a load of select(0, NULL, NULL, NULL, {0,0}). Basically all
> threads are pounding on schedule().
...
> Our guess is that the app has some form of userspace synchronisation
> (semaphores/spinlocks). I'd argue that the app needs to be fixed not the
> kernel, or a more valid test case is put forwards. :)
...
> PS: I just looked at the postgresql source and the spinlocks (s_lock() etc)
> are in a tight loop doing select(0, NULL, NULL, NULL, {0,0}). 

Anton,

Thanks for looking into postgresql/pgbench related locking.  Yes, 
apparently postgresql uses a synchronization scheme that uses select()
to effect delays for backing off while attempting to acquire a lock.
However, it seems to me that runqueue lock contention was not entirely due 
to postgresql code, since it was largely alleviated by the multiqueue 
scheduler patch.

In using postgresql/pgbench to measure lock contention, I was attempting
to apply a typical server workload to measure scalability using only open 
software.  My goal is to load and measure the kernel for server performance, 
so I need to ensure that the software I use represents likely real world 
server configurations.  I did not use mysql, because it cannot perform 
transactions which I considered important.  Any pointers to other open 
database software or benchmarks that might be suitable for this effort 
would be appreciated.

Jonathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Jonathan Lahr


 Tridge and I tried out the postgresql benchmark you used here and this
 contention is due to a bug in postgres. From a quick strace, we found
 the threads do a load of select(0, NULL, NULL, NULL, {0,0}). Basically all
 threads are pounding on schedule().
...
 Our guess is that the app has some form of userspace synchronisation
 (semaphores/spinlocks). I'd argue that the app needs to be fixed not the
 kernel, or a more valid test case is put forwards. :)
...
 PS: I just looked at the postgresql source and the spinlocks (s_lock() etc)
 are in a tight loop doing select(0, NULL, NULL, NULL, {0,0}). 

Anton,

Thanks for looking into postgresql/pgbench related locking.  Yes, 
apparently postgresql uses a synchronization scheme that uses select()
to effect delays for backing off while attempting to acquire a lock.
However, it seems to me that runqueue lock contention was not entirely due 
to postgresql code, since it was largely alleviated by the multiqueue 
scheduler patch.

In using postgresql/pgbench to measure lock contention, I was attempting
to apply a typical server workload to measure scalability using only open 
software.  My goal is to load and measure the kernel for server performance, 
so I need to ensure that the software I use represents likely real world 
server configurations.  I did not use mysql, because it cannot perform 
transactions which I considered important.  Any pointers to other open 
database software or benchmarks that might be suitable for this effort 
would be appreciated.

Jonathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Matthew Kirkwood

On Tue, 6 Mar 2001, Jonathan Lahr wrote:

[ sorry to reply over another reply, but I don't have
  the original of this ]

  Tridge and I tried out the postgresql benchmark you used here and this
  contention is due to a bug in postgres. From a quick strace, we found
  the threads do a load of select(0, NULL, NULL, NULL, {0,0}).

I can shed some light on this (though I'm far from a PG hacker).

Postgres can use either of two locking methods -- SysV semaphores
(which it tries to avoid, asusming that they'll be too heavy) or
userspace spinlocks (via inline assembler on platforms which support
it).

In the slow path of a spinlock_acquire they busy wait for a few
cycles, and then call schedule with a zero timeout assuming that
it'll basically do the same as a sched_yield() but more portably.

Matthew.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Tim Wright

On Tue, Mar 06, 2001 at 11:39:17PM +, Matthew Kirkwood wrote:
 On Tue, 6 Mar 2001, Jonathan Lahr wrote:
 
 [ sorry to reply over another reply, but I don't have
   the original of this ]
 
   Tridge and I tried out the postgresql benchmark you used here and this
   contention is due to a bug in postgres. From a quick strace, we found
   the threads do a load of select(0, NULL, NULL, NULL, {0,0}).
 
 I can shed some light on this (though I'm far from a PG hacker).
 
 Postgres can use either of two locking methods -- SysV semaphores
 (which it tries to avoid, asusming that they'll be too heavy) or
 userspace spinlocks (via inline assembler on platforms which support
 it).
 
 In the slow path of a spinlock_acquire they busy wait for a few
 cycles, and then call schedule with a zero timeout assuming that
 it'll basically do the same as a sched_yield() but more portably.
 

Ugh !
I had a nasty feeling that might be what they were up to. The reason for
the "ugh" is as follows. If you're a UP system, it never makes sense to
spin in userland, since you'll just burn up a timeslice and prevent the
lock holder from running. I haven't looked, but assume that their code only
uses spinlocks on SMP. If you're an SMP system, then you shouldn't be
using a spinlock unless the critical section is "short", in which case the 
waiters should simply spin in userland rather than making system calls which
is simply overhead. If the argument is that the "spinners" take too much
useful time away from other processes, then it sounds like the contention is
too high, or that the critical section is sufficiently long that semaphores
would be a better choice.

Actually, what's really needed here is an efficient form of dynamically
marking a process as non-preemptible so that when acquiring a spinlock the
process can ensure that it exits the critical section as fast as possible,
when it would relinquish its non-preemptible privilege.

Another synchronization method popular with database peeps is "post/wait"
for which SGI have a patch available for Linux. I understand that this is
relatively "light weight" and might be a better choice for PG.

Tim

-- 
Tim Wright - [EMAIL PROTECTED] or [EMAIL PROTECTED] or [EMAIL PROTECTED]
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Jeff Dike

[EMAIL PROTECTED] said:
 If you're a UP system, it never makes sense to spin in userland, since
 you'll just burn up a timeslice and prevent the lock holder from
 running. I haven't looked, but assume that their code only uses
 spinlocks on SMP. If you're an SMP system, then you shouldn't be using
 a spinlock unless the critical section is "short", in which case the
 waiters should simply spin in userland rather than making system calls
 which is simply overhead.

This is a problem that UML is going to have when I turn SMP back on.  
Emulating a multiprocessor box on a UP host with the existing locking 
primitives is going to result in exactly this problem.

 Actually, what's really needed here is an efficient form of
 dynamically marking a process as non-preemptible so that when
 acquiring a spinlock the process can ensure that it exits the critical
 section as fast as possible, when it would relinquish its
 non-preemptible privilege.

That sounds like a pretty fundamental (and abusable) mechanism.

I had a suggestion from an IBM guy at ALS last year to make UML "spin"-locks 
actually sleep in the host (this doesn't make them sleep locks in userspace 
because they don't call schedule), which sounds reasonable.  This gives the 
lock-holder an opportunity to run immediately.  It's unclear to me what the 
wake-up mechanism would be, though.

Another thought I had was to raise the priority of a thread holding a 
spinlock.  This would reduce the chance that it would be preempted by a thread 
that will waste a timeslice spinning on that lock.  I don't know whether this 
is a good idea either.

 Another synchronization method popular with database peeps is "post/
 wait" for which SGI have a patch available for Linux. I understand
 that this is relatively "light weight" and might be a better choice
 for PG. 

URL?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Rajagopal Ananthanarayanan

Jeff Dike wrote:
[ ... ]
 
  Another synchronization method popular with database peeps is "post/
  wait" for which SGI have a patch available for Linux. I understand
  that this is relatively "light weight" and might be a better choice
  for PG.
 
 URL?
 
 Jeff


Here it is:

http://oss.sgi.com/projects/postwait/

Check out the download section for a 2.4.0 patch.

cheers,

ananth.

--
Rajagopal Ananthanarayanan ("ananth")
Member Technical Staff, SGI.
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-06 Thread Jeff Dike

[EMAIL PROTECTED] said:
 Here it is:
   http://oss.sgi.com/projects/postwait/
 Check out the download section for a 2.4.0 patch. 

After having thought about this a bit more, I don't see why pw_post and 
pw_wait can't be implemented in userspace as:

int pw_post(uid_t uid)
{
return(kill(uid, SIGHUP)) /* Or signal of the waiter's choice */
}

int pw_wait(struct timespec *t)
{
return(nanosleep(t, t));
}

In the case of UML, there would be a uid field in its lock structure and the 
spin code would look like:

lock-uid = getpid();
pw_wait(NULL);

and the lock release code would be:

pw_post(lock-uid);

Obviously, sending signals to processes from the outside could massively 
confuse matters, but I don't see that being a big problem, since I think you 
can do that now, and no one is complaining about it.

Is there anything that I'm missing?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-05 Thread Jonathan Lahr


Manfred Spraul [[EMAIL PROTECTED]] wrote:
>
> > lock contention work would be appreciated.  I'm aware of timer scalability
> > work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone
> > working on reducing sem_ids contention?
>
> Is that really a problem?
> The contention is high, but the actual lost time is quite small.

I agree it isn't a major performance problem under that workload.  But, I
thought since the contention was high that other workloads which may
utilize it more might have shown it to be a significant problem.

> I've attached 2 changes that might reduce the contention, but it's just
> an idea, completely untested.

Thanks for the insight into the sempahore subsystem and the suggested fixes.

--
Jonathan Lahr
IBM Linux Technology Center
Beaverton, Oregon
[EMAIL PROTECTED]
503-578-3385

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-05 Thread Jonathan Lahr


Manfred Spraul [[EMAIL PROTECTED]] wrote:

  lock contention work would be appreciated.  I'm aware of timer scalability
  work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone
  working on reducing sem_ids contention?

 Is that really a problem?
 The contention is high, but the actual lost time is quite small.

I agree it isn't a major performance problem under that workload.  But, I
thought since the contention was high that other workloads which may
utilize it more might have shown it to be a significant problem.

 I've attached 2 changes that might reduce the contention, but it's just
 an idea, completely untested.

Thanks for the insight into the sempahore subsystem and the suggested fixes.

--
Jonathan Lahr
IBM Linux Technology Center
Beaverton, Oregon
[EMAIL PROTECTED]
503-578-3385

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-04 Thread Anton Blanchard

 
Hi,

> To discover possible locking limitations to scalability, I have collected 
> locking statistics on a 2-way, 4-way, and 8-way performing as networked
> database servers.  I patched the [48]-way kernels with Kravetz's multiqueue 
> patch in the hope that mitigating runqueue_lock contention might better 
> reveal other lock contention.

...

>   24.38%  23.93%15us(   218us)   4.3us(   111us) 744475 566289 
>178186  0  runqueue_lock
>   23.15%  38.78%28us(   218us)   6.2us(   108us) 376292 230381 
>145911  0schedule+0xe0

Tridge and I tried out the postgresql benchmark you used here and this
contention is due to a bug in postgres. From a quick strace, we found
the threads do a load of select(0, NULL, NULL, NULL, {0,0}). Basically all
threads are pounding on schedule().

Our guess is that the app has some form of userspace synchronisation
(semaphores/spinlocks). I'd argue that the app needs to be fixed not the
kernel, or a more valid test case is put forwards. :)

PS: I just looked at the postgresql source and the spinlocks (s_lock() etc)
are in a tight loop doing select(0, NULL, NULL, NULL, {0,0}). In samba
we have userspace spinlocks, but they cover small amounts of code and
offer an advantage over ipc semaphores. When you have to synchronise
large sections of code ipc semaphores are reasonably fast on linux and
would be a better fit.

Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-03-04 Thread Anton Blanchard

 
Hi,

 To discover possible locking limitations to scalability, I have collected 
 locking statistics on a 2-way, 4-way, and 8-way performing as networked
 database servers.  I patched the [48]-way kernels with Kravetz's multiqueue 
 patch in the hope that mitigating runqueue_lock contention might better 
 reveal other lock contention.

...

   24.38%  23.93%15us(   218us)   4.3us(   111us) 744475 566289 
178186  0  runqueue_lock
   23.15%  38.78%28us(   218us)   6.2us(   108us) 376292 230381 
145911  0schedule+0xe0

Tridge and I tried out the postgresql benchmark you used here and this
contention is due to a bug in postgres. From a quick strace, we found
the threads do a load of select(0, NULL, NULL, NULL, {0,0}). Basically all
threads are pounding on schedule().

Our guess is that the app has some form of userspace synchronisation
(semaphores/spinlocks). I'd argue that the app needs to be fixed not the
kernel, or a more valid test case is put forwards. :)

PS: I just looked at the postgresql source and the spinlocks (s_lock() etc)
are in a tight loop doing select(0, NULL, NULL, NULL, {0,0}). In samba
we have userspace spinlocks, but they cover small amounts of code and
offer an advantage over ipc semaphores. When you have to synchronise
large sections of code ipc semaphores are reasonably fast on linux and
would be a better fit.

Cheers,
Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel lock contention and scalability

2001-02-25 Thread Manfred Spraul

Jonathan Lahr wrote:
> 
> To discover possible locking limitations to scalability, I have collected
> locking statistics on a 2-way, 4-way, and 8-way performing as networked
> database servers.  I patched the [48]-way kernels with Kravetz's multiqueue
> patch in the hope that mitigating runqueue_lock contention might better
> reveal other lock contention.
>

The dual cpu numbers are really odd. Extremely high count of
add_timer(), del_timer_sync(), schedule() and process_timeout().

That could be a kernel bug:
perhaps someone uses
for(;;) {
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(100);
}
without checking signal_pending()?


> In the attached document, I describe my test environment and excerpt
> lockstat output to show the more contentious locks for a typical run on
> each of my server configurations.  I'm interested in comparing these data
> to other lock contention data, so information regarding previous or ongoing
> lock contention work would be appreciated.  I'm aware of timer scalability
> work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone
> working on reducing sem_ids contention?
>

Is that really a problem?
The contention is high, but the actual lost time is quite small.

The 8-way test ran for ~ 129 seconds wall clock time (total cpu time
1030 seconds), and around 0.7 seconds were lost due to spinning.
The high contention is caused by the wakeups: cpu0 scans the list of
waiting processes and if it finds one it is woken up. If that thread
runs before cpu0 can release the spinlock, the second cpu will spin.

I've attached 2 changes that might reduce the contention, but it's just
an idea, completely untested.

* slightly more efficient try_atomic_semop().
* don't acquire the spinlock if q->alter was 0. It could slightly
improve performance, but I assume that q->alter will be always 1.

Btw, I found a small bug in try_atomic_semop():
If a semaphore operation with sem_op==0 blocks, then the pid is
corrupted. The bug also exists in 2.2.

--
Manfred

--- sem.c.old   Sun Feb 25 10:50:55 2001
+++ sem.c   Sun Feb 25 10:51:19 2001
@@ -250,23 +250,23 @@
curr = sma->sem_base + sop->sem_num;
sem_op = sop->sem_op;
 
-   if (!sem_op && curr->semval)
+   result = curr->semval;
+   if (!sem_op && result)
goto would_block;
+   result += sem_op;
+   if (result < 0)
+   goto would_block;
+   if (result > SEMVMX)
+   goto out_of_range;
 
curr->sempid = (curr->sempid << 16) | pid;
-   curr->semval += sem_op;
+   curr->semval = result;
if (sop->sem_flg & SEM_UNDO)
un->semadj[sop->sem_num] -= sem_op;
-
-   if (curr->semval < 0)
-   goto would_block;
-   if (curr->semval > SEMVMX)
-   goto out_of_range;
}
 
if (do_undo)
{
-   sop--;
result = 0;
goto undo;
}
@@ -285,6 +285,7 @@
result = 1;
 
 undo:
+   sop--;
while (sop >= sops) {
curr = sma->sem_base + sop->sem_num;
curr->semval -= sop->sem_op;
@@ -305,7 +306,9 @@
 {
int error;
struct sem_queue * q;
+   int do_retry = 0;
 
+retry:
for (q = sma->sem_pending; q; q = q->next) {

if (q->status == 1)
@@ -323,10 +326,17 @@
q->status = 1;
return;
}
-   q->status = error;
remove_from_queue(sma,q);
+   wmb();
+   q->status = error;
+   /* FIXME: retry only required if an increase was
+* executed
+*/
+   do_retry = 1;
}
}
+   if (do_retry)
+   goto retry;
 }
 
 /* The following counts are associated to each semaphore:
@@ -919,7 +929,13 @@
sem_unlock(semid);
 
schedule();
-
+   if (queue.status == 0) {
+   error = 0;
+   if (queue.prev)
+   BUG();
+   current->semsleeping = NULL;
+   goto out_free;
+   }
tmp = sem_lock(semid);
if(tmp==NULL) {
if(queue.prev != NULL)



Re: kernel lock contention and scalability

2001-02-25 Thread Manfred Spraul

Jonathan Lahr wrote:
 
 To discover possible locking limitations to scalability, I have collected
 locking statistics on a 2-way, 4-way, and 8-way performing as networked
 database servers.  I patched the [48]-way kernels with Kravetz's multiqueue
 patch in the hope that mitigating runqueue_lock contention might better
 reveal other lock contention.


The dual cpu numbers are really odd. Extremely high count of
add_timer(), del_timer_sync(), schedule() and process_timeout().

That could be a kernel bug:
perhaps someone uses
for(;;) {
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(100);
}
without checking signal_pending()?


 In the attached document, I describe my test environment and excerpt
 lockstat output to show the more contentious locks for a typical run on
 each of my server configurations.  I'm interested in comparing these data
 to other lock contention data, so information regarding previous or ongoing
 lock contention work would be appreciated.  I'm aware of timer scalability
 work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone
 working on reducing sem_ids contention?


Is that really a problem?
The contention is high, but the actual lost time is quite small.

The 8-way test ran for ~ 129 seconds wall clock time (total cpu time
1030 seconds), and around 0.7 seconds were lost due to spinning.
The high contention is caused by the wakeups: cpu0 scans the list of
waiting processes and if it finds one it is woken up. If that thread
runs before cpu0 can release the spinlock, the second cpu will spin.

I've attached 2 changes that might reduce the contention, but it's just
an idea, completely untested.

* slightly more efficient try_atomic_semop().
* don't acquire the spinlock if q-alter was 0. It could slightly
improve performance, but I assume that q-alter will be always 1.

Btw, I found a small bug in try_atomic_semop():
If a semaphore operation with sem_op==0 blocks, then the pid is
corrupted. The bug also exists in 2.2.

--
Manfred

--- sem.c.old   Sun Feb 25 10:50:55 2001
+++ sem.c   Sun Feb 25 10:51:19 2001
@@ -250,23 +250,23 @@
curr = sma-sem_base + sop-sem_num;
sem_op = sop-sem_op;
 
-   if (!sem_op  curr-semval)
+   result = curr-semval;
+   if (!sem_op  result)
goto would_block;
+   result += sem_op;
+   if (result  0)
+   goto would_block;
+   if (result  SEMVMX)
+   goto out_of_range;
 
curr-sempid = (curr-sempid  16) | pid;
-   curr-semval += sem_op;
+   curr-semval = result;
if (sop-sem_flg  SEM_UNDO)
un-semadj[sop-sem_num] -= sem_op;
-
-   if (curr-semval  0)
-   goto would_block;
-   if (curr-semval  SEMVMX)
-   goto out_of_range;
}
 
if (do_undo)
{
-   sop--;
result = 0;
goto undo;
}
@@ -285,6 +285,7 @@
result = 1;
 
 undo:
+   sop--;
while (sop = sops) {
curr = sma-sem_base + sop-sem_num;
curr-semval -= sop-sem_op;
@@ -305,7 +306,9 @@
 {
int error;
struct sem_queue * q;
+   int do_retry = 0;
 
+retry:
for (q = sma-sem_pending; q; q = q-next) {

if (q-status == 1)
@@ -323,10 +326,17 @@
q-status = 1;
return;
}
-   q-status = error;
remove_from_queue(sma,q);
+   wmb();
+   q-status = error;
+   /* FIXME: retry only required if an increase was
+* executed
+*/
+   do_retry = 1;
}
}
+   if (do_retry)
+   goto retry;
 }
 
 /* The following counts are associated to each semaphore:
@@ -919,7 +929,13 @@
sem_unlock(semid);
 
schedule();
-
+   if (queue.status == 0) {
+   error = 0;
+   if (queue.prev)
+   BUG();
+   current-semsleeping = NULL;
+   goto out_free;
+   }
tmp = sem_lock(semid);
if(tmp==NULL) {
if(queue.prev != NULL)



kernel lock contention and scalability

2001-02-15 Thread Jonathan Lahr


To discover possible locking limitations to scalability, I have collected 
locking statistics on a 2-way, 4-way, and 8-way performing as networked
database servers.  I patched the [48]-way kernels with Kravetz's multiqueue 
patch in the hope that mitigating runqueue_lock contention might better 
reveal other lock contention.

In the attached document, I describe my test environment and excerpt
lockstat output to show the more contentious locks for a typical run on 
each of my server configurations.  I'm interested in comparing these data 
to other lock contention data, so information regarding previous or ongoing 
lock contention work would be appreciated.  I'm aware of timer scalability 
work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone 
working on reducing sem_ids contention?

--
Jonathan Lahr
IBM Linux Technology Center
Beaverton, Oregon
[EMAIL PROTECTED]
503-578-3385




server configuration:
  hardware:
memory:
  2-way:  .5 Gb
  4-way:  1 Gb
  8-way:  1 Gb
cpus:
  2-way:  Pentium II, 300 MHz
  [48]-way:  Pentium III, 700 MHz
NICs:  100 Mbps ethernet (2)
  software:
distribution:  Redhat 7.0
kernel:  
  2-way:  2.4.0-test10 patched with lockmeter1.4.5-2.4.0 
  [48]-way:  2.4.0 patched with lockmeter1.4.5-2.4.0, 2.4.0.MQ1-sched.rt
database:  postgresql-7.0.2-17
client:  pgbench (distributed with postgresql)

lockstat excerpts:

  2way:

SPINLOCKS HOLD  WAIT
   UTIL CONMEAN (   MAX  )   MEAN (   MAX  )  TOTAL NOWAIT   
SPIN REJECT  NAME

   4.04%   1.22%50us(  3344us)   5.2us(  2014us)  36515  36068
447  0  kernel_flag
   0.01%   3.47%46us(   427us)17us(  2014us)144139 
 5  0do_coredump+0x24
   0.00%   0.00%   960us(   960us) 0us1  1 
 0  0do_exit+0x94
   0.00%   4.00%   2.0us(   4.2us)75us(  1876us) 25 24 
 1  0ext2_discard_prealloc+0x24
   0.03%   0.70%11us(  1048us)   1.3us(   682us)   1144   1136 
 8  0ext2_get_block+0x50
   1.78%   0.79%   455us(  3344us)   0.8us(   759us)   1766   1752 
14  0ext2_sync_file+0x28
   0.62%   0.84%12us(  1289us)   2.5us(  1717us)  23353  23157
196  0real_lookup+0x68
   1.46%   1.29%   186us(  2980us)   5.4us(  1824us)   3553   3507 
46  0schedule+0x490
   0.01%   0.00%   456us(   596us) 0us9  9 
 0  0sync_old_buffers+0x20
   0.01%   1.83%   9.4us(84us)   0.7us(92us)328322 
 6  0sys_fcntl64+0x44
   0.00%   3.87%   8.0us(   329us)   6.7us(  1011us)155149 
 6  0sys_ioctl+0x48
   0.02%   2.79%   1.9us(   805us)19us(  1986us)   5483   5330
153  0sys_lseek+0x70
   0.00%   0.00%22us(22us) 0us1  1 
 0  0sys_sysctl+0x50
   0.01%   3.23%17us(84us)   0.5us(25us)155150 
 5  0tty_read+0xbc
   0.02%   2.35%39us(   110us)   0.2us(11us)213208 
 5  0tty_write+0x1dc
   0.07%   1.09%   168us(  1442us)   0.7us(   116us)184182 
 2  0vfs_readdir+0x70
   0.00%   0.00%31us(31us) 0us1  1 
 0  0vfs_statfs+0x54

  24.38%  23.93%15us(   218us)   4.3us(   111us) 744475 566289 
178186  0  runqueue_lock
   0.06%  15.97%   4.5us(26us)   2.6us(67us)   5592   4699
893  0__wake_up+0xdc
   0.00%  10.27%   0.4us(   1.3us)   1.5us(60us)146131 
15  0deliver_signal+0x58
   1.16%   8.59%   1.5us(27us)   2.3us(   111us) 360313 329373  
30940  0process_timeout+0x14
   0.00%   0.00%   0.6us(   0.6us) 0us1  1 
 0  0release+0x28
  23.15%  38.78%28us(   218us)   6.2us(   108us) 376292 230381 
145911  0schedule+0xe0
   0.01%  45.34%   3.7us(24us)16us(82us)686375
311  0schedule+0x458
   0.00%   0.00%   2.8us(70us) 0us   89 89 
 0  0schedule+0x504
   0.01%   8.55%   3.0us(18us)   1.9us(68us)   1356   1240
116  0wake_up_process+0x14

   0.11%   4.97%12us(  1113us)   1.0us(  1540us)   4041   3840
201  0  sem_ids+0x24
   0.00%   1.32%   7.1us(88us)   0.1us(11us)303299 
 4  0semctl_main+0x4c
   0.06%   3.85%11us(   281us)   0.5us(81us)   2392   2300 
92  0

kernel lock contention and scalability

2001-02-15 Thread Jonathan Lahr


To discover possible locking limitations to scalability, I have collected 
locking statistics on a 2-way, 4-way, and 8-way performing as networked
database servers.  I patched the [48]-way kernels with Kravetz's multiqueue 
patch in the hope that mitigating runqueue_lock contention might better 
reveal other lock contention.

In the attached document, I describe my test environment and excerpt
lockstat output to show the more contentious locks for a typical run on 
each of my server configurations.  I'm interested in comparing these data 
to other lock contention data, so information regarding previous or ongoing 
lock contention work would be appreciated.  I'm aware of timer scalability 
work ongoing at people.redhat.com/mingo/scalable-timers, but is anyone 
working on reducing sem_ids contention?

--
Jonathan Lahr
IBM Linux Technology Center
Beaverton, Oregon
[EMAIL PROTECTED]
503-578-3385




server configuration:
  hardware:
memory:
  2-way:  .5 Gb
  4-way:  1 Gb
  8-way:  1 Gb
cpus:
  2-way:  Pentium II, 300 MHz
  [48]-way:  Pentium III, 700 MHz
NICs:  100 Mbps ethernet (2)
  software:
distribution:  Redhat 7.0
kernel:  
  2-way:  2.4.0-test10 patched with lockmeter1.4.5-2.4.0 
  [48]-way:  2.4.0 patched with lockmeter1.4.5-2.4.0, 2.4.0.MQ1-sched.rt
database:  postgresql-7.0.2-17
client:  pgbench (distributed with postgresql)

lockstat excerpts:

  2way:

SPINLOCKS HOLD  WAIT
   UTIL CONMEAN (   MAX  )   MEAN (   MAX  )  TOTAL NOWAIT   
SPIN REJECT  NAME

   4.04%   1.22%50us(  3344us)   5.2us(  2014us)  36515  36068
447  0  kernel_flag
   0.01%   3.47%46us(   427us)17us(  2014us)144139 
 5  0do_coredump+0x24
   0.00%   0.00%   960us(   960us) 0us1  1 
 0  0do_exit+0x94
   0.00%   4.00%   2.0us(   4.2us)75us(  1876us) 25 24 
 1  0ext2_discard_prealloc+0x24
   0.03%   0.70%11us(  1048us)   1.3us(   682us)   1144   1136 
 8  0ext2_get_block+0x50
   1.78%   0.79%   455us(  3344us)   0.8us(   759us)   1766   1752 
14  0ext2_sync_file+0x28
   0.62%   0.84%12us(  1289us)   2.5us(  1717us)  23353  23157
196  0real_lookup+0x68
   1.46%   1.29%   186us(  2980us)   5.4us(  1824us)   3553   3507 
46  0schedule+0x490
   0.01%   0.00%   456us(   596us) 0us9  9 
 0  0sync_old_buffers+0x20
   0.01%   1.83%   9.4us(84us)   0.7us(92us)328322 
 6  0sys_fcntl64+0x44
   0.00%   3.87%   8.0us(   329us)   6.7us(  1011us)155149 
 6  0sys_ioctl+0x48
   0.02%   2.79%   1.9us(   805us)19us(  1986us)   5483   5330
153  0sys_lseek+0x70
   0.00%   0.00%22us(22us) 0us1  1 
 0  0sys_sysctl+0x50
   0.01%   3.23%17us(84us)   0.5us(25us)155150 
 5  0tty_read+0xbc
   0.02%   2.35%39us(   110us)   0.2us(11us)213208 
 5  0tty_write+0x1dc
   0.07%   1.09%   168us(  1442us)   0.7us(   116us)184182 
 2  0vfs_readdir+0x70
   0.00%   0.00%31us(31us) 0us1  1 
 0  0vfs_statfs+0x54

  24.38%  23.93%15us(   218us)   4.3us(   111us) 744475 566289 
178186  0  runqueue_lock
   0.06%  15.97%   4.5us(26us)   2.6us(67us)   5592   4699
893  0__wake_up+0xdc
   0.00%  10.27%   0.4us(   1.3us)   1.5us(60us)146131 
15  0deliver_signal+0x58
   1.16%   8.59%   1.5us(27us)   2.3us(   111us) 360313 329373  
30940  0process_timeout+0x14
   0.00%   0.00%   0.6us(   0.6us) 0us1  1 
 0  0release+0x28
  23.15%  38.78%28us(   218us)   6.2us(   108us) 376292 230381 
145911  0schedule+0xe0
   0.01%  45.34%   3.7us(24us)16us(82us)686375
311  0schedule+0x458
   0.00%   0.00%   2.8us(70us) 0us   89 89 
 0  0schedule+0x504
   0.01%   8.55%   3.0us(18us)   1.9us(68us)   1356   1240
116  0wake_up_process+0x14

   0.11%   4.97%12us(  1113us)   1.0us(  1540us)   4041   3840
201  0  sem_ids+0x24
   0.00%   1.32%   7.1us(88us)   0.1us(11us)303299 
 4  0semctl_main+0x4c
   0.06%   3.85%11us(   281us)   0.5us(81us)   2392   2300 
92  0