subject:"\[HACKERS\] Spinlock performance improvement proposal"

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Karel Zak


On Sat, Sep 29, 2001 at 06:48:56PM +0530, Chamanya wrote:
 
 Number of threads should be equal to or twice that of number of CPUs. I don't 
 think more than those many threads would yield any performance improvement.
 
 
 This expects that thread still runnig, but each process (thread) sometime
waiting for disk, net etc. During this time can runs some other thread.
 Performance of program not directly depends on number of CPU, but on 
type of a work that execute thread. The important thing is how you can 
split a work to small and independent parts. 

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
 http://home.zf.jcu.cz/~zakkr/
 
 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Justin Clift


Tom Lane wrote:
 
snip
 I think the default NBuffers (64) is too low to give meaningful
 performance numbers, too.  I've been thinking that maybe we should
 raise it to 1000 or so by default.  This would trigger startup failures
 on platforms with small SHMMAX, but we could tell people to use -B until
 they get around to fixing their kernel settings.  It's been a long time
 since we fit into a 1-MB shared memory segment at the default settings
 anyway, so maybe it's time to select somewhat-realistic defaults.
 What we have now is neither very useful nor the lowest common
 denominator...

How about a startup error message which gets displayed when used with
untuned settings (i.e. the default settings), maybe unless an option
like -q (quiet) is given?

My thought is the server should operate, but let the new/novice admin
know they need to configure PostgreSQL properly.  Would probably be a
good reminder for experienced admins if they forget too.

Maybe something simple like pg_ctl shell script message, or something
proper like a postmaster start-up check.

This wouldn't break anything would it?

Regards and best wishes,

Justin Clift

 
 regards, tom lane
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
 http://www.postgresql.org/users-lounge/docs/faq.html

-- 
My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there.
 - Indira Gandhi

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Giles Lean



Bruce Momjian [EMAIL PROTECTED] wrote:

 From postmaster startup, by default, could we try larger amounts of
 buffer memory until it fails then back off and allocate that?  Seems
 like a nice default to me.

So performance would vary depending on the amount of shared memory
that could be allocated at startup?  Not a good idea IMHO.

Regards,

Giles


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Bruce Momjian


 This is still missing a bet since it fails to mention the option of
 adjusting -B and -N instead of changing kernel parameters, but that's
 easily fixed.  I propose that we reword this message and the semget
 one to mention first the option of changing -B/-N and second the option
 of changing kernel parameters.  Then we could consider raising the
 default -B setting to something more realistic.

Yes, we could do that but it makes things harder for newbies and really
isn't the right numbers for production use anyway.  I think anyone using
default values should see a message asking them to tune it.  Can we
throw a message during initdb?  Of course, we don't have a running
backend at that point so you would always throw a message.

From postmaster startup, by default, could we try larger amounts of
buffer memory until it fails then back off and allocate that?  Seems
like a nice default to me.


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Tom Lane


Bruce Momjian [EMAIL PROTECTED] writes:
 From postmaster startup, by default, could we try larger amounts of
 buffer memory until it fails then back off and allocate that?  Seems
 like a nice default to me.

Chewing all available memory is the very opposite of a nice default,
I'd think.

The real problem here is that some platforms will let us have huge shmem
segments, and some will only let us have tiny ones, and neither of those
is a reasonable default behavior.  Allowing the platform to determine
our sizing is the wrong way round IMHO; the dbadmin should have a clear
idea of what he's getting, and silent adjustment of the B/N parameters
will not give him that.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Tom Lane


Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I think the default NBuffers (64) is too low to give meaningful
 performance numbers, too.  I've been thinking that maybe we should
 raise it to 1000 or so by default.

 Maybe something simple like pg_ctl shell script message, or something
 proper like a postmaster start-up check.

 Yes, this seems like the way to go, probably something in the postmaster
 log file.

Except that a lot of people send postmaster stderr to /dev/null.
I think bleating about untuned parameters in the postmaster log will be
next to useless, because it won't do a thing except for people who are
clueful enough to (a) direct the log someplace useful and (b) look at it
carefully.  Those folks are not the ones who need help about tuning.

We already have quite detailed error messages for shmget/semget
failures, eg

$ postmaster -B 20
IpcMemoryCreate: shmget(key=5440001, size=1668366336, 03600) failed: Invalid argument

This error can be caused by one of three things:

1. The maximum size for shared memory segments on your system was
   exceeded.  You need to raise the SHMMAX parameter in your kernel
   to be at least 4042162176 bytes.

2. The requested shared memory segment was too small for your system.
   You need to lower the SHMMIN parameter in your kernel.

3. The requested shared memory segment already exists but is of the
   wrong size.  This can occur if some other application on your system
   is also using shared memory.

The PostgreSQL Administrator's Guide contains more information about
shared memory configuration.


This is still missing a bet since it fails to mention the option of
adjusting -B and -N instead of changing kernel parameters, but that's
easily fixed.  I propose that we reword this message and the semget
one to mention first the option of changing -B/-N and second the option
of changing kernel parameters.  Then we could consider raising the
default -B setting to something more realistic.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-10-01 Thread Bruce Momjian


 Tom Lane wrote:
  
 snip
  I think the default NBuffers (64) is too low to give meaningful
  performance numbers, too.  I've been thinking that maybe we should
  raise it to 1000 or so by default.  This would trigger startup failures
  on platforms with small SHMMAX, but we could tell people to use -B until
  they get around to fixing their kernel settings.  It's been a long time
  since we fit into a 1-MB shared memory segment at the default settings
  anyway, so maybe it's time to select somewhat-realistic defaults.
  What we have now is neither very useful nor the lowest common
  denominator...
 
 How about a startup error message which gets displayed when used with
 untuned settings (i.e. the default settings), maybe unless an option
 like -q (quiet) is given?
 
 My thought is the server should operate, but let the new/novice admin
 know they need to configure PostgreSQL properly.  Would probably be a
 good reminder for experienced admins if they forget too.
 
 Maybe something simple like pg_ctl shell script message, or something
 proper like a postmaster start-up check.

Yes, this seems like the way to go, probably something in the postmaster
log file.  For single-user developers, we want it to start but we want
production machines to tune it. In fact, picking a higher number for
these values may be almost as far off as our defaults.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Justin Clift


Vadim Mikheev wrote:
 
  I have committed changes to implement this proposal.  I'm not seeing
  any significant performance difference on pgbench on my single-CPU
  system ... but pgbench is I/O bound anyway on this hardware, so that's
  not very surprising.  I'll be interested to see what other people
  observe.  (Tatsuo, care to rerun that 1000-client test?)
 
 What is your system? CPU, memory, IDE/SCSI, OS?
 Scaling factor and # of clients?
 
 BTW1 - shouldn't we rewrite pgbench to use threads instead of
 libpq async queries? At least as option. I'd say that with 1000
 clients current pgbench implementation is very poor.

Would it be useful to run a test like the AS3AP benchmark on this to
look for performance measurements?

On linux the Open Source Database Benchmark (osdb.sf.net) does this, and
it's multi-threaded to simulate multiple clients hitting the database at
once.  The only inconvenience is having to download a separate program
to generate the test data, as OSDB doesn't generate this itself yet.  I
can supply the test program (needs to be run through Wine) and a script
if anyone wants.

???

 
 BTW2 - shouldn't we learn if there are really portability/performance
 issues in using POSIX mutex-es (and cond. variables) in place of
 TAS (and SysV semaphores)?
 
 Vadim
 
 ---(end of broadcast)---
 TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

-- 
My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there.
 - Indira Gandhi

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Tom Lane


I wrote:
 The following proposal should improve performance substantially when
 there is contention for a lock, but it creates no portability risks
 ...

I have committed changes to implement this proposal.  I'm not seeing
any significant performance difference on pgbench on my single-CPU
system ... but pgbench is I/O bound anyway on this hardware, so that's
not very surprising.  I'll be interested to see what other people
observe.  (Tatsuo, care to rerun that 1000-client test?)

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Vadim Mikheev


 I have committed changes to implement this proposal.  I'm not seeing
 any significant performance difference on pgbench on my single-CPU
 system ... but pgbench is I/O bound anyway on this hardware, so that's
 not very surprising.  I'll be interested to see what other people
 observe.  (Tatsuo, care to rerun that 1000-client test?)

What is your system? CPU, memory, IDE/SCSI, OS?
Scaling factor and # of clients?

BTW1 - shouldn't we rewrite pgbench to use threads instead of
libpq async queries? At least as option. I'd say that with 1000
clients current pgbench implementation is very poor.

BTW2 - shouldn't we learn if there are really portability/performance
issues in using POSIX mutex-es (and cond. variables) in place of
TAS (and SysV semaphores)?

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Chamanya


On Thursday 27 September 2001 04:09, you wrote:
 This depends on your system.  Solaris has a huge difference between
 thread and process context switch times, whereas Linux has very little
 difference (and in fact a Linux process context switch is about as
 fast as a Solaris thread switch on the same hardware--Solaris is just
 a pig when it comes to process context switching).

I have never worked on any big systems but from what (little) I have seen, I 
think there should be a hybrid model.

This whole discussion started off, from poor performance on SMP machines. If 
I am getting this correctly, threads can be spread on multiple CPUs if 
available but process can not.

So I would suggest to have threaded approach for intensive tasks such as 
sorting/searching etc. IMHO converting entire paradigm to thread based is a 
huge task and may not be required in all cases. 

I think of an approach.  Threads are created when they are needed but they 
are kept dormant when not needed. So that there is no recreation overhead(if 
that's a concern). So at any given point of time, one back end connection has 
as many threads as number of CPUs. More than that may not yield much of 
performance improvement. Say a big task like sorting is split and given to 
different threads so that it can use them all.

It should be easy to switch the threading function and arguments on the fly, 
restricting number of threads and there will not be much of thread switching 
as each thread handles different parts of task and later the results are 
merged.

Number of threads should be equal to or twice that of number of CPUs. I don't 
think more than those many threads would yield any performance improvement.

And with this approach we can migrate one functionality at a time to threaded 
one, thus avoiding big effort at any given time.

Just a suggestion.

 Shridhar

_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Tom Lane


Vadim Mikheev [EMAIL PROTECTED] writes:
 I have committed changes to implement this proposal.  I'm not seeing
 any significant performance difference on pgbench on my single-CPU
 system ... but pgbench is I/O bound anyway on this hardware, so that's
 not very surprising.  I'll be interested to see what other people
 observe.  (Tatsuo, care to rerun that 1000-client test?)

 What is your system? CPU, memory, IDE/SCSI, OS?
 Scaling factor and # of clients?

HP C180, SCSI-2 disks, HPUX 10.20.  I used scale factor 10 and between
1 and 10 clients.  Now that I think about it, I was running with the
default NBuffers (64), which probably constrained performance too.

 BTW1 - shouldn't we rewrite pgbench to use threads instead of
 libpq async queries? At least as option. I'd say that with 1000
 clients current pgbench implementation is very poor.

Well, it uses select() to wait for activity, so as long as all query
responses arrive as single packets I don't see the problem.  Certainly
rewriting pgbench without making libpq thread-friendly won't help a bit.

 BTW2 - shouldn't we learn if there are really portability/performance
 issues in using POSIX mutex-es (and cond. variables) in place of
 TAS (and SysV semaphores)?

Sure, that'd be worth looking into on a long-term basis.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Bruce Momjian



Good summary.  I agree checkpoint should look like as normal a Proc as
possible.


 At the just-past OSDN database conference, Bruce and I were annoyed by
 some benchmark results showing that Postgres performed poorly on an
 8-way SMP machine.  Based on past discussion, it seems likely that the
 culprit is the known inefficiency in our spinlock implementation.
 After chewing on it for awhile, we came up with an idea for a solution.
 
 The following proposal should improve performance substantially when
 there is contention for a lock, but it creates no portability risks
 because it uses the same system facilities (TAS and SysV semaphores)
 that we have always relied on.  Also, I think it'd be fairly easy to
 implement --- I could probably get it done in a day.
 
 Comments anyone?
 
   regards, tom lane
 
 
 Plan:
 
 Replace most uses of spinlocks with lightweight locks (LW locks)
 implemented by a new lock manager.  The principal remaining use of true
 spinlocks (TAS locks) will be to provide mutual exclusion of access to
 LW lock structures.  Therefore, we can assume that spinlocks are never
 held for more than a few dozen instructions --- and never across a kernel
 call.
 
 It's pretty easy to rejigger the spinlock code to work well when the lock
 is never held for long.  We just need to change the spinlock retry code
 so that it does a tight spin (continuous retry) for a few dozen cycles ---
 ideally, the total delay should be some small multiple of the max expected
 lock hold time.  If lock still not acquired, yield the CPU via a select()
 call (10 msec minimum delay) and repeat.  Although this looks inefficient,
 it doesn't matter on a uniprocessor because we expect that backends will
 only rarely be interrupted while holding the lock, so in practice a held
 lock will seldom be encountered.  On SMP machines the tight spin will win
 since the lock will normally become available before we give up and yield
 the CPU.
 
 Desired properties of the LW lock manager include:
   * very fast fall-through when no contention for lock
   * waiting proc does not spin
   * support both exclusive and shared (read-only) lock modes
   * grant lock to waiters in arrival order (no starvation)
   * small lock structure to allow many LW locks to exist.
 
 Proposed contents of LW lock structure:
 
   spinlock mutex (protects LW lock state and PROC queue links)
   count of exclusive holders (always 0 or 1)
   count of shared holders (0 .. MaxBackends)
   queue head pointer (NULL or ptr to PROC object)
   queue tail pointer (could do without this to save space)
 
 If a backend sees it must wait to acquire the lock, it adds its PROC
 struct to the end of the queue, releases the spinlock mutex, and then
 sleeps by P'ing its per-backend wait semaphore.  A backend releasing the
 lock will check to see if any waiter should be granted the lock.  If so,
 it will update the lock state, release the spinlock mutex, and finally V
 the wait semaphores of any backends that it decided should be released
 (which it removed from the lock's queue while holding the sema).  Notice
 that no kernel calls need be done while holding the spinlock.  Since the
 wait semaphore will remember a V occurring before P, there's no problem
 if the releaser is fast enough to release the waiter before the waiter
 reaches its P operation.
 
 We will need to add a few fields to PROC structures:
   * Flag to show whether PROC is waiting for an LW lock, and if so
 whether it waits for read or write access
   * Additional PROC queue link field.
 We can't reuse the existing queue link field because it is possible for a
 PROC to be waiting for both a heavyweight lock and a lightweight one ---
 this will occur when HandleDeadLock or LockWaitCancel tries to acquire
 the LockMgr module's lightweight lock (formerly spinlock).
 
 It might seem that we also need to create a second wait semaphore per
 backend, one to wait on HW locks and one to wait on LW locks.  But I
 believe we can get away with just one, by recognizing that a wait for an
 LW lock can never be interrupted by a wait for a HW lock, only vice versa.
 After being awoken (V'd), the LW lock manager must check to see if it was
 actually granted the lock (easiest way: look at own PROC struct to see if
 LW lock wait flag has been cleared).  If not, the V must have been to
 grant us a HW lock --- but we still have to sleep to get the LW lock.  So
 remember this happened, then loop back and P again.  When we finally get
 the LW lock, if there was an extra P operation then V the semaphore once
 before returning.  This will allow ProcSleep to exit the wait for the HW
 lock when we return to it.
 
 Fine points:
 
 While waiting for an LW lock, we need to show in our PROC struct whether
 we are waiting for read or write access.  But we don't need to remember
 this after getting the lock; if we know we have the lock, it's easy to
 see by

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Gunnar Rønning


* Doug McNaught [EMAIL PROTECTED] wrote:
|
| Depends on what you mean.  For scaling well with many connections and
| simultaneous queries, there's no reason IMHO that the current
| process-per-backend model won't do, assuming the locking issues are
| addressed. 

Wouldn't a threading model allow you to share more data across different
connections ? I'm thinking in terms of introducing more cache functionality
to improve performance. What is shared memory used for today ?

-- 
Gunnar Rønning - [EMAIL PROTECTED]
Senior Consultant, Polygnosis AS, http://www.polygnosis.com/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread mlw


Chamanya wrote:
 
 On Thursday 27 September 2001 04:09, you wrote:
  This depends on your system.  Solaris has a huge difference between
  thread and process context switch times, whereas Linux has very little
  difference (and in fact a Linux process context switch is about as
  fast as a Solaris thread switch on the same hardware--Solaris is just
  a pig when it comes to process context switching).
 
 I have never worked on any big systems but from what (little) I have seen, I
 think there should be a hybrid model.
 
 This whole discussion started off, from poor performance on SMP machines. If
 I am getting this correctly, threads can be spread on multiple CPUs if
 available but process can not.

Different processes will be on handled evenly across all CPUs in an SMP
machine, unless you set process affinity for a process and a CPU.
 
 So I would suggest to have threaded approach for intensive tasks such as
 sorting/searching etc. IMHO converting entire paradigm to thread based is a
 huge task and may not be required in all cases.

Dividing a query into multiple threads is an amazing task. I wish I had a
couple years and someone willing to pay me to try it.

 
 I think of an approach.  Threads are created when they are needed but they
 are kept dormant when not needed. So that there is no recreation overhead(if
 that's a concern). So at any given point of time, one back end connection has
 as many threads as number of CPUs. More than that may not yield much of
 performance improvement. Say a big task like sorting is split and given to
 different threads so that it can use them all.

This is a huge undertaking, and quite frankly, if I understand PostgreSQL, a
complete redesign of the entire system.
 
 It should be easy to switch the threading function and arguments on the fly,
 restricting number of threads and there will not be much of thread switching
 as each thread handles different parts of task and later the results are
 merged.

That is not what I would consider easy.

 
 Number of threads should be equal to or twice that of number of CPUs. I don't
 think more than those many threads would yield any performance improvement.

That isn't true at all.

One of the problems I see when when people discuss performance on an SMP
machine, is that they usually think from the perspective of a single task. If
you are doing data mining, one sql query may take a very long time. Which may
be a problem, but in the grander scheme of things there are usually multiple
concurrent performance issues to be considered. Threading the back end for
parallel query processing will probably not help this. More often than not a
database has much more to do than one thing at a time.

Also, if you are threading query processing, you have to analyze what your
query needs to do with the threads.  If your query is CPU bound, then you will
want to use fewer threads, if your query is I/O bound, you should have as many
threads as you have I/O requests, and have each thread block on the I/O.

 
 And with this approach we can migrate one functionality at a time to threaded
 one, thus avoiding big effort at any given time.

Perhaps I am being over dramatic, but I have moved a number of systems from
fork() to threaded (for ports to Windows NT from UNIX), and if my opinion means
anything on this mailing list, I STRONGLY urge against it. PostgreSQL is a huge
system, over a decade old. The original developers are no longer working on it,
and in fact, probably wouldn't recognize it. There are nooks and crannys that
no one knows about.

It has also been my experience going from separate processes to separate
threads does not do much for performance, simply because the operation of your
system does not change, only the methods by which you share memory. If you want
to multithread a single query, that's a different story and a good RD project
in itself.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Tom Lane


Bruce Momjian [EMAIL PROTECTED] writes:
 No scale factor, as I illustrated from the initialization command I
 used.  Standard buffers too.  Let me know what values I should use for
 testing.

Scale factor has to be = max number of clients you use, else you're
just measuring serialization on the branch rows.

I think the default NBuffers (64) is too low to give meaningful
performance numbers, too.  I've been thinking that maybe we should
raise it to 1000 or so by default.  This would trigger startup failures
on platforms with small SHMMAX, but we could tell people to use -B until
they get around to fixing their kernel settings.  It's been a long time
since we fit into a 1-MB shared memory segment at the default settings
anyway, so maybe it's time to select somewhat-realistic defaults.
What we have now is neither very useful nor the lowest common
denominator...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-29 Thread Bruce Momjian



OK, testing now with 1000 backends and 2000 buffers.  Will report.

 Bruce Momjian [EMAIL PROTECTED] writes:
  No scale factor, as I illustrated from the initialization command I
  used.  Standard buffers too.  Let me know what values I should use for
  testing.
 
 Scale factor has to be = max number of clients you use, else you're
 just measuring serialization on the branch rows.
 
 I think the default NBuffers (64) is too low to give meaningful
 performance numbers, too.  I've been thinking that maybe we should
 raise it to 1000 or so by default.  This would trigger startup failures
 on platforms with small SHMMAX, but we could tell people to use -B until
 they get around to fixing their kernel settings.  It's been a long time
 since we fit into a 1-MB shared memory segment at the default settings
 anyway, so maybe it's time to select somewhat-realistic defaults.
 What we have now is neither very useful nor the lowest common
 denominator...
 
   regards, tom lane
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread Bruce Momjian


 
 Sounds cool to me ... definitely something to fix before v7.2, if its as
 easy as you make it sound ... I'm expecting the new drive to be
 installed today (if all goes well ... Thomas still has his date/time stuff
 to finish off, now that CVSup is fixed ...
 
 Let''s try and target Monday for Beta then?  I think the only two
 outstaandings are you and Thomas right now?
 
 Bruce, that latest rtree patch looks intriguing also ... can anyone
 comment positive/negative about it, so that we can try and get that in
 before Beta?

I put it in the queue and will apply in a day or two.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread mlw


Bruce Momjian wrote:
 
  Bruce Momjian wrote:
  
Save for the fact that the kernel can switch between threads faster then
it can switch processes considering threads share the same address space,
stack, code, etc.  If need be sharing the data between threads is much
easier then sharing between processes.
  
   Just a clarification but because we fork each backend, don't they share
   the same code space?  Data/stack is still separate.
 
  In Linux and many modern UNIX programs, you share everything at fork time. The
  data and stack pages are marked copy on write which means that if you touch
  it, the processor traps and drops into the memory manager code. A new page is
  created and replaced into your address space where the page, to which you were
  going to write, was.
 
 Yes, very true.  My point was that backends already share code space and
 non-modified data space.  It is just modified data and stack that is
 non-shared, but then again, they would have to be non-shared in a
 threaded backend too.

In a threaded system everything would be shared, depending on the OS, even the
stacks. The stacks could be allocated out of the same global pool.

You would need something like thread local storage to deal with isolating
aviables from one thread to another. That always seemed more trouble that it
was worth. Either that or go through each and every global variable in
PostgreSQL and make it a member of a structure, and create an instance of this
structure for each new thread.

IMHO once you go down the road of using Thread local memory, you are getting to
the same level of difficulty (for the OS) in task switching as just switching
processes. The exception to this is Windows where tasks are such a big hit.

I think threaded software is quite usefull, and I have a number of thread based
servers in production. However, my experience tells me that the work trying to
move PostgreSQL to a threaded ebvironment would be extensive and have little or
no tangable benefit.

I would rather see stuff like 64bit OIDs, three options for function definition
(short cache, nocache, long cache), etc. than to waste time making PostgreSQL
threaded. That's just my opinion.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread mlw


Bruce Momjian wrote:
 
  Save for the fact that the kernel can switch between threads faster then
  it can switch processes considering threads share the same address space,
  stack, code, etc.  If need be sharing the data between threads is much
  easier then sharing between processes.
 
 Just a clarification but because we fork each backend, don't they share
 the same code space?  Data/stack is still separate.

In Linux and many modern UNIX programs, you share everything at fork time. The
data and stack pages are marked copy on write which means that if you touch
it, the processor traps and drops into the memory manager code. A new page is
created and replaced into your address space where the page, to which you were
going to write, was.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread mlw


Lincoln Yeoh wrote:
 
 At 10:02 AM 9/27/01 -0400, mlw wrote:
 D. Hageman wrote:
  I agree with everything you wrote above except for the first line.  My
  only comment is that process boundaries are only *truely* a powerful
  barrier if the processes are different pieces of code and are not
  dependent on each other in crippling ways.  Forking the same code with the
  bug in it - and only 1 in 5 die - is still 4 copies of buggy code running
  on your system ;-)
 
 This is simply not true. All software has bugs, it is an undeniable fact.
 Some
 bugs are more likely to be hit than others. 5 processes , when one process
 hits a
 bug, that does not mean the other 4 will hit the same bug. Obscure bugs kill
 software all the time, the trick is to minimize the impact. Software is not
 perfect, assuming it can be is a mistake.
 
 A bit off topic, but that really reminded me of how Microsoft does their
 forking in hardware.
 
 Basically they fork (cluster) FIVE windows machines to run the same buggy
 code all on the same IP. That way if one process (machine) goes down, the
 other 4 stay running, thus minimizing the impact ;).
 
 They have many of these clusters put together.
 
 See: http://www.microsoft.com/backstage/column_T2_1.htm
 From Microsoft.com Backstage [1]
 
 OK so it's old (1998), but from their recent articles I believe they're
 still using the same method of achieving 100% availability. And they brag
 about it like it's a good thing...
 
 When I first read it I didn't know whether to laugh or get disgusted or
 whatever.

Believe me don't think anyone should be shipping software with serious bugs in
it, and I deplore Microsoft's complete lack of accountability when it comes to
quality, but come on now, lets not lie to ourselves. No matter which god you
may pray to, you have to accept that people are not perfect and mistakes will
be made.

At issue is how well programs are isolated from one another (one of the
purposes of operating systems) and how to deal with programmatic errors. I am
not advocating releasing bad software, I am just saying that you must code
defensively, assume a caller may pass the wrong parameters, don't trust that
malloc worked, etc. Stuff happens in the real world. Code to deal with it. 

In the end, no matter what you do, you will have a crash at some point. (The
tao of programming) accept it. Just try to make the damage as minimal as
possible.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread Lincoln Yeoh


At 10:02 AM 9/27/01 -0400, mlw wrote:
D. Hageman wrote:
 I agree with everything you wrote above except for the first line.  My
 only comment is that process boundaries are only *truely* a powerful
 barrier if the processes are different pieces of code and are not
 dependent on each other in crippling ways.  Forking the same code with the
 bug in it - and only 1 in 5 die - is still 4 copies of buggy code running
 on your system ;-)

This is simply not true. All software has bugs, it is an undeniable fact.
Some
bugs are more likely to be hit than others. 5 processes , when one process
hits a
bug, that does not mean the other 4 will hit the same bug. Obscure bugs kill
software all the time, the trick is to minimize the impact. Software is not
perfect, assuming it can be is a mistake.

A bit off topic, but that really reminded me of how Microsoft does their
forking in hardware.

Basically they fork (cluster) FIVE windows machines to run the same buggy
code all on the same IP. That way if one process (machine) goes down, the
other 4 stay running, thus minimizing the impact ;).

They have many of these clusters put together.

See: http://www.microsoft.com/backstage/column_T2_1.htm
From Microsoft.com Backstage [1]

OK so it's old (1998), but from their recent articles I believe they're
still using the same method of achieving 100% availability. And they brag
about it like it's a good thing...

When I first read it I didn't know whether to laugh or get disgusted or
whatever.

Cheerio,
Link.

[1]
http://www.microsoft.com/backstage/
http://www.microsoft.com/backstage/archives.htm



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread Bruce Momjian


 Bruce Momjian wrote:
  
   Save for the fact that the kernel can switch between threads faster then
   it can switch processes considering threads share the same address space,
   stack, code, etc.  If need be sharing the data between threads is much
   easier then sharing between processes.
  
  Just a clarification but because we fork each backend, don't they share
  the same code space?  Data/stack is still separate.
 
 In Linux and many modern UNIX programs, you share everything at fork time. The
 data and stack pages are marked copy on write which means that if you touch
 it, the processor traps and drops into the memory manager code. A new page is
 created and replaced into your address space where the page, to which you were
 going to write, was.

Yes, very true.  My point was that backends already share code space and
non-modified data space.  It is just modified data and stack that is
non-shared, but then again, they would have to be non-shared in a
threaded backend too.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread Bruce Momjian


 We have been doing some scalability testing just recently here at Red
 Hat. The machine I was using was a 4-way 550 MHz Xeon SMP machine, I
 also ran the machine in uniprocessor mode to make some comparisons. All
 runs were made on Red Hat Linux running 2.4.x series kernels. I've
 examined a number of potentially interesting cases -- I'm still
 analyzing the results, but some of the initial results might be
 interesting:

Let me add a little historical information here.  I think the first
report of bad performance on SMP machines was from Tatsuo, where he had
1000 backends running in pgbench.  He was seeing poor
transactions/second with little CPU or I/O usage.  It was clear
something was wrong.

Looking at the code, it was easy to see that on SMP machines, the
spinlock select() was a problem.  Later tests on various OS's found that
no matter how small your select interval was, select() couldn't sleep
for less than one cpu tick, which is tyically 100Hz or 10ms.  At that
point we knew that the spinlock backoff code was a serious problem.  On
multi-processor machines that could hit the backoff code on lock
failure, there where hudreds of threads sleeping for 10ms, then all
waking up, one gets the lock, and the others sleep again.

On single-cpu machines, the backoff code doesn't get hit too much, but
it is still a problem.  Tom's implementation changes backoffs in all
cases by placing them in a semaphore queue and reducing the amount of
code protected by the spinlock.

We have these TODO items out of this:

* Improve spinlock code [performance]
o use SysV semaphores or queue of backends waiting on the lock
o wakeup sleeper or sleep for less than one clock tick
o spin for lock on multi-cpu machines, yield on single cpu machines
o read/write locks




-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-28 Thread Bruce Momjian



FYI, I have added a number of these emails to the 'thread' TODO.detail list.

 On Wed, 26 Sep 2001, D. Hageman wrote:
 
Save for the fact that the kernel can switch between threads faster then 
it can switch processes considering threads share the same address space, 
stack, code, etc.  If need be sharing the data between threads is much 
easier then sharing between processes. 
   
   When using a kernel threading model, it's not obvious to me that the
   kernel will switch between threads much faster than it will switch
   between processes.  As far as I can see, the only potential savings is
   not reloading the pointers to the page tables.  That is not nothing,
   but it is also
 major snippage
I can't comment on the isolate data line.  I am still trying to figure 
that one out.
   
   Sometimes you need data which is specific to a particular thread.
  
  When you need data that is specific to a thread you use a TSD (Thread 
  Specific Data).  
 Which Linux does not support with a vengeance, to my knowledge.
 
 As a matter of fact, quote from Linus on the matter was something like
 Solution to slow process switching is fast process switching, not another
 kernel abstraction [referring to threads and TSD]. TSDs make
 implementation of thread switching complex, and fork() complex.
 
 The question about threads boils down to: Is there far more data that is
 shared than unshared? If yes, threads are better, if not, you'll be
 abusing TSD and slowing things down. 
 
 I believe right now, postgresql' model of sharing only things that need to
 be shared is pretty damn good. The only slight problem is overhead of
 forking another backend, but its still _fast_.
 
 IMHO, threads would not bring large improvement to postgresql.
 
  Actually, if I remember, there was someone who ported postgresql (I think
 it was 6.5) to be multithreaded with major pain, because the requirement
 was to integrate with CORBA. I believe that person posted some benchmarks
 which were essentially identical to non-threaded postgres...
 
 -alex
 
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
 http://archives.postgresql.org
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-27 Thread mlw


D. Hageman wrote:

 On 26 Sep 2001, Ian Lance Taylor wrote:
 
   Save for the fact that the kernel can switch between threads faster then
   it can switch processes considering threads share the same address space,
   stack, code, etc.  If need be sharing the data between threads is much
   easier then sharing between processes.
 
  When using a kernel threading model, it's not obvious to me that the
  kernel will switch between threads much faster than it will switch
  between processes.  As far as I can see, the only potential savings is
  not reloading the pointers to the page tables.  That is not nothing,
  but it is also not a lot.

 It is my understanding that avoiding a full context switch of the
 processor can be of a significant advantage.  This is especially important
 on processor architectures that can be kinda slow at doing it (x86). I
 will admit that most modern kernels have features that assist software
 packages utilizing the forking model (copy on write for instance).  It is
 also my impression that these do a good job.  I am the kind of guy that
 looks towards the future (as in a year, year and half or so) and say that
 processors will hopefully get faster at context switching and more and
 more kernels will implement these algorithms to speed up the forking
 model.  At the same time, I see more and more processors being shoved into
 a single box and it appears that the threads model works better on these
 type of systems.

context switching happens all the time on a multitasking system. On the x86
processor, a context switch happens when you call into the kernel. You have to go
through a call-gate to get to a lower privilege ring. context switching is very
fast. The operating system dictates how heavy or light a process switch is. Under
Linux (and I believe FreeBSD with Linux threads, or version 4.x ) threads and
processes are virtually identical. The only difference is that the virtual memory
pages are not copy on write. Process vs thread scheduling is also virtually
identical.

If you look to the future, then you should accept that process switching should
become more efficient as the operating systems improve.


   I can't comment on the isolate data line.  I am still trying to figure
   that one out.
 
  Sometimes you need data which is specific to a particular thread.

 When you need data that is specific to a thread you use a TSD (Thread
 Specific Data).

Yes, but Postgres has many global variables. The assumption has always been that
it is a stand-alone process with an explicitly shared paradigm, not implicitly.


  Basically, you have to look at every global variable in the Postgres
  backend, and determine whether to share it among all threads or to
  make it thread-specific.

 Yes, if one was to implement threads into PostgreSQL I would think that
 some re-writing would be in order of several areas.  Like I said before,
 give a person a chance to restructure things so future TODO items wouldn't
 be so hard to implement.  Personally, I like to stay away from global
 variables as much as possible.  They just get you into trouble.

In real live software, software which lives from year to year with active
development, things do get messy. There are always global variables involved in a
program. Efforts, of course, should be made to keep them to a minimum, but the
reality is that they always happen.

Also, the very structure of function calls may need to change when going from a
process model to a threaded model. Functions never before reentrant are now be
reentrant, think about that. That is a huge undertaking. Every single function
may need to be examined for thread safety, with little benefit.


   That last line is a troll if I every saw it ;-)  I will agree that threads
   isn't for everything and that it has costs just like everything else.  Let
   me stress that last part - like everything else.  Certain costs exist in
   the present model, nothing is - how should we say ... perfect.
 
  When writing in C, threading inevitably loses robustness.  Erratic
  behaviour by one thread, perhaps in a user defined function, can
  subtly corrupt the entire system, rather than just that thread.  Part
  of defensive programming is building barriers between different parts
  of a system.  Process boundaries are a powerful barrier.

 I agree with everything you wrote above except for the first line.  My
 only comment is that process boundaries are only *truely* a powerful
 barrier if the processes are different pieces of code and are not
 dependent on each other in crippling ways.  Forking the same code with the
 bug in it - and only 1 in 5 die - is still 4 copies of buggy code running
 on your system ;-)

This is simply not true. All software has bugs, it is an undeniable fact. Some
bugs are more likely to be hit than others. 5 processes , when one process hits a
bug, that does not mean the other 4 will hit the same bug. Obscure bugs kill
software all the time, the trick is to minimize

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-27 Thread Neil Padgett


Tom Lane wrote:
 
 Neil Padgett [EMAIL PROTECTED] writes:
  Well. Currently the runs are the typical pg_bench runs.
 
 With what parameters?  If you don't initialize the pg_bench database
 with scale proportional to the number of clients you intend to use,
 then you'll naturally get huge lock contention.  For example, if you
 use scale=1, there's only one branch in the database.  Since every
 transaction wants to update the branch's balance, every transaction
 has to write-lock that single row, and so everybody serializes on that
 one lock.  Under these conditions it's not surprising to see lots of
 lock waits and lots of useless runs of the deadlock detector ...

The results you saw with the large number of useless runs of the
deadlock detector had a scale factor of 2. With a scale factor 2, the
performance fall-off began at about 100 clients. So, I reran the 512
client profiling run with a scale factor of 12. (2:100 as 10:500 -- so
12 might be an appropriate scale factor with some cushion?) This does,
of course, reduce the contention. However, the throughput is still only
about twice as much, which sounds good, but is still a small fraction of
the throughput realized on the same machine with a small number of
clients. (This is the uniprocessor machine.)

The new profile looks like this (uniprocessor machine):
Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self  self total   
 time   samples   samplescalls  T1/call  T1/call  name
  9.44  10753.00 10753.00 pg_fsync (I'd
attribute this to the slow disk in the machine -- scale 12 yields a lot
of tuples.)
  6.63  18303.01  7550.00 s_lock_sleep
  6.56  25773.01  7470.00 s_lock
  5.88  32473.01  6700.00 heapgettup
  5.28  38487.02  6014.00
HeapTupleSatisfiesSnapshot
  4.83  43995.02  5508.00 hash_destroy
  2.77  47156.02  3161.00 load_file
  1.90  49322.02  2166.00 XLogInsert
  1.86  51436.02  2114.00 _bt_compare
  1.82  53514.02  2078.00 AllocSetAlloc
  1.72  55473.02  1959.00 LockBuffer
  1.50  57180.02  1707.00 init_ps_display
  1.40  58775.03  1595.00
DirectFunctionCall9
  1.26  60211.03  1436.00 hash_search
  1.14  61511.03  1300.00 GetSnapshotData
  1.11  62780.03  1269.00 SpinAcquire
  1.10  64028.03  1248.00 LockAcquire
  1.04  70148.03  1190.00 heap_fetch
  0.91  71182.03  1034.00 _bt_orderkeys
  0.89  72201.03  1019.00 LockRelease
  0.75  73058.03   857.00
InitBufferPoolAccess
.
.
.

I reran the benchmarks on the SMP machine with a scale of 12 instead of
2. The numbers still show a clear performance drop off at approximately
100 clients, albeit not as sharp. (But still quite pronounced.) In terms
of raw performance, the numbers are comparable. The scale factor
certainly helped -- but it still seems that we might have a problem
here.

Thoughts?

Neil

-- 
Neil Padgett
Red Hat Canada Ltd.   E-Mail:  [EMAIL PROTECTED]
2323 Yonge Street, Suite #300, 
Toronto, ON  M4P 2C9

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

[HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Tom Lane


At the just-past OSDN database conference, Bruce and I were annoyed by
some benchmark results showing that Postgres performed poorly on an
8-way SMP machine.  Based on past discussion, it seems likely that the
culprit is the known inefficiency in our spinlock implementation.
After chewing on it for awhile, we came up with an idea for a solution.

The following proposal should improve performance substantially when
there is contention for a lock, but it creates no portability risks
because it uses the same system facilities (TAS and SysV semaphores)
that we have always relied on.  Also, I think it'd be fairly easy to
implement --- I could probably get it done in a day.

Comments anyone?

regards, tom lane


Plan:

Replace most uses of spinlocks with lightweight locks (LW locks)
implemented by a new lock manager.  The principal remaining use of true
spinlocks (TAS locks) will be to provide mutual exclusion of access to
LW lock structures.  Therefore, we can assume that spinlocks are never
held for more than a few dozen instructions --- and never across a kernel
call.

It's pretty easy to rejigger the spinlock code to work well when the lock
is never held for long.  We just need to change the spinlock retry code
so that it does a tight spin (continuous retry) for a few dozen cycles ---
ideally, the total delay should be some small multiple of the max expected
lock hold time.  If lock still not acquired, yield the CPU via a select()
call (10 msec minimum delay) and repeat.  Although this looks inefficient,
it doesn't matter on a uniprocessor because we expect that backends will
only rarely be interrupted while holding the lock, so in practice a held
lock will seldom be encountered.  On SMP machines the tight spin will win
since the lock will normally become available before we give up and yield
the CPU.

Desired properties of the LW lock manager include:
* very fast fall-through when no contention for lock
* waiting proc does not spin
* support both exclusive and shared (read-only) lock modes
* grant lock to waiters in arrival order (no starvation)
* small lock structure to allow many LW locks to exist.

Proposed contents of LW lock structure:

spinlock mutex (protects LW lock state and PROC queue links)
count of exclusive holders (always 0 or 1)
count of shared holders (0 .. MaxBackends)
queue head pointer (NULL or ptr to PROC object)
queue tail pointer (could do without this to save space)

If a backend sees it must wait to acquire the lock, it adds its PROC
struct to the end of the queue, releases the spinlock mutex, and then
sleeps by P'ing its per-backend wait semaphore.  A backend releasing the
lock will check to see if any waiter should be granted the lock.  If so,
it will update the lock state, release the spinlock mutex, and finally V
the wait semaphores of any backends that it decided should be released
(which it removed from the lock's queue while holding the sema).  Notice
that no kernel calls need be done while holding the spinlock.  Since the
wait semaphore will remember a V occurring before P, there's no problem
if the releaser is fast enough to release the waiter before the waiter
reaches its P operation.

We will need to add a few fields to PROC structures:
* Flag to show whether PROC is waiting for an LW lock, and if so
  whether it waits for read or write access
* Additional PROC queue link field.
We can't reuse the existing queue link field because it is possible for a
PROC to be waiting for both a heavyweight lock and a lightweight one ---
this will occur when HandleDeadLock or LockWaitCancel tries to acquire
the LockMgr module's lightweight lock (formerly spinlock).

It might seem that we also need to create a second wait semaphore per
backend, one to wait on HW locks and one to wait on LW locks.  But I
believe we can get away with just one, by recognizing that a wait for an
LW lock can never be interrupted by a wait for a HW lock, only vice versa.
After being awoken (V'd), the LW lock manager must check to see if it was
actually granted the lock (easiest way: look at own PROC struct to see if
LW lock wait flag has been cleared).  If not, the V must have been to
grant us a HW lock --- but we still have to sleep to get the LW lock.  So
remember this happened, then loop back and P again.  When we finally get
the LW lock, if there was an extra P operation then V the semaphore once
before returning.  This will allow ProcSleep to exit the wait for the HW
lock when we return to it.

Fine points:

While waiting for an LW lock, we need to show in our PROC struct whether
we are waiting for read or write access.  But we don't need to remember
this after getting the lock; if we know we have the lock, it's easy to
see by inspecting the lock whether we hold read or write access.

ProcStructLock cannot be replaced by an LW lock, since a backend cannot
use an LW lock until it

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Marc G. Fournier



Sounds cool to me ... definitely something to fix before v7.2, if its as
easy as you make it sound ... I'm expecting the new drive to be
installed today (if all goes well ... Thomas still has his date/time stuff
to finish off, now that CVSup is fixed ...

Let''s try and target Monday for Beta then?  I think the only two
outstaandings are you and Thomas right now?

Bruce, that latest rtree patch looks intriguing also ... can anyone
comment positive/negative about it, so that we can try and get that in
before Beta?

On Wed, 26 Sep 2001, Tom Lane wrote:

 At the just-past OSDN database conference, Bruce and I were annoyed by
 some benchmark results showing that Postgres performed poorly on an
 8-way SMP machine.  Based on past discussion, it seems likely that the
 culprit is the known inefficiency in our spinlock implementation.
 After chewing on it for awhile, we came up with an idea for a solution.

 The following proposal should improve performance substantially when
 there is contention for a lock, but it creates no portability risks
 because it uses the same system facilities (TAS and SysV semaphores)
 that we have always relied on.  Also, I think it'd be fairly easy to
 implement --- I could probably get it done in a day.

 Comments anyone?

   regards, tom lane


 Plan:

 Replace most uses of spinlocks with lightweight locks (LW locks)
 implemented by a new lock manager.  The principal remaining use of true
 spinlocks (TAS locks) will be to provide mutual exclusion of access to
 LW lock structures.  Therefore, we can assume that spinlocks are never
 held for more than a few dozen instructions --- and never across a kernel
 call.

 It's pretty easy to rejigger the spinlock code to work well when the lock
 is never held for long.  We just need to change the spinlock retry code
 so that it does a tight spin (continuous retry) for a few dozen cycles ---
 ideally, the total delay should be some small multiple of the max expected
 lock hold time.  If lock still not acquired, yield the CPU via a select()
 call (10 msec minimum delay) and repeat.  Although this looks inefficient,
 it doesn't matter on a uniprocessor because we expect that backends will
 only rarely be interrupted while holding the lock, so in practice a held
 lock will seldom be encountered.  On SMP machines the tight spin will win
 since the lock will normally become available before we give up and yield
 the CPU.

 Desired properties of the LW lock manager include:
   * very fast fall-through when no contention for lock
   * waiting proc does not spin
   * support both exclusive and shared (read-only) lock modes
   * grant lock to waiters in arrival order (no starvation)
   * small lock structure to allow many LW locks to exist.

 Proposed contents of LW lock structure:

   spinlock mutex (protects LW lock state and PROC queue links)
   count of exclusive holders (always 0 or 1)
   count of shared holders (0 .. MaxBackends)
   queue head pointer (NULL or ptr to PROC object)
   queue tail pointer (could do without this to save space)

 If a backend sees it must wait to acquire the lock, it adds its PROC
 struct to the end of the queue, releases the spinlock mutex, and then
 sleeps by P'ing its per-backend wait semaphore.  A backend releasing the
 lock will check to see if any waiter should be granted the lock.  If so,
 it will update the lock state, release the spinlock mutex, and finally V
 the wait semaphores of any backends that it decided should be released
 (which it removed from the lock's queue while holding the sema).  Notice
 that no kernel calls need be done while holding the spinlock.  Since the
 wait semaphore will remember a V occurring before P, there's no problem
 if the releaser is fast enough to release the waiter before the waiter
 reaches its P operation.

 We will need to add a few fields to PROC structures:
   * Flag to show whether PROC is waiting for an LW lock, and if so
 whether it waits for read or write access
   * Additional PROC queue link field.
 We can't reuse the existing queue link field because it is possible for a
 PROC to be waiting for both a heavyweight lock and a lightweight one ---
 this will occur when HandleDeadLock or LockWaitCancel tries to acquire
 the LockMgr module's lightweight lock (formerly spinlock).

 It might seem that we also need to create a second wait semaphore per
 backend, one to wait on HW locks and one to wait on LW locks.  But I
 believe we can get away with just one, by recognizing that a wait for an
 LW lock can never be interrupted by a wait for a HW lock, only vice versa.
 After being awoken (V'd), the LW lock manager must check to see if it was
 actually granted the lock (easiest way: look at own PROC struct to see if
 LW lock wait flag has been cleared).  If not, the V must have been to
 grant us a HW lock --- but we still have to sleep to get the LW lock.  So
 remember this happened, then loop

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Tom Lane


Marc G. Fournier [EMAIL PROTECTED] writes:
 Let''s try and target Monday for Beta then?

Sounds like a plan.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Neil Padgett


Tom Lane wrote:
 
 At the just-past OSDN database conference, Bruce and I were annoyed by
 some benchmark results showing that Postgres performed poorly on an
 8-way SMP machine.  Based on past discussion, it seems likely that the
 culprit is the known inefficiency in our spinlock implementation.
 After chewing on it for awhile, we came up with an idea for a solution.
 
 The following proposal should improve performance substantially when
 there is contention for a lock, but it creates no portability risks
 because it uses the same system facilities (TAS and SysV semaphores)
 that we have always relied on.  Also, I think it'd be fairly easy to
 implement --- I could probably get it done in a day.
 
 Comments anyone?


We have been doing some scalability testing just recently here at Red
Hat. The machine I was using was a 4-way 550 MHz Xeon SMP machine, I
also ran the machine in uniprocessor mode to make some comparisons. All
runs were made on Red Hat Linux running 2.4.x series kernels. I've
examined a number of potentially interesting cases -- I'm still
analyzing the results, but some of the initial results might be
interesting:

- We have tried benchmarking the following: TAS spinlocks (existing
implementation), SysV semaphores (existing implementation), Pthread
Mutexes. Pgbench runs were conducted for 1 to 512 simultaneous backends.

  For these three cases we found:
  - TAS spinlocks fared the best of all three lock types, however above
100 clients the Pthread mutexes were lock step in performance. I expect
this is due to the cost of any system calls being  negligible
relative to lock wait time.
  - SysV semaphore implementation faired terribly as expected. However,
it is worse, relative to the TAS spinlocks on SMP than on uniprocessor.

- Since the above seemed to indicate that the lock implementation may
not be the problem (Pthread mutexes are supposed to be implemented to be
less bang-bang than the Postgres TAS spinlocks, IIRC), I decided to
profile Postgres. After much trouble, I got results for it using
oprofile, a kernel profiler for Linux. Unfortunately, I can only profile
for uniprocessor right now using oprofile, as it doesn't support SMP
boxes yet. (soon, I hope.)

Initial results (top five -- if you would like a complete profile, let
me know):
Each sample counts as 1 samples.
  %   cumulative   self  self total   
 time   samples   samplescalls  T1/call  T1/call  name
 26.57  42255.02 42255.02
FindLockCycleRecurse
  5.55  51081.02  8826.00 s_lock_sleep
  5.07  59145.03  8064.00 heapgettup
  4.48  66274.03  7129.00 hash_search
  4.48  73397.03  7123.00 s_lock
  2.85  77926.03  4529.00
HeapTupleSatisfiesSnapshot
  2.07  81217.04  3291.00 SHMQueueNext
  1.85  84154.04  2937.00 AllocSetAlloc
  1.84  87085.04  2931.00 fmgr_isbuiltin
  1.64  89696.04  2611.00 set_ps_display
  1.51  92101.04  2405.00 FunctionCall2
  1.47  94442.04  2341.00 XLogInsert
  1.39  96649.04  2207.00 _bt_compare
  1.22  98597.04  1948.00 SpinAcquire
  1.22 100544.04  1947.00 LockBuffer
  1.21 102469.04  1925.00 tag_hash
  1.01 104078.05  1609.00 LockAcquire
.
.
.

(The samples are proportional to execution time.)

This would seem to point to the deadlock detector. (Which some have
fingered as a possible culprit before, IIRC.)

However, this seems to be a red herring. Removing the deadlock detector
had no effect. In fact, benchmarking showed removing it yielded no
improvement in transaction processing rate on uniprocessor or SMP
systems. Instead, it seems that the deadlock detector simply amounts to
something to do for the blocked backend while it waits for lock
acquisition. 

Profiling bears this out:

Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self  self total   
 time   samples   samplescalls  T1/call  T1/call  name
 12.38  14112.01 14112.01 s_lock_sleep
 10.18  25710.01 11598.01 s_lock
  6.47  33079.01  7369.00 hash_search
  5.88  39784.02  6705.00 heapgettup
  5.32  45843.02  6059.00
HeapTupleSatisfiesSnapshot 
  2.62  48830.02  2987.00 AllocSetAlloc
  2.48  51654.02  2824.00 fmgr_isbuiltin
  1.89  53813.02  2159.00 XLogInsert
  1.86  55938.02  2125.00 _bt_compare
  1.72  57893.03  1955.00

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Tom Lane


Neil Padgett [EMAIL PROTECTED] writes:
 Initial results (top five -- if you would like a complete profile, let
 me know):
 Each sample counts as 1 samples.
   %   cumulative   self  self total   
  time   samples   samplescalls  T1/call  T1/call  name
  26.57  42255.02 42255.02 FindLockCycleRecurse

Yipes.  It would be interesting to know more about the locking pattern
of your benchmark --- are there long waits-for chains, or not?  The
present deadlock detector was certainly written with an eye to get it
right rather than make it fast, but I wonder whether this shows a
performance problem in the detector, or just too many executions because
you're waiting too long to get locks.

 However, this seems to be a red herring. Removing the deadlock detector
 had no effect. In fact, benchmarking showed removing it yielded no
 improvement in transaction processing rate on uniprocessor or SMP
 systems. Instead, it seems that the deadlock detector simply amounts to
 something to do for the blocked backend while it waits for lock
 acquisition. 

Do you have any idea about the typical lock-acquisition delay in this
benchmark?  Our docs advise trying to set DEADLOCK_TIMEOUT higher than
the typical acquisition delay, so that the deadlock detector does not
run unnecessarily.

 For example, there has been some suggestion
 that perhaps some component of the database is causing large lock
 contention.

My thought as well.  I would certainly recommend that you use more than
one test case while looking at these things.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Tom Lane


Neil Padgett [EMAIL PROTECTED] writes:
 Well. Currently the runs are the typical pg_bench runs.

With what parameters?  If you don't initialize the pg_bench database
with scale proportional to the number of clients you intend to use,
then you'll naturally get huge lock contention.  For example, if you
use scale=1, there's only one branch in the database.  Since every
transaction wants to update the branch's balance, every transaction
has to write-lock that single row, and so everybody serializes on that
one lock.  Under these conditions it's not surprising to see lots of
lock waits and lots of useless runs of the deadlock detector ...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Neil Padgett


Tom Lane wrote:
 
 Neil Padgett [EMAIL PROTECTED] writes:
  Initial results (top five -- if you would like a complete profile, let
  me know):
  Each sample counts as 1 samples.
%   cumulative   self  self total
   time   samples   samplescalls  T1/call  T1/call  name
   26.57  42255.02 42255.02 FindLockCycleRecurse
 
 Yipes.  It would be interesting to know more about the locking pattern
 of your benchmark --- are there long waits-for chains, or not?  The
 present deadlock detector was certainly written with an eye to get it
 right rather than make it fast, but I wonder whether this shows a
 performance problem in the detector, or just too many executions because
 you're waiting too long to get locks.
 
  However, this seems to be a red herring. Removing the deadlock detector
  had no effect. In fact, benchmarking showed removing it yielded no
  improvement in transaction processing rate on uniprocessor or SMP
  systems. Instead, it seems that the deadlock detector simply amounts to
  something to do for the blocked backend while it waits for lock
  acquisition.
 
 Do you have any idea about the typical lock-acquisition delay in this
 benchmark?  Our docs advise trying to set DEADLOCK_TIMEOUT higher than
 the typical acquisition delay, so that the deadlock detector does not
 run unnecessarily.

Well. Currently the runs are the typical pg_bench runs. This was useful
since it was a handy benchmark that was already done, and I was hoping
it might be useful for comparison since it seems to be popular. More
benchmarks of different types would of course be useful though. 

I think the large time consumed by the deadlock detector in the profile
is simply due to too many executions while waiting to acquire to
contended locks. But, I agree that it seems DEADLOCK_TIMEOUT was set too
low, since it appears from the profile output that the deadlock detector
was running unnecessarily. But the deadlock detector isn't causing the
SMP performance hit right now, since the throughput is the same with it
in place or with it removed completely. I therefore didn't make any
attempt to tune DEADLOCK_TIMEOUT. As I mentioned before, it apparently
just gives the backend something to do while it waits for a lock. 

I'm thinking that the deadlock detector unnecessarily has no effect on
performance since the shared memory is causing some level of
serialization. So, one CPU (or two, or three, but not all) is doing
useful work, while the others are idle (that is to say, doing no useful
work). If they are idle spinning, or idle running the deadlock detector
the net throughput is still the same. (This might also indicate that
improving the lock design won't help here.) Of course, another
possibility is that you spend so long spinning simply because you do
spin (rather than sleep), and this is wasting much CPU time so the
useful work backends take longer to get things done. Either is just
speculation right now without any data to back things up.

 
  For example, there has been some suggestion
  that perhaps some component of the database is causing large lock
  contention.
 
 My thought as well.  I would certainly recommend that you use more than
 one test case while looking at these things.

Yes. That is another suggestion for a next step. Several cases might
serve to better expose the path causing the slowdown. I think that
several test cases of varying usage patterns, coupled with hold time
instrumentation (which can tell what routine acquired the lock and how
long it held it, and yield wait-for data in the analysis), are the right
way to go about attacking SMP performance. Any other thoughts?

Neil

-- 
Neil Padgett
Red Hat Canada Ltd.   E-Mail:  [EMAIL PROTECTED]
2323 Yonge Street, Suite #300, 
Toronto, ON  M4P 2C9

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Ian Lance Taylor


D. Hageman [EMAIL PROTECTED] writes:

  you have a newer kernel scheduled implementation, then you will have the same
  scheduling as separate processes. The only thing you will need to do is
  switch your brain from figuring out how to share data, to trying to figure
  out how to isolate data. A multithreaded implementation lacks many of the
  benefits and robustness of a multiprocess implementation.
 
 Save for the fact that the kernel can switch between threads faster then 
 it can switch processes considering threads share the same address space, 
 stack, code, etc.  If need be sharing the data between threads is much 
 easier then sharing between processes. 

When using a kernel threading model, it's not obvious to me that the
kernel will switch between threads much faster than it will switch
between processes.  As far as I can see, the only potential savings is
not reloading the pointers to the page tables.  That is not nothing,
but it is also not a lot.

 I can't comment on the isolate data line.  I am still trying to figure 
 that one out.

Sometimes you need data which is specific to a particular thread.
Basically, you have to look at every global variable in the Postgres
backend, and determine whether to share it among all threads or to
make it thread-specific.  In other words, you have to take extra steps
to isolate the data within the thread.  This is the reverse of the
current situation, in which you have to take extra steps to share data
among all backend processes.

 That last line is a troll if I every saw it ;-)  I will agree that threads 
 isn't for everything and that it has costs just like everything else.  Let 
 me stress that last part - like everything else.  Certain costs exist in 
 the present model, nothing is - how should we say ... perfect.

When writing in C, threading inevitably loses robustness.  Erratic
behaviour by one thread, perhaps in a user defined function, can
subtly corrupt the entire system, rather than just that thread.  Part
of defensive programming is building barriers between different parts
of a system.  Process boundaries are a powerful barrier.

(Actually, though, Postgres is already vulnerable to erratic behaviour
because any backend process can corrupt the shared buffer pool.)

Ian

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Myron Scott




On Wed, 26 Sep 2001, mlw wrote:

 I can only think of two objectives for threading. (1) running the various
 connections in their own thread instead of their own process. (2) running
 complex queries across multiple threads.
 

I did a multi-threaded version of 7.0.2 using Solaris threads about a year
ago in order to try
and get multiple backend connections working under one java process using
jni.  I used the thread per connection model.

I eventually got it working, but it was/is very messy ( there were global
variables everywhere! ).  Anyway, I was able to get a pretty good speed up
on inserts by scheduling buffer writes from multiple connections on one
common writing thread.  

I also got some other features that were important to me at the time.

1.  True prepared statements under java with bound input and output
variables
2.  Better system utilization 
a.  fewer Solaris lightweight processes mapped to threads.
b.  Fewer open files per postgres installation 
3.  Automatic vacuums when system activity is low by a daemon thread.

but there were some drawbacks...  One rogue thread or bad user 
function could take down all connections for that process.  This
was and seems to still be the major drawback to using threads.


Myron Scott
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Doug McNaught


D. Hageman [EMAIL PROTECTED] writes:

 Save for the fact that the kernel can switch between threads faster then 
 it can switch processes considering threads share the same address space, 
 stack, code, etc.  If need be sharing the data between threads is much 
 easier then sharing between processes. 

This depends on your system.  Solaris has a huge difference between
thread and process context switch times, whereas Linux has very little 
difference (and in fact a Linux process context switch is about as
fast as a Solaris thread switch on the same hardware--Solaris is just
a pig when it comes to process context switching). 

 I can't comment on the isolate data line.  I am still trying to figure 
 that one out.

I think his point is one of clarity and maintainability.  When a
task's data is explicitly shared (via shared memory of some sort) it's
fairly clear when you're accessing shared data and need to worry about
locking.  Whereas when all data is shared by default (as with threads)
it's very easy to miss places where threads can step on each other.

-Doug
-- 
In a world of steel-eyed death, and men who are fighting to be warm,
Come in, she said, I'll give you shelter from the storm.-Dylan

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread D. Hageman


On 26 Sep 2001, Ian Lance Taylor wrote:

  Save for the fact that the kernel can switch between threads faster then 
  it can switch processes considering threads share the same address space, 
  stack, code, etc.  If need be sharing the data between threads is much 
  easier then sharing between processes. 
 
 When using a kernel threading model, it's not obvious to me that the
 kernel will switch between threads much faster than it will switch
 between processes.  As far as I can see, the only potential savings is
 not reloading the pointers to the page tables.  That is not nothing,
 but it is also not a lot.

It is my understanding that avoiding a full context switch of the 
processor can be of a significant advantage.  This is especially important 
on processor architectures that can be kinda slow at doing it (x86). I 
will admit that most modern kernels have features that assist software 
packages utilizing the forking model (copy on write for instance).  It is 
also my impression that these do a good job.  I am the kind of guy that 
looks towards the future (as in a year, year and half or so) and say that 
processors will hopefully get faster at context switching and more and 
more kernels will implement these algorithms to speed up the forking 
model.  At the same time, I see more and more processors being shoved into 
a single box and it appears that the threads model works better on these 
type of systems.   

  I can't comment on the isolate data line.  I am still trying to figure 
  that one out.
 
 Sometimes you need data which is specific to a particular thread.

When you need data that is specific to a thread you use a TSD (Thread 
Specific Data).  

 Basically, you have to look at every global variable in the Postgres
 backend, and determine whether to share it among all threads or to
 make it thread-specific.

Yes, if one was to implement threads into PostgreSQL I would think that 
some re-writing would be in order of several areas.  Like I said before, 
give a person a chance to restructure things so future TODO items wouldn't 
be so hard to implement.  Personally, I like to stay away from global 
variables as much as possible.  They just get you into trouble.

  That last line is a troll if I every saw it ;-)  I will agree that threads 
  isn't for everything and that it has costs just like everything else.  Let 
  me stress that last part - like everything else.  Certain costs exist in 
  the present model, nothing is - how should we say ... perfect.
 
 When writing in C, threading inevitably loses robustness.  Erratic
 behaviour by one thread, perhaps in a user defined function, can
 subtly corrupt the entire system, rather than just that thread.  Part
 of defensive programming is building barriers between different parts
 of a system.  Process boundaries are a powerful barrier.

I agree with everything you wrote above except for the first line.  My 
only comment is that process boundaries are only *truely* a powerful 
barrier if the processes are different pieces of code and are not 
dependent on each other in crippling ways.  Forking the same code with the 
bug in it - and only 1 in 5 die - is still 4 copies of buggy code running 
on your system ;-)  

 (Actually, though, Postgres is already vulnerable to erratic behaviour
 because any backend process can corrupt the shared buffer pool.)

I appreciate your total honest view of the situation.  

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread D. Hageman


On 26 Sep 2001, Doug McNaught wrote:

 This depends on your system.  Solaris has a huge difference between
 thread and process context switch times, whereas Linux has very little 
 difference (and in fact a Linux process context switch is about as
 fast as a Solaris thread switch on the same hardware--Solaris is just
 a pig when it comes to process context switching). 

Yeah, I kinda commented on this in another e-mail.  Linux has some nice 
tweaks for software using the forking model, but I am sure a couple of 
Solaris admins out there like to run PostgreSQL.  ;-)  You are right in 
that it is very system dependent.  I should have prefaced it with In 
general ...

  I can't comment on the isolate data line.  I am still trying to figure 
  that one out.
 
 I think his point is one of clarity and maintainability.  When a
 task's data is explicitly shared (via shared memory of some sort) it's
 fairly clear when you're accessing shared data and need to worry about
 locking.  Whereas when all data is shared by default (as with threads)
 it's very easy to miss places where threads can step on each other.

Well, I understand what you are saying and you are correct.  The situation 
is that when you implement anything using pthreads you lock your 
variables (which is where the major performance penalty comes into play 
with threads).  Now, the kicker is how you lock them.  Depending on how 
you do it (as per discussion earlier on this list concerning threads) it 
can be faster or slower.  It all depends on what model you use.  

Data is not explicitely shared between threads unless you make it so.  The 
threads just share the same stack and all of that, but you can't 
(shouldn't is probably a better word) really access anything you don't have 
an address for.  Threads just makes it easier to share if you want to.  
Also, see my other e-mail to the list concerning TSDs.

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Tom Lane


Ian Lance Taylor [EMAIL PROTECTED] writes:
 (Actually, though, Postgres is already vulnerable to erratic behaviour
 because any backend process can corrupt the shared buffer pool.)

Not to mention the other parts of shared memory.

Nonetheless, our experience has been that cross-backend failures due to
memory clobbers in shared memory are very infrequent --- certainly far
less often than we see localized-to-a-backend crashes.  Probably this is
because the shared memory is (a) small compared to the rest of the
address space and (b) only accessed by certain specific modules within
Postgres.

I'm convinced that switching to a thread model would result in a
significant degradation in our ability to recover from coredump-type
failures, even given the (implausible) assumption that we introduce no
new bugs during the conversion.  I'm also *un*convinced that such a
conversion will yield significant performance benefits, unless we
introduce additional cross-thread dependencies (and more fragility
and lock contention) by tactics such as sharing catalog caches across
threads.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Thomas Lockhart


 ... Thomas still has his date/time stuff
 to finish off, now that CVSup is fixed ...

I'm now getting clean runs through the regression tests on a freshly
merged cvs tree. I'd like to look at it a little more to adjust
pg_proc.h attributes before I commit the changes.

There was a bit of a hiccup when merging since there was some bytea
stuff added to the catalogs over the last couple of weeks. Could folks
hold off on claiming new OIDs until I get this stuff committed? TIA

I expect to be able to merge this stuff by Friday at the latest, more
likely tomorrow.

 - Thomas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Alex Pilosov


On Wed, 26 Sep 2001, D. Hageman wrote:

   Save for the fact that the kernel can switch between threads faster then 
   it can switch processes considering threads share the same address space, 
   stack, code, etc.  If need be sharing the data between threads is much 
   easier then sharing between processes. 
  
  When using a kernel threading model, it's not obvious to me that the
  kernel will switch between threads much faster than it will switch
  between processes.  As far as I can see, the only potential savings is
  not reloading the pointers to the page tables.  That is not nothing,
  but it is also
major snippage
   I can't comment on the isolate data line.  I am still trying to figure 
   that one out.
  
  Sometimes you need data which is specific to a particular thread.
 
 When you need data that is specific to a thread you use a TSD (Thread 
 Specific Data).  
Which Linux does not support with a vengeance, to my knowledge.

As a matter of fact, quote from Linus on the matter was something like
Solution to slow process switching is fast process switching, not another
kernel abstraction [referring to threads and TSD]. TSDs make
implementation of thread switching complex, and fork() complex.

The question about threads boils down to: Is there far more data that is
shared than unshared? If yes, threads are better, if not, you'll be
abusing TSD and slowing things down. 

I believe right now, postgresql' model of sharing only things that need to
be shared is pretty damn good. The only slight problem is overhead of
forking another backend, but its still _fast_.

IMHO, threads would not bring large improvement to postgresql.

 Actually, if I remember, there was someone who ported postgresql (I think
it was 6.5) to be multithreaded with major pain, because the requirement
was to integrate with CORBA. I believe that person posted some benchmarks
which were essentially identical to non-threaded postgres...

-alex


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread D. Hageman


On Wed, 26 Sep 2001, Alex Pilosov wrote:

 On Wed, 26 Sep 2001, D. Hageman wrote:
 
  When you need data that is specific to a thread you use a TSD (Thread 
  Specific Data).  

 Which Linux does not support with a vengeance, to my knowledge.

I am not sure what that means.  If it works it works. 

 As a matter of fact, quote from Linus on the matter was something like
 Solution to slow process switching is fast process switching, not another
 kernel abstraction [referring to threads and TSD]. TSDs make
 implementation of thread switching complex, and fork() complex.

Linus does have some interesting ideas.  I always like to hear his 
perspective on matters, but just like the government - I don't always 
agree with him.  I don't see why TSDs would make the implementation of 
thread switching complex - seems to me that would be something that is 
implemented in the userland side part of the pthreads implemenation and 
not the kernel side.  I don't really like to talk specifics, but both the 
lightweight process and the system call fork() are implemented using the 
__clone kernel function with the parameters slightly different (This is 
in the Linux kernel, btw since you wanted to use that as an example).  The 
speed improvements the kernel has given the fork() command (like copy on 
write) only lasts until the process writes to memmory.  The next time it 
comes around - it is for all intents and purposes a full context switch 
again.  With threads ... the cost is relatively consistant.

 The question about threads boils down to: Is there far more data that is
 shared than unshared? If yes, threads are better, if not, you'll be
 abusing TSD and slowing things down. 

I think the question about threads boils down to if the core members of 
the PostgreSQL team want to try it or not.  At this time, I would have to 
say they pretty much agree they like things the way they are now, which is 
completely fine.  They are the ones that spend most of the time on it and 
want to support it.

 I believe right now, postgresql' model of sharing only things that need to
 be shared is pretty damn good. The only slight problem is overhead of
 forking another backend, but its still _fast_.

Oh, man ... am I reading stuff into what you are writing or are you 
reading stuff into what I am writing?  Maybe a little bit of both?  My 
original contention is that I think that the best way to get the full 
potential out of SMP machines is to use a threads model.  I didn't say the 
present way wasn't fast.  

  Actually, if I remember, there was someone who ported postgresql (I think
 it was 6.5) to be multithreaded with major pain, because the requirement
 was to integrate with CORBA. I believe that person posted some benchmarks
 which were essentially identical to non-threaded postgres...

Actually, it was 7.0.2 and the performance gain was interesting.  The 
posting can be found at:

http://candle.pha.pa.us/mhonarc/todo.detail/thread/msg7.html

The results are:

20 clients, 900 inserts per client, 1 insert per transaction, 4 different
tables.

7.0.2About10:52 average completion
multi-threaded2:42 average completion
7.1beta3  1:13 average completion

If the multi-threaded version was 7.0.2 and threads increased performance 
that much - I would have to say that was a bonus.  However, the 
performance increases that the PostgreSQL team implemented later ... 
pushed the regular version ahead again.  That kinda says to me that 
potential is there.

If you look at Myron Scott's post today you will see that it had other 
advantages going for it (like auto-vacuum!) and disadvantages ... rogue 
thread corruption (already debated today).

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//




---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Alex Pilosov


On Wed, 26 Sep 2001, D. Hageman wrote:

 Oh, man ... am I reading stuff into what you are writing or are you 
 reading stuff into what I am writing?  Maybe a little bit of both?  My 
 original contention is that I think that the best way to get the full 
 potential out of SMP machines is to use a threads model.  I didn't say the 
 present way wasn't fast.  
Or alternatively, that the current inter-process locking is a bit
inefficient. Its possible to have inter-process locks that are as fast as
inter-thread locks.

   Actually, if I remember, there was someone who ported postgresql (I think
  it was 6.5) to be multithreaded with major pain, because the requirement
  was to integrate with CORBA. I believe that person posted some benchmarks
  which were essentially identical to non-threaded postgres...
 
 Actually, it was 7.0.2 and the performance gain was interesting.  The 
 posting can be found at:
 
 7.0.2About10:52 average completion
 multi-threaded2:42 average completion
 7.1beta3  1:13 average completion
 
 If the multi-threaded version was 7.0.2 and threads increased performance 
 that much - I would have to say that was a bonus.  However, the 
 performance increases that the PostgreSQL team implemented later ... 
 pushed the regular version ahead again.  That kinda says to me that 
 potential is there.
Alternatively, you could read that 7.1 took the wind out of threaded
sails. :) But I guess we won't know until the current version is ported to
threads...

-alex


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Tom Lane


D. Hageman [EMAIL PROTECTED] writes:
 If you look at Myron Scott's post today you will see that it had other 
 advantages going for it (like auto-vacuum!) and disadvantages ... rogue 
 thread corruption (already debated today).

But note that Myron did a number of things that are (IMHO) orthogonal
to process-to-thread conversion, such as adding prepared statements,
a separate thread/process/whateveryoucallit for buffer writing, ditto
for vacuuming, etc.  I think his results cannot be taken as indicative
of the benefits of threads per se --- these other things could be
implemented in a pure process model too, and we have no data with which
to estimate which change bought how much.

Threading certainly should reduce the context switch time, but this
comes at the price of increased overhead within each context (since
access to thread-local variables is not free).  It's by no means
obvious that there's a net win there.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Myron Scott



 
 But note that Myron did a number of things that are (IMHO) orthogonal

yes, I did :)

 to process-to-thread conversion, such as adding prepared statements,
 a separate thread/process/whateveryoucallit for buffer writing, ditto
 for vacuuming, etc.  I think his results cannot be taken as indicative
 of the benefits of threads per se --- these other things could be
 implemented in a pure process model too, and we have no data with which
 to estimate which change bought how much.
 

If you are comparing just process vs. thread, I really don't think I
gained much for performance and ended up with some pretty unmanageable
code.

The one thing that led to most of the gains was scheduling all the writes
to one thread which, as noted by Tom,  you could do on the process model.
Besides, Most of the advantage in doing this was taken away with the
addition of WAL in 7.1.

The other real gain that I saw with threading was limiting the number of
open files but
that led me to alter much of the file manager in order to synchronize
access to the files which probably slowed things a bit.

To be honest, I don't think I, personally,
would try this again. I went pretty far off
the beaten path with this thing.  It works well for what I am doing 
( a limited number of SQL statements run many times over ) but there
probably was a better way.  I'm thinking now that I should have tried to 
add a CORBA interface for connections. I would have been able to 
accomplish my original goals without creating a deadend for myself.


Thanks all for a great project,

Myron
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread D. Hageman



The plan for the new spinlocks does look like it has some potential.  My 
only comment in regards to permformance when we start looking at SMP 
machines is ... it is my belief that getting a true threaded backend may 
be the only way to get the full potential out of SMP machines.  I see that 
is one of the things to experiment with on the TODO list and I have seen 
some people have messed around already with this using Solaris threads.  
It should probably be attempted with pthreads if PostgreSQL is going to 
keep some resemblance of cross-platform compatibility.  At that time, it 
would probably be easier to go in and clean up some stuff for the 
implementation of other TODO items (put in the base framework for more 
complex future items) as threading the backend would take a little bit of 
ideology shift.

Of course, it is much easier to stand back and talk about this then 
actually do it  - especially comming from someone who has only tried to 
contribute a few pieces of code.  Keep up the good work.


On Wed, 26 Sep 2001, Tom Lane wrote:

 At the just-past OSDN database conference, Bruce and I were annoyed by
 some benchmark results showing that Postgres performed poorly on an
 8-way SMP machine.  Based on past discussion, it seems likely that the
 culprit is the known inefficiency in our spinlock implementation.
 After chewing on it for awhile, we came up with an idea for a solution.
 
 The following proposal should improve performance substantially when
 there is contention for a lock, but it creates no portability risks
 because it uses the same system facilities (TAS and SysV semaphores)
 that we have always relied on.  Also, I think it'd be fairly easy to
 implement --- I could probably get it done in a day.
 
 Comments anyone?
 
   regards, tom lane

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread Doug McNaught


D. Hageman [EMAIL PROTECTED] writes:

 The plan for the new spinlocks does look like it has some potential.  My 
 only comment in regards to permformance when we start looking at SMP 
 machines is ... it is my belief that getting a true threaded backend may 
 be the only way to get the full potential out of SMP machines.

Depends on what you mean.  For scaling well with many connections and
simultaneous queries, there's no reason IMHO that the current
process-per-backend model won't do, assuming the locking issues are
addressed. 

If you're talking about making a single query use multiple CPUs, then
yes, we're probably talking about a fundamental rewrite to use threads 
or some other mechanism.

-Doug
-- 
In a world of steel-eyed death, and men who are fighting to be warm,
Come in, she said, I'll give you shelter from the storm.-Dylan

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread D. Hageman


On 26 Sep 2001, Doug McNaught wrote:

 D. Hageman [EMAIL PROTECTED] writes:
 
  The plan for the new spinlocks does look like it has some potential.  My 
  only comment in regards to permformance when we start looking at SMP 
  machines is ... it is my belief that getting a true threaded backend may 
  be the only way to get the full potential out of SMP machines.
 
 Depends on what you mean.  For scaling well with many connections and
 simultaneous queries, there's no reason IMHO that the current
 process-per-backend model won't do, assuming the locking issues are
 addressed. 

Well, I know the current process-per-backend model does quite well.  My 
argument is not that it fails to do as intended.  My original argument is 
that it is belief (at the momment with the knowledge I have) to get the 
full potential out of SMP machines - threads might be the way to go.  The 
data from RedHat is quite interesting, so my feelings on this might 
change or could be re-inforced.  I watch anxiously ;-)

 If you're talking about making a single query use multiple CPUs, then
 yes, we're probably talking about a fundamental rewrite to use threads 
 or some other mechanism.

Well, we have several thread model ideologies that we could chose from.  
Only experimentation would let us determine the proper path to follow and 
then it wouldn't be ideal for everyone.  You kinda just have to take the 
best scenerio and run with it.  My first inclination would be something 
like a thread per connection (to reduce connection overhead), but then we 
could run into limits on different platforms (threads per process).  I 
kinda like the idea of using a thread for replication purposes ... lots 
of interesting possibilities exist and I will be first to admit that I 
don't have all the answers.  

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread mlw


D. Hageman wrote:

 The plan for the new spinlocks does look like it has some potential.  My
 only comment in regards to permformance when we start looking at SMP
 machines is ... it is my belief that getting a true threaded backend may
 be the only way to get the full potential out of SMP machines.  I see that
 is one of the things to experiment with on the TODO list and I have seen
 some people have messed around already with this using Solaris threads.
 It should probably be attempted with pthreads if PostgreSQL is going to
 keep some resemblance of cross-platform compatibility.  At that time, it
 would probably be easier to go in and clean up some stuff for the
 implementation of other TODO items (put in the base framework for more
 complex future items) as threading the backend would take a little bit of
 ideology shift.

I can only think of two objectives for threading. (1) running the various
connections in their own thread instead of their own process. (2) running
complex queries across multiple threads.

For  item (1) I see no value to this. It is a lot of work with no tangible
benefit. If you have an old fashion pthreads implementation, it will hurt
performance because are scheduled within the single process's time slice.. If
you have a newer kernel scheduled implementation, then you will have the same
scheduling as separate processes. The only thing you will need to do is
switch your brain from figuring out how to share data, to trying to figure
out how to isolate data. A multithreaded implementation lacks many of the
benefits and robustness of a multiprocess implementation.

For item (2) I can see how that could speed up queries in a low utilization
system, and that would be cool, but in a server that is under load, threading
the queries probably be less efficient.


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Spinlock performance improvement proposal

2001-09-26 Thread D. Hageman


On Wed, 26 Sep 2001, mlw wrote:
 
 I can only think of two objectives for threading. (1) running the various
 connections in their own thread instead of their own process. (2) running
 complex queries across multiple threads.
 
 For  item (1) I see no value to this. It is a lot of work with no tangible
 benefit. If you have an old fashion pthreads implementation, it will hurt
 performance because are scheduled within the single process's time slice..

Old fashion ... as in a userland library that implements POSIX threads?  
Well, I would agree.  However, most *modern* implementations are done in 
the kernel or kernel and userland coop model and don't have this 
limitation (as you mention later in your e-mail).  You have kinda hit on 
one of my gripes about computers in general.  At what point in time does 
one say something is obsolete or too old to support anymore - that it 
hinders progress instead of adding a feature?

 you have a newer kernel scheduled implementation, then you will have the same
 scheduling as separate processes. The only thing you will need to do is
 switch your brain from figuring out how to share data, to trying to figure
 out how to isolate data. A multithreaded implementation lacks many of the
 benefits and robustness of a multiprocess implementation.

Save for the fact that the kernel can switch between threads faster then 
it can switch processes considering threads share the same address space, 
stack, code, etc.  If need be sharing the data between threads is much 
easier then sharing between processes. 

I can't comment on the isolate data line.  I am still trying to figure 
that one out.

That last line is a troll if I every saw it ;-)  I will agree that threads 
isn't for everything and that it has costs just like everything else.  Let 
me stress that last part - like everything else.  Certain costs exist in 
the present model, nothing is - how should we say ... perfect.

 For item (2) I can see how that could speed up queries in a low utilization
 system, and that would be cool, but in a server that is under load, threading
 the queries probably be less efficient.

Well, I don't follow your logic and you didn't give any substance to back 
up your claim.  I am willing to listen.

Another thought ... Oracle uses threads doesn't it or at least it has a 
single processor and multi-processor version last time I knew ... which do 
they claim is better?  (Not saying that Oracle's proclimation of what is 
good and what is not matters, but it is good for another view point).

-- 
//\\
||  D. Hageman[EMAIL PROTECTED]  ||
\\//


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

51 matches

Mail list logo