Re: [HACKERS] mosbench revisited

2011-08-15 Thread Greg Stark
On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote:
  I'm kind of interested by the
 result, actually, as I had feared that the spinlock protecting
 ProcArrayLock was going to be a bigger problem sooner.

I think this depends on how many connections you have. If you try to
scale up your benchmark by having hundreds of connections then get
O(n^2) increase in the time spent with the procarray locked. It sounds
like they pinned the number of connections at the number of cores they
had. That makes sense if they're intentionally driving a cpu-bound
benchmark but it means they won't run into this problem.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-15 Thread Robert Haas
On Mon, Aug 15, 2011 at 6:22 PM, Greg Stark st...@mit.edu wrote:
 On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote:
  I'm kind of interested by the
 result, actually, as I had feared that the spinlock protecting
 ProcArrayLock was going to be a bigger problem sooner.

 I think this depends on how many connections you have. If you try to
 scale up your benchmark by having hundreds of connections then get
 O(n^2) increase in the time spent with the procarray locked. It sounds
 like they pinned the number of connections at the number of cores they
 had. That makes sense if they're intentionally driving a cpu-bound
 benchmark but it means they won't run into this problem.

The experiments that I've done so far seem to indicate that there are
actually a couple of different ways for ProcArrayLock to become a
problem.

First, as you say, if you have a busy read-write workload, you just
get plain old LWLock contention.  Everyone who wants a snapshot needs
the lock in shared mode; everyone who wants to commit needs it in
exclusive mode.  So, the people who are committing block both
snapshot-taking and other commits; while the snapshot-takers block
commit.

If you switch to a read-only workload, ProcArrayLock becomes
uncontended, since a transaction without an XID does not need to
acquire ProcArrayLock to commit.  But if you have enough CPUs (64 will
do it), the spinlock protecting ProcArrayLock becomes a bottleneck.
However, you can only observe this problem if the working set fits in
shared buffers; otherwise, BufFreelistLock contention slows the whole
system to a crawl, and the spinlock contention never materializes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-10 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 However, it doesn't really do anything to solve this problem.  The
 problem here is not that the size of the relation is changing too
 frequently - indeed, it's not changing at all in this test case.  The
 problem is rather that testing whether or not the size has in fact
 changed is costing too much.

You were complaining about the cost of the cache maintenance, that in
the current scheme of things would have to be called far too often.
Reducing the relation extension trafic would allow, I guess, to have
something more expensive to reset the cache ― it would not happen much.

Now, it could be that the idea is only worth “the electrons it's written
on” if all the relation extensions are taken care of by a background
process...

 The reason why we are testing the size of the relation here rather
 than just using reltuples is because the relation might have been
 extended since it was last analyzed.  We can't count on analyze to run
 often enough to avoid looking at the actual file size.  If the file's
 grown, we have to scale the number of tuples up proportional to the
 growth in relpages.

Could we send the same message to the stat collector as autoanalyze is
doing each time we extend a relation?

-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-10 Thread Robert Haas
2011/8/10 Dimitri Fontaine dfonta...@hi-media.com:
 Robert Haas robertmh...@gmail.com writes:
 However, it doesn't really do anything to solve this problem.  The
 problem here is not that the size of the relation is changing too
 frequently - indeed, it's not changing at all in this test case.  The
 problem is rather that testing whether or not the size has in fact
 changed is costing too much.

 You were complaining about the cost of the cache maintenance, that in
 the current scheme of things would have to be called far too often.
 Reducing the relation extension trafic would allow, I guess, to have
 something more expensive to reset the cache -- it would not happen much.

That's an interesting point.  I confess to having no idea how frequent
relation extension is now, or how much overhead we could add before it
started to get noticeable.

It seems likely to me that we can design something that is
sufficiently lightweight that we don't need to worry about reducing
the relation extension traffic first, but I don't know that for
certain.

 The reason why we are testing the size of the relation here rather
 than just using reltuples is because the relation might have been
 extended since it was last analyzed.  We can't count on analyze to run
 often enough to avoid looking at the actual file size.  If the file's
 grown, we have to scale the number of tuples up proportional to the
 growth in relpages.

 Could we send the same message to the stat collector as autoanalyze is
 doing each time we extend a relation?

Mmm, maybe.  I don't really like the idea of getting the stats
collector involved in this; that seems like it will likely add
complexity without any corresponding benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-10 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mié ago 10 21:27:08 -0400 2011:
 2011/8/10 Dimitri Fontaine dfonta...@hi-media.com:
  Robert Haas robertmh...@gmail.com writes:
  However, it doesn't really do anything to solve this problem.  The
  problem here is not that the size of the relation is changing too
  frequently - indeed, it's not changing at all in this test case.  The
  problem is rather that testing whether or not the size has in fact
  changed is costing too much.
 
  You were complaining about the cost of the cache maintenance, that in
  the current scheme of things would have to be called far too often.
  Reducing the relation extension trafic would allow, I guess, to have
  something more expensive to reset the cache -- it would not happen much.
 
 That's an interesting point.  I confess to having no idea how frequent
 relation extension is now, or how much overhead we could add before it
 started to get noticeable.

I have seen several cases on which it (rel extension) is really
troublesome.  Think insertion on large bytea fields -- the toast table's
extension lock was heavily contended.  When this was in 8.1 we had quite
some trouble because of the locking involving some shared buffer pool
lwlock which is fortunately now partitioned; even without that, it is a
major contention point.  These were insert-only tables, so pages are
always full except the last one.

  The reason why we are testing the size of the relation here rather
  than just using reltuples is because the relation might have been
  extended since it was last analyzed.  We can't count on analyze to run
  often enough to avoid looking at the actual file size.  If the file's
  grown, we have to scale the number of tuples up proportional to the
  growth in relpages.
 
  Could we send the same message to the stat collector as autoanalyze is
  doing each time we extend a relation?
 
 Mmm, maybe.  I don't really like the idea of getting the stats
 collector involved in this; that seems like it will likely add
 complexity without any corresponding benefit.

Yeah, it adds delay and uncertainty (UDP being lossy).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-10 Thread Robert Haas
On Wed, Aug 10, 2011 at 9:46 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of mié ago 10 21:27:08 -0400 2011:
 2011/8/10 Dimitri Fontaine dfonta...@hi-media.com:
  Robert Haas robertmh...@gmail.com writes:
  However, it doesn't really do anything to solve this problem.  The
  problem here is not that the size of the relation is changing too
  frequently - indeed, it's not changing at all in this test case.  The
  problem is rather that testing whether or not the size has in fact
  changed is costing too much.
 
  You were complaining about the cost of the cache maintenance, that in
  the current scheme of things would have to be called far too often.
  Reducing the relation extension trafic would allow, I guess, to have
  something more expensive to reset the cache -- it would not happen much.

 That's an interesting point.  I confess to having no idea how frequent
 relation extension is now, or how much overhead we could add before it
 started to get noticeable.

 I have seen several cases on which it (rel extension) is really
 troublesome.  Think insertion on large bytea fields -- the toast table's
 extension lock was heavily contended.  When this was in 8.1 we had quite
 some trouble because of the locking involving some shared buffer pool
 lwlock which is fortunately now partitioned; even without that, it is a
 major contention point.  These were insert-only tables, so pages are
 always full except the last one.

Yeah.  I think there's a good argument to be made that we should
extend relations in larger chunks, both for this reason and to avoid
fragmenting the underlying file.

But in this case, the fact that relation extension is such a
heavyweight operation almost works to our advantage.  I mean, if we're
already acquiring a heavyweight lock (and they're not called
heavyweight for nothing), making at least one system call which
updates filesystem metadata, and then releasing our heavyweight lock,
the additional overhead of setting a flag in shared memory or somesuch
might well be unnoticeable.

Or it might not - hard to know without testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas
On Sat, Aug 6, 2011 at 1:43 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 The approach is to move the important things from a LWLock to a
 spinlock, and to not do any locking for increments to clock-hand
 increment and numBufferAllocs.
 That means that some buffers might occasionally get inspected twice
 and some might not get inspected at all during any given clock cycle,
 but this should not lead to any correctness problems.   (Disclosure:
 Tom didn't like this approach when it was last discussed.)

 I just offer this for whatever it is worth to you--I'm not proposing
 it as an actual patch to be applied.

Interesting approach.

 When data fits in RAM but not shared_buffers, maybe the easiest fix is
 to increase shared_buffers.  Which brings up the other question I had
 for you about your work with Nate's celebrated loaner machine.  Have
 you tried to reproduce the performance problems that have been
 reported (but without public disclosure of how to reproduce) with
 shared_buffers  8GB on machines with RAM 8GB ?

No.  That's on my list, but thus far has not made it to the top of
said list.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas
On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Robert Haas robertmh...@gmail.com writes:
 It would be nice if the Linux guys would fix this problem for us, but
 I'm not sure whether they will.  For those who may be curious, the
 problem is in generic_file_llseek() in fs/read-write.c.  On a platform
 with 8-byte atomic reads, it seems like it ought to be very possible
 to read inode-i_size without taking a spinlock.  A little Googling
 around suggests that some patches along these lines have been proposed
 and - for reasons that I don't fully understand - rejected.  That now
 seems unfortunate.  Barring a kernel-level fix, we could try to
 implement our own cache to work around this problem.  However, any
 such cache would need to be darn cheap to check and update (since we
 can't assume that relation extension is an infrequent event) and must
 somehow having the same sort of mutex contention that's killing the
 kernel in this workload.

 What about making the relation extension much less frequent?  It's been
 talked about before here, that instead of extending 8kB at a time we
 could (should) extend by much larger chunks.  I would go as far as
 preallocating the whole next segment (1GB) (in the background) as soon
 as the current is more than half full, or such a policy.

 Then you have the problem that you can't really use lseek() anymore to
 guess'timate a relation size, but Tom said in this thread that the
 planner certainly doesn't need something that accurate.  Maybe the
 reltuples would do?  If not, it could be that some adapting of its
 accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

However, it doesn't really do anything to solve this problem.  The
problem here is not that the size of the relation is changing too
frequently - indeed, it's not changing at all in this test case.  The
problem is rather that testing whether or not the size has in fact
changed is costing too much.

The reason why we are testing the size of the relation here rather
than just using reltuples is because the relation might have been
extended since it was last analyzed.  We can't count on analyze to run
often enough to avoid looking at the actual file size.  If the file's
grown, we have to scale the number of tuples up proportional to the
growth in relpages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas
On Sat, Aug 6, 2011 at 3:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 My experiments have shown that the freelist proper is not
 substantially faster than the freelist clocksweep--and that is even
 under the assumption that putting things back into the freelist is
 absolutely free.

 The freelist isn't there to make buffer allocation faster, though;
 it's there for allocation efficiency.  The point is that when some
 buffers have become completely useless (eg, because we dropped the table
 they were for), they'll be recycled in preference to reclaiming buffers
 that contain still-possibly-useful data.  It would certainly be simple
 to get rid of the freelist and only recycle dead buffers when the clock
 sweep reaches them, but I think we'd be paying for that in extra,
 unnecessary I/O.

This is an interesting point, because I think it really gets at the
heart of the trade-off we're facing here: quality of buffer allocation
vs. concurrency.

Assuming no free buffers are present, we'd ideally like to evict the
buffer that won't be used until the latest possible future time.  LRU
is an approximation of this.  Our clock-sweep algorithm is a
less-accurate approximation of this that pays for itself with less
locking.  If we really wanted true LRU, we'd have to move things
around in the list every time a buffer was pinned.  That would be
really expensive.  Our clock sweep algorithm only requires fighting
over a piece of shared state when we want to kick a buffer out, rather
than every time we want to do anything to a buffer.  And it's probably
only slightly worse in terms of buffer allocation efficiency than true
LRU, so on balance it appears to be a good trade-off.

Even so, I think it's pretty trivial to construct a workload where
performance is throttled by BufFreelistLock.  I seriously doubt we're
going to be able to solve that problem by optimizing the code that
runs while holding that lock.  We might by ourselves a few more
percentage points here or there, but if we really want to bust this
bottleneck we're going to need to move to some algorithm that is an
even-less-perfect approximation of LRU but which doesn't involve a
single cluster-wide lock that throttles all buffer allocation.  No
matter how fast you make the critical section protected by that lock,
given enough CPUs, it will eventually not be fast enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Jesper Krogh

On 2011-08-08 15:29, Robert Haas wrote:

On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontainedimi...@2ndquadrant.fr  wrote:

Robert Haasrobertmh...@gmail.com  writes:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode-i_size without taking a spinlock.  A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected.  That now
seems unfortunate.  Barring a kernel-level fix, we could try to
implement our own cache to work around this problem.  However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

What about making the relation extension much less frequent?  It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks.  I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate.  Maybe the
reltuples would do?  If not, it could be that some adapting of its
accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.

On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.

I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-06 Thread Jeff Janes
On Wed, Aug 3, 2011 at 11:21 AM, Robert Haas robertmh...@gmail.com wrote:
 About nine months ago, we had a discussion of some benchmarking that
 was done by the mosbench folks at MIT:

 http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php

 Although the authors used PostgreSQL as a test harness for driving
 load, it's pretty clear from reading the paper that their primary goal
 was to stress the Linux kernel, so the applicability of the paper to
 real-world PostgreSQL performance improvement is less than it might
 be.  Still, having now actually investigated in some detail many of
 the same performance issues that they were struggling with, I have a
 much clearer understanding of what's really going on here.  In
 PostgreSQL terms, here are the bottlenecks they ran into:

 1. We configure PostgreSQL to use a 2 Gbyte application-level cache
 because PostgreSQL protects its free-list with a single lock and thus
 scales poorly with smaller caches.  This is a complaint about
 BufFreeList lock which, in fact, I've seen as a huge point of
 contention on some workloads.  In fact, on read-only workloads, with
 my lazy vxid lock patch applied, this is, I believe, the only
 remaining unpartitioned LWLock that is ever taken in exclusive mode;
 or at least the only one that's taken anywhere near often enough to
 matter.  I think we're going to do something about this, although I
 don't have a specific idea in mind at the moment.

I was going to ask if you if had done any benchmarks with scale such
that the tables fit in RAM but not in shared_buffers.  I guess you
have.

The attached experimental patch fixed freelist contention on 8 cores.
It would be nice to see what happens above that.

It has been cherry picked up to HEAD, but not tested against it. (Last
tested in Dec 2010, my how time flies)

The approach is to move the important things from a LWLock to a
spinlock, and to not do any locking for increments to clock-hand
increment and numBufferAllocs.
That means that some buffers might occasionally get inspected twice
and some might not get inspected at all during any given clock cycle,
but this should not lead to any correctness problems.   (Disclosure:
Tom didn't like this approach when it was last discussed.)

I just offer this for whatever it is worth to you--I'm not proposing
it as an actual patch to be applied.

When data fits in RAM but not shared_buffers, maybe the easiest fix is
to increase shared_buffers.  Which brings up the other question I had
for you about your work with Nate's celebrated loaner machine.  Have
you tried to reproduce the performance problems that have been
reported (but without public disclosure of how to reproduce) with
shared_buffers  8GB on machines with RAM 8GB ?

Cheers,

Jeff
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index bf9903b..304135e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -28,7 +28,8 @@ typedef struct
 	int			nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
-	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			lastFreeBuffer;		 /* Tail of list of unused buffers */
+	slock_t			mutex;			/* Protects all of these, except sometimes not nextVictimBuffer */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -44,7 +45,7 @@ typedef struct
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -123,15 +124,13 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
 	 * the rate of buffer consumption.	Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
+	 * This is now done without a lock, and so updates can be lost
 	 */
+  
 	StrategyControl-numBufferAllocs++;
 
 	/*
@@ -142,6 +141,12 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 	 */
 	while (StrategyControl-firstFreeBuffer = 0)
 	{
+		SpinLockAcquire(StrategyControl-mutex);
+		if (StrategyControl-firstFreeBuffer0)
+		{
+			SpinLockRelease(StrategyControl-mutex);
+			break;
+		}
 		buf = BufferDescriptors[StrategyControl-firstFreeBuffer];
 		Assert(buf-freeNext != FREENEXT_NOT_IN_LIST);
 
@@ -149,6 +154,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		StrategyControl-firstFreeBuffer = buf-freeNext;
 		buf-freeNext = FREENEXT_NOT_IN_LIST;
 
+		SpinLockRelease(StrategyControl-mutex);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; discard it and retry.  (This can only happen if VACUUM put 

Re: [HACKERS] mosbench revisited

2011-08-06 Thread Jeff Janes
On Wed, Aug 3, 2011 at 3:21 PM, Jim Nasby j...@nasby.net wrote:
 On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
 1. We configure PostgreSQL to use a 2 Gbyte application-level cache
 because PostgreSQL protects its free-list with a single lock and thus
 scales poorly with smaller caches.  This is a complaint about
 BufFreeList lock which, in fact, I've seen as a huge point of
 contention on some workloads.  In fact, on read-only workloads, with
 my lazy vxid lock patch applied, this is, I believe, the only
 remaining unpartitioned LWLock that is ever taken in exclusive mode;
 or at least the only one that's taken anywhere near often enough to
 matter.  I think we're going to do something about this, although I
 don't have a specific idea in mind at the moment.

 This has been discussed before: 
 http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which 
 itself references 2 other threads).

 The basic idea is: have a background process that proactively moves buffers 
 onto the free list so that backends should normally never have to run the 
 clock sweep (which is rather expensive). The challenge there is figuring out 
 how to get stuff onto the free list with minimal locking impact. I think one 
 possible option would be to put the freelist under it's own lock (IIRC we 
 currently use it to protect the clock sweep as well). Of course, that still 
 means the free list lock could be a point of contention, but presumably it's 
 far faster to add or remove something from the list than it is to run the 
 clock sweep.

Hi Jim,

My experiments have shown that the freelist proper is not
substantially faster than the freelist clocksweep--and that is even
under the assumption that putting things back into the freelist is
absolutely free.  Under all the workload's I've been able to contrive,
other than ones contrived by actually hacking the code itself to make
it pathological, the average number of buffers inspected per run of
the clock sweep is 2.5.  Under contention, the mere act of acquiring
a lock is more traumatic than the actual work carried out under the
lock.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-06 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 It would be nice if the Linux guys would fix this problem for us, but
 I'm not sure whether they will.  For those who may be curious, the
 problem is in generic_file_llseek() in fs/read-write.c.  On a platform
 with 8-byte atomic reads, it seems like it ought to be very possible
 to read inode-i_size without taking a spinlock.  A little Googling
 around suggests that some patches along these lines have been proposed
 and - for reasons that I don't fully understand - rejected.  That now
 seems unfortunate.  Barring a kernel-level fix, we could try to
 implement our own cache to work around this problem.  However, any
 such cache would need to be darn cheap to check and update (since we
 can't assume that relation extension is an infrequent event) and must
 somehow having the same sort of mutex contention that's killing the
 kernel in this workload.

What about making the relation extension much less frequent?  It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks.  I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate.  Maybe the
reltuples would do?  If not, it could be that some adapting of its
accuracy could be done?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-06 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 My experiments have shown that the freelist proper is not
 substantially faster than the freelist clocksweep--and that is even
 under the assumption that putting things back into the freelist is
 absolutely free.

The freelist isn't there to make buffer allocation faster, though;
it's there for allocation efficiency.  The point is that when some
buffers have become completely useless (eg, because we dropped the table
they were for), they'll be recycled in preference to reclaiming buffers
that contain still-possibly-useful data.  It would certainly be simple
to get rid of the freelist and only recycle dead buffers when the clock
sweep reaches them, but I think we'd be paying for that in extra,
unnecessary I/O.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-04 Thread Robert Haas
On Wed, Aug 3, 2011 at 9:16 PM, Robert Haas robertmh...@gmail.com wrote:
 Spinlocks seem to have a very ugly tipping point.

And on that note, here are oprofile results from pgbench -n -T 300 -S
-c 64 -j 64 -M prepared on the latest master branch, compiled with
-O2 -fno-omit-frame-pointer.  shared_buffers=8GB, 64-core machine,
RHEL 6.1.  By running with -M prepared, it dodges the lseek()
problem.

960576   23.7580  postgres postgres s_lock
562821   13.9203  no-vmlinux   no-vmlinux   /no-vmlinux
3211917.9440  postgres postgres
LWLockRelease
3176537.8565  postgres postgres
LWLockAcquire
2248125.5603  postgres postgres
GetSnapshotData
81156 2.0072  postgres postgres _bt_compare
78744 1.9476  postgres postgres PinBuffer
58101 1.4370  postgres postgres
hash_search_with_hash_value
43865 1.0849  postgres postgres
AllocSetAlloc
25832 0.6389  postgres postgres PostgresMain

Since SpinLockAcquire() is an in-line macro that only calls s_lock()
if the initial TAS fails, not only the time directly attributed to
s_lock but also a good chunk of the CPU time attributable to
LWLockAcquire and LWLockRelease() is likely time spent fighting over
spinlocks.  Since I compiled with frame pointers, it's pretty easy to
see where those s_lock calls are coming from.  Here's an excerpt from
opreport -c:

 55.0e-04  postgres postgres _bt_getbuf
  66.0e-04  postgres postgres
_bt_relandgetbuf
  140.0014  postgres postgres
ReleaseAndReadBuffer
  850.0085  postgres postgres
ReadBuffer_common
  206   0.0207  postgres postgres
GetSnapshotData
  18344 1.8437  postgres postgres
UnpinBuffer
  24977 2.5103  postgres postgres PinBuffer
  406948   40.9009  postgres postgres
LWLockRelease
  544376   54.7133  postgres postgres
LWLockAcquire
994947   23.5746  postgres postgres s_lock

It's also fairly easy to track down who is calling LWLockAcquire and
LWLockRelease.  Nearly all of the calls are from just two
contributors:

  241655   27.6830  postgres postgres
ReadBuffer_common
  566434   64.8885  postgres postgres
GetSnapshotData
3285487.7847  postgres postgres
LWLockAcquire

  176629   23.8917  postgres postgres
ReadBuffer_common
  524348   70.9259  postgres postgres
GetSnapshotData
3323337.8744  postgres postgres
LWLockRelease

So, most of the s_lock calls come from LWLockAcquire, and most of the
LWLockAcquire calls come from GetSnapshotData.  That's not quite
enough to prove that all the spinning going on here is coming from
contention over the spinlock protecting ProcArrayLock, because it
needn't be the case that all calls to LWLockAcquire are equally likely
to end up in s_lock.  You could speculate that ProcArrayLock isn't
actually responsible for many of those s_lock calls and that some
other lock, like maybe the buffer mapping locks, is disproportionately
responsible for the s_lock calls.  But in fact I think it's exactly
the other way around: the buffer mapping locks are partitioned 16
ways, while there's only one ProcArrayLock.  I'm willing to bet that's
where nearly all of the spinning is happening, and I'll further bet
that that spinning accounts for AT LEAST a third of the total CPU time
on this workload.  And maybe closer to half.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-04 Thread Aidan Van Dyk
On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas robertmh...@gmail.com wrote:

  And hoping that the Linux guys decide to do something about it.
  This isn't really our bug - lseek is quite cheap in the uncontended
 case.

Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where
they've finally remove the BKL out of VFS/inode?

I mean, complaining about scalability in linux 2.6.18 is like
complaining about scalability in postgresql 8.2 ;-)

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-04 Thread Robert Haas
On Thu, Aug 4, 2011 at 5:09 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas robertmh...@gmail.com wrote:

      And hoping that the Linux guys decide to do something about it.
  This isn't really our bug - lseek is quite cheap in the uncontended
 case.

 Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where
 they've finally remove the BKL out of VFS/inode?

 I mean, complaining about scalability in linux 2.6.18 is like
 complaining about scalability in postgresql 8.2 ;-)

Hmm.  This machine is running 2.6.32-131.6.1.el6.x86_64, not 2.6.18.
Not sure how much the code has changed since then, but the spinlock is
there in the master branch of Linus's repository.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] mosbench revisited

2011-08-03 Thread Robert Haas
About nine months ago, we had a discussion of some benchmarking that
was done by the mosbench folks at MIT:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php

Although the authors used PostgreSQL as a test harness for driving
load, it's pretty clear from reading the paper that their primary goal
was to stress the Linux kernel, so the applicability of the paper to
real-world PostgreSQL performance improvement is less than it might
be.  Still, having now actually investigated in some detail many of
the same performance issues that they were struggling with, I have a
much clearer understanding of what's really going on here.  In
PostgreSQL terms, here are the bottlenecks they ran into:

1. We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches.  This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads.  In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter.  I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

2. PostgreSQL implements row- and table-level locks atop user-level
mutexes; as a result, even a non-conflicting row- or table-level lock
acquisition requires exclusively locking one of only 16 global
mutexes.  I think that the reference to row-level locks here is a red
herring; or at least, I haven't seen any evidence that row-level
locking is a meaningful source of contention on any workload I've
tested.  Table-level locks clearly are, and this is the problem that
the now-committed fastlock patch addressed.  So, fixed!

3. Our workload creates one PostgreSQL connection per server core and
sends queries (selects or updates) in batches of 256, aggregating
successive read-only transac- tions into single transactions. This
workload is intended to minimize application-level contention within
PostgreSQL in order to maximize the stress PostgreSQL places on the
kernel.  I had no idea what this was talking about at the time, but
it's now obvious in retrospect that they were working around the
overhead imposed by acquiring and releasing relation and virtualxid
locks.  My pending lazy vxids patch will address the remaining issue
here.

4. With modified PostgreSQL on stock Linux, throughput for both
workloads collapses at 36 cores .. The main reason is the kernel's
lseek implementation.  With the fastlock, sinval-hasmessages, and
lazy-vxid patches applied (the first two are committed now), it's now
much easier to run headlong into this bottleneck.  Prior to those
patches, for this to be an issue, you would need to batch your queries
together in big groups to avoid getting whacked by the lock manager
and/or sinval overhead first.  With those problems and the recently
discovered bottleneck in glibc's random() implementation fixed, good
old pgbench -S is enough to hit this problem if you have enough
clients and enough cores.  And it turns out that the word collapse
is not an exaggeration.  On a 64-core Intel box running RHEL 6.1,
performance ramped up from 24k TPS at 4 clients to 175k TPS at 32
clients and then to 207k TPS at 44 clients.  After that it fell off a
cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients,
consuming truly horrifying amounts of system time in the process.  A
somewhat tedious investigation revealed that the problem is, in fact,
contention on the inode mutex caused by lseek().  Results are much
better with -M prepared (310k TPS at 48 clients, 294k TPS at 64
clients).  All one-minute tests with scale factor 100, fitting inside
8GB of shared_buffers (clearly not enough for serious benchmarking,
but enough to demonstrate this issue).

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode-i_size without taking a spinlock.  A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected.  That now
seems unfortunate.  Barring a kernel-level fix, we could try to
implement our own cache to work around this problem.  However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

5. With all of the above problems fixed or worked around, the authors
write, PostgreSQL's overall scalability is primarily limited by
contention for the spinlock protecting the buffer 

Re: [HACKERS] mosbench revisited

2011-08-03 Thread Martijn van Oosterhout
On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:
 It would be nice if the Linux guys would fix this problem for us, but
 I'm not sure whether they will.  For those who may be curious, the
 problem is in generic_file_llseek() in fs/read-write.c.  On a platform
 with 8-byte atomic reads, it seems like it ought to be very possible
 to read inode-i_size without taking a spinlock.

Interesting. There's this thread from 2003 suggesting the use of pread
instead, it was rejected on the argument that lseek is cheap so not a
problem.

http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

Perhaps we now have a benchmark where the effect can be measured.

There's the issue about whether it screws up the readahead mechanism...

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:
 It would be nice if the Linux guys would fix this problem for us, but
 I'm not sure whether they will.  For those who may be curious, the
 problem is in generic_file_llseek() in fs/read-write.c.  On a platform
 with 8-byte atomic reads, it seems like it ought to be very possible
 to read inode-i_size without taking a spinlock.

 Interesting. There's this thread from 2003 suggesting the use of pread
 instead, it was rejected on the argument that lseek is cheap so not a
 problem.

 http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

That seems rather unrelated.  The point here is our use of lseek to find
out the current file size --- or at least, I would hope they're not
trying to read the inode's file size in a SEEK_CUR call.

The reason -M prepared helps is presumably that it eliminates most of
the RelationGetNumberOfBlocks calls the planner does to check current
table size.  While we could certainly consider using a cheaper (possibly
more stale) value there, it's a bit astonishing to think that that's the
main cost in a parse/plan/execute cycle.  Perhaps there are more hotspot
calls than that one?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Robert Haas
On Wed, Aug 3, 2011 at 2:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Martijn van Oosterhout klep...@svana.org writes:
 On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:
 It would be nice if the Linux guys would fix this problem for us, but
 I'm not sure whether they will.  For those who may be curious, the
 problem is in generic_file_llseek() in fs/read-write.c.  On a platform
 with 8-byte atomic reads, it seems like it ought to be very possible
 to read inode-i_size without taking a spinlock.

 Interesting. There's this thread from 2003 suggesting the use of pread
 instead, it was rejected on the argument that lseek is cheap so not a
 problem.

 http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

 That seems rather unrelated.  The point here is our use of lseek to find
 out the current file size --- or at least, I would hope they're not
 trying to read the inode's file size in a SEEK_CUR call.

Correct.

 The reason -M prepared helps is presumably that it eliminates most of
 the RelationGetNumberOfBlocks calls the planner does to check current
 table size.  While we could certainly consider using a cheaper (possibly
 more stale) value there, it's a bit astonishing to think that that's the
 main cost in a parse/plan/execute cycle.  Perhaps there are more hotspot
 calls than that one?

Nope.

On a straight pgbench -S test, you get four system calls per query:
recvfrom(), lseek(), lseek(), sendto().  Adding -M prepared eliminates
the two lseeks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On a straight pgbench -S test, you get four system calls per query:
 recvfrom(), lseek(), lseek(), sendto().  Adding -M prepared eliminates
 the two lseeks.

[ scratches head... ]  Two?  Is that one for the table and one for its
lone index, or are we being redundant there?

(If the query ended up being a seqscan, I'd expect a second
lseek(SEEK_END) when the executor starts up, but I gather from the other
complaints that the mosbench people were only testing simple indexscan
queries.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Robert Haas
On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On a straight pgbench -S test, you get four system calls per query:
 recvfrom(), lseek(), lseek(), sendto().  Adding -M prepared eliminates
 the two lseeks.

 [ scratches head... ]  Two?

Yep.

 Is that one for the table and one for its
 lone index, or are we being redundant there?

The former.  Specifically, it appears we're smart enough to only test
the last segment (in this case, the table is large enough that there
is a .1 file, and that's what we're lseeking).

 (If the query ended up being a seqscan, I'd expect a second
 lseek(SEEK_END) when the executor starts up, but I gather from the other
 complaints that the mosbench people were only testing simple indexscan
 queries.)

Yeah, it seems that for a sequential scan we lseek the heap, then the
index, then the heap again; but for index scans we just hit the heap
and the index.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 (If the query ended up being a seqscan, I'd expect a second
 lseek(SEEK_END) when the executor starts up, but I gather from the other
 complaints that the mosbench people were only testing simple indexscan
 queries.)

 Yeah, it seems that for a sequential scan we lseek the heap, then the
 index, then the heap again; but for index scans we just hit the heap
 and the index.

Sure.  The first two come from the planner getting the table and index
sizes for estimation purposes (look in plancat.c).  The last is done in
heapam.c's initscan().  We could possibly accept stale values for the
planner estimates, but I think heapam's number had better be accurate.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Robert Haas
On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 (If the query ended up being a seqscan, I'd expect a second
 lseek(SEEK_END) when the executor starts up, but I gather from the other
 complaints that the mosbench people were only testing simple indexscan
 queries.)

 Yeah, it seems that for a sequential scan we lseek the heap, then the
 index, then the heap again; but for index scans we just hit the heap
 and the index.

 Sure.  The first two come from the planner getting the table and index
 sizes for estimation purposes (look in plancat.c).  The last is done in
 heapam.c's initscan().  We could possibly accept stale values for the
 planner estimates, but I think heapam's number had better be accurate.

I think the exact requirement is that, if the relation turns out to be
larger than the size we read, the extra blocks had better not contain
any tuples our snapshot can see.  There's actually no interlock
between smgrnblocks() and smgrextend() right now, so presumably we
don't need to add one.  However, a value cached from a few seconds ago
is clearly not going to cut it.

I don't really think there's any sensible way to implement a
per-backend cache, because that would require invalidation events of
some kind to be sent on relation extension, and that seems utterly
insane from a performance standpoint, even if we invented something
less expensive than sinval.  I guess it might work for planning
purposes if you only sent out invalidation events on every N'th
extension or something, but penalizing the accuracy of planning to
work around a Linux kernel bug that only manifests itself on machines
with 32 cores doesn't seem very appealing.

A shared cache seems like it could work, but the locking is tricky.
Normally we'd just use a hash table protected by an LWLock, one one
LWLock per partition, but here that's clearly not going to work.  The
kernel is using a spinlock per file, and that's still too
heavy-weight.  I think that if we could prepopulate the cache with all
the keys (i.e. relfilenodes) we care about and then never add or evict
any, we could run it completely unlocked.  I believe that the existing
memory barriers in things like LWLockAcquire() would be sufficient to
prevent us from reading a too-old value (i.e. block count).  In
particular, you couldn't read a value that predated your snapshot, if
you got your snapshot by holding ProcArrayLock.  But the races
involved with adding and removing items from the cache are hard to
deal with without using locks, especially because the keys are 12
bytes or more and therefore can't be read or written atomically.

I've been mulling over how we might deal with this and actually coded
up an implementation, but it turns out (surprise, surprise) to have
problems with insufficient locking.  So I'm thinking it over some
more.  And hoping that the Linux guys decide to do something about it.
 This isn't really our bug - lseek is quite cheap in the uncontended
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... We could possibly accept stale values for the
 planner estimates, but I think heapam's number had better be accurate.

 I think the exact requirement is that, if the relation turns out to be
 larger than the size we read, the extra blocks had better not contain
 any tuples our snapshot can see.  There's actually no interlock
 between smgrnblocks() and smgrextend() right now, so presumably we
 don't need to add one.

No interlock in userspace, you mean.  We're relying on the kernel to do
it, ie, give us a number that is not older than the time of our (already
taken at this point) snapshot.

 I don't really think there's any sensible way to implement a
 per-backend cache, because that would require invalidation events of
 some kind to be sent on relation extension, and that seems utterly
 insane from a performance standpoint, even if we invented something
 less expensive than sinval.

Yeah, that's the issue.  But relation extension is not actually a
cheap operation, since it requires a minimum of one kernel call that is
presumably doing something nontrivial in the filesystem.  I'm not
entirely convinced that we couldn't make this work --- especially since
we could certainly derate the duty cycle by a factor of ten or more
without giving up anything remotely meaningful in planning accuracy.
(I'd be inclined to make it send an inval only once the relation size
had changed at least, say, 10%.)

 A shared cache seems like it could work, but the locking is tricky.
 Normally we'd just use a hash table protected by an LWLock, one one
 LWLock per partition, but here that's clearly not going to work.  The
 kernel is using a spinlock per file, and that's still too
 heavy-weight.

That still seems utterly astonishing to me.  We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch?  How is that the bottleneck?

I am wondering whether kernel spinlocks are broken.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Jim Nasby
On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
 1. We configure PostgreSQL to use a 2 Gbyte application-level cache
 because PostgreSQL protects its free-list with a single lock and thus
 scales poorly with smaller caches.  This is a complaint about
 BufFreeList lock which, in fact, I've seen as a huge point of
 contention on some workloads.  In fact, on read-only workloads, with
 my lazy vxid lock patch applied, this is, I believe, the only
 remaining unpartitioned LWLock that is ever taken in exclusive mode;
 or at least the only one that's taken anywhere near often enough to
 matter.  I think we're going to do something about this, although I
 don't have a specific idea in mind at the moment.

This has been discussed before: 
http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself 
references 2 other threads).

The basic idea is: have a background process that proactively moves buffers 
onto the free list so that backends should normally never have to run the clock 
sweep (which is rather expensive). The challenge there is figuring out how to 
get stuff onto the free list with minimal locking impact. I think one possible 
option would be to put the freelist under it's own lock (IIRC we currently use 
it to protect the clock sweep as well). Of course, that still means the free 
list lock could be a point of contention, but presumably it's far faster to add 
or remove something from the list than it is to run the clock sweep.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Robert Haas
On Wed, Aug 3, 2011 at 5:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That still seems utterly astonishing to me.  We're touching each of
 those files once per query cycle; a cycle that contains two message
 sends, who knows how many internal spinlock/lwlock/heavyweightlock
 acquisitions inside Postgres (some of which *do* contend with each
 other), and a not insignificant amount of plain old computing.
 Meanwhile, this particular spinlock inside the kernel is protecting
 what, a single doubleword fetch?  How is that the bottleneck?

Spinlocks seem to have a very ugly tipping point.  When I tested
pgbench -S on a 64-core system with the lazy vxid patch applied and a
patch to use random_r() in lieu of random, the amount of system time
used per SELECT-only transaction at 48 clients was 3.59 times as much
as it was at 4 clients.  And the amount used per transaction at 52
clients was 3.63 times the amount used per transaction at 48 clients.
And the amount used at 56 clients was 3.25 times the amount used at 52
clients.  You can see the throughput graph starting to flatten out in
the 32-44 client range, but it's not particularly alarming.  However,
once you pass that point things rapidly get totally out of control in
a real hurry.  A few more clients and the machine is basically doing
nothing but spin.

 I am wondering whether kernel spinlocks are broken.

I don't think so.  Stefan Kaltenbrunner had one profile where he
showed something like sixty or eighty percent of the usermode CPU time
in s_lock.  I didn't have access to that particular hardware, but the
testing I've done strongly suggests that most of that was the
SInvalReadLock spinlock.  And before I patched pgbench to avoid
calling random(), that was doing the same thing - literally flattening
a 64-core box fighting over a single futex that normally costs almost
nothing.  (That one wasn't quite as bad because the futex actually
deschedules the waiters, but it was still bad.)  I'm actually not
really sure why it shakes out this way (birthday paradox?) but having
seen the effect several times now, I'm disinclined to believe it's an
artifact.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-03 Thread Robert Haas
On Wed, Aug 3, 2011 at 6:21 PM, Jim Nasby j...@nasby.net wrote:
 On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:
 1. We configure PostgreSQL to use a 2 Gbyte application-level cache
 because PostgreSQL protects its free-list with a single lock and thus
 scales poorly with smaller caches.  This is a complaint about
 BufFreeList lock which, in fact, I've seen as a huge point of
 contention on some workloads.  In fact, on read-only workloads, with
 my lazy vxid lock patch applied, this is, I believe, the only
 remaining unpartitioned LWLock that is ever taken in exclusive mode;
 or at least the only one that's taken anywhere near often enough to
 matter.  I think we're going to do something about this, although I
 don't have a specific idea in mind at the moment.

 This has been discussed before: 
 http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which 
 itself references 2 other threads).

 The basic idea is: have a background process that proactively moves buffers 
 onto the free list so that backends should normally never have to run the 
 clock sweep (which is rather expensive). The challenge there is figuring out 
 how to get stuff onto the free list with minimal locking impact. I think one 
 possible option would be to put the freelist under it's own lock (IIRC we 
 currently use it to protect the clock sweep as well). Of course, that still 
 means the free list lock could be a point of contention, but presumably it's 
 far faster to add or remove something from the list than it is to run the 
 clock sweep.

Based on recent benchmarking, I'm going to say no.  It doesn't seem
to matter how short you make the critical section: a single
program-wide mutex is a loser.  Furthermore, the free list is a
joke, because it's nearly always going to be completely empty.  We
could probably just rip that out and use the clock sweep and never
miss it, but I doubt it would improve performance much.

I think what we probably need to do is have multiple clock sweeps in
progress at the same time.  So, for example, if you have 8GB of
shared_buffers, you might have 8 mutexes, one for each GB.  When a
process wants a buffer, it locks one of the mutexes and sweeps through
that 1GB partition.  If it finds a buffer before returning to the
point at which it started the scan, it's done.  Otherwise, it releases
its mutex, grabs the next one, and continues on until it finds a free
buffer.

The trick with any modification in this area is that pretty much any
degree of increased parallelism is potentially going to reduce the
quality of buffer replacement to some degree. So the trick will be to
try to squeeze out as much concurrency as possible while minimizing
degradation in the quality of buffer replacements.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers