Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote: I'm kind of interested by the result, actually, as I had feared that the spinlock protecting ProcArrayLock was going to be a bigger problem sooner. I think this depends on how many connections you have. If you try to scale up your benchmark by having hundreds of connections then get O(n^2) increase in the time spent with the procarray locked. It sounds like they pinned the number of connections at the number of cores they had. That makes sense if they're intentionally driving a cpu-bound benchmark but it means they won't run into this problem. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Mon, Aug 15, 2011 at 6:22 PM, Greg Stark st...@mit.edu wrote: On Wed, Aug 3, 2011 at 7:21 PM, Robert Haas robertmh...@gmail.com wrote: I'm kind of interested by the result, actually, as I had feared that the spinlock protecting ProcArrayLock was going to be a bigger problem sooner. I think this depends on how many connections you have. If you try to scale up your benchmark by having hundreds of connections then get O(n^2) increase in the time spent with the procarray locked. It sounds like they pinned the number of connections at the number of cores they had. That makes sense if they're intentionally driving a cpu-bound benchmark but it means they won't run into this problem. The experiments that I've done so far seem to indicate that there are actually a couple of different ways for ProcArrayLock to become a problem. First, as you say, if you have a busy read-write workload, you just get plain old LWLock contention. Everyone who wants a snapshot needs the lock in shared mode; everyone who wants to commit needs it in exclusive mode. So, the people who are committing block both snapshot-taking and other commits; while the snapshot-takers block commit. If you switch to a read-only workload, ProcArrayLock becomes uncontended, since a transaction without an XID does not need to acquire ProcArrayLock to commit. But if you have enough CPUs (64 will do it), the spinlock protecting ProcArrayLock becomes a bottleneck. However, you can only observe this problem if the working set fits in shared buffers; otherwise, BufFreelistLock contention slows the whole system to a crawl, and the spinlock contention never materializes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas robertmh...@gmail.com writes: However, it doesn't really do anything to solve this problem. The problem here is not that the size of the relation is changing too frequently - indeed, it's not changing at all in this test case. The problem is rather that testing whether or not the size has in fact changed is costing too much. You were complaining about the cost of the cache maintenance, that in the current scheme of things would have to be called far too often. Reducing the relation extension trafic would allow, I guess, to have something more expensive to reset the cache ― it would not happen much. Now, it could be that the idea is only worth “the electrons it's written on” if all the relation extensions are taken care of by a background process... The reason why we are testing the size of the relation here rather than just using reltuples is because the relation might have been extended since it was last analyzed. We can't count on analyze to run often enough to avoid looking at the actual file size. If the file's grown, we have to scale the number of tuples up proportional to the growth in relpages. Could we send the same message to the stat collector as autoanalyze is doing each time we extend a relation? -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
2011/8/10 Dimitri Fontaine dfonta...@hi-media.com: Robert Haas robertmh...@gmail.com writes: However, it doesn't really do anything to solve this problem. The problem here is not that the size of the relation is changing too frequently - indeed, it's not changing at all in this test case. The problem is rather that testing whether or not the size has in fact changed is costing too much. You were complaining about the cost of the cache maintenance, that in the current scheme of things would have to be called far too often. Reducing the relation extension trafic would allow, I guess, to have something more expensive to reset the cache -- it would not happen much. That's an interesting point. I confess to having no idea how frequent relation extension is now, or how much overhead we could add before it started to get noticeable. It seems likely to me that we can design something that is sufficiently lightweight that we don't need to worry about reducing the relation extension traffic first, but I don't know that for certain. The reason why we are testing the size of the relation here rather than just using reltuples is because the relation might have been extended since it was last analyzed. We can't count on analyze to run often enough to avoid looking at the actual file size. If the file's grown, we have to scale the number of tuples up proportional to the growth in relpages. Could we send the same message to the stat collector as autoanalyze is doing each time we extend a relation? Mmm, maybe. I don't really like the idea of getting the stats collector involved in this; that seems like it will likely add complexity without any corresponding benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Excerpts from Robert Haas's message of mié ago 10 21:27:08 -0400 2011: 2011/8/10 Dimitri Fontaine dfonta...@hi-media.com: Robert Haas robertmh...@gmail.com writes: However, it doesn't really do anything to solve this problem. The problem here is not that the size of the relation is changing too frequently - indeed, it's not changing at all in this test case. The problem is rather that testing whether or not the size has in fact changed is costing too much. You were complaining about the cost of the cache maintenance, that in the current scheme of things would have to be called far too often. Reducing the relation extension trafic would allow, I guess, to have something more expensive to reset the cache -- it would not happen much. That's an interesting point. I confess to having no idea how frequent relation extension is now, or how much overhead we could add before it started to get noticeable. I have seen several cases on which it (rel extension) is really troublesome. Think insertion on large bytea fields -- the toast table's extension lock was heavily contended. When this was in 8.1 we had quite some trouble because of the locking involving some shared buffer pool lwlock which is fortunately now partitioned; even without that, it is a major contention point. These were insert-only tables, so pages are always full except the last one. The reason why we are testing the size of the relation here rather than just using reltuples is because the relation might have been extended since it was last analyzed. We can't count on analyze to run often enough to avoid looking at the actual file size. If the file's grown, we have to scale the number of tuples up proportional to the growth in relpages. Could we send the same message to the stat collector as autoanalyze is doing each time we extend a relation? Mmm, maybe. I don't really like the idea of getting the stats collector involved in this; that seems like it will likely add complexity without any corresponding benefit. Yeah, it adds delay and uncertainty (UDP being lossy). -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 10, 2011 at 9:46 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of mié ago 10 21:27:08 -0400 2011: 2011/8/10 Dimitri Fontaine dfonta...@hi-media.com: Robert Haas robertmh...@gmail.com writes: However, it doesn't really do anything to solve this problem. The problem here is not that the size of the relation is changing too frequently - indeed, it's not changing at all in this test case. The problem is rather that testing whether or not the size has in fact changed is costing too much. You were complaining about the cost of the cache maintenance, that in the current scheme of things would have to be called far too often. Reducing the relation extension trafic would allow, I guess, to have something more expensive to reset the cache -- it would not happen much. That's an interesting point. I confess to having no idea how frequent relation extension is now, or how much overhead we could add before it started to get noticeable. I have seen several cases on which it (rel extension) is really troublesome. Think insertion on large bytea fields -- the toast table's extension lock was heavily contended. When this was in 8.1 we had quite some trouble because of the locking involving some shared buffer pool lwlock which is fortunately now partitioned; even without that, it is a major contention point. These were insert-only tables, so pages are always full except the last one. Yeah. I think there's a good argument to be made that we should extend relations in larger chunks, both for this reason and to avoid fragmenting the underlying file. But in this case, the fact that relation extension is such a heavyweight operation almost works to our advantage. I mean, if we're already acquiring a heavyweight lock (and they're not called heavyweight for nothing), making at least one system call which updates filesystem metadata, and then releasing our heavyweight lock, the additional overhead of setting a flag in shared memory or somesuch might well be unnoticeable. Or it might not - hard to know without testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Sat, Aug 6, 2011 at 1:43 PM, Jeff Janes jeff.ja...@gmail.com wrote: The approach is to move the important things from a LWLock to a spinlock, and to not do any locking for increments to clock-hand increment and numBufferAllocs. That means that some buffers might occasionally get inspected twice and some might not get inspected at all during any given clock cycle, but this should not lead to any correctness problems. (Disclosure: Tom didn't like this approach when it was last discussed.) I just offer this for whatever it is worth to you--I'm not proposing it as an actual patch to be applied. Interesting approach. When data fits in RAM but not shared_buffers, maybe the easiest fix is to increase shared_buffers. Which brings up the other question I had for you about your work with Nate's celebrated loaner machine. Have you tried to reproduce the performance problems that have been reported (but without public disclosure of how to reproduce) with shared_buffers 8GB on machines with RAM 8GB ? No. That's on my list, but thus far has not made it to the top of said list. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Robert Haas robertmh...@gmail.com writes: It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. A little Googling around suggests that some patches along these lines have been proposed and - for reasons that I don't fully understand - rejected. That now seems unfortunate. Barring a kernel-level fix, we could try to implement our own cache to work around this problem. However, any such cache would need to be darn cheap to check and update (since we can't assume that relation extension is an infrequent event) and must somehow having the same sort of mutex contention that's killing the kernel in this workload. What about making the relation extension much less frequent? It's been talked about before here, that instead of extending 8kB at a time we could (should) extend by much larger chunks. I would go as far as preallocating the whole next segment (1GB) (in the background) as soon as the current is more than half full, or such a policy. Then you have the problem that you can't really use lseek() anymore to guess'timate a relation size, but Tom said in this thread that the planner certainly doesn't need something that accurate. Maybe the reltuples would do? If not, it could be that some adapting of its accuracy could be done? I think that pre-extending relations or extending them in larger increments is probably a good idea, although I think the AMOUNT of preallocation you just proposed would be severe overkill. If we extended the relation in 1MB chunks, we'd reduce the number of relation extensions by more than 99%, and with far less space wastage than the approach you are proposing. However, it doesn't really do anything to solve this problem. The problem here is not that the size of the relation is changing too frequently - indeed, it's not changing at all in this test case. The problem is rather that testing whether or not the size has in fact changed is costing too much. The reason why we are testing the size of the relation here rather than just using reltuples is because the relation might have been extended since it was last analyzed. We can't count on analyze to run often enough to avoid looking at the actual file size. If the file's grown, we have to scale the number of tuples up proportional to the growth in relpages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Sat, Aug 6, 2011 at 3:30 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Janes jeff.ja...@gmail.com writes: My experiments have shown that the freelist proper is not substantially faster than the freelist clocksweep--and that is even under the assumption that putting things back into the freelist is absolutely free. The freelist isn't there to make buffer allocation faster, though; it's there for allocation efficiency. The point is that when some buffers have become completely useless (eg, because we dropped the table they were for), they'll be recycled in preference to reclaiming buffers that contain still-possibly-useful data. It would certainly be simple to get rid of the freelist and only recycle dead buffers when the clock sweep reaches them, but I think we'd be paying for that in extra, unnecessary I/O. This is an interesting point, because I think it really gets at the heart of the trade-off we're facing here: quality of buffer allocation vs. concurrency. Assuming no free buffers are present, we'd ideally like to evict the buffer that won't be used until the latest possible future time. LRU is an approximation of this. Our clock-sweep algorithm is a less-accurate approximation of this that pays for itself with less locking. If we really wanted true LRU, we'd have to move things around in the list every time a buffer was pinned. That would be really expensive. Our clock sweep algorithm only requires fighting over a piece of shared state when we want to kick a buffer out, rather than every time we want to do anything to a buffer. And it's probably only slightly worse in terms of buffer allocation efficiency than true LRU, so on balance it appears to be a good trade-off. Even so, I think it's pretty trivial to construct a workload where performance is throttled by BufFreelistLock. I seriously doubt we're going to be able to solve that problem by optimizing the code that runs while holding that lock. We might by ourselves a few more percentage points here or there, but if we really want to bust this bottleneck we're going to need to move to some algorithm that is an even-less-perfect approximation of LRU but which doesn't involve a single cluster-wide lock that throttles all buffer allocation. No matter how fast you make the critical section protected by that lock, given enough CPUs, it will eventually not be fast enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On 2011-08-08 15:29, Robert Haas wrote: On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontainedimi...@2ndquadrant.fr wrote: Robert Haasrobertmh...@gmail.com writes: It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. A little Googling around suggests that some patches along these lines have been proposed and - for reasons that I don't fully understand - rejected. That now seems unfortunate. Barring a kernel-level fix, we could try to implement our own cache to work around this problem. However, any such cache would need to be darn cheap to check and update (since we can't assume that relation extension is an infrequent event) and must somehow having the same sort of mutex contention that's killing the kernel in this workload. What about making the relation extension much less frequent? It's been talked about before here, that instead of extending 8kB at a time we could (should) extend by much larger chunks. I would go as far as preallocating the whole next segment (1GB) (in the background) as soon as the current is more than half full, or such a policy. Then you have the problem that you can't really use lseek() anymore to guess'timate a relation size, but Tom said in this thread that the planner certainly doesn't need something that accurate. Maybe the reltuples would do? If not, it could be that some adapting of its accuracy could be done? I think that pre-extending relations or extending them in larger increments is probably a good idea, although I think the AMOUNT of preallocation you just proposed would be severe overkill. If we extended the relation in 1MB chunks, we'd reduce the number of relation extensions by more than 99%, and with far less space wastage than the approach you are proposing. Preextending in bigger chuncks has other benefits as well, since it helps the filsystem (if it supports extends) to get the data from the relation layed out in sequential order on disk. On a well filled relation doing filefrag on an ext4 filesystem reveals that data loaded during initial creation gives 10-11 extends per 1GB file. Whereas a relation filled over time gives as much as 128 extends. I would suggest 5% of current relation size or 25-100MB whatever being the smallest of it. That would still keep the size down on small relations. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 11:21 AM, Robert Haas robertmh...@gmail.com wrote: About nine months ago, we had a discussion of some benchmarking that was done by the mosbench folks at MIT: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php Although the authors used PostgreSQL as a test harness for driving load, it's pretty clear from reading the paper that their primary goal was to stress the Linux kernel, so the applicability of the paper to real-world PostgreSQL performance improvement is less than it might be. Still, having now actually investigated in some detail many of the same performance issues that they were struggling with, I have a much clearer understanding of what's really going on here. In PostgreSQL terms, here are the bottlenecks they ran into: 1. We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches. This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. I was going to ask if you if had done any benchmarks with scale such that the tables fit in RAM but not in shared_buffers. I guess you have. The attached experimental patch fixed freelist contention on 8 cores. It would be nice to see what happens above that. It has been cherry picked up to HEAD, but not tested against it. (Last tested in Dec 2010, my how time flies) The approach is to move the important things from a LWLock to a spinlock, and to not do any locking for increments to clock-hand increment and numBufferAllocs. That means that some buffers might occasionally get inspected twice and some might not get inspected at all during any given clock cycle, but this should not lead to any correctness problems. (Disclosure: Tom didn't like this approach when it was last discussed.) I just offer this for whatever it is worth to you--I'm not proposing it as an actual patch to be applied. When data fits in RAM but not shared_buffers, maybe the easiest fix is to increase shared_buffers. Which brings up the other question I had for you about your work with Nate's celebrated loaner machine. Have you tried to reproduce the performance problems that have been reported (but without public disclosure of how to reproduce) with shared_buffers 8GB on machines with RAM 8GB ? Cheers, Jeff diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index bf9903b..304135e 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -28,7 +28,8 @@ typedef struct int nextVictimBuffer; int firstFreeBuffer; /* Head of list of unused buffers */ - int lastFreeBuffer; /* Tail of list of unused buffers */ + int lastFreeBuffer; /* Tail of list of unused buffers */ + slock_t mutex; /* Protects all of these, except sometimes not nextVictimBuffer */ /* * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is, @@ -44,7 +45,7 @@ typedef struct } BufferStrategyControl; /* Pointers to shared state */ -static BufferStrategyControl *StrategyControl = NULL; +static volatile BufferStrategyControl *StrategyControl = NULL; /* * Private (non-shared) state for managing a ring of shared buffers to re-use. @@ -123,15 +124,13 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) } } - /* Nope, so lock the freelist */ - *lock_held = true; - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); - /* * We count buffer allocation requests so that the bgwriter can estimate * the rate of buffer consumption. Note that buffers recycled by a * strategy object are intentionally not counted here. + * This is now done without a lock, and so updates can be lost */ + StrategyControl-numBufferAllocs++; /* @@ -142,6 +141,12 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) */ while (StrategyControl-firstFreeBuffer = 0) { + SpinLockAcquire(StrategyControl-mutex); + if (StrategyControl-firstFreeBuffer0) + { + SpinLockRelease(StrategyControl-mutex); + break; + } buf = BufferDescriptors[StrategyControl-firstFreeBuffer]; Assert(buf-freeNext != FREENEXT_NOT_IN_LIST); @@ -149,6 +154,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) StrategyControl-firstFreeBuffer = buf-freeNext; buf-freeNext = FREENEXT_NOT_IN_LIST; + SpinLockRelease(StrategyControl-mutex); + /* * If the buffer is pinned or has a nonzero usage_count, we cannot use * it; discard it and retry. (This can only happen if VACUUM put
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 3:21 PM, Jim Nasby j...@nasby.net wrote: On Aug 3, 2011, at 1:21 PM, Robert Haas wrote: 1. We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches. This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads). The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep. Hi Jim, My experiments have shown that the freelist proper is not substantially faster than the freelist clocksweep--and that is even under the assumption that putting things back into the freelist is absolutely free. Under all the workload's I've been able to contrive, other than ones contrived by actually hacking the code itself to make it pathological, the average number of buffers inspected per run of the clock sweep is 2.5. Under contention, the mere act of acquiring a lock is more traumatic than the actual work carried out under the lock. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas robertmh...@gmail.com writes: It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. A little Googling around suggests that some patches along these lines have been proposed and - for reasons that I don't fully understand - rejected. That now seems unfortunate. Barring a kernel-level fix, we could try to implement our own cache to work around this problem. However, any such cache would need to be darn cheap to check and update (since we can't assume that relation extension is an infrequent event) and must somehow having the same sort of mutex contention that's killing the kernel in this workload. What about making the relation extension much less frequent? It's been talked about before here, that instead of extending 8kB at a time we could (should) extend by much larger chunks. I would go as far as preallocating the whole next segment (1GB) (in the background) as soon as the current is more than half full, or such a policy. Then you have the problem that you can't really use lseek() anymore to guess'timate a relation size, but Tom said in this thread that the planner certainly doesn't need something that accurate. Maybe the reltuples would do? If not, it could be that some adapting of its accuracy could be done? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Jeff Janes jeff.ja...@gmail.com writes: My experiments have shown that the freelist proper is not substantially faster than the freelist clocksweep--and that is even under the assumption that putting things back into the freelist is absolutely free. The freelist isn't there to make buffer allocation faster, though; it's there for allocation efficiency. The point is that when some buffers have become completely useless (eg, because we dropped the table they were for), they'll be recycled in preference to reclaiming buffers that contain still-possibly-useful data. It would certainly be simple to get rid of the freelist and only recycle dead buffers when the clock sweep reaches them, but I think we'd be paying for that in extra, unnecessary I/O. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 9:16 PM, Robert Haas robertmh...@gmail.com wrote: Spinlocks seem to have a very ugly tipping point. And on that note, here are oprofile results from pgbench -n -T 300 -S -c 64 -j 64 -M prepared on the latest master branch, compiled with -O2 -fno-omit-frame-pointer. shared_buffers=8GB, 64-core machine, RHEL 6.1. By running with -M prepared, it dodges the lseek() problem. 960576 23.7580 postgres postgres s_lock 562821 13.9203 no-vmlinux no-vmlinux /no-vmlinux 3211917.9440 postgres postgres LWLockRelease 3176537.8565 postgres postgres LWLockAcquire 2248125.5603 postgres postgres GetSnapshotData 81156 2.0072 postgres postgres _bt_compare 78744 1.9476 postgres postgres PinBuffer 58101 1.4370 postgres postgres hash_search_with_hash_value 43865 1.0849 postgres postgres AllocSetAlloc 25832 0.6389 postgres postgres PostgresMain Since SpinLockAcquire() is an in-line macro that only calls s_lock() if the initial TAS fails, not only the time directly attributed to s_lock but also a good chunk of the CPU time attributable to LWLockAcquire and LWLockRelease() is likely time spent fighting over spinlocks. Since I compiled with frame pointers, it's pretty easy to see where those s_lock calls are coming from. Here's an excerpt from opreport -c: 55.0e-04 postgres postgres _bt_getbuf 66.0e-04 postgres postgres _bt_relandgetbuf 140.0014 postgres postgres ReleaseAndReadBuffer 850.0085 postgres postgres ReadBuffer_common 206 0.0207 postgres postgres GetSnapshotData 18344 1.8437 postgres postgres UnpinBuffer 24977 2.5103 postgres postgres PinBuffer 406948 40.9009 postgres postgres LWLockRelease 544376 54.7133 postgres postgres LWLockAcquire 994947 23.5746 postgres postgres s_lock It's also fairly easy to track down who is calling LWLockAcquire and LWLockRelease. Nearly all of the calls are from just two contributors: 241655 27.6830 postgres postgres ReadBuffer_common 566434 64.8885 postgres postgres GetSnapshotData 3285487.7847 postgres postgres LWLockAcquire 176629 23.8917 postgres postgres ReadBuffer_common 524348 70.9259 postgres postgres GetSnapshotData 3323337.8744 postgres postgres LWLockRelease So, most of the s_lock calls come from LWLockAcquire, and most of the LWLockAcquire calls come from GetSnapshotData. That's not quite enough to prove that all the spinning going on here is coming from contention over the spinlock protecting ProcArrayLock, because it needn't be the case that all calls to LWLockAcquire are equally likely to end up in s_lock. You could speculate that ProcArrayLock isn't actually responsible for many of those s_lock calls and that some other lock, like maybe the buffer mapping locks, is disproportionately responsible for the s_lock calls. But in fact I think it's exactly the other way around: the buffer mapping locks are partitioned 16 ways, while there's only one ProcArrayLock. I'm willing to bet that's where nearly all of the spinning is happening, and I'll further bet that that spinning accounts for AT LEAST a third of the total CPU time on this workload. And maybe closer to half. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas robertmh...@gmail.com wrote: And hoping that the Linux guys decide to do something about it. This isn't really our bug - lseek is quite cheap in the uncontended case. Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where they've finally remove the BKL out of VFS/inode? I mean, complaining about scalability in linux 2.6.18 is like complaining about scalability in postgresql 8.2 ;-) a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Thu, Aug 4, 2011 at 5:09 PM, Aidan Van Dyk ai...@highrise.ca wrote: On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas robertmh...@gmail.com wrote: And hoping that the Linux guys decide to do something about it. This isn't really our bug - lseek is quite cheap in the uncontended case. Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where they've finally remove the BKL out of VFS/inode? I mean, complaining about scalability in linux 2.6.18 is like complaining about scalability in postgresql 8.2 ;-) Hmm. This machine is running 2.6.32-131.6.1.el6.x86_64, not 2.6.18. Not sure how much the code has changed since then, but the spinlock is there in the master branch of Linus's repository. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] mosbench revisited
About nine months ago, we had a discussion of some benchmarking that was done by the mosbench folks at MIT: http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php Although the authors used PostgreSQL as a test harness for driving load, it's pretty clear from reading the paper that their primary goal was to stress the Linux kernel, so the applicability of the paper to real-world PostgreSQL performance improvement is less than it might be. Still, having now actually investigated in some detail many of the same performance issues that they were struggling with, I have a much clearer understanding of what's really going on here. In PostgreSQL terms, here are the bottlenecks they ran into: 1. We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches. This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. 2. PostgreSQL implements row- and table-level locks atop user-level mutexes; as a result, even a non-conflicting row- or table-level lock acquisition requires exclusively locking one of only 16 global mutexes. I think that the reference to row-level locks here is a red herring; or at least, I haven't seen any evidence that row-level locking is a meaningful source of contention on any workload I've tested. Table-level locks clearly are, and this is the problem that the now-committed fastlock patch addressed. So, fixed! 3. Our workload creates one PostgreSQL connection per server core and sends queries (selects or updates) in batches of 256, aggregating successive read-only transac- tions into single transactions. This workload is intended to minimize application-level contention within PostgreSQL in order to maximize the stress PostgreSQL places on the kernel. I had no idea what this was talking about at the time, but it's now obvious in retrospect that they were working around the overhead imposed by acquiring and releasing relation and virtualxid locks. My pending lazy vxids patch will address the remaining issue here. 4. With modified PostgreSQL on stock Linux, throughput for both workloads collapses at 36 cores .. The main reason is the kernel's lseek implementation. With the fastlock, sinval-hasmessages, and lazy-vxid patches applied (the first two are committed now), it's now much easier to run headlong into this bottleneck. Prior to those patches, for this to be an issue, you would need to batch your queries together in big groups to avoid getting whacked by the lock manager and/or sinval overhead first. With those problems and the recently discovered bottleneck in glibc's random() implementation fixed, good old pgbench -S is enough to hit this problem if you have enough clients and enough cores. And it turns out that the word collapse is not an exaggeration. On a 64-core Intel box running RHEL 6.1, performance ramped up from 24k TPS at 4 clients to 175k TPS at 32 clients and then to 207k TPS at 44 clients. After that it fell off a cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients, consuming truly horrifying amounts of system time in the process. A somewhat tedious investigation revealed that the problem is, in fact, contention on the inode mutex caused by lseek(). Results are much better with -M prepared (310k TPS at 48 clients, 294k TPS at 64 clients). All one-minute tests with scale factor 100, fitting inside 8GB of shared_buffers (clearly not enough for serious benchmarking, but enough to demonstrate this issue). It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. A little Googling around suggests that some patches along these lines have been proposed and - for reasons that I don't fully understand - rejected. That now seems unfortunate. Barring a kernel-level fix, we could try to implement our own cache to work around this problem. However, any such cache would need to be darn cheap to check and update (since we can't assume that relation extension is an infrequent event) and must somehow having the same sort of mutex contention that's killing the kernel in this workload. 5. With all of the above problems fixed or worked around, the authors write, PostgreSQL's overall scalability is primarily limited by contention for the spinlock protecting the buffer
Re: [HACKERS] mosbench revisited
On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote: It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. Interesting. There's this thread from 2003 suggesting the use of pread instead, it was rejected on the argument that lseek is cheap so not a problem. http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php Perhaps we now have a benchmark where the effect can be measured. There's the issue about whether it screws up the readahead mechanism... Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] mosbench revisited
Martijn van Oosterhout klep...@svana.org writes: On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote: It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. Interesting. There's this thread from 2003 suggesting the use of pread instead, it was rejected on the argument that lseek is cheap so not a problem. http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php That seems rather unrelated. The point here is our use of lseek to find out the current file size --- or at least, I would hope they're not trying to read the inode's file size in a SEEK_CUR call. The reason -M prepared helps is presumably that it eliminates most of the RelationGetNumberOfBlocks calls the planner does to check current table size. While we could certainly consider using a cheaper (possibly more stale) value there, it's a bit astonishing to think that that's the main cost in a parse/plan/execute cycle. Perhaps there are more hotspot calls than that one? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 2:49 PM, Tom Lane t...@sss.pgh.pa.us wrote: Martijn van Oosterhout klep...@svana.org writes: On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote: It would be nice if the Linux guys would fix this problem for us, but I'm not sure whether they will. For those who may be curious, the problem is in generic_file_llseek() in fs/read-write.c. On a platform with 8-byte atomic reads, it seems like it ought to be very possible to read inode-i_size without taking a spinlock. Interesting. There's this thread from 2003 suggesting the use of pread instead, it was rejected on the argument that lseek is cheap so not a problem. http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php That seems rather unrelated. The point here is our use of lseek to find out the current file size --- or at least, I would hope they're not trying to read the inode's file size in a SEEK_CUR call. Correct. The reason -M prepared helps is presumably that it eliminates most of the RelationGetNumberOfBlocks calls the planner does to check current table size. While we could certainly consider using a cheaper (possibly more stale) value there, it's a bit astonishing to think that that's the main cost in a parse/plan/execute cycle. Perhaps there are more hotspot calls than that one? Nope. On a straight pgbench -S test, you get four system calls per query: recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates the two lseeks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas robertmh...@gmail.com writes: On a straight pgbench -S test, you get four system calls per query: recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates the two lseeks. [ scratches head... ] Two? Is that one for the table and one for its lone index, or are we being redundant there? (If the query ended up being a seqscan, I'd expect a second lseek(SEEK_END) when the executor starts up, but I gather from the other complaints that the mosbench people were only testing simple indexscan queries.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On a straight pgbench -S test, you get four system calls per query: recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates the two lseeks. [ scratches head... ] Two? Yep. Is that one for the table and one for its lone index, or are we being redundant there? The former. Specifically, it appears we're smart enough to only test the last segment (in this case, the table is large enough that there is a .1 file, and that's what we're lseeking). (If the query ended up being a seqscan, I'd expect a second lseek(SEEK_END) when the executor starts up, but I gather from the other complaints that the mosbench people were only testing simple indexscan queries.) Yeah, it seems that for a sequential scan we lseek the heap, then the index, then the heap again; but for index scans we just hit the heap and the index. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas robertmh...@gmail.com writes: On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: (If the query ended up being a seqscan, I'd expect a second lseek(SEEK_END) when the executor starts up, but I gather from the other complaints that the mosbench people were only testing simple indexscan queries.) Yeah, it seems that for a sequential scan we lseek the heap, then the index, then the heap again; but for index scans we just hit the heap and the index. Sure. The first two come from the planner getting the table and index sizes for estimation purposes (look in plancat.c). The last is done in heapam.c's initscan(). We could possibly accept stale values for the planner estimates, but I think heapam's number had better be accurate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: (If the query ended up being a seqscan, I'd expect a second lseek(SEEK_END) when the executor starts up, but I gather from the other complaints that the mosbench people were only testing simple indexscan queries.) Yeah, it seems that for a sequential scan we lseek the heap, then the index, then the heap again; but for index scans we just hit the heap and the index. Sure. The first two come from the planner getting the table and index sizes for estimation purposes (look in plancat.c). The last is done in heapam.c's initscan(). We could possibly accept stale values for the planner estimates, but I think heapam's number had better be accurate. I think the exact requirement is that, if the relation turns out to be larger than the size we read, the extra blocks had better not contain any tuples our snapshot can see. There's actually no interlock between smgrnblocks() and smgrextend() right now, so presumably we don't need to add one. However, a value cached from a few seconds ago is clearly not going to cut it. I don't really think there's any sensible way to implement a per-backend cache, because that would require invalidation events of some kind to be sent on relation extension, and that seems utterly insane from a performance standpoint, even if we invented something less expensive than sinval. I guess it might work for planning purposes if you only sent out invalidation events on every N'th extension or something, but penalizing the accuracy of planning to work around a Linux kernel bug that only manifests itself on machines with 32 cores doesn't seem very appealing. A shared cache seems like it could work, but the locking is tricky. Normally we'd just use a hash table protected by an LWLock, one one LWLock per partition, but here that's clearly not going to work. The kernel is using a spinlock per file, and that's still too heavy-weight. I think that if we could prepopulate the cache with all the keys (i.e. relfilenodes) we care about and then never add or evict any, we could run it completely unlocked. I believe that the existing memory barriers in things like LWLockAcquire() would be sufficient to prevent us from reading a too-old value (i.e. block count). In particular, you couldn't read a value that predated your snapshot, if you got your snapshot by holding ProcArrayLock. But the races involved with adding and removing items from the cache are hard to deal with without using locks, especially because the keys are 12 bytes or more and therefore can't be read or written atomically. I've been mulling over how we might deal with this and actually coded up an implementation, but it turns out (surprise, surprise) to have problems with insufficient locking. So I'm thinking it over some more. And hoping that the Linux guys decide to do something about it. This isn't really our bug - lseek is quite cheap in the uncontended case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
Robert Haas robertmh...@gmail.com writes: On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: ... We could possibly accept stale values for the planner estimates, but I think heapam's number had better be accurate. I think the exact requirement is that, if the relation turns out to be larger than the size we read, the extra blocks had better not contain any tuples our snapshot can see. There's actually no interlock between smgrnblocks() and smgrextend() right now, so presumably we don't need to add one. No interlock in userspace, you mean. We're relying on the kernel to do it, ie, give us a number that is not older than the time of our (already taken at this point) snapshot. I don't really think there's any sensible way to implement a per-backend cache, because that would require invalidation events of some kind to be sent on relation extension, and that seems utterly insane from a performance standpoint, even if we invented something less expensive than sinval. Yeah, that's the issue. But relation extension is not actually a cheap operation, since it requires a minimum of one kernel call that is presumably doing something nontrivial in the filesystem. I'm not entirely convinced that we couldn't make this work --- especially since we could certainly derate the duty cycle by a factor of ten or more without giving up anything remotely meaningful in planning accuracy. (I'd be inclined to make it send an inval only once the relation size had changed at least, say, 10%.) A shared cache seems like it could work, but the locking is tricky. Normally we'd just use a hash table protected by an LWLock, one one LWLock per partition, but here that's clearly not going to work. The kernel is using a spinlock per file, and that's still too heavy-weight. That still seems utterly astonishing to me. We're touching each of those files once per query cycle; a cycle that contains two message sends, who knows how many internal spinlock/lwlock/heavyweightlock acquisitions inside Postgres (some of which *do* contend with each other), and a not insignificant amount of plain old computing. Meanwhile, this particular spinlock inside the kernel is protecting what, a single doubleword fetch? How is that the bottleneck? I am wondering whether kernel spinlocks are broken. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Aug 3, 2011, at 1:21 PM, Robert Haas wrote: 1. We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches. This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads). The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 5:35 PM, Tom Lane t...@sss.pgh.pa.us wrote: That still seems utterly astonishing to me. We're touching each of those files once per query cycle; a cycle that contains two message sends, who knows how many internal spinlock/lwlock/heavyweightlock acquisitions inside Postgres (some of which *do* contend with each other), and a not insignificant amount of plain old computing. Meanwhile, this particular spinlock inside the kernel is protecting what, a single doubleword fetch? How is that the bottleneck? Spinlocks seem to have a very ugly tipping point. When I tested pgbench -S on a 64-core system with the lazy vxid patch applied and a patch to use random_r() in lieu of random, the amount of system time used per SELECT-only transaction at 48 clients was 3.59 times as much as it was at 4 clients. And the amount used per transaction at 52 clients was 3.63 times the amount used per transaction at 48 clients. And the amount used at 56 clients was 3.25 times the amount used at 52 clients. You can see the throughput graph starting to flatten out in the 32-44 client range, but it's not particularly alarming. However, once you pass that point things rapidly get totally out of control in a real hurry. A few more clients and the machine is basically doing nothing but spin. I am wondering whether kernel spinlocks are broken. I don't think so. Stefan Kaltenbrunner had one profile where he showed something like sixty or eighty percent of the usermode CPU time in s_lock. I didn't have access to that particular hardware, but the testing I've done strongly suggests that most of that was the SInvalReadLock spinlock. And before I patched pgbench to avoid calling random(), that was doing the same thing - literally flattening a 64-core box fighting over a single futex that normally costs almost nothing. (That one wasn't quite as bad because the futex actually deschedules the waiters, but it was still bad.) I'm actually not really sure why it shakes out this way (birthday paradox?) but having seen the effect several times now, I'm disinclined to believe it's an artifact. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] mosbench revisited
On Wed, Aug 3, 2011 at 6:21 PM, Jim Nasby j...@nasby.net wrote: On Aug 3, 2011, at 1:21 PM, Robert Haas wrote: 1. We configure PostgreSQL to use a 2 Gbyte application-level cache because PostgreSQL protects its free-list with a single lock and thus scales poorly with smaller caches. This is a complaint about BufFreeList lock which, in fact, I've seen as a huge point of contention on some workloads. In fact, on read-only workloads, with my lazy vxid lock patch applied, this is, I believe, the only remaining unpartitioned LWLock that is ever taken in exclusive mode; or at least the only one that's taken anywhere near often enough to matter. I think we're going to do something about this, although I don't have a specific idea in mind at the moment. This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads). The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep. Based on recent benchmarking, I'm going to say no. It doesn't seem to matter how short you make the critical section: a single program-wide mutex is a loser. Furthermore, the free list is a joke, because it's nearly always going to be completely empty. We could probably just rip that out and use the clock sweep and never miss it, but I doubt it would improve performance much. I think what we probably need to do is have multiple clock sweeps in progress at the same time. So, for example, if you have 8GB of shared_buffers, you might have 8 mutexes, one for each GB. When a process wants a buffer, it locks one of the mutexes and sweeps through that 1GB partition. If it finds a buffer before returning to the point at which it started the scan, it's done. Otherwise, it releases its mutex, grabs the next one, and continues on until it finds a free buffer. The trick with any modification in this area is that pretty much any degree of increased parallelism is potentially going to reduce the quality of buffer replacement to some degree. So the trick will be to try to squeeze out as much concurrency as possible while minimizing degradation in the quality of buffer replacements. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers