Re: [HACKERS] CLOG contention, part 2
On Mon, Feb 27, 2012 at 4:03 AM, Simon Riggs si...@2ndquadrant.com wrote: So please use a scale factor that the hardware can cope with. OK. I tested this out on Nate Boley's 32-core AMD machine, using scale factor 100 and scale factor 300. I initialized it with Simon's patch, which should have the effect of rendering the entire table unhinted and giving each row a different XID. I used my usual configuration settings for that machine, which are: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. I did three runs on master, as of commit 9bf8603c7a9153cada7e32eb0cf7ac1feb1d3b56, and three runs with the clog_history_v4 patch applied. The command to initialize the database was: ~/install/clog-contention/bin/pgbench -i -I -s $scale The command to run the test was: ~/install/clog-contention/bin/pgbench -l -T 1800 -c 32 -j 32 -n Executive Summary: The patch makes things way slower at scale factor 300, and possibly slightly slower at scale factor 100. Detailed Results: resultslp.clog_history_v4.32.100.1800:tps = 14286.049637 (including connections establishing) resultslp.clog_history_v4.32.100.1800:tps = 13532.814984 (including connections establishing) resultslp.clog_history_v4.32.100.1800:tps = 13972.987301 (including connections establishing) resultslp.clog_history_v4.32.300.1800:tps = 5061.650470 (including connections establishing) resultslp.clog_history_v4.32.300.1800:tps = 4871.126457 (including connections establishing) resultslp.clog_history_v4.32.300.1800:tps = 5861.124177 (including connections establishing) resultslp.master.32.100.1800:tps = 13420.777222 (including connections establishing) resultslp.master.32.100.1800:tps = 14912.336257 (including connections establishing) resultslp.master.32.100.1800:tps = 14505.718977 (including connections establishing) resultslp.master.32.300.1800:tps = 14766.984548 (including connections establishing) resultslp.master.32.300.1800:tps = 14783.026190 (including connections establishing) resultslp.master.32.300.1800:tps = 14567.504887 (including connections establishing) I don't know whether this is just a bug or whether there's some more fundamental problem with the approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Tue, Feb 28, 2012 at 6:11 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Feb 27, 2012 at 4:03 AM, Simon Riggs si...@2ndquadrant.com wrote: So please use a scale factor that the hardware can cope with. OK. I tested this out on Nate Boley's 32-core AMD machine, using scale factor 100 and scale factor 300. I initialized it with Simon's patch, which should have the effect of rendering the entire table unhinted and giving each row a different XID. Thanks for making the test. I think this tells me the only real way to do this kind of testing is not at arms length from a test machine. So time to get my hands on a machine, but not for this release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Feb 26, 2012 at 10:53 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 25, 2012 at 2:16 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas robertmh...@gmail.com wrote: Given that, I obviously cannot test this at this point, Patch with minor corrections attached here for further review. All right, I will set up some benchmarks with this version, and also review the code. Thanks. As a preliminary comment, Tom recently felt that it was useful to reduce the minimum number of CLOG buffers from 8 to 4, to benefit very small installations. So I'm guessing he'll object to an across-the-board doubling of the amount of memory being used, since that would effectively undo that change. It also makes it a bit hard to compare apples to apples, since of course we expect that by using more memory we can reduce the amount of CLOG contention. I think it's really only meaningful to compare contention between implementations that use approximately the same total amount of memory. It's true that doubling the maximum number of buffers from 32 to 64 straight up does degrade performance, but I believe that's because the buffer lookup algorithm is just straight linear search, not because we can't in general benefit from more buffers. I'm happy if you want to benchmark this against simply increasing clog buffers. We expect downsides to that, but it is worth testing nonetheless. pgbench loads all the data in one go, then pretends the data got their one transaction at a time. So pgbench with no mods is actually the theoretically most unreal imaginable. You have to run pgbench for 1 million transactions before you even theoretically show any gain from this patch, and it would need to be a long test indeed before the averaged effect of the patch was large enough to avoid the zero contribution from the first million transacts. Depends on the scale factor. At scale factor 100, the first million transactions figure to have replaced a sizeable percentage of the rows already. But I can use your other patch to set up the run. Maybe scale factor 300 would be good? Clearly if too much I/O is induced by the test we will see the results swamped. The patch is aimed at people with bigger databases and lots of RAM, which is many, many people because RAM is cheap. So please use a scale factor that the hardware can cope with. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sat, Feb 25, 2012 at 2:16 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas robertmh...@gmail.com wrote: Given that, I obviously cannot test this at this point, Patch with minor corrections attached here for further review. All right, I will set up some benchmarks with this version, and also review the code. As a preliminary comment, Tom recently felt that it was useful to reduce the minimum number of CLOG buffers from 8 to 4, to benefit very small installations. So I'm guessing he'll object to an across-the-board doubling of the amount of memory being used, since that would effectively undo that change. It also makes it a bit hard to compare apples to apples, since of course we expect that by using more memory we can reduce the amount of CLOG contention. I think it's really only meaningful to compare contention between implementations that use approximately the same total amount of memory. It's true that doubling the maximum number of buffers from 32 to 64 straight up does degrade performance, but I believe that's because the buffer lookup algorithm is just straight linear search, not because we can't in general benefit from more buffers. pgbench loads all the data in one go, then pretends the data got their one transaction at a time. So pgbench with no mods is actually the theoretically most unreal imaginable. You have to run pgbench for 1 million transactions before you even theoretically show any gain from this patch, and it would need to be a long test indeed before the averaged effect of the patch was large enough to avoid the zero contribution from the first million transacts. Depends on the scale factor. At scale factor 100, the first million transactions figure to have replaced a sizeable percentage of the rows already. But I can use your other patch to set up the run. Maybe scale factor 300 would be good? However, there is a potential fly in the ointment: in other cases in which we've reduced contention at the LWLock layer, we've ended up with very nasty contention at the spinlock layer that can sometimes eat more CPU time than the LWLock contention did. In that light, it strikes me that it would be nice to be able to partition the contention N ways rather than just 2 ways. I think we could do that as follows. Instead of having one control lock per SLRU, have N locks, where N is probably a power of 2. Divide the buffer pool for the SLRU N ways, and decree that each slice of the buffer pool is controlled by one of the N locks. Route all requests for a page P to slice P mod N. Unlike this approach, that wouldn't completely eliminate contention at the LWLock level, but it would reduce it proportional to the number of partitions, and it would reduce spinlock contention according to the number of partitions as well. A down side is that you'll need more buffers to get the same hit rate, but this proposal has the same problem: it doubles the amount of memory allocated for CLOG. Of course, this approach is all vaporware right now, so it's anybody's guess whether it would be better than this if we had code for it. I'm just throwing it out there. We've already discussed that and my patch for that has already been rules out by us for this CF. I'm not aware that anybody's coded up the approach I'm talking about. You've proposed splitting this up a couple of ways, but AFAICT they all boil down to splitting up CLOG into multiple SLRUs, whereas what I'm talking about is to have just a single SLRU, but with multiple control locks. I feel that approach is a bit more flexible, because it could be applied to any SLRU, not just CLOG. But I haven't coded it, let alone tested it, so I might be all wet. I agree with you that we should further analyse CLOG contention in following releases but that is not an argument against making this change now. No, but the fact that this approach is completely untested, or at least that no test results have been posted, is an argument against it. Assuming this version compiles and works I'll try to see what I can do about bridging that gap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Wed, Feb 8, 2012 at 11:26 PM, Robert Haas robertmh...@gmail.com wrote: Given that, I obviously cannot test this at this point, Patch with minor corrections attached here for further review. but let me go ahead and theorize about how well it's likely to work. What Tom suggested before (and after some reflection I think I believe it) is that the frequency of access will be highest for the newest CLOG page and then drop off for each page further back you go. Clearly, if that drop-off is fast - e.g. each buffer further backward is half as likely to be accessed as the next newer one - then the fraction of accesses that will hit pages that are far enough back to benefit from this optimization will be infinitesmal; 1023 out of every 1024 accesses will hit the first ten pages, and on a high-velocity system those all figure to have been populated since the last checkpoint. That's just making up numbers, so its not much help. The theory would apply to one workload but not another, so may well be true for some workload but I doubt whether all databases work that way. I ask accept the long tail distribution as being very common, we just don't know how long that tail is typically or even if there is a dominant single use case. The best case for this patch should be an access pattern that involves a very long tail; Agreed actually, pgbench is a pretty good fit for that Completely disagree, as described in detail in the other patch about creating a realistic test environment for this patch. pgbench is *not* a real world test. pgbench loads all the data in one go, then pretends the data got their one transaction at a time. So pgbench with no mods is actually the theoretically most unreal imaginable. You have to run pgbench for 1 million transactions before you even theoretically show any gain from this patch, and it would need to be a long test indeed before the averaged effect of the patch was large enough to avoid the zero contribution from the first million transacts. The only real world way to test this patch is to pre-create the database using a scale factor of 100 using the modified pgbench, then run a test. That correctly simulates the real world situation where all data arrived in single transactions. assuming the scale factor is large enough. For example, at scale factor 100, we've got 10,000,000 tuples: choosing one at random, we're almost exactly 90% likely to find one that hasn't been chosen in the last 1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page). In terms of reducing contention on the main CLOG SLRU, that sounds pretty promising, but depends somewhat on the rate at which transactions are processed relative to the frequency of checkpoints, since that will affect how many pages back you have go to use the history path. However, there is a potential fly in the ointment: in other cases in which we've reduced contention at the LWLock layer, we've ended up with very nasty contention at the spinlock layer that can sometimes eat more CPU time than the LWLock contention did. In that light, it strikes me that it would be nice to be able to partition the contention N ways rather than just 2 ways. I think we could do that as follows. Instead of having one control lock per SLRU, have N locks, where N is probably a power of 2. Divide the buffer pool for the SLRU N ways, and decree that each slice of the buffer pool is controlled by one of the N locks. Route all requests for a page P to slice P mod N. Unlike this approach, that wouldn't completely eliminate contention at the LWLock level, but it would reduce it proportional to the number of partitions, and it would reduce spinlock contention according to the number of partitions as well. A down side is that you'll need more buffers to get the same hit rate, but this proposal has the same problem: it doubles the amount of memory allocated for CLOG. Of course, this approach is all vaporware right now, so it's anybody's guess whether it would be better than this if we had code for it. I'm just throwing it out there. We've already discussed that and my patch for that has already been rules out by us for this CF. A much better take is to list what options for scaling we have: * separate out the history * partition access to the most active parts For me, any loss of performance comes from two areas: (1) concurrent access to pages (2) clog LRU is dirty and delays reading in new pages For the most active parts, (1) is significant. Using partitioning at the page level will be ineffective in reducing contention because almost all of the contention is on the first 1-2 pages. If we do partitioning, it should be done by *striping* the most recent pages across many locks, as I already suggested. Reducing page size would reduce page contention but increase number of new page events and so make (2) more important. Increasing page size will amplify (1). (2) is less significant but much more easily
Re: [HACKERS] CLOG contention, part 2
On Feb 9, 2012 1:27 AM, Robert Haas robertmh...@gmail.com However, there is a potential fly in the ointment: in other cases in which we've reduced contention at the LWLock layer, we've ended up with very nasty contention at the spinlock layer that can sometimes eat more CPU time than the LWLock contention did. In that light, it strikes me that it would be nice to be able to partition the contention N ways rather than just 2 ways. I think we could do that as follows. Instead of having one control lock per SLRU, have N locks, where N is probably a power of 2. Divide the buffer pool for the SLRU N ways, and decree that each slice of the buffer pool is controlled by one of the N locks. Route all requests for a page P to slice P mod N. Unlike this approach, that wouldn't completely eliminate contention at the LWLock level, but it would reduce it proportional to the number of partitions, and it would reduce spinlock contention according to the number of partitions as well. A down side is that you'll need more buffers to get the same hit rate, but this proposal has the same problem: it doubles the amount of memory allocated for CLOG. Splitting the SLRU into different parts is exactly the same approach as associativity used in CPU caches. I found some numbers that analyze cache hit rate with different associativities: http://research.cs.wisc.edu/multifacet/misc/spec2000cache-data/ Now obviously CPU cache access patterns are different from CLOG patterns, but I think that the numbers strongly suggest that the reduction in hitrate might be less than what you fear. For example, the harmonic mean of data cache misses over all benchmark for 16, 32 and 64 cache lines: | Size | Direct | 2-way LRU | 4-way LRU | 8-way LRU | Full LRU | |---+-+-+-+-+-| | 1KB | 0.0863842-- | 0.0697167-- | 0.0634309-- | 0.0563450-- | 0.0533706-- | | 2KB | 0.0571524-- | 0.0423833-- | 0.0360463-- | 0.0330364-- | 0.0305213-- | | 4KB | 0.0370053-- | 0.0260286-- | 0.0222981-- | 0.0202763-- | 0.0190243-- | As you can see, the reduction in hit rate is rather small down to 4 way associative caches. There may be a performance problem when multiple CLOG pages that happen to sit in a single way become hot at the same time. The most likely case that I can come up with is multiple scans going over unhinted pages created at different time periods. If that is something to worry about, then a tool that's used for CPUs is to employ a fully associative victim cache behind the main cache. If a CLOG page is evicted, it is transferred into the victim cache, evicting a page from there. When a page isn't found in the main cache, the victim cache is first checked for a possible hit. The movement between the two caches doesn't need to involve any memory copying - just swap pointers in metadata. The victim cache will bring back concurrency issues when the hit rate of the main cache is small - like the pgbench example you mentioned. In that case, a simple associative cache will allow multiple reads of clog pages simultaneously. On the other hand - in that case lock contention seems to be the symptom, rather than the disease. I think that those cases would be better handled by increasing the maximum CLOG SLRU size. The increase in memory usage should be a drop in the bucket for systems that have enough transaction processing velocity for that to be a problem. -- Ants Aasma
Re: [HACKERS] CLOG contention, part 2
On Fri, Feb 10, 2012 at 7:01 PM, Ants Aasma ants.aa...@eesti.ee wrote: On Feb 9, 2012 1:27 AM, Robert Haas robertmh...@gmail.com However, there is a potential fly in the ointment: in other cases in which we've reduced contention at the LWLock layer, we've ended up with very nasty contention at the spinlock layer that can sometimes eat more CPU time than the LWLock contention did. In that light, it strikes me that it would be nice to be able to partition the contention N ways rather than just 2 ways. I think we could do that as follows. Instead of having one control lock per SLRU, have N locks, where N is probably a power of 2. Divide the buffer pool for the SLRU N ways, and decree that each slice of the buffer pool is controlled by one of the N locks. Route all requests for a page P to slice P mod N. Unlike this approach, that wouldn't completely eliminate contention at the LWLock level, but it would reduce it proportional to the number of partitions, and it would reduce spinlock contention according to the number of partitions as well. A down side is that you'll need more buffers to get the same hit rate, but this proposal has the same problem: it doubles the amount of memory allocated for CLOG. Splitting the SLRU into different parts is exactly the same approach as associativity used in CPU caches. I found some numbers that analyze cache hit rate with different associativities: My suggested approach is essentially identical approach to the one we already use for partitioning the buffer cache and lock manager. I expect it to be equally effective at reducing contention. There is little danger of all hitting same partition at once, since there are many xids and they are served out sequentially. In the lock manager case we use the relid as key, so there is some skewing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 6:04 PM, Simon Riggs si...@2ndquadrant.com wrote: On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: If I cast to a int, then I see advancement: I'll initialise it as 0, rather than -1 and then we don't have a problem in any circumstance. I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. Yep, I've used that one for the testing. Most of the current patch is just bookkeeping to keep track of the point when we can look at history in read only manner. I've isolated the code better to allow you to explore various implementation options. I don't see any performance difference between any of them really, but you're welcome to look. Please everybody note that the clog history doesn't even become active until the first checkpoint, so this is dead code until we've hit the first checkpoint cycle and completed a million transactions since startup. So its designed to tune for real world situations, and is not easy to benchmark. (Maybe we could start earlier, but having extra code just for first few minutes seems waste of energy, especially since we must hit million xids also). I find that this version does not compile: clog.c: In function ‘TransactionIdGetStatus’: clog.c:431: error: ‘clog’ undeclared (first use in this function) clog.c:431: error: (Each undeclared identifier is reported only once clog.c:431: error: for each function it appears in.) Given that, I obviously cannot test this at this point, but let me go ahead and theorize about how well it's likely to work. What Tom suggested before (and after some reflection I think I believe it) is that the frequency of access will be highest for the newest CLOG page and then drop off for each page further back you go. Clearly, if that drop-off is fast - e.g. each buffer further backward is half as likely to be accessed as the next newer one - then the fraction of accesses that will hit pages that are far enough back to benefit from this optimization will be infinitesmal; 1023 out of every 1024 accesses will hit the first ten pages, and on a high-velocity system those all figure to have been populated since the last checkpoint. The best case for this patch should be an access pattern that involves a very long tail; actually, pgbench is a pretty good fit for that, assuming the scale factor is large enough. For example, at scale factor 100, we've got 10,000,000 tuples: choosing one at random, we're almost exactly 90% likely to find one that hasn't been chosen in the last 1,024,576 tuples (i.e. 32 CLOG pages @ 32K txns/page). In terms of reducing contention on the main CLOG SLRU, that sounds pretty promising, but depends somewhat on the rate at which transactions are processed relative to the frequency of checkpoints, since that will affect how many pages back you have go to use the history path. However, there is a potential fly in the ointment: in other cases in which we've reduced contention at the LWLock layer, we've ended up with very nasty contention at the spinlock layer that can sometimes eat more CPU time than the LWLock contention did. In that light, it strikes me that it would be nice to be able to partition the contention N ways rather than just 2 ways. I think we could do that as follows. Instead of having one control lock per SLRU, have N locks, where N is probably a power of 2. Divide the buffer pool for the SLRU N ways, and decree that each slice of the buffer pool is controlled by one of the N locks. Route all requests for a page P to slice P mod N. Unlike this approach, that wouldn't completely eliminate contention at the LWLock level, but it would reduce it proportional to the number of partitions, and it would reduce spinlock contention according to the number of partitions as well. A down side is that you'll need more buffers to get the same hit rate, but this proposal has the same problem: it doubles the amount of memory allocated for CLOG. Of course, this approach is all vaporware right now, so it's anybody's guess whether it would be better than this if we had code for it. I'm just throwing it out there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Mon, Jan 30, 2012 at 12:24 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 27, 2012 at 8:21 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. right -- exactly. but why stop at one page? If you have more than one, you need code to decide which one to evict (just free) every time you need a new one. And every process needs to be running this code, while the kernel is still going to need make its own decisions for the entire system. It seems simpler to just let the kernel do the job for everyone. Are you worried that a read syscall is going to be slow even when the data is presumably cached in the OS? I think that would be a very legitimate worry. You're talking about copying 8kB of data because you need two bits. Even if the user/kernel mode context switch is lightning-fast, that's a lot of extra data copying. I guess the most radical step in the direction I am advocating would be to simply read the one single byte with the data you want. Very little copying, but then the odds of the next thing you want being on the one chunk of data you already had in memory is much smaller. In a previous commit, 33aaa139e6302e81b4fbf2570be20188bb974c4f, we increased the number of CLOG buffers from 8 to 32 (except in very low-memory configurations). The main reason that shows a win on Nate Boley's 32-core test machine appears to be because it avoids the scenario where there are, say, 12 people simultaneously wanting to read 12 different CLOG buffers, and so 4 of them have to wait for a buffer to become available before they can even think about starting a read. The really bad latency spikes were happening not because the I/O took a long time, but because it can't be started immediately. Ah, I hadn't followed that closely. I had thought the main problem solved by that patch was that sometimes all of the CLOG buffers would be dirty, and so no one could read anything in until something else was written out, which could involve either blocking writes on a system with checkpoint-sync related constipation, or (if synchronous_commit=off) fsyncs. By reading the old-enough ones into local memory, you avoid both any locking and any writes. Simon's patch solves the writes, but there is still locking. I don't have enough hardware to test any of these theories, so all I can do is wave hands around. Maybe if I drop the number of buffers from 32 back to 8 or even 4, that would create a model system that could usefully test out the theories on hardware I have, but I'd doubt how transferable the results would be. With Simon's patch if I drop it to 8, it would really be 16 as there are now 2 sets of them, so I suppose it should be compared to head with 16 buffers to put them on an equal footing. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 8:21 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. right -- exactly. but why stop at one page? If you have more than one, you need code to decide which one to evict (just free) every time you need a new one. And every process needs to be running this code, while the kernel is still going to need make its own decisions for the entire system. It seems simpler to just let the kernel do the job for everyone. Are you worried that a read syscall is going to be slow even when the data is presumably cached in the OS? I think that would be a very legitimate worry. You're talking about copying 8kB of data because you need two bits. Even if the user/kernel mode context switch is lightning-fast, that's a lot of extra data copying. In a previous commit, 33aaa139e6302e81b4fbf2570be20188bb974c4f, we increased the number of CLOG buffers from 8 to 32 (except in very low-memory configurations). The main reason that shows a win on Nate Boley's 32-core test machine appears to be because it avoids the scenario where there are, say, 12 people simultaneously wanting to read 12 different CLOG buffers, and so 4 of them have to wait for a buffer to become available before they can even think about starting a read. The really bad latency spikes were happening not because the I/O took a long time, but because it can't be started immediately. However, these spikes are now gone, as a result of the above-commit. Probably you can get them back with enough cores, but you'll probably hit a lot of other, more serious problems first. I assume that if there's any purpose to further optimization here, it's either because the overall miss rate of the cache is too large, or because the remaining locking costs are too high. Unfortunately I haven't yet had time to look at this patch and understand what it does, or machine cycles available to benchmark it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sat, Jan 28, 2012 at 1:52 PM, Simon Riggs si...@2ndquadrant.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. Will think on that. For me, there are arguments both ways as to whether it should be in shared or local memory. The one factor that makes the answer shared for me is that its much easier to reuse existing SLRU code. We dont need to invent a new way of cacheing/access etc. We just rewire what we already have. So overall, the local/shared debate is much less important that the robustness/code reuse angle. That's what makes this patch fairly simple. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks for looking at the patch. The patch works fine. RecentXminPageno does move forwards as it is supposed to and there are no uints anywhere in that calculation. The pageno only moves forwards every 32,000 transactions, so I'm guessing that your testing didn't go on for long enough to show it working correctly. As regards to effectiveness, you need to execute more than 1 million transactions before the main clog cache fills, which might sound a lot, but its approximately 1 minute of heavy transactions at the highest rate Robert has published. I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks for looking at the patch. The patch works fine. RecentXminPageno does move forwards as it is supposed to and there are no uints anywhere in that calculation. Maybe it is system dependent. Or, are you running this patch on top of some other uncommitted patch (other than the pgbench one)? RecentXmin is a TransactionID, which is a uint32. I think the TransactionIdToPage macro preserves that. If I cast to a int, then I see advancement: if (ClogCtl-shared-RecentXminPageno (int) TransactionIdToPage(RecentXmin)) ... I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. Yep, I've used that one for the testing. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 1:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sun, Jan 29, 2012 at 12:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks for looking at the patch. The patch works fine. RecentXminPageno does move forwards as it is supposed to and there are no uints anywhere in that calculation. Maybe it is system dependent. Or, are you running this patch on top of some other uncommitted patch (other than the pgbench one)? RecentXmin is a TransactionID, which is a uint32. I think the TransactionIdToPage macro preserves that. If I cast to a int, then I see advancement: if (ClogCtl-shared-RecentXminPageno (int) TransactionIdToPage(RecentXmin)) And to clarify, if I don't do the cast, I don't see advancement, using this code: elog(LOG, JJJ RecentXminPageno %d, %d, ClogCtl-shared-RecentXminPageno , TransactionIdToPage(RecentXmin)); if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Then using your pgbench -I -s 100 -c 8 -j8, I get tons of log entries like: LOG: JJJ RecentXminPageno -1, 149 STATEMENT: INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (nextval('pgbench_accounts_load_seq'), 1 + (lastval()/(10)), 0); Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 29, 2012 at 9:41 PM, Jeff Janes jeff.ja...@gmail.com wrote: If I cast to a int, then I see advancement: I'll initialise it as 0, rather than -1 and then we don't have a problem in any circumstance. I've specifically designed the pgbench changes required to simulate conditions of clog contention to help in the evaluation of this patch. Yep, I've used that one for the testing. Most of the current patch is just bookkeeping to keep track of the point when we can look at history in read only manner. I've isolated the code better to allow you to explore various implementation options. I don't see any performance difference between any of them really, but you're welcome to look. Please everybody note that the clog history doesn't even become active until the first checkpoint, so this is dead code until we've hit the first checkpoint cycle and completed a million transactions since startup. So its designed to tune for real world situations, and is not easy to benchmark. (Maybe we could start earlier, but having extra code just for first few minutes seems waste of energy, especially since we must hit million xids also). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 69b6ef3..8ab1b3c 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -37,6 +37,7 @@ #include access/transam.h #include miscadmin.h #include pg_trace.h +#include utils/snapmgr.h /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -70,12 +71,19 @@ /* * Link to shared-memory data structures for CLOG control + * + * As of 9.2, we have 2 structures for commit log data. + * ClogCtl manages the main read/write part of the commit log, while + * the ClogHistoryCtl manages the now read-only, older part. ClogHistory + * removes contention from the path of transaction commits. */ static SlruCtlData ClogCtlData; +static SlruCtlData ClogHistoryCtlData; -#define ClogCtl (ClogCtlData) - +#define ClogCtl (ClogCtlData) +#define ClogHistoryCtl (ClogHistoryCtlData) +static XidStatus TransactionIdGetStatusHistory(TransactionId xid); static int ZeroCLOGPage(int pageno, bool writeXlog); static bool CLOGPagePrecedes(int page1, int page2); static void WriteZeroPageXlogRec(int pageno); @@ -296,6 +304,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids, /* ... then the main transaction */ TransactionIdSetStatusBit(xid, status, lsn, slotno); + + /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ + if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) + ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); } /* Set the subtransactions */ @@ -387,6 +399,7 @@ TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) { + bool useClogHistory = true; int pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; @@ -397,15 +410,64 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) /* lock is acquired by SimpleLruReadPage_ReadOnly */ - slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid); - byteptr = ClogCtl-shared-page_buffer[slotno] + byteno; + /* + * Decide whether to use main Clog or read-only ClogHistory. + * + * Our knowledge of the boundary between the two may be a little out + * of date, so if we try Clog and can't find it we need to try again + * against ClogHistory. + */ + if (pageno = ClogCtl-recent_oldest_active_page_number) + { + slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid); + if (slotno = 0) + useClogHistory = false; + } + + if (useClogHistory) + return TransactionIdGetStatusHistory(xid); + + byteptr = clog-shared-page_buffer[slotno] + byteno; status = (*byteptr bshift) CLOG_XACT_BITMASK; lsnindex = GetLSNIndex(slotno, xid); - *lsn = ClogCtl-shared-group_lsn[lsnindex]; + *lsn = clog-shared-group_lsn[lsnindex]; - LWLockRelease(CLogControlLock); + LWLockRelease(clog-shared-ControlLock); + + return status; +} + +/* + * Get state of a transaction from the read-only portion of the clog, + * which we refer to as the clog history. + * + * Code isolated here to more easily allow various implementation options. + */ +static XidStatus +TransactionIdGetStatusHistory(TransactionId xid) +{ + SlruCtl clog = ClogHistoryCtl; + int pageno = TransactionIdToPage(xid); + int byteno = TransactionIdToByte(xid); + int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; + int slotno; + char *byteptr; + XidStatus status; + + slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid); + + byteptr = clog-shared-page_buffer[slotno] + byteno; + status = (*byteptr bshift) CLOG_XACT_BITMASK; + + /* +
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 10:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Thanks, will look again. Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. Will think on that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 12, 2012 at 4:49 AM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Jan 5, 2012 at 6:26 PM, Simon Riggs si...@2ndquadrant.com wrote: Patch to remove clog contention caused by dirty clog LRU. v2, minor changes, updated for recent commits This no longer applies to file src/backend/postmaster/bgwriter.c, due to the latch code, and I'm not confident I know how to fix it. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. right -- exactly. but why stop at one page? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. right -- exactly. but why stop at one page? If you have more than one, you need code to decide which one to evict (just free) every time you need a new one. And every process needs to be running this code, while the kernel is still going to need make its own decisions for the entire system. It seems simpler to just let the kernel do the job for everyone. Are you worried that a read syscall is going to be slow even when the data is presumably cached in the OS? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 6:44 AM, Simon Riggs si...@2ndquadrant.com wrote: OT: It would save lots of time if we had 2 things for the CF app: .. 2. Something that automatically tests patches. If you submit a patch we run up a blank VM and run patch applies on all patches. As soon as we get a fail, an email goes to patch author. That way authors know as soon as a recent commit invalidates something. Well, first the CF app would need to reliably be able to find the actual patch. That is currently not a given. Also, it seems that OID collisions are a dime a dozen, and I'm starting to doubt that they are even worth reporting in the absence of a more substantive review. And in the patches I've looked at, it seems like the OID is not even cross-referenced anywhere else in the patch, the cross-references are all based on symbolic names. I freely admit I have no idea what I am talking about, but it seems like the only purpose of OIDs is to create bit rot. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 10:44 AM, Robert Haas robertmh...@gmail.com wrote: D'oh. You're right. Looks like I accidentally tried to apply this to the 9.1 sources. Sigh... No worries. It's Friday. Server passed 'make check' with this patch, but when I tried to fire it up for some test runs, it fell over with: FATAL: no more LWLockIds available I assume that it must be dependent on the config settings used. Here are mine: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sat, Jan 21, 2012 at 1:57 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 20, 2012 at 10:44 AM, Robert Haas robertmh...@gmail.com wrote: D'oh. You're right. Looks like I accidentally tried to apply this to the 9.1 sources. Sigh... No worries. It's Friday. Server passed 'make check' with this patch, but when I tried to fire it up for some test runs, it fell over with: FATAL: no more LWLockIds available I assume that it must be dependent on the config settings used. Here are mine: shared_buffers = 8GB maintenance_work_mem = 1GB synchronous_commit = off checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 wal_writer_delay = 20ms Yes, it was. Sorry about that. New version attached, retesting while you read this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 69b6ef3..6ff6894 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -37,6 +37,7 @@ #include access/transam.h #include miscadmin.h #include pg_trace.h +#include utils/snapmgr.h /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -70,10 +71,17 @@ /* * Link to shared-memory data structures for CLOG control + * + * As of 9.2, we have 2 structures for commit log data. + * ClogCtl manages the main read/write part of the commit log, while + * the ClogHistoryCtl manages the now read-only, older part. ClogHistory + * removes contention from the path of transaction commits. */ static SlruCtlData ClogCtlData; +static SlruCtlData ClogHistoryCtlData; -#define ClogCtl (ClogCtlData) +#define ClogCtl (ClogCtlData) +#define ClogHistoryCtl (ClogHistoryCtlData) static int ZeroCLOGPage(int pageno, bool writeXlog); @@ -296,6 +304,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids, /* ... then the main transaction */ TransactionIdSetStatusBit(xid, status, lsn, slotno); + + /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ + if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) + ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); } /* Set the subtransactions */ @@ -387,6 +399,8 @@ TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) { + SlruCtl clog = ClogCtl; + bool useClogHistory = true; int pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; @@ -397,15 +411,35 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) /* lock is acquired by SimpleLruReadPage_ReadOnly */ - slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid); - byteptr = ClogCtl-shared-page_buffer[slotno] + byteno; + /* + * Decide whether to use main Clog or read-only ClogHistory. + * + * Our knowledge of the boundary between the two may be a little out + * of date, so if we try Clog and can't find it we need to try again + * against ClogHistory. + */ + if (pageno = ClogCtl-recent_oldest_active_page_number) + { + slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid); + if (slotno = 0) + useClogHistory = false; + } + + if (useClogHistory) + { + clog = ClogHistoryCtl; + slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid); + Assert(slotno = 0); + } + + byteptr = clog-shared-page_buffer[slotno] + byteno; status = (*byteptr bshift) CLOG_XACT_BITMASK; lsnindex = GetLSNIndex(slotno, xid); - *lsn = ClogCtl-shared-group_lsn[lsnindex]; + *lsn = clog-shared-group_lsn[lsnindex]; - LWLockRelease(CLogControlLock); + LWLockRelease(clog-shared-ControlLock); return status; } @@ -445,15 +479,19 @@ CLOGShmemBuffers(void) Size CLOGShmemSize(void) { - return SimpleLruShmemSize(CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE); + /* Reserve shmem for both ClogCtl and ClogHistoryCtl */ + return SimpleLruShmemSize(2 * CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE); } void CLOGShmemInit(void) { ClogCtl-PagePrecedes = CLOGPagePrecedes; + ClogHistoryCtl-PagePrecedes = CLOGPagePrecedes; SimpleLruInit(ClogCtl, CLOG Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, CLogControlLock, pg_clog); + SimpleLruInit(ClogHistoryCtl, CLOG History Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, + CLogHistoryControlLock, pg_clog); } /* @@ -592,6 +630,16 @@ CheckPointCLOG(void) TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); SimpleLruFlush(ClogCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); + + /* + * Now that we've written out all dirty buffers the only pages that + * will get dirty again will be pages with active transactions on them. + * So we can move forward the oldest_active_page_number and allow + * read only operations via ClogHistoryCtl. + */ + LWLockAcquire(CLogControlLock,
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. OT: It would save lots of time if we had 2 things for the CF app: 1. Emails that go to appropriate people when status changes. e.g. when someone sets Waiting for Author the author gets an email so they know the reviewer is expecting something. No knowing that wastes lots of days, so if we want to do this in less days that seems like a great place to start. 2. Something that automatically tests patches. If you submit a patch we run up a blank VM and run patch applies on all patches. As soon as we get a fail, an email goes to patch author. That way authors know as soon as a recent commit invalidates something. Those things have wasted time for me in the past, so they're opportunities to improve the process, not must haves. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 9:44 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. OT: It would save lots of time if we had 2 things for the CF app: 1. Emails that go to appropriate people when status changes. e.g. when someone sets Waiting for Author the author gets an email so they know the reviewer is expecting something. No knowing that wastes lots of days, so if we want to do this in less days that seems like a great place to start. 2. Something that automatically tests patches. If you submit a patch we run up a blank VM and run patch applies on all patches. As soon as we get a fail, an email goes to patch author. That way authors know as soon as a recent commit invalidates something. Those things have wasted time for me in the past, so they're opportunities to improve the process, not must haves. Yeah, I agree that that would be nice. I just haven't had time to implement much of anything for the CF application in a long time. My management has been very interested in the performance and scalability stuff, so that's been my main focus for 9.2. I'm going to see if I can carve out some time for this once the dust settles. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 10:16 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. Still applies and compiles cleanly for me. D'oh. You're right. Looks like I accidentally tried to apply this to the 9.1 sources. Sigh... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 3:32 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 20, 2012 at 10:16 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. Still applies and compiles cleanly for me. D'oh. You're right. Looks like I accidentally tried to apply this to the 9.1 sources. Sigh... No worries. It's Friday. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 10:38 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 20, 2012 at 3:32 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 20, 2012 at 10:16 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. Still applies and compiles cleanly for me. D'oh. You're right. Looks like I accidentally tried to apply this to the 9.1 sources. Sigh... No worries. It's Friday. http://www.youtube.com/watch?v=kfVsfOSbJY0 Of course, I even ran git log to check that I had the latest sources... but what I had, of course, was the latest 9.1 sources, which still have recently-timestamped commits, and I didn't look carefully enough. Sigh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 20, 2012 at 1:37 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 8, 2012 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. This seems to need a rebase. Still applies and compiles cleanly for me. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sun, Jan 8, 2012 at 2:25 PM, Simon Riggs si...@2ndquadrant.com wrote: I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. Why do we need this in 9.2? We now have clog_buffers = 32 and we have write rates ~16,000 tps. At those write rates we fill a clog buffer every 2 seconds, so the clog cache completely churns every 64 seconds. If we wish to achieve those rates in the real world, any access to data that was written by a transaction more than a minute ago will cause clog cache page faults, leading to stalls in new transactions. To avoid those problems we need * background writing of the clog LRU (already posted as a separate patch) * a way of separating access to historical data from the main commit path (this patch) And to evaluate such situations, we need a way to simulate data that contains many transactions. 32 buffers can hold just over 1 million transaction ids, so benchmarks against databases containing 10 million separate transactions are recommended (remembering that this is just 10 mins of data on high TPS systems). A pgbench patch is provided separately to aid in the evaluation. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 6:26 PM, Simon Riggs si...@2ndquadrant.com wrote: Patch to remove clog contention caused by dirty clog LRU. v2, minor changes, updated for recent commits -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 69b6ef3..f3e08e6 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -594,6 +594,26 @@ CheckPointCLOG(void) TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } +/* + * Conditionally flush the CLOG LRU. + * + * When a backend does ExtendCLOG we need to write the CLOG LRU if it is + * dirty. Performing I/O while holding XidGenLock prevents new write + * transactions from starting. To avoid that we flush the CLOG LRU, if + * we think that a page write is due soon, according to a heuristic. + * + * Note that we're reading ShmemVariableCache-nextXid without a lock + * since the exact value doesn't matter as input into our heuristic. + */ +void +CLOGBackgroundFlushLRU(void) +{ + TransactionId xid = ShmemVariableCache-nextXid; + int threshold = (CLOG_XACTS_PER_PAGE - (CLOG_XACTS_PER_PAGE / 4)); + + if (TransactionIdToPgIndex(xid) threshold) + SlruBackgroundFlushLRUPage(ClogCtl); +} /* * Make sure that CLOG has room for a newly-allocated XID. diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 30538ff..aea6c09 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -885,6 +885,82 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid) } /* + * Identify the LRU slot but just leave it as it is. + * + * Control lock must be held at entry, and will be held at exit. + */ +static int +SlruIdentifyLRUSlot(SlruCtl ctl) +{ + SlruShared shared = ctl-shared; + int slotno; + int cur_count; + int bestslot; + int best_delta; + int best_page_number; + + /* + * If we find any EMPTY slot, just select that one. Else locate the + * least-recently-used slot. + * + * Normally the page_lru_count values will all be different and so + * there will be a well-defined LRU page. But since we allow + * concurrent execution of SlruRecentlyUsed() within + * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages + * acquire the same lru_count values. In that case we break ties by + * choosing the furthest-back page. + * + * In no case will we select the slot containing latest_page_number + * for replacement, even if it appears least recently used. + * + * Notice that this next line forcibly advances cur_lru_count to a + * value that is certainly beyond any value that will be in the + * page_lru_count array after the loop finishes. This ensures that + * the next execution of SlruRecentlyUsed will mark the page newly + * used, even if it's for a page that has the current counter value. + * That gets us back on the path to having good data when there are + * multiple pages with the same lru_count. + */ + cur_count = (shared-cur_lru_count)++; + best_delta = -1; + bestslot = 0; /* no-op, just keeps compiler quiet */ + best_page_number = 0; /* ditto */ + for (slotno = 0; slotno shared-num_slots; slotno++) + { + int this_delta; + int this_page_number; + + if (shared-page_status[slotno] == SLRU_PAGE_EMPTY) + return slotno; + this_delta = cur_count - shared-page_lru_count[slotno]; + if (this_delta 0) + { + /* + * Clean up in case shared updates have caused cur_count + * increments to get lost. We back off the page counts, + * rather than trying to increase cur_count, to avoid any + * question of infinite loops or failure in the presence of + * wrapped-around counts. + */ + shared-page_lru_count[slotno] = cur_count; + this_delta = 0; + } + this_page_number = shared-page_number[slotno]; + if ((this_delta best_delta || + (this_delta == best_delta + ctl-PagePrecedes(this_page_number, best_page_number))) + this_page_number != shared-latest_page_number) + { + bestslot = slotno; + best_delta = this_delta; + best_page_number = this_page_number; + } + } + + return bestslot; +} + +/* * Select the slot to re-use when we need a free slot. * * The target page number is passed because we need to consider the @@ -905,11 +981,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno) /* Outer loop handles restart after I/O */ for (;;) { - int slotno; - int cur_count; int bestslot; - int best_delta; - int best_page_number; + int slotno; /* See if page already has a buffer assigned */ for (slotno = 0; slotno shared-num_slots; slotno++) @@ -919,69 +992,14 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno) return slotno; } - /* - * If we find any EMPTY slot, just select that one. Else locate the - * least-recently-used slot to replace. - * - * Normally the page_lru_count values will all be different and so - *
[HACKERS] CLOG contention, part 2
Recent results from Robert show clog contention is still an issue. In various discussions Tom noted that pages prior to RecentXmin are readonly and we might find a way to make use of that fact in providing different mechanisms or resources. I've taken that idea and used it to build a second Clog cache, known as ClogHistory which allows access to the read-only tail of pages in the clog. Once a page has been written to for the last time, it will be accessed via the ClogHistory Slru in preference to the normal Clog Slru. This separates historical accesses by readers from current write access by committers. Historical access doesn't force dirty writes, nor are commits made to wait when historical access occurs. The patch is very simple because all the writes still continue through the normal route, so is suitable for 9.2. I'm no longer working on clog partitioning patch for this release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 69b6ef3..6ff6894 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -37,6 +37,7 @@ #include access/transam.h #include miscadmin.h #include pg_trace.h +#include utils/snapmgr.h /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -70,10 +71,17 @@ /* * Link to shared-memory data structures for CLOG control + * + * As of 9.2, we have 2 structures for commit log data. + * ClogCtl manages the main read/write part of the commit log, while + * the ClogHistoryCtl manages the now read-only, older part. ClogHistory + * removes contention from the path of transaction commits. */ static SlruCtlData ClogCtlData; +static SlruCtlData ClogHistoryCtlData; -#define ClogCtl (ClogCtlData) +#define ClogCtl (ClogCtlData) +#define ClogHistoryCtl (ClogHistoryCtlData) static int ZeroCLOGPage(int pageno, bool writeXlog); @@ -296,6 +304,10 @@ TransactionIdSetPageStatus(TransactionId xid, int nsubxids, /* ... then the main transaction */ TransactionIdSetStatusBit(xid, status, lsn, slotno); + + /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ + if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) + ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); } /* Set the subtransactions */ @@ -387,6 +399,8 @@ TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) { + SlruCtl clog = ClogCtl; + bool useClogHistory = true; int pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; @@ -397,15 +411,35 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) /* lock is acquired by SimpleLruReadPage_ReadOnly */ - slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid); - byteptr = ClogCtl-shared-page_buffer[slotno] + byteno; + /* + * Decide whether to use main Clog or read-only ClogHistory. + * + * Our knowledge of the boundary between the two may be a little out + * of date, so if we try Clog and can't find it we need to try again + * against ClogHistory. + */ + if (pageno = ClogCtl-recent_oldest_active_page_number) + { + slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid); + if (slotno = 0) + useClogHistory = false; + } + + if (useClogHistory) + { + clog = ClogHistoryCtl; + slotno = SimpleLruReadPage_ReadOnly(clog, pageno, xid); + Assert(slotno = 0); + } + + byteptr = clog-shared-page_buffer[slotno] + byteno; status = (*byteptr bshift) CLOG_XACT_BITMASK; lsnindex = GetLSNIndex(slotno, xid); - *lsn = ClogCtl-shared-group_lsn[lsnindex]; + *lsn = clog-shared-group_lsn[lsnindex]; - LWLockRelease(CLogControlLock); + LWLockRelease(clog-shared-ControlLock); return status; } @@ -445,15 +479,19 @@ CLOGShmemBuffers(void) Size CLOGShmemSize(void) { - return SimpleLruShmemSize(CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE); + /* Reserve shmem for both ClogCtl and ClogHistoryCtl */ + return SimpleLruShmemSize(2 * CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE); } void CLOGShmemInit(void) { ClogCtl-PagePrecedes = CLOGPagePrecedes; + ClogHistoryCtl-PagePrecedes = CLOGPagePrecedes; SimpleLruInit(ClogCtl, CLOG Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, CLogControlLock, pg_clog); + SimpleLruInit(ClogHistoryCtl, CLOG History Ctl, CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, + CLogHistoryControlLock, pg_clog); } /* @@ -592,6 +630,16 @@ CheckPointCLOG(void) TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); SimpleLruFlush(ClogCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); + + /* + * Now that we've written out all dirty buffers the only pages that + * will get dirty again will be pages with active transactions on them. + * So we can move forward the
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 10:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I have some results that show that, under the right set of circumstances, 8-32 is a win, and I can quantify by how much it wins. I don't have any data at all to quantify the cost of dropping the minimum from 8-6, or from 8-4, and therefore I'm reluctant to do it. My guess is that it's a bad idea, anyway. Even on a system where shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by 50%), we can increase the number of regular buffers from 1024 to 1028 (i.e. by 0.5%). Maybe you can find a case where that comes out to a win, but you might have to look pretty hard. I think you're rejecting the concept too easily. A setup with very little shmem is only going to be suitable for low-velocity systems that are not pushing too many transactions through per second, so it's not likely to need so many CLOG buffers. And frankly I'm not that concerned about what the performance is like: I'm more concerned about whether PG will start up at all without modifying the system shmem limits, on systems with legacy values for SHMMAX etc. Shaving a few single-purpose buffers to make back what we spent on SSI, for example, seems like a good idea to me. Having 32 clog buffers is important at the high end. I don't think that other complexities should mask that truth and lead to us not doing anything on this topic for this release. Please can we either make it user configurable? prepared transactions require config, lock table size is configurable also, so having SSI and clog require config is not too much of a stretch. We can then discuss intelligent autotuning behaviour when we have more time and more evidence. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I have some results that show that, under the right set of circumstances, 8-32 is a win, and I can quantify by how much it wins. I don't have any data at all to quantify the cost of dropping the minimum from 8-6, or from 8-4, and therefore I'm reluctant to do it. My guess is that it's a bad idea, anyway. Even on a system where shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by 50%), we can increase the number of regular buffers from 1024 to 1028 (i.e. by 0.5%). Maybe you can find a case where that comes out to a win, but you might have to look pretty hard. I think you're rejecting the concept too easily. A setup with very little shmem is only going to be suitable for low-velocity systems that are not pushing too many transactions through per second, so it's not likely to need so many CLOG buffers. Well, if you take the same workload and spread it out over a long period of time, it will still have just as many CLOG misses or shared_buffers misses as it had when you did it all at top speed. Admittedly, you're unlikely to run into the situation where you have people wanting to do simultaneous CLOG reads than there are buffers, but you'll still thrash the cache. And frankly I'm not that concerned about what the performance is like: I'm more concerned about whether PG will start up at all without modifying the system shmem limits, on systems with legacy values for SHMMAX etc. After thinking about this a bit, I think the problem is that the divisor we picked is still too high. Suppose we set num_clog_buffers = (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. That way, pretty much anyone who bothers to set shared_buffers to a non-default value will get 32 CLOG buffers, which should be fine, but people who are in the 32MB-or-less range can ramp down lower than what we've allowed in the past. That seems like it might give us the best of both worlds. Shaving a few single-purpose buffers to make back what we spent on SSI, for example, seems like a good idea to me. I think if we want to buy back that memory, the best way to do it would be to add a GUC to disable SSI at startup time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Simon Riggs si...@2ndquadrant.com writes: Please can we either make it user configurable? Weren't you just complaining that *I* was overcomplicating things? I see no evidence to justify inventing a user-visible GUC here. We have rough consensus on both the need for and the shape of a formula, with just minor discussion about the exact parameters to plug into it. Punting the problem off to a GUC is not a better answer. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: After thinking about this a bit, I think the problem is that the divisor we picked is still too high. Suppose we set num_clog_buffers = (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. Works for me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Fri, Jan 6, 2012 at 3:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Please can we either make it user configurable? Weren't you just complaining that *I* was overcomplicating things? I see no evidence to justify inventing a user-visible GUC here. We have rough consensus on both the need for and the shape of a formula, with just minor discussion about the exact parameters to plug into it. Punting the problem off to a GUC is not a better answer. As long as we get 32 buffers on big systems, I have no complaint. I'm sorry if I moaned at you personally. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Fri, Jan 6, 2012 at 11:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: After thinking about this a bit, I think the problem is that the divisor we picked is still too high. Suppose we set num_clog_buffers = (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. Works for me. Done. I tested this on my MacBook Pro and I see no statistically significant difference from the change on a couple of small pgbench tests. Hopefully that means this is good on large boxes and at worst harmless on small ones. As far as I can see, the trade-off is this: If you increase the number of CLOG buffers, then your CLOG miss rate will go down. On the other hand, the cost of looking up a CLOG buffer will go up. At some point, the reduction in the miss rate will not be enough to pay for a longer linear search - which also means holding CLogControlLock. I think it'd probably be worthwhile to think about looking for something slightly smarter than a linear search at some point, and maybe also looking for a way to partition the locking better. But, this at least picks the available load-hanging fruit, which is a good place to start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Tue, Dec 27, 2011 at 5:23 AM, Simon Riggs si...@2ndquadrant.com wrote: On Sat, Dec 24, 2011 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote: Also, if it is that, what do we do about it? I don't think any of the ideas proposed so far are going to help much. If you don't like guessing, don't guess, don't think. Just measure. Does increasing the number of buffers solve the problems you see? That must be the first port of call - is that enough, or not? If not, we can discuss the various ideas, write patches and measure them. Just in case you want a theoretical prediction to test: increasing NUM_CLOG_BUFFERS should reduce the frequency of the spikes you measured earlier. That should happen proportionally, so as that is increased they will become even less frequent. But the size of the buffer will not decrease the impact of each event when it happens. I'm still catching up on email, so apologies for the slow response on this. I actually ran this test before Christmas, but didn't get around to emailing the results. I'm attaching graphs of the last 100 seconds of a run with the normal count of CLOG buffers, and the last 100 seconds of a run with NUM_CLOG_BUFFERS = 32. I am also attaching graphs of the entire runs. It appears to me that increasing the number of CLOG buffers reduced the severity of the latency spikes considerably. In the last 100 seconds, for example, master has several spikes in the 500-700ms range, but with 32 CLOG buffers it never goes above 400 ms. Also, the number of points associated with each spike is considerably less - each spike seems to affect fewer transactions. So it seems that at least on this machine, increasing the number of CLOG buffers both improves performance and reduces latency. I hypothesize that there are actually two kinds of latency spikes here. Just taking a wild guess, I wonder if the *remaining* latency spikes are caused by the effect that you mentioned before: namely, the need to write an old CLOG page every time we advance onto a new one. I further speculate that the spikes are more severe on the unpatched code because this effect combines with the one I mentioned before: if there are more simultaneous I/O requests than there are buffers, a new I/O request has to wait for one of the I/Os already in progress to complete. If the new I/O request that has to wait extra-long happens to be the one caused by XID advancement, then things get really ugly. If that hypothesis is correct, then it supports your previous belief that more than one fix is needed here... but it also means we can get a significant and I think quite worthwhile benefit just out of finding a reasonable way to add some more buffers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: latency-end.pngattachment: latency-clog32-end.pngattachment: latency.pngattachment: latency-clog32.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 4:04 PM, Robert Haas robertmh...@gmail.com wrote: It appears to me that increasing the number of CLOG buffers reduced the severity of the latency spikes considerably. In the last 100 seconds, for example, master has several spikes in the 500-700ms range, but with 32 CLOG buffers it never goes above 400 ms. Also, the number of points associated with each spike is considerably less - each spike seems to affect fewer transactions. So it seems that at least on this machine, increasing the number of CLOG buffers both improves performance and reduces latency. I believed before that the increase was worthwhile and now even more so. Let's commit the change to 32. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Simon Riggs si...@2ndquadrant.com wrote: Robert Haas robertmh...@gmail.com wrote: So it seems that at least on this machine, increasing the number of CLOG buffers both improves performance and reduces latency. I believed before that the increase was worthwhile and now even more so. Let's commit the change to 32. +1 -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 4:04 PM, Robert Haas robertmh...@gmail.com wrote: I hypothesize that there are actually two kinds of latency spikes here. Just taking a wild guess, I wonder if the *remaining* latency spikes are caused by the effect that you mentioned before: namely, the need to write an old CLOG page every time we advance onto a new one. I further speculate that the spikes are more severe on the unpatched code because this effect combines with the one I mentioned before: if there are more simultaneous I/O requests than there are buffers, a new I/O request has to wait for one of the I/Os already in progress to complete. If the new I/O request that has to wait extra-long happens to be the one caused by XID advancement, then things get really ugly. If that hypothesis is correct, then it supports your previous belief that more than one fix is needed here... but it also means we can get a significant and I think quite worthwhile benefit just out of finding a reasonable way to add some more buffers. Sounds reaonable. Patch to remove clog contention caused by my dirty clog LRU. The patch implements background WAL allocation also, with the intention of being separately tested, if possible. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 4060e60..dbefa02 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -565,6 +565,26 @@ CheckPointCLOG(void) TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } +/* + * Conditionally flush the CLOG LRU. + * + * When a backend does ExtendCLOG we need to write the CLOG LRU if it is + * dirty. Performing I/O while holding XidGenLock prevents new write + * transactions from starting. To avoid that we flush the CLOG LRU, if + * we think that a page write is due soon, according to a heuristic. + * + * Note that we're reading ShmemVariableCache-nextXid without a lock + * since the exact value doesn't matter as input into our heuristic. + */ +void +CLOGBackgroundFlushLRU(void) +{ + TransactionId xid = ShmemVariableCache-nextXid; + int threshold = (CLOG_XACTS_PER_PAGE - (CLOG_XACTS_PER_PAGE / 4)); + + if (TransactionIdToPgIndex(xid) threshold) + SlruBackgroundFlushLRUPage(ClogCtl); +} /* * Make sure that CLOG has room for a newly-allocated XID. diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 30538ff..aea6c09 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -885,6 +885,82 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid) } /* + * Identify the LRU slot but just leave it as it is. + * + * Control lock must be held at entry, and will be held at exit. + */ +static int +SlruIdentifyLRUSlot(SlruCtl ctl) +{ + SlruShared shared = ctl-shared; + int slotno; + int cur_count; + int bestslot; + int best_delta; + int best_page_number; + + /* + * If we find any EMPTY slot, just select that one. Else locate the + * least-recently-used slot. + * + * Normally the page_lru_count values will all be different and so + * there will be a well-defined LRU page. But since we allow + * concurrent execution of SlruRecentlyUsed() within + * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages + * acquire the same lru_count values. In that case we break ties by + * choosing the furthest-back page. + * + * In no case will we select the slot containing latest_page_number + * for replacement, even if it appears least recently used. + * + * Notice that this next line forcibly advances cur_lru_count to a + * value that is certainly beyond any value that will be in the + * page_lru_count array after the loop finishes. This ensures that + * the next execution of SlruRecentlyUsed will mark the page newly + * used, even if it's for a page that has the current counter value. + * That gets us back on the path to having good data when there are + * multiple pages with the same lru_count. + */ + cur_count = (shared-cur_lru_count)++; + best_delta = -1; + bestslot = 0; /* no-op, just keeps compiler quiet */ + best_page_number = 0; /* ditto */ + for (slotno = 0; slotno shared-num_slots; slotno++) + { + int this_delta; + int this_page_number; + + if (shared-page_status[slotno] == SLRU_PAGE_EMPTY) + return slotno; + this_delta = cur_count - shared-page_lru_count[slotno]; + if (this_delta 0) + { + /* + * Clean up in case shared updates have caused cur_count + * increments to get lost. We back off the page counts, + * rather than trying to increase cur_count, to avoid any + * question of infinite loops or failure in the presence of + * wrapped-around counts. + */ + shared-page_lru_count[slotno] = cur_count; + this_delta = 0; + } + this_page_number = shared-page_number[slotno]; + if ((this_delta best_delta || +
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 I also think it would be a worth a quick test to see how the increase performs on a system with 32 cores. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 We're talking about an extra 192KB or thereabouts and Clog buffers will only be the size of subtrans when we've finished. If you want to have a special low-memory option, then it would need to include many more things than clog buffers. Let's just use a constant value for clog buffers until the low-memory patch arrives. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 2:21 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 We're talking about an extra 192KB or thereabouts and Clog buffers will only be the size of subtrans when we've finished. If you want to have a special low-memory option, then it would need to include many more things than clog buffers. Let's just use a constant value for clog buffers until the low-memory patch arrives. Tom already stated that he found that unacceptable. Unless he changes his opinion, we're not going to get far if you're only happy if it's constant and he's only happy if there's a formula. On the other hand, I think there's a decent argument that he should change his opinion, because 192kB of memory is not a lot. However, what I mostly want is something that nobody hates, so we can get it committed and move on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 1:12 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 The assumption that machines that need this will have gigabytes of shared memory set is not valid IMO. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 7:26 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 2:21 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 We're talking about an extra 192KB or thereabouts and Clog buffers will only be the size of subtrans when we've finished. If you want to have a special low-memory option, then it would need to include many more things than clog buffers. Let's just use a constant value for clog buffers until the low-memory patch arrives. Tom already stated that he found that unacceptable. Unless he changes his opinion, we're not going to get far if you're only happy if it's constant and he's only happy if there's a formula. On the other hand, I think there's a decent argument that he should change his opinion, because 192kB of memory is not a lot. However, what I mostly want is something that nobody hates, so we can get it committed and move on. If that was a reasonable objection it would have applied when we added serializable support, or any other SLRU for that matter. If memory reduction is a concern to anybody, then a separate patch to address *all* issues is required. Blocking this patch makes no sense. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Excerpts from Simon Riggs's message of jue ene 05 16:21:31 -0300 2012: On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 We're talking about an extra 192KB or thereabouts and Clog buffers will only be the size of subtrans when we've finished. Speaking of which, maybe it'd be a good idea to parametrize the subtrans size according to the same (or a similar) formula too. (It might be good to reduce multixact memory consumption too; I'd think that 4+4 pages should be more than sufficient for low memory systems, so making those be half the clog values should be good) So you get both things: reduce memory usage for systems on the low end, which has been slowly increasing lately as we've added more uses of SLRU, and more buffers for large systems. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com wrote: Simon Riggs si...@2ndquadrant.com wrote: Robert Haas robertmh...@gmail.com wrote: Simon Riggs si...@2ndquadrant.com wrote: Let's commit the change to 32. I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 If we go with such a formula, I think 32 MB would be a more appropriate divisor than 128 MB. Even on very large machines where 32 CLOG buffers would be a clear win, we often can't go above 1 or 2 GB of shared_buffers without hitting latency spikes due to overrun of the RAID controller cache. (Now, that may change if we get DW in, but that's not there yet.) 1 GB / 32 is 32 MB. This would leave CLOG pinned at the minimum of 8 buffers (64 KB) all the way up to shared_buffers of 256 MB. Let's just use a constant value for clog buffers until the low-memory patch arrives. Tom already stated that he found that unacceptable. Unless he changes his opinion, we're not going to get far if you're only happy if it's constant and he's only happy if there's a formula. On the other hand, I think there's a decent argument that he should change his opinion, because 192kB of memory is not a lot. However, what I mostly want is something that nobody hates, so we can get it committed and move on. I wouldn't hate it either way, as long as the divisor isn't too large. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: I would like to do that, but I think we need to at least figure out a way to provide an escape hatch for people without much shared memory. We could do that, perhaps, by using a formula like this: 1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a maximum of 32 I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I think that the reason it's historically been a constant is that the original coding took advantage of having a compile-time-constant number of buffers --- but since we went over to the common SLRU infrastructure for several different logs, there's no longer any benefit whatever to using a simple constant. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Simon Riggs si...@2ndquadrant.com writes: On Thu, Jan 5, 2012 at 7:26 PM, Robert Haas robertmh...@gmail.com wrote: On the other hand, I think there's a decent argument that he should change his opinion, because 192kB of memory is not a lot. However, what I mostly want is something that nobody hates, so we can get it committed and move on. If that was a reasonable objection it would have applied when we added serializable support, or any other SLRU for that matter. If memory reduction is a concern to anybody, then a separate patch to address *all* issues is required. Blocking this patch makes no sense. No, your argument is the one that makes no sense. The fact that things could be made better for low-mem situations is not an argument for instead making them worse. Which is what going to a fixed value of 32 would do, in return for no benefit that I can see compared to using a formula of some sort. The details of the formula barely matter, though I would like to see one that bottoms out at less than 8 buffers so that there is some advantage gained for low-memory cases. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 7:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I think that the reason it's historically been a constant is that the original coding took advantage of having a compile-time-constant number of buffers --- but since we went over to the common SLRU infrastructure for several different logs, there's no longer any benefit whatever to using a simple constant. You astound me, you really do. Parameterised slru buffer sizes were proposed about for 8.3 and opposed by you. I guess we all reserve the right to change our minds... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Simon Riggs si...@2ndquadrant.com writes: Parameterised slru buffer sizes were proposed about for 8.3 and opposed by you. I guess we all reserve the right to change our minds... When presented with new data, sure. Robert's results offer a reason to worry about this, which we did not have before now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 2:44 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: If we go with such a formula, I think 32 MB would be a more appropriate divisor than 128 MB. Even on very large machines where 32 CLOG buffers would be a clear win, we often can't go above 1 or 2 GB of shared_buffers without hitting latency spikes due to overrun of the RAID controller cache. (Now, that may change if we get DW in, but that's not there yet.) 1 GB / 32 is 32 MB. This would leave CLOG pinned at the minimum of 8 buffers (64 KB) all the way up to shared_buffers of 256 MB. That seems reasonable to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I have some results that show that, under the right set of circumstances, 8-32 is a win, and I can quantify by how much it wins. I don't have any data at all to quantify the cost of dropping the minimum from 8-6, or from 8-4, and therefore I'm reluctant to do it. My guess is that it's a bad idea, anyway. Even on a system where shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by 50%), we can increase the number of regular buffers from 1024 to 1028 (i.e. by 0.5%). Maybe you can find a case where that comes out to a win, but you might have to look pretty hard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Jan 5, 2012 at 2:25 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 5, 2012 at 2:44 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: If we go with such a formula, I think 32 MB would be a more appropriate divisor than 128 MB. Even on very large machines where 32 CLOG buffers would be a clear win, we often can't go above 1 or 2 GB of shared_buffers without hitting latency spikes due to overrun of the RAID controller cache. (Now, that may change if we get DW in, but that's not there yet.) 1 GB / 32 is 32 MB. This would leave CLOG pinned at the minimum of 8 buffers (64 KB) all the way up to shared_buffers of 256 MB. That seems reasonable to me. likewise (champion bikeshedder here). It just so happens I typically set 'large' server shared memory to 256mb. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane t...@sss.pgh.pa.us wrote: I would be in favor of that, or perhaps some other formula (eg, maybe the minimum should be less than 8 for when you've got very little shmem). I have some results that show that, under the right set of circumstances, 8-32 is a win, and I can quantify by how much it wins. I don't have any data at all to quantify the cost of dropping the minimum from 8-6, or from 8-4, and therefore I'm reluctant to do it. My guess is that it's a bad idea, anyway. Even on a system where shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by 50%), we can increase the number of regular buffers from 1024 to 1028 (i.e. by 0.5%). Maybe you can find a case where that comes out to a win, but you might have to look pretty hard. I think you're rejecting the concept too easily. A setup with very little shmem is only going to be suitable for low-velocity systems that are not pushing too many transactions through per second, so it's not likely to need so many CLOG buffers. And frankly I'm not that concerned about what the performance is like: I'm more concerned about whether PG will start up at all without modifying the system shmem limits, on systems with legacy values for SHMMAX etc. Shaving a few single-purpose buffers to make back what we spent on SSI, for example, seems like a good idea to me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Dec 20, 2011, at 11:29 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: So, what do we do about this? The obvious answer is increase NUM_CLOG_BUFFERS, and I'm not sure that's a bad idea. As you say, that's likely to hurt people running in small shared memory. I too have thought about merging the SLRU areas into the main shared buffer arena, and likewise have concluded that it is likely to be way more painful than it's worth. What I think might be an appropriate compromise is something similar to what we did for autotuning wal_buffers: use a fixed percentage of shared_buffers, with some minimum and maximum limits to ensure sanity. But picking the appropriate percentage would take a bit of research. ISTM that this is based more on number of CPUs rather than total memory, no? Likewise, things like the number of shared buffer partitions would be highly dependent on the number of CPUs. So perhaps we should either probe the number of CPUs on a box, or have a GUC to tell us how many there are... -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Sat, Dec 24, 2011 at 9:25 AM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote: Also, if it is that, what do we do about it? I don't think any of the ideas proposed so far are going to help much. If you don't like guessing, don't guess, don't think. Just measure. Does increasing the number of buffers solve the problems you see? That must be the first port of call - is that enough, or not? If not, we can discuss the various ideas, write patches and measure them. Just in case you want a theoretical prediction to test: increasing NUM_CLOG_BUFFERS should reduce the frequency of the spikes you measured earlier. That should happen proportionally, so as that is increased they will become even less frequent. But the size of the buffer will not decrease the impact of each event when it happens. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas robertmh...@gmail.com wrote: You mentioned latency so this morning I ran pgbench with -l and graphed the output. There are latency spikes every few seconds. I'm attaching the overall graph as well as the graph of the last 100 seconds, where the spikes are easier to see clearly. Now, here's the problem: it seems reasonable to hypothesize that the spikes are due to CLOG page replacement since the frequency is at least plausibly right, but this is obviously not enough to prove that conclusively. Ideas? Thanks. That illustrates the effect I explained earlier very clearly, so now we all know I wasn't speculating. Also, if it is that, what do we do about it? I don't think any of the ideas proposed so far are going to help much. If you don't like guessing, don't guess, don't think. Just measure. Does increasing the number of buffers solve the problems you see? That must be the first port of call - is that enough, or not? If not, we can discuss the various ideas, write patches and measure them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Dec 22, 2011 at 1:04 AM, Simon Riggs si...@2ndquadrant.com wrote: I understand why you say that and take no offence. All I can say is last time I has access to a good test rig and well structured reporting and analysis I was able to see evidence of what I described to you here. I no longer have that access, which is the main reason I've not done anything in the last few years. We both know you do have good access and that's the main reason I'm telling you about it rather than just doing it myself. Right. But I need more details. If I know what to test and how to test it, I can do it. Otherwise, I'm just guessing. I dislike guessing. You mentioned latency so this morning I ran pgbench with -l and graphed the output. There are latency spikes every few seconds. I'm attaching the overall graph as well as the graph of the last 100 seconds, where the spikes are easier to see clearly. Now, here's the problem: it seems reasonable to hypothesize that the spikes are due to CLOG page replacement since the frequency is at least plausibly right, but this is obviously not enough to prove that conclusively. Ideas? Also, if it is that, what do we do about it? I don't think any of the ideas proposed so far are going to help much. Increasing the number of CLOG buffers isn't going to fix the problem that once they're all dirty, you have to write and fsync one before pulling in the next one. Striping might actually make it worse - everyone will move to the next buffer right around the same time, and instead of everybody waiting for one fsync, they'll each be waiting for their own. Maybe the solution is to have the background writer keep an eye on how many CLOG buffers are dirty and start writing them out if the number gets too big. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: latency.pngattachment: latency-end.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 5:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: ... while the main buffer manager is content with some loosey-goosey approximation of recency, the SLRU code makes a fervent attempt at strict LRU (slightly compromised for the sake of reduced locking in SimpleLruReadPage_Readonly). Oh btw, I haven't looked at that code recently, but I have a nasty feeling that there are parts of it that assume that the number of buffers it is managing is fairly small. Cranking up the number might require more work than just changing the value. My memory was that you'd said benchmarks showed NUM_CLOG_BUFFERS needs to be low enough to allow fast lookups, since the lookups don't use an LRU they just scan all buffers. Indeed, it was your objection that stopped NUM_CLOG_BUFFERS being increased many years before this. With the increased performance we have now, I don't think increasing that alone will be that useful since it doesn't solve all of the problems and (I am told) likely increases lookup speed. The full list of clog problems I'm aware of is: raw lookup speed, multi-user contention, writes at checkpoint and new xid allocation. Would it be better just to have multiple SLRUs dedicated to the clog? Simply partition things so we have 2^N sets of everything, and we look up the xid in partition (xid % (2^N)). That would overcome all of the problems, not just lookup, in exactly the same way that we partitioned the buffer and lock manager. We would use a graduated offset on the page to avoid zeroing pages at the same time. Clog size wouldn't increase, we'd have the same number of bits, just spread across 2^N files. We'd have more pages too, but that's not a bad thing since it spreads out the contention. Code-wise, those changes would be isolated to clog.c only, probably a days work if you like the idea. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 12:33 AM, Tom Lane t...@sss.pgh.pa.us wrote: Oh btw, I haven't looked at that code recently, but I have a nasty feeling that there are parts of it that assume that the number of buffers it is managing is fairly small. Cranking up the number might require more work than just changing the value. Oh, you mean like the fact that it tries to do strict LRU page replacement? *rolls eyes* We seem to have named the SLRU system after one of its scalability limitations... I think there probably are some scalability limits to the current implementation, but also I think we could probably increase the current value modestly with something less than a total rewrite. Linearly scanning the slot array won't scale indefinitely, but I think it will scale to more than 8 elements. The performance results I posted previously make it clear that 8 - 32 is a net win at least on that system. One fairly low-impact option might be to make the cache less than fully associative - e.g. given N buffers, a page with pageno % 4 == X is only allowed to be in a slot numbered between (N/4)*X and (N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but might be OK at larger values. We could also switch to using a hash table but that seems awfully heavy-weight. The real question is how to decide how many buffers to create. You suggested a formula based on shared_buffers, but what would that formula be? I mean, a typical large system is going to have 1,048,576 shared buffers, and it probably needs less than 0.1% of that amount of CLOG buffers. My guess is that there's no real reason to skimp: if you are really tight for memory, you might want to crank this down, but otherwise you may as well just go with whatever we decide the best-performing value is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 5:17 AM, Simon Riggs si...@2ndquadrant.com wrote: With the increased performance we have now, I don't think increasing that alone will be that useful since it doesn't solve all of the problems and (I am told) likely increases lookup speed. I have benchmarks showing that it works, for whatever that's worth. The full list of clog problems I'm aware of is: raw lookup speed, multi-user contention, writes at checkpoint and new xid allocation. What is the best workload to show a bottleneck on raw lookup speed? I wouldn't expect writes at checkpoint to be a big problem because it's so little data. What's the problem with new XID allocation? Would it be better just to have multiple SLRUs dedicated to the clog? Simply partition things so we have 2^N sets of everything, and we look up the xid in partition (xid % (2^N)). That would overcome all of the problems, not just lookup, in exactly the same way that we partitioned the buffer and lock manager. We would use a graduated offset on the page to avoid zeroing pages at the same time. Clog size wouldn't increase, we'd have the same number of bits, just spread across 2^N files. We'd have more pages too, but that's not a bad thing since it spreads out the contention. It seems that would increase memory requirements (clog1 through clog4 with 2 pages each doesn't sound workable). It would also break on-disk compatibility for pg_upgrade. I'm still holding out hope that we can find a simpler solution... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com wrote: Any thoughts on what makes most sense here? I find it fairly tempting to just crank up NUM_CLOG_BUFFERS and call it good, The only thought I have to add to discussion so far is that the need to do anything may be reduced significantly by any work to write hint bits more aggressively. We only consult CLOG for tuples on which hint bits have not yet been set, right? What if, before writing a page, we try to set hint bits where we can? When successful, it would not only prevent one or more later writes of the page, but could also prevent having to load old CLOG pages. Perhaps the hint bit issue should be addressed first, and *then* we check whether we still have a problem with CLOG. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 10:51 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: Any thoughts on what makes most sense here? I find it fairly tempting to just crank up NUM_CLOG_BUFFERS and call it good, The only thought I have to add to discussion so far is that the need to do anything may be reduced significantly by any work to write hint bits more aggressively. We only consult CLOG for tuples on which hint bits have not yet been set, right? What if, before writing a page, we try to set hint bits where we can? When successful, it would not only prevent one or more later writes of the page, but could also prevent having to load old CLOG pages. Perhaps the hint bit issue should be addressed first, and *then* we check whether we still have a problem with CLOG. There may be workloads where that will help, but it's definitely not going to cover all cases. Consider my trusty pgbench-at-scale-factor-100 test case: since the working set fits inside shared buffers, we're only writing pages at checkpoint time. The contention happens because we randomly select rows from the table, and whatever row we select hasn't been examined since it was last updated, and so it's unhinted. But we're not reading the page in: it's already in shared buffers, and has never been written out. I don't see any realistic way to avoid the CLOG lookups in that case: nobody else has had any reason to touch that page in any way since the tuple was first written. So I think we need a more general solution. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Excerpts from Robert Haas's message of mié dic 21 13:18:36 -0300 2011: There may be workloads where that will help, but it's definitely not going to cover all cases. Consider my trusty pgbench-at-scale-factor-100 test case: since the working set fits inside shared buffers, we're only writing pages at checkpoint time. The contention happens because we randomly select rows from the table, and whatever row we select hasn't been examined since it was last updated, and so it's unhinted. But we're not reading the page in: it's already in shared buffers, and has never been written out. I don't see any realistic way to avoid the CLOG lookups in that case: nobody else has had any reason to touch that page in any way since the tuple was first written. Maybe we need a background tuple hinter process ... -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: I think there probably are some scalability limits to the current implementation, but also I think we could probably increase the current value modestly with something less than a total rewrite. Linearly scanning the slot array won't scale indefinitely, but I think it will scale to more than 8 elements. The performance results I posted previously make it clear that 8 - 32 is a net win at least on that system. Agreed, the question is whether 32 is enough to fix the problem for anything except this one benchmark. One fairly low-impact option might be to make the cache less than fully associative - e.g. given N buffers, a page with pageno % 4 == X is only allowed to be in a slot numbered between (N/4)*X and (N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but might be OK at larger values. I'm inclined to think that that specific arrangement wouldn't be good. The normal access pattern for CLOG is, I believe, an exponentially decaying probability-of-access for each page as you go further back from current. We have a hack to pin the current (latest) page into SLRU all the time, but you want the design to be such that the next-to-latest page is most likely to still be around, then the second-latest, etc. If I'm reading your equation correctly then the most recent pages would compete against each other, not against much older pages, which is exactly the wrong thing. Perhaps what you actually meant to say was that all pages with the same number mod 4 are in one bucket, which would be better, but still not really ideal: for instance the next-to-latest page could end up getting removed while say the third-latest page is still there because it's in a different associative bucket that's under less pressure. But possibly we could fix that with some other variant of the idea. I certainly agree that strict LRU isn't an essential property here, so long as we have a design that is matched to the expected access pattern statistics. We could also switch to using a hash table but that seems awfully heavy-weight. Yeah. If we're not going to go to hundreds of CLOG buffers, which I think probably wouldn't be useful, then hashing is unlikely to be the best answer. The real question is how to decide how many buffers to create. You suggested a formula based on shared_buffers, but what would that formula be? I mean, a typical large system is going to have 1,048,576 shared buffers, and it probably needs less than 0.1% of that amount of CLOG buffers. Well, something like 0.1% with minimum of 8 and max of 32 might be reasonable. What I'm mainly fuzzy about is the upper limit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 3:28 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 21, 2011 at 5:17 AM, Simon Riggs si...@2ndquadrant.com wrote: With the increased performance we have now, I don't think increasing that alone will be that useful since it doesn't solve all of the problems and (I am told) likely increases lookup speed. I have benchmarks showing that it works, for whatever that's worth. The full list of clog problems I'm aware of is: raw lookup speed, multi-user contention, writes at checkpoint and new xid allocation. What is the best workload to show a bottleneck on raw lookup speed? A microbenchmark. I wouldn't expect writes at checkpoint to be a big problem because it's so little data. What's the problem with new XID allocation? Earlier experience shows that those are areas of concern. You aren't measuring response time in your tests, so you won't notice them as problems. But they do effect throughput much more than intuition says it would. Would it be better just to have multiple SLRUs dedicated to the clog? Simply partition things so we have 2^N sets of everything, and we look up the xid in partition (xid % (2^N)). That would overcome all of the problems, not just lookup, in exactly the same way that we partitioned the buffer and lock manager. We would use a graduated offset on the page to avoid zeroing pages at the same time. Clog size wouldn't increase, we'd have the same number of bits, just spread across 2^N files. We'd have more pages too, but that's not a bad thing since it spreads out the contention. It seems that would increase memory requirements (clog1 through clog4 with 2 pages each doesn't sound workable). It would also break on-disk compatibility for pg_upgrade. I'm still holding out hope that we can find a simpler solution... Not sure what you mean by increase memory requirements. How would increasing NUM_CLOG_BUFFERS = 64 differ from having NUM_CLOG_BUFFERS = 8 and NUM_CLOG_PARTITIONS = 8? I think you appreciate that having 8 lwlocks rather than 1 might help scalability. I'm sure pg_upgrade can be tweaked easily enough and it would still work quickly. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: Agreed, the question is whether 32 is enough to fix the problem for anything except this one benchmark. Right. My thought on that topic is that it depends on what you mean by fix. It's clearly NOT possible to keep enough CLOG buffers around to cover the entire range of XID space that might get probed, at least not without some massive rethinking of the infrastructure. It seems that the amount of space that might need to be covered there is at least on the order of vacuum_freeze_table_age, which is to say 150 million by default. At 32K txns/page, that would require almost 5K pages, which is a lot more than 8. On the other hand, if we just want to avoid having more requests simultaneously in flight than we have buffers, so that backends don't need to wait for an available buffer before beginning their I/O, then something on the order of the number of CPUs in the machine is likely sufficient. I'll do a little more testing and see if I can figure out where the tipping point is on this 32-core box. One fairly low-impact option might be to make the cache less than fully associative - e.g. given N buffers, a page with pageno % 4 == X is only allowed to be in a slot numbered between (N/4)*X and (N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but might be OK at larger values. I'm inclined to think that that specific arrangement wouldn't be good. The normal access pattern for CLOG is, I believe, an exponentially decaying probability-of-access for each page as you go further back from current. We have a hack to pin the current (latest) page into SLRU all the time, but you want the design to be such that the next-to-latest page is most likely to still be around, then the second-latest, etc. If I'm reading your equation correctly then the most recent pages would compete against each other, not against much older pages, which is exactly the wrong thing. Perhaps what you actually meant to say was that all pages with the same number mod 4 are in one bucket, which would be better, That's what I meant. I think the formula works out to that, but in any case it's what I meant. :-) but still not really ideal: for instance the next-to-latest page could end up getting removed while say the third-latest page is still there because it's in a different associative bucket that's under less pressure. Well, sure. But who is to say that's bad? I think you can find a way to throw stones at any given algorithm we might choose to implement. For example, if you contrive things so that you repeatedly access the same old CLOG pages cyclically: 1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,... ...then our existing LRU algorithm will be anti-optimal, because we'll keep the latest page plus the most recently accessed 7 old pages in memory, and every lookup will fault out the page that the next lookup is about to need. If you're not that excited about that happening in real life, neither am I. But neither am I that excited about your scenario: if the next-to-last page gets kicked out, there are a whole bunch of pages -- maybe 8, if you imagine 32 buffers split 4 ways -- that have been accessed more recently than that next-to-last page. So it wouldn't be resident in an 8-buffer pool either. Maybe the last page was mostly transactions updating some infrequently-accessed table, and we don't really need that page right now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: On Wed, Dec 21, 2011 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: I'm inclined to think that that specific arrangement wouldn't be good. The normal access pattern for CLOG is, I believe, an exponentially decaying probability-of-access for each page as you go further back from current. ... for instance the next-to-latest page could end up getting removed while say the third-latest page is still there because it's in a different associative bucket that's under less pressure. Well, sure. But who is to say that's bad? I think you can find a way to throw stones at any given algorithm we might choose to implement. The point I'm trying to make is that buffer management schemes like that one are built on the assumption that the probability of access is roughly uniform for all pages. We know (or at least have strong reason to presume) that CLOG pages have very non-uniform probability of access. The straight LRU scheme is good because it deals well with non-uniform access patterns. Dividing the buffers into independent buckets in a way that doesn't account for the expected access probabilities is going to degrade things. (The approach Simon suggests nearby seems isomorphic to yours and so suffers from this same objection, btw.) For example, if you contrive things so that you repeatedly access the same old CLOG pages cyclically: 1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,... Sure, and the reason that that's contrived is that it flies in the face of reasonable assumptions about CLOG access probabilities. Any scheme will lose some of the time, but you don't want to pick a scheme that is more likely to lose for more probable access patterns. It strikes me that one simple thing we could do is extend the current heuristic that says pin the latest page. That is, pin the last K pages into SLRU, and apply LRU or some other method across the rest. If K is large enough, that should get us down to where the differential in access probability among the older pages is small enough to neglect, and then we could apply associative bucketing or other methods to the rest without fear of getting burnt by the common usage pattern. I don't know what K would need to be, though. Maybe it's worth instrumenting a benchmark run or two so we can get some facts rather than guesses about the access frequencies? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 1:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: It strikes me that one simple thing we could do is extend the current heuristic that says pin the latest page. That is, pin the last K pages into SLRU, and apply LRU or some other method across the rest. If K is large enough, that should get us down to where the differential in access probability among the older pages is small enough to neglect, and then we could apply associative bucketing or other methods to the rest without fear of getting burnt by the common usage pattern. I don't know what K would need to be, though. Maybe it's worth instrumenting a benchmark run or two so we can get some facts rather than guesses about the access frequencies? I guess the point is that it seems to me to depend rather heavily on what benchmark you run. For something like pgbench, we initialize the cluster with one or a few big transactions, so the page containing those XIDs figures to stay hot for a very long time. Then after that we choose rows to update randomly, which will produce the sort of newer-pages-are-hotter-than-older-pages effect that you're talking about. But the slope of the curve depends heavily on the scale factor. If we have scale factor 1 (= 100,000 rows) then chances are that when we randomly pick a row to update, we'll hit one that's been touched within the last few hundred thousand updates - i.e. the last couple of CLOG pages. But if we have scale factor 100 (= 10,000,000 rows) we might easily hit a row that hasn't been updated for many millions of transactions, so there's going to be a much longer tail there. And some other test could yield very different results - e.g. something that uses lots of subtransactions might well have a much longer tail, while something that does more than one update per transaction would presumably have a shorter one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 3:24 PM, Robert Haas robertmh...@gmail.com wrote: I think there probably are some scalability limits to the current implementation, but also I think we could probably increase the current value modestly with something less than a total rewrite. Linearly scanning the slot array won't scale indefinitely, but I think it will scale to more than 8 elements. The performance results I posted previously make it clear that 8 - 32 is a net win at least on that system. Agreed to that, but I don't think its nearly enough. One fairly low-impact option might be to make the cache less than fully associative - e.g. given N buffers, a page with pageno % 4 == X is only allowed to be in a slot numbered between (N/4)*X and (N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but might be OK at larger values. Which is pretty much the same as saying, yes, lets partition the clog as I suggested, but by a different route. We could also switch to using a hash table but that seems awfully heavy-weight. Which is a re-write of SLRU ground up and inapproriate for most SLRU usage. We'd get partitioning for free as long as we re-write. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 2:05 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Dec 21, 2011 at 3:24 PM, Robert Haas robertmh...@gmail.com wrote: I think there probably are some scalability limits to the current implementation, but also I think we could probably increase the current value modestly with something less than a total rewrite. Linearly scanning the slot array won't scale indefinitely, but I think it will scale to more than 8 elements. The performance results I posted previously make it clear that 8 - 32 is a net win at least on that system. Agreed to that, but I don't think its nearly enough. One fairly low-impact option might be to make the cache less than fully associative - e.g. given N buffers, a page with pageno % 4 == X is only allowed to be in a slot numbered between (N/4)*X and (N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but might be OK at larger values. Which is pretty much the same as saying, yes, lets partition the clog as I suggested, but by a different route. We could also switch to using a hash table but that seems awfully heavy-weight. Which is a re-write of SLRU ground up and inapproriate for most SLRU usage. We'd get partitioning for free as long as we re-write. I'm not sure what your point is here. I feel like this is on the edge of turning into an argument, and if we're going to have an argument I'd like to know what we're arguing about. I am not arguing that under no circumstances should we partition anything related to CLOG, nor am I trying to deny you credit for your ideas. I'm merely saying that the specific plan of having multiple SLRUs for CLOG doesn't appeal to me -- mostly because I think it will make life difficult for pg_upgrade without any compensating advantage. If we're going to go that route, I'd rather build something into the SLRU machinery generally that allows for the cache to be less than fully-associative, with all of the savings in terms of lock contention that this entails. Such a system could be used by any SLRU, not just CLOG, if it proved to be helpful; and it would avoid any on-disk changes, with, as far as I can see, basically no downside. That having been said, Tom isn't convinced that any form of partitioning is the right way to go, and since Tom often has good ideas, I'd like to explore his notions of how we might fix this problem other than via some form of partitioning before we focus in on partitioning. Partitioning may ultimately be the right way to go, but let's keep an open mind: this thread is only 14 hours old. The only things I'm completely convinced of at this point are (1) we need more CLOG buffers (but I don't know exactly how many) and (2) the current code isn't designed to manage large numbers of buffers (but I don't know exactly where it starts to fall over). If I'm completely misunderstanding the point of your email, please set me straight (gently). Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 12:48 PM, Robert Haas robertmh...@gmail.com wrote: On the other hand, if we just want to avoid having more requests simultaneously in flight than we have buffers, so that backends don't need to wait for an available buffer before beginning their I/O, then something on the order of the number of CPUs in the machine is likely sufficient. I'll do a little more testing and see if I can figure out where the tipping point is on this 32-core box. I recompiled with NUM_CLOG_BUFFERS = 8, 16, 24, 32, 40, 48 and ran 5-minute tests, using unlogged tables to avoid getting killed by WALInsertLock contentions. With 32-clients on this 32-core box, the tipping point is somewhere in the neighborhood of 32 buffers. 40 buffers might still be winning over 32, or maybe not, but 48 is definitely losing. Below 32, more is better, all the way up. Here are the full results: resultswu.clog16.32.100.300:tps = 19549.454462 (including connections establishing) resultswu.clog16.32.100.300:tps = 19883.583245 (including connections establishing) resultswu.clog16.32.100.300:tps = 19984.857186 (including connections establishing) resultswu.clog24.32.100.300:tps = 20124.147651 (including connections establishing) resultswu.clog24.32.100.300:tps = 20108.504407 (including connections establishing) resultswu.clog24.32.100.300:tps = 20303.964120 (including connections establishing) resultswu.clog32.32.100.300:tps = 20573.873097 (including connections establishing) resultswu.clog32.32.100.300:tps = 20444.289259 (including connections establishing) resultswu.clog32.32.100.300:tps = 20234.209965 (including connections establishing) resultswu.clog40.32.100.300:tps = 21762.222195 (including connections establishing) resultswu.clog40.32.100.300:tps = 20621.749677 (including connections establishing) resultswu.clog40.32.100.300:tps = 20290.990673 (including connections establishing) resultswu.clog48.32.100.300:tps = 19253.424997 (including connections establishing) resultswu.clog48.32.100.300:tps = 19542.095191 (including connections establishing) resultswu.clog48.32.100.300:tps = 19284.962036 (including connections establishing) resultswu.master.32.100.300:tps = 18694.886622 (including connections establishing) resultswu.master.32.100.300:tps = 18417.647703 (including connections establishing) resultswu.master.32.100.300:tps = 18331.718955 (including connections establishing) Parameters in use: shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms It isn't clear to me whether we can extrapolate anything more general from this. It'd be awfully interesting to repeat this experiment on, say, an 8-core server, but I don't have one of those I can use at the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Wed, Dec 21, 2011 at 4:17 PM, Simon Riggs si...@2ndquadrant.com wrote: Partitioning will give us more buffers and more LWlocks, to spread the contention when we access the buffers. I use that word because its what we call the technique already used in the buffer manager and lock manager. If you wish to call this less than fully-associative I really don't mind, as long as we're discussing the same overall concept, so we can then focus on an implementation of that concept, which no doubt has many ways of doing it. More buffers per lock does reduce the lock contention somewhat, but not by much. So for me, it seems essential that we have more LWlocks to solve the problem, which is where partitioning comes in. My perspective is that there is clog contention in many places, not just in the ones you identified. Well, that's possible. The locking in slru.c is pretty screwy and could probably benefit from better locking granularity. One point worth noting is that the control lock for each SLRU protects all the SLRU buffer mappings and the contents of all the buffers; in the main buffer manager, those responsibilities are split across BufFreelistLock, 16 buffer manager partition locks, one content lock per buffer, and the buffer header spinlocks. (The SLRU per-buffer locks are the equivalent of the I/O-in-progresss locks, not the content locks.) So splitting up CLOG into multiple SLRUs might not be the only way of improving the lock granularity; the current situation is almost comical. But on the flip side, I feel like your discussion of the problems is a bit hand-wavy. I think we need some real test cases that we can look at and measure, not just an informal description of what we think is happening. I'm sure, for example, that repeatedly reading different CLOG pages costs something - but I'm not sure that it's enough to have a material impact on performance. And if it doesn't, then we'd be better off leaving it alone and working on things that do. And if it does, then we need a way to assess how successful any given approach is in addressing that problem, so we can decide which of various proposed approaches is best. * We allocate a new clog page every 32k xids. At the rates you have now measured, we will do this every 1-2 seconds. And a new pg_subtrans page quite a bit more frequently than that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
On Thu, Dec 22, 2011 at 12:28 AM, Robert Haas robertmh...@gmail.com wrote: But on the flip side, I feel like your discussion of the problems is a bit hand-wavy. I think we need some real test cases that we can look at and measure, not just an informal description of what we think is happening. I understand why you say that and take no offence. All I can say is last time I has access to a good test rig and well structured reporting and analysis I was able to see evidence of what I described to you here. I no longer have that access, which is the main reason I've not done anything in the last few years. We both know you do have good access and that's the main reason I'm telling you about it rather than just doing it myself. * We allocate a new clog page every 32k xids. At the rates you have now measured, we will do this every 1-2 seconds. And a new pg_subtrans page quite a bit more frequently than that. It is less of a concern, all the same. In most cases we can simply drop pg_subtrans pages (though we don't do that as often as we could), no fsync is required on write, no WAL record required for extension and no update required at commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CLOG contention
A few weeks ago I posted some performance results showing that increasing NUM_CLOG_BUFFERS was improving pgbench performance. http://archives.postgresql.org/pgsql-hackers/2011-12/msg00095.php I spent some time today looking at this in a bit more detail. Somewhat obviously in retrospect, it turns out that the problem becomes more severe the longer you run the test. CLOG lookups are induced when we go to update a row that we've previously updated. When the test first starts, just after pgbench -i, all the rows are hinted and, even if they weren't, they all have the same XID. So no problem. But, as the fraction of rows that have been updated increases, it becomes progressively more likely that the next update will hit a row that's already been updated. Initially, that's OK, because we can keep all the CLOG pages of interest in the 8 available buffers. But eaten through enough XIDs - specifically, 8 buffers * 8192 bytes/buffer * 4 xids/byte = 256k - we can't keep all the necessary pages in memory at the same time, and so we have to keep replacing CLOG pages. This effect is not difficult to see even on my 2-core laptop, although I'm not sure whether it causes any material performance degradation. If you have enough concurrent tasks, a probably-more-serious form of starvation can occur. As SlruSelectLRUPage notes: /* * We need to wait for I/O. Normal case is that it's dirty and we * must initiate a write, but it's possible that the page is already * write-busy, or in the worst case still read-busy. In those cases * we wait for the existing I/O to complete. */ On Nate Boley's 32-core box, after running pgbench for a few minutes, that in the worst case scenario starts happening quite regularly, apparently because the number of people who simultaneously wish to read a different CLOG pages exceeds the number of available buffers into which they can be read. The ninth and following backends to come along have to wait until the least-recently-used page is no longer read-busy before starting their reads. So, what do we do about this? The obvious answer is increase NUM_CLOG_BUFFERS, and I'm not sure that's a bad idea. 64kB is a pretty small cache on anything other than an embedded system, these days. We could either increase the hard-coded value, or make it configurable - but it would have to be PGC_POSTMASTER, since there's no way to allocate more shared memory later on. The downsides of this approach are: 1. If we make it configurable, nobody will have a clue what value to set. 2. If we just make it bigger, people laboring under the default 32MB shared memory limit will conceivably suffer even more than they do now if they just initdb and go. A more radical approach would be to try to merge the buffer arenas for the various SLRUs either with each other or with shared_buffers, which would presumably allow a lot more flexibility to ratchet the number of CLOG buffers up or down depending on overall memory pressure. Merging the buffer arenas into shared_buffers seems like the most flexible solution, but it also seems like a big, complex, error-prone behavior change, because the SLRU machinery does things quite differently from shared_buffers: we look up buffers with a linear array search rather than a hash table probe; we have only a per-SLRU lock and a per-page lock, rather than separate mapping locks, content locks, io-in-progress locks, and pins; and while the main buffer manager is content with some loosey-goosey approximation of recency, the SLRU code makes a fervent attempt at strict LRU (slightly compromised for the sake of reduced locking in SimpleLruReadPage_Readonly). Any thoughts on what makes most sense here? I find it fairly tempting to just crank up NUM_CLOG_BUFFERS and call it good, but the siren song of refactoring is whispering in my other ear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: So, what do we do about this? The obvious answer is increase NUM_CLOG_BUFFERS, and I'm not sure that's a bad idea. As you say, that's likely to hurt people running in small shared memory. I too have thought about merging the SLRU areas into the main shared buffer arena, and likewise have concluded that it is likely to be way more painful than it's worth. What I think might be an appropriate compromise is something similar to what we did for autotuning wal_buffers: use a fixed percentage of shared_buffers, with some minimum and maximum limits to ensure sanity. But picking the appropriate percentage would take a bit of research. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention
Robert Haas robertmh...@gmail.com writes: ... while the main buffer manager is content with some loosey-goosey approximation of recency, the SLRU code makes a fervent attempt at strict LRU (slightly compromised for the sake of reduced locking in SimpleLruReadPage_Readonly). Oh btw, I haven't looked at that code recently, but I have a nasty feeling that there are parts of it that assume that the number of buffers it is managing is fairly small. Cranking up the number might require more work than just changing the value. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers