Re: [HACKERS] 2nd Level Buffer Cache
Josh Berkus wrote: Was it really all that bad? IIRC we replaced ARC with the current clock sweep due to patent concerns. (Maybe there were performance concerns as well, I don't remember). Yeah, that was why the patent was frustrating. Performance was poor and we were planning on replacing ARC in 8.2 anyway. Instead we had to backport it. [ Replying late.] FYI, the performance problem was that while ARC was slightly better than clock sweep in keeping useful buffers in the cache, it was terrible when multiple CPUs were all modifying the buffer cache, which is why we were going to remove it anyway. In summary, any new algorithm has to be better at keeping useful data in the cache, and also not slow down workloads on multiple CPUs. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On 03/24/2011 03:36 PM, Jim Nasby wrote: On Mar 23, 2011, at 5:12 PM, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php Investigating this has been on the TODO list for four years now: http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php I feel that work in this area is blocked behind putting together a decent mix of benchmarks that can be used to test whether changes here are actually good or bad. All of the easy changes to buffer allocation strategy, ones that you could verify by inspection and simple tests, were made in 8.3. The stuff that's left has the potential to either improve or reduce performance, and which will happen is very workload dependent. Setting up systematic benchmarks of multiple workloads to run continuously on big hardware is a large, boring, expensive problem that few can justify financing (except for Jim of course), and even fewer want to volunteer time toward. This whole discussion of cache policy tweaks is fun, but I just delete all the discussion now because it's just going in circles without a good testing regime. The right way to start is by saying this is the benchmark I'm going to improve with this change, and it has a profiled hotspot at this point. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 25, 2011 at 8:07 AM, Gurjeet Singh singh.gurj...@gmail.com wrote: On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. Under what circumstances does this happen? Can a simple pgbench -S with a large scaling factor elicit this behavior? To be honest, I'm mostly just reporting what I've heard Greg Smith say on this topic. I don't have any machine with that kind of RAM. I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform). That's the largest memory AWS has. Does AWS have machines with battery-backed write cache? I think people running servers with 192G probably have BBWC, so it may be hard to do realistic tests without also having one on the test machine. But probably a bigger problem is that (to the best of my knowledge) we don't seem to have a non-proprietary, generally implementable benchmark system or load-generator which is known to demonstrate the problem. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. Under what circumstances does this happen? Can a simple pgbench -S with a large scaling factor elicit this behavior? To be honest, I'm mostly just reporting what I've heard Greg Smith say on this topic. I don't have any machine with that kind of RAM. I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform). That's the largest memory AWS has. Let me know if I can help. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] 2nd Level Buffer Cache
On Mar 25, 2011, at 10:07 AM, Gurjeet Singh wrote: On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. Under what circumstances does this happen? Can a simple pgbench -S with a large scaling factor elicit this behavior? To be honest, I'm mostly just reporting what I've heard Greg Smith say on this topic. I don't have any machine with that kind of RAM. I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform). That's the largest memory AWS has. Related to that... after talking to Greg Smith at PGEast last night, he felt it would be very valuable just to profile how much time is being spent waiting/holding the freelist lock in a real environment. I'm going to see if we can do that on one of our slave databases. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, Mar 24, 2011 at 7:51 PM, Greg Stark gsst...@mit.edu wrote: On Thu, Mar 24, 2011 at 11:33 PM, Jeff Janes jeff.ja...@gmail.com wrote: I tried under the circumstances I thought were mostly likely to show a time difference, and I was unable to detect a reliable difference in timing between free list and clock sweep. It strikes me that it shouldn't be terribly hard to add a profiling option to Postgres to dump out a list of precisely which blocks of data were accessed in which order. Then it's fairly straightforward to process that list using different algorithms to measure which generates the fewest cache misses. It is pretty easy to get the list by adding a couple elog. To be safe you probably also need to record pins and unpins, as you can't evict a pinned buffer no matter how other-wise eligible it might be. For most workloads you might be able to get away with just assuming that if it is eligible for replacement under any reasonable strategy, than it is very unlikely to still be pinned. Also, if the list is derived from a concurrent environment, then the order of access you see under a particular policy might no longer be the same if a different policy were adopted. But whose work-load would you use to do the testing? The ones I was testing were simple enough that I just know what the access pattern is, the root and 1st level branch blocks are almost always in shared buffer, the leaf and table blocks almost never are. Here my concern was not how to choose which block to replace in a conceptual way, but rather how to code that selection in way that is fast and concurrent and low latency for the latency-sensitive processes. Either method will evict the same blocks, with the exception of differences introduced by race conditions that get resolved differently. A benefit of focusing on the implementation rather than the high level selection strategy is that improvements in implementation are more likely to better carry over to other workloads. My high level conclusions were that the running of the selection is generally not a bottleneck, and in the cases where it was, the bottleneck was due to contention on the LWLock, regardless of what was done under that lock. Changing who does the clock-sweep is probably not meaningful unless it facilitates a lock-strength reduction or other contention reduction. I have also played with simulations of different algorithms for managing the usage_count, and I could get improvements but they weren't big enough or general enough to be very exciting. It was generally the case were if the data size was X, the improvement was maybe 30% over the current, but if the data size was 0.8X or 1.2X, there was no difference. So not very general. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mar 25, 2011, at 11:58 AM, Jim Nasby j...@nasby.net wrote: Related to that... after talking to Greg Smith at PGEast last night, he felt it would be very valuable just to profile how much time is being spent waiting/holding the freelist lock in a real environment. I'm going to see if we can do that on one of our slave databases. Yeah, that would be great. Also, some LWLOCK_STATS output or oprofile output would be definitely be useful. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mar 23, 2011, at 5:12 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php Tom made the point in the first one that it might be good to proactively move buffers to the freelist so that backends would normally just have to hit the freelist and not run the sweep. Unfortunately I haven't yet been able to do any performance testing of any of this... perhaps someone else can try and measure the amount of time spent by backends running the clock sweep with different shared buffer sizes. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Jim Nasby j...@nasby.net Thursday 24 March 2011 20:36:48 On Mar 23, 2011, at 5:12 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php Tom made the point in the first one that it might be good to proactively move buffers to the freelist so that backends would normally just have to hit the freelist and not run the sweep. Unfortunately I haven't yet been able to do any performance testing of any of this... perhaps someone else can try and measure the amount of time spent by backends running the clock sweep with different shared buffer sizes. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net Will not be enough to take spin lock (or make ASM (lock) and increment call for Intels/AMD) around increment StrategyControl-nextVictimBuffer, everything here may be controlled by macro GetNextVictimBuffer(). Within for (;;) the valid buffer may be obtained with modulo NBuffers, to decrease lock time. We may try to calculate how many buffers we had skipped, and decrease e.g. trycount by this value, and put some additional restriction like no more passes then NBuffers*4 calls, and notify error. This will made clock sweep concurrent. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Wed, Mar 23, 2011 at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. It seems at least plausible that buffer allocation could be significantly faster if it need only pop the head of a list, rather than scanning until it finds a suitable candidate. Moving as much buffer allocation work as possible into the background seems like it ought to be useful. Granted, I've made no attempt to code or test this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote: It seems at least plausible that buffer allocation could be significantly faster if it need only pop the head of a list, rather than scanning until it finds a suitable candidate. Moving as much buffer allocation work as possible into the background seems like it ought to be useful. Linked lists are notoriously non-concurrent, that's the whole reason for the clock sweep algorithm to exist at all instead of just using an LRU directly. That said, an LRU needs to be able to remove elements from the middle and not just enqueue elements on the tail, so the situation isn't exactly equivalent. Just popping off the head is simple enough but the bgwriter would need to be able to add elements to the tail of the list and the people popping elements off the head would need to compete with it for the lock on the list. And I think you need a single lock for the whole list because of the cases where the list becomes a single element or empty. The main impact this list would have is that it would presumably need some real number of buffers to satisfy the pressure for victim buffers for a real amount of time. That would represent a decrease in cache size, effectively evicting buffers from cache as if the cache were smaller by that amount. Theoretical results are that a small change in cache size affects cache hit rates substantially. I'm not sure that's born out by practical experience with Postgres though. People tend to either be doing mostly i/o or very little i/o. Cache hit rate only really matters and is likely to be affected by small changes in cache size in the space in between -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark gsst...@mit.edu wrote: On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote: It seems at least plausible that buffer allocation could be significantly faster if it need only pop the head of a list, rather than scanning until it finds a suitable candidate. Moving as much buffer allocation work as possible into the background seems like it ought to be useful. Linked lists are notoriously non-concurrent, that's the whole reason for the clock sweep algorithm to exist at all instead of just using an LRU directly. That said, an LRU needs to be able to remove elements from the middle and not just enqueue elements on the tail, so the situation isn't exactly equivalent. Just popping off the head is simple enough but the bgwriter would need to be able to add elements to the tail of the list and the people popping elements off the head would need to compete with it for the lock on the list. And I think you need a single lock for the whole list because of the cases where the list becomes a single element or empty. The main impact this list would have is that it would presumably need some real number of buffers to satisfy the pressure for victim buffers for a real amount of time. That would represent a decrease in cache size, effectively evicting buffers from cache as if the cache were smaller by that amount. Theoretical results are that a small change in cache size affects cache hit rates substantially. I'm not sure that's born out by practical experience with Postgres though. People tend to either be doing mostly i/o or very little i/o. Cache hit rate only really matters and is likely to be affected by small changes in cache size in the space in between You wouldn't really have to reduce the effective cache size - there's logic in there to just skip to the next buffer if the first one you pull off the freelist has been reused. But the cache hit rates on those buffers would (you'd hope) be fairly low, since they are the ones we're about to reuse. Maybe it doesn't work out to a win, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Robert Haas robertmh...@gmail.com Thursday 24 March 2011 22:41:19 On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark gsst...@mit.edu wrote: On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote: It seems at least plausible that buffer allocation could be significantly faster if it need only pop the head of a list, rather than scanning until it finds a suitable candidate. Moving as much buffer allocation work as possible into the background seems like it ought to be useful. Linked lists are notoriously non-concurrent, that's the whole reason for the clock sweep algorithm to exist at all instead of just using an LRU directly. That said, an LRU needs to be able to remove elements from the middle and not just enqueue elements on the tail, so the situation isn't exactly equivalent. Just popping off the head is simple enough but the bgwriter would need to be able to add elements to the tail of the list and the people popping elements off the head would need to compete with it for the lock on the list. And I think you need a single lock for the whole list because of the cases where the list becomes a single element or empty. The main impact this list would have is that it would presumably need some real number of buffers to satisfy the pressure for victim buffers for a real amount of time. That would represent a decrease in cache size, effectively evicting buffers from cache as if the cache were smaller by that amount. Theoretical results are that a small change in cache size affects cache hit rates substantially. I'm not sure that's born out by practical experience with Postgres though. People tend to either be doing mostly i/o or very little i/o. Cache hit rate only really matters and is likely to be affected by small changes in cache size in the space in between You wouldn't really have to reduce the effective cache size - there's logic in there to just skip to the next buffer if the first one you pull off the freelist has been reused. But the cache hit rates on those buffers would (you'd hope) be fairly low, since they are the ones we're about to reuse. Maybe it doesn't work out to a win, though. If I may, Under unnormal circumstances (like current process is held by kernel) obtaining buffer from list may be cheaper this code while (StrategyControl-firstFreeBuffer = 0) { buf = BufferDescriptors[StrategyControl-firstFreeBuffer]; Assert(buf-freeNext != FREENEXT_NOT_IN_LIST); /* Unconditionally remove buffer from freelist */ StrategyControl-firstFreeBuffer = buf-freeNext; buf-freeNext = FREENEXT_NOT_IN_LIST; could look do { SpinLock(); if (StrategyControl-firstFreeBuffer = 0) { Unspin() break; } buf = BufferDescriptors[StrategyControl-firstFreeBuffer]; Unspin(); Assert(buf-freeNext != FREENEXT_NOT_IN_LIST); /* Unconditionally remove buffer from freelist */ StrategyControl-firstFreeBuffer = buf-freeNext; buf-freeNext = FREENEXT_NOT_IN_LIST;like this }while(true); and aquirng spin lock for linked list is enaugh, and cheaper then taking lwlock is more complex than spin on this. after this simmilary with spin lock on trycounter = NBuffers*4; for (;;) { spinlock() buf = BufferDescriptors[StrategyControl-nextVictimBuffer]; if (++StrategyControl-nextVictimBuffer = NBuffers) { StrategyControl-nextVictimBuffer = 0; StrategyControl-completePasses++; } unspin(); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, Mar 24, 2011 at 12:36 PM, Jim Nasby j...@nasby.net wrote: On Mar 23, 2011, at 5:12 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php Tom made the point in the first one that it might be good to proactively move buffers to the freelist so that backends would normally just have to hit the freelist and not run the sweep. Unfortunately I haven't yet been able to do any performance testing of any of this... perhaps someone else can try and measure the amount of time spent by backends running the clock sweep with different shared buffer sizes. I tried under the circumstances I thought were mostly likely to show a time difference, and I was unable to detect a reliable difference in timing between free list and clock sweep. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, Mar 24, 2011 at 11:33 PM, Jeff Janes jeff.ja...@gmail.com wrote: I tried under the circumstances I thought were mostly likely to show a time difference, and I was unable to detect a reliable difference in timing between free list and clock sweep. It strikes me that it shouldn't be terribly hard to add a profiling option to Postgres to dump out a list of precisely which blocks of data were accessed in which order. Then it's fairly straightforward to process that list using different algorithms to measure which generates the fewest cache misses. This is usually how the topic is handled in academic discussions. The optimal cache policy is the one which flushes the cache entry which will be used next the furthest into the future. Given a precalculated file you can calculate the results from that optimal strategy and then compare each strategy against that one. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin I can't reproduce this. Simple test shows 2x faster read with mmap that read(); I'm sending this what I done with mmap (really ugly, but I'm in forest). It is read only solution, so init database first with some amount of data (I have about 300MB) (2nd level scripts may do this for You). This what I found: 1. If I not require to put new mmap (mmap with FIXED) in previous region (just I do munmap / mmap) with each query, execution time grows, about 10%. 2. Sometimes is enough just to comment or uncomment something that do not have side effects on code flow (bufmgr.c; (un)comment some unused if; put NULL, it will be replaced), and e.g. query execution time may grow 2x. 3. My initial solution, was 2% faster, about 9ms when reading, now it's 10% slower, after making them more usable. Regards, Radek pg_mmap_20110323.patch.bz2 Description: application/bzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Tuesday 22 March 2011 23:06:02 On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura rsmog...@softperience.eu wrote: Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.h tm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin Actually after dealing with mmap and adding munmap I found crucial thing why to not use mmap: You need to munmap, and for me this takes much time, even if I read with SHARED | PROT_READ, it's looks like Linux do flush or something else, same as with MAP_FIXED, MAP_PRIVATE, etc. can you produce small program demonstrating the problem? This is not how things should work AIUI. I was thinking about playing with mmap implementation of clog system -- it's perhaps better fit. clog is rigidly defined size, and has very high performance requirements. Also it's much less changes than reimplementing heap buffering, and maybe not so much affected by munmap. merlin Ah... just one thing, maybe usefull why performance is lost with huge memory. I saw mmaped buffers are allocated in something like 0x007, so definitly above 4gb. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mar 22, 2011, at 2:53 PM, Robert Haas wrote: On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. Under what circumstances does this happen? Can a simple pgbench -S with a large scaling factor elicit this behavior? To be honest, I'm mostly just reporting what I've heard Greg Smith say on this topic. I don't have any machine with that kind of RAM. When we started using 192G servers we tried switching our largest OLTP database (would have been about 1.2TB at the time) from 8GB shared buffers to 28GB. Performance went down enough to notice; I don't have any solid metrics, but I'd ballpark it at 10-15%. One thing that I've always wondered about is the logic of having backends run the clocksweep on a normal basis. OS's that use clock-sweep have a dedicated process to run the clock in the background, with the intent of keeping X amount of pages on the free list. We actually have most of the mechanisms to do that, we just don't have the added process. I believe bg_writer was intended to handle that, but in reality I don't think it actually manages to keep much of anything on the free list. Once we have a performance testing environment I'd be interested to test a modified version that includes a dedicated background clock sweep process that strives to keep X amount of buffers on the free list. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Wed, Mar 23, 2011 at 1:53 PM, Jim Nasby j...@nasby.net wrote: When we started using 192G servers we tried switching our largest OLTP database (would have been about 1.2TB at the time) from 8GB shared buffers to 28GB. Performance went down enough to notice; I don't have any solid metrics, but I'd ballpark it at 10-15%. One thing that I've always wondered about is the logic of having backends run the clocksweep on a normal basis. OS's that use clock-sweep have a dedicated process to run the clock in the background, with the intent of keeping X amount of pages on the free list. We actually have most of the mechanisms to do that, we just don't have the added process. I believe bg_writer was intended to handle that, but in reality I don't think it actually manages to keep much of anything on the free list. Once we have a performance testing environment I'd be interested to test a modified version that includes a dedicated background clock sweep process that strives to keep X amount of buffers on the free list. It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. I wonder if the background writer shouldn't be trying to maintain the free list. That is, perhaps BgBufferSync() should notice when the number of free buffers drops below some threshold, and run the clock sweep enough to get it back up to that threshold. On a related note, I've been thinking about whether we could make bgwriter_delay adaptively self-tuning. If we notice that we overslept, we don't sleep as long the next time; if not much happens while we sleep, we sleep longer the next time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. I wonder if the background writer shouldn't be trying to maintain the free list. That is, perhaps BgBufferSync() should notice when the number of free buffers drops below some threshold, and run the clock sweep enough to get it back up to that threshold. I think this is just a terminology discrepancy. In postgres the free list is only used for buffers that contain no useful data at all. The only time there are buffers on the free list is at startup or if a relation or database is dropped. Most of the time blocks are read into buffers that already contain other data. Candidate buffers to evict are buffers that have been used least recently. That's what the clock sweep implements. What the bgwriter's responsible for is looking at the buffers *ahead* of the clock sweep and flushing them to disk. They stay in ram and don't go on the free list, all that changes is that they're clean and therefore can be reused without having to do any i/o. I'm a bit skeptical that this works because as soon as bgwriter saturates the i/o the os will throttle the rate at which it can write. When that happens even a few dozens of milliseconds will be plenty to allow the purely user-space processes consuming the buffers to catch up instantly. But Greg Smith has done a lot of work tuning the bgwriter so that it is at least useful in some circumstances. I could well see it being useful for systems where latency matters and the i/o is not saturated. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Greg Stark gsst...@mit.edu Wednesday 23 March 2011 21:30:04 On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. I wonder if the background writer shouldn't be trying to maintain the free list. That is, perhaps BgBufferSync() should notice when the number of free buffers drops below some threshold, and run the clock sweep enough to get it back up to that threshold. I think this is just a terminology discrepancy. In postgres the free list is only used for buffers that contain no useful data at all. The only time there are buffers on the free list is at startup or if a relation or database is dropped. Most of the time blocks are read into buffers that already contain other data. Candidate buffers to evict are buffers that have been used least recently. That's what the clock sweep implements. What the bgwriter's responsible for is looking at the buffers *ahead* of the clock sweep and flushing them to disk. They stay in ram and don't go on the free list, all that changes is that they're clean and therefore can be reused without having to do any i/o. I'm a bit skeptical that this works because as soon as bgwriter saturates the i/o the os will throttle the rate at which it can write. When that happens even a few dozens of milliseconds will be plenty to allow the purely user-space processes consuming the buffers to catch up instantly. But Greg Smith has done a lot of work tuning the bgwriter so that it is at least useful in some circumstances. I could well see it being useful for systems where latency matters and the i/o is not saturated. Freelist is almost useless under normal operations, but it's only one check if it's empty or not, which could be optimized by checking (... -1), or !(... 0) Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Robert Haas robertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Hi, hackers. I am interested in this discussion! So I surveyed current buffer algorithms around other software. I share about it. (sorry, it is easy survey..) CLOCK-PRO and LIRS are popular in current buffer algorithms in my easy survey. Their algorithms are same author that is Song Jiang. CLOCK-PRO is improved LIRS algorithm based on CLOCK algorithm. CLOCK-PRO is used by Apache Derby and NetBSD. And, LIRS is used by MySQL. The following is easy explanation of LIRS. LRU use Recency metric that is the number of other blocks accessed from last reference to the current time. Strong points of LRU - Low overhead and simplicity data structure - LRU assumption is works well Weak points of LRU - A recently used block will not necessarily be used again or soon - The prediction is based on a single source information LIRS algorithm use Recency metric and Inter-Reference Recency(IRR) metric that is the number of other unique blocks accessed between two consecutive references to the block. The priority in LIRS algorithm is the order of IRR and Recency. IRR metric compensate for LRU weak points. LIRS paper insists on the following. - LIRS is same overhead as LRU. - Results of experiments were indicated that LIRS is higher buffer hit rate than LRU and other buffer algorithms. * Their experiment is used LIRS and other algorithms in PostgreSQL buffer system. In CLOCK-PRO paper is indicated that CLOCK-PRO is superior than LIRS and other buffer algorithms (including Arc). I think that PostgreSQL is very powerful and reliable database! So I hope that PostgreSQL buffer system will be more powerful and more intelligent. Thanks. [Refference] - CLOCK-PRO: http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang05_CLOCK-Pro.pdf - LIRS: http://dragonstar.ict.ac.cn/course_09/XD_Zhang/%286%29-LIRS-replacement.pdf - Apache Derbey(Google Summer Code): http://www.eecg.toronto.edu/~gokul/derby/derby-report-aug-19-2006.pdf - NetBSD source code: http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD - MySQL source code: http://mysql.lamphost.net/sources/doxygen/mysql-5.1/structPgman_1_1Page__entry.html - Song Jiang HP: http://www.ece.eng.wayne.edu/~sjiang/ -- Kondo Mitsumasa NTT Corporation, NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. Under what circumstances does this happen? Can a simple pgbench -S with a large scaling factor elicit this behavior? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 8:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: rsmogura rsmog...@softperience.eu wrote: Yes, there is some change, and I looked at this more carefully, as my performance results wasn't such as I expected. I found PG uses BufferAccessStrategy to do sequence scans, so my test query took only 32 buffers from pool and didn't overwritten index pool too much. This BAS is really surprising. In any case when I end polishing I will send good patch, with proof. Yeah, that heuristic makes this less critical, for sure. Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Well, the more complex you make it the more overhead there is, which makes it harder to come out ahead. FWIW, in musing about it (as recently as this week), my idea was to add another field which would factor into the clock sweep calculations. For indexes, it might be levels above leaf pages. The high level blocks of frequently used indexes do a pretty good job of keeping their usage counts high already, and so probably stay in the buffer pool already. And to the extent they don't, promoting all indexes (even infrequently used ones, which I think most databases have) would probably not be the way to encourage the others. I would be more interested in looking at the sweep algorithm itself. One thing I noticed in simulating the clock sweep is that the entry of pages into the buffer with a usage count of 1 might not be very useful. That give that page 2 sweeps of the clock arm before getting evicted, so they have an opportunity to get used again. But since all the blocks they are competing against also do the same thing, that just means the arm sweeps about twice as fast, so they don't really get much more of an opportunity. The other thought was that each buffers gets its usage incremented by 2 or 3 rather than 1 each time it is found already in the cache. Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Yeah, that sounds great. Even just having a centrally organized group of scripts/programs that have a good mix of loads, without the automated farm to go with it, would be a help. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On 03/22/2011 12:47 PM, Jeff Janes wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Yeah, that sounds great. Even just having a centrally organized group of scripts/programs that have a good mix of loads, without the automated farm to go with it, would be a help. Part of the reason for releasing the buildfarm server code a few months ago (see https://github.com/PGBuildFarm/server-code) was to encourage development of a benchmark farm, amoong other offspring. But I haven't seen such an animal emerging. Someone just needs to sit down and do it and present us with a fait accompli. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. Under what circumstances does this happen? Can a simple pgbench -S with a large scaling factor elicit this behavior? To be honest, I'm mostly just reporting what I've heard Greg Smith say on this topic. I don't have any machine with that kind of RAM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Tue, 2011-03-22 at 15:53 -0400, Robert Haas wrote: To be honest, I'm mostly just reporting what I've heard Greg Smith say on this topic. I don't have any machine with that kind of RAM. I thought we had a machine for hackers who want to do performance testing. Mark? -- Devrim GÜNDÜZ Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz signature.asc Description: This is a digitally signed message part
Re: [HACKERS] 2nd Level Buffer Cache
Radek, I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. The problem with any special buffering of database objects (other than maybe the system catalogs) improves one use case at the expense of others. For example, special buffering of indexes would have a negative effect on use cases which are primarily seq scans. Also, how would your index buffer work for really huge indexes, like GiST and GIN indexes? In general, I think that improving the efficiency/scalability of our existing buffer system is probably going to bear a lot more fruit than adding extra levels of buffering. That being said, one my argue that the root pages of btree indexes are a legitimate special case. However, it seems like clock-sweep would end up keeping those in shared buffers all the time regardless. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin Actually after dealing with mmap and adding munmap I found crucial thing why to not use mmap: You need to munmap, and for me this takes much time, even if I read with SHARED | PROT_READ, it's looks like Linux do flush or something else, same as with MAP_FIXED, MAP_PRIVATE, etc. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura rsmog...@softperience.eu wrote: Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin Actually after dealing with mmap and adding munmap I found crucial thing why to not use mmap: You need to munmap, and for me this takes much time, even if I read with SHARED | PROT_READ, it's looks like Linux do flush or something else, same as with MAP_FIXED, MAP_PRIVATE, etc. can you produce small program demonstrating the problem? This is not how things should work AIUI. I was thinking about playing with mmap implementation of clog system -- it's perhaps better fit. clog is rigidly defined size, and has very high performance requirements. Also it's much less changes than reimplementing heap buffering, and maybe not so much affected by munmap. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus j...@agliodbs.com wrote: To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... A lot of people have talked about it. You can find references to mmap going at least as far back as 2001 or so. The problem is that it would depend on the OS implementing things in a certain way and guaranteeing things we don't think can be portably assumed. We would need to mlock large amounts of address space which most OS's don't allow, and we would need to at least mlock and munlock lots of small bits of memory all over the place which would create lots and lots of mappings which the kernel and hardware implementations would generally not appreciate. As far as I know, no OS has a more sophisticated approach to eviction than LRU. And clock-sweep is a significant improvement on performance over LRU for frequently accessed database objects ... plus our optimizations around not overwriting the whole cache for things like VACUUM. The clock-sweep algorithm was standard OS design before you or I knew how to type. I would expect any half-decent OS to have sometihng at least as good -- perhaps better because it can rely on hardware features to handle things. However the second point is the crux of the issue and of all similar issues on where to draw the line between the OS and Postgres. The OS knows better about the hardware characteristics and can better optimize the overall system behaviour, but Postgres understands better its own access patterns and can better optimize its behaviour whereas the OS is stuck reverse-engineering what Postgres needs, usually from simple heuristics. 2-level caches work well for a variety of applications. I think 2-level caches with simple heuristics like pin all the indexes is unlikely to be helpful. At least it won't optimize the average case and I think that's been proven. It might be helpful for optimizing the worst-case which would reduce the standard deviation. Perhaps we're at the point now where that matters. Where it might be helpful is as a more refined version of the sequential scans use limited set of buffers patch. Instead of having each sequential scan use a hard coded number of buffers, perhaps all sequential scans should share a fraction of the global buffer pool managed separately from the main pool. Though in my thought experiments I don't see any real win here. In the current scheme if there's any sign the buffer is useful it gets thrown from the sequential scan's set of buffers to reuse anyways. Now, what would be *really* useful is some way to avoid all the data copying we do between shared_buffers and the FS cache. Well the two options are mmap/mlock or directio. The former might be a fun experiment but I expect any OS to fall over pretty quickly when faced with thousands (or millions) of 8kB mappings. The latter would need Postgres to do async i/o and hopefully a global view of its i/o access patterns so it could do prefetching in a lot more cases. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mon, 21 Mar 2011 10:24:22 +, Greg Stark wrote: On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus j...@agliodbs.com wrote: To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... A lot of people have talked about it. You can find references to mmap going at least as far back as 2001 or so. The problem is that it would depend on the OS implementing things in a certain way and guaranteeing things we don't think can be portably assumed. We would need to mlock large amounts of address space which most OS's don't allow, and we would need to at least mlock and munlock lots of small bits of memory all over the place which would create lots and lots of mappings which the kernel and hardware implementations would generally not appreciate. Actually, just from curious, I done test with mmap, and I got 2% boost on data reading, maybe because of skipping memcpy in fread. I really curious how fast, if even, it will be if I add some good and needed stuff and how e.g. vacuum will work. snip 2-level caches work well for a variety of applications. I think 2-level caches with simple heuristics like pin all the indexes is unlikely to be helpful. At least it won't optimize the average case and I think that's been proven. It might be helpful for optimizing the worst-case which would reduce the standard deviation. Perhaps we're at the point now where that matters. Actually, 2nd level caches do not pin index buffer. It's just, in simple words, some set of reserved buffers' ids to be used for index pages, all logic with pining, etc. it's same, the difference is that default level operation will not touch 2nd level. I post some reports from my simple tests. When I was experimenting with 2nd level caches I saw that some operations may swap out system tables buffers, too. snip Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mon, Mar 21, 2011 at 5:24 AM, Greg Stark gsst...@mit.edu wrote: On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus j...@agliodbs.com wrote: To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... A lot of people have talked about it. You can find references to mmap going at least as far back as 2001 or so. The problem is that it would depend on the OS implementing things in a certain way and guaranteeing things we don't think can be portably assumed. We would need to mlock large amounts of address space which most OS's don't allow, and we would need to at least mlock and munlock lots of small bits of memory all over the place which would create lots and lots of mappings which the kernel and hardware implementations would generally not appreciate. As far as I know, no OS has a more sophisticated approach to eviction than LRU. And clock-sweep is a significant improvement on performance over LRU for frequently accessed database objects ... plus our optimizations around not overwriting the whole cache for things like VACUUM. The clock-sweep algorithm was standard OS design before you or I knew how to type. I would expect any half-decent OS to have sometihng at least as good -- perhaps better because it can rely on hardware features to handle things. However the second point is the crux of the issue and of all similar issues on where to draw the line between the OS and Postgres. The OS knows better about the hardware characteristics and can better optimize the overall system behaviour, but Postgres understands better its own access patterns and can better optimize its behaviour whereas the OS is stuck reverse-engineering what Postgres needs, usually from simple heuristics. 2-level caches work well for a variety of applications. I think 2-level caches with simple heuristics like pin all the indexes is unlikely to be helpful. At least it won't optimize the average case and I think that's been proven. It might be helpful for optimizing the worst-case which would reduce the standard deviation. Perhaps we're at the point now where that matters. Where it might be helpful is as a more refined version of the sequential scans use limited set of buffers patch. Instead of having each sequential scan use a hard coded number of buffers, perhaps all sequential scans should share a fraction of the global buffer pool managed separately from the main pool. Though in my thought experiments I don't see any real win here. In the current scheme if there's any sign the buffer is useful it gets thrown from the sequential scan's set of buffers to reuse anyways. Now, what would be *really* useful is some way to avoid all the data copying we do between shared_buffers and the FS cache. Well the two options are mmap/mlock or directio. The former might be a fun experiment but I expect any OS to fall over pretty quickly when faced with thousands (or millions) of 8kB mappings. The latter would need Postgres to do async i/o and hopefully a global view of its i/o access patterns so it could do prefetching in a lot more cases. Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html) compatibility issues (possibly obsolete), etc. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On 21.03.2011 17:54, Merlin Moncure wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html) compatibility issues (possibly obsolete), etc. That mail is about replacing SysV shared memory with mmap(). Detecting other processes is a problem in that use, but that's not an issue with using mmap() to replace shared buffers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On 3/21/11 3:24 AM, Greg Stark wrote: 2-level caches work well for a variety of applications. I think 2-level caches with simple heuristics like pin all the indexes is unlikely to be helpful. At least it won't optimize the average case and I think that's been proven. It might be helpful for optimizing the worst-case which would reduce the standard deviation. Perhaps we're at the point now where that matters. You're missing my point ... Postgres already *has* a 2-level cache: shared_buffers and the FS cache. Anything we add to that will be adding levels. We already did that, actually, when we implemented ARC: effectively gave PostgreSQL a 3-level cache. The results were not very good, although the algorithm could be at fault there. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mon, Mar 21, 2011 at 4:47 PM, Josh Berkus j...@agliodbs.com wrote: You're missing my point ... Postgres already *has* a 2-level cache: shared_buffers and the FS cache. Anything we add to that will be adding levels. I don't think those two levels are interesting -- they don't interact cleverly at all. I was assuming the two levels were segments of the shared buffers that didn't interoperate at all. If you kick buffers from the higher level cache into the lower level one then why not just increase the number of clock sweeps before you flush a buffer and insert non-index pages into a lower clock level instead of writing code for two levels? I don't think it will outperform in general because LRU is provably within some margin from optimal and the clock sweep is an approximate LRU. The only place you're going to find wins is when you know something extra about the *future* access pattern that the lru/clock doesn't know based on the past behaviour. Just saying indexes are heavily used or system tables are heavily used isn't really extra information since the LRU can figure that out. Something like sequential scans of tables larger than shared buffers don't go back and read old pages before they age out is. The other place you might win is if you have some queries that you want to always be fast at the expense of slower queries. So your short web queries that only need to touch a few small tables and system tables can tag buffers that are higher priority and shouldn't be swapped out to achieve a slightly higher hit rate on the global cache. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Excerpts from Josh Berkus's message of lun mar 21 13:47:21 -0300 2011: We already did that, actually, when we implemented ARC: effectively gave PostgreSQL a 3-level cache. The results were not very good, although the algorithm could be at fault there. Was it really all that bad? IIRC we replaced ARC with the current clock sweep due to patent concerns. (Maybe there were performance concerns as well, I don't remember). -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Was it really all that bad? IIRC we replaced ARC with the current clock sweep due to patent concerns. (Maybe there were performance concerns as well, I don't remember). Yeah, that was why the patent was frustrating. Performance was poor and we were planning on replacing ARC in 8.2 anyway. Instead we had to backport it. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. Sorry for curious, but I think mlock is for swap prevent not for flush prevent. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin mmap man do not say anything about when flush occurs when mmap is file and is shared, so flushes may be intended or not. Much more, this what I read, SysV shared memory is emulated by mmap (and I think this mmap is on /dev/shm) Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, 17 Mar 2011 16:02:18 -0500, Kevin Grittner wrote: Rados*aw Smogurarsmog...@softperience.eu wrote: I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? For something like this it makes perfect sense to show proof of concept before trying to cover everything. -Kevin Yes, there is some change, and I looked at this more carefully, as my performance results wasn't such as I expected. I found PG uses BufferAccessStrategy to do sequence scans, so my test query took only 32 buffers from pool and didn't overwritten index pool too much. This BAS is really surprising. In any case when I end polishing I will send good patch, with proof. Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
rsmogura rsmog...@softperience.eu wrote: Yes, there is some change, and I looked at this more carefully, as my performance results wasn't such as I expected. I found PG uses BufferAccessStrategy to do sequence scans, so my test query took only 32 buffers from pool and didn't overwritten index pool too much. This BAS is really surprising. In any case when I end polishing I will send good patch, with proof. Yeah, that heuristic makes this less critical, for sure. Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Well, the more complex you make it the more overhead there is, which makes it harder to come out ahead. FWIW, in musing about it (as recently as this week), my idea was to add another field which would factor into the clock sweep calculations. For indexes, it might be levels above leaf pages. I haven't reviewed the code in depth to know how to use it, this was just idle daydreaming based on that prior experience. It's far from certain that the concept will actually prove beneficial in PostgreSQL. Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can apparently handle much larger buffer pools without a performance drop-off. There are some advantages to our reliance on the OS buffer cache, to be sure, but as RAM continues to grow this might start to get annoying. On a 4GB system you might have shared_buffers set to 25% of memory, but on a 64GB system it'll be a smaller percentage, and as memory capacities continue to clime it'll be smaller still. Unfortunately I don't have the hardware to investigate this, but it's worth thinking about, especially if we're thinking of doing things that add more caching. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Excerpts from rsmogura's message of vie mar 18 11:57:48 -0300 2011: Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. The BufferAccessStrategy stuff was written to solve this problem. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Keep in mind that there's already another layer of caching (see syscache.c) for system catalogs on top of the buffer cache. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mar 18, 2011, at 11:19 AM, Robert Haas wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can apparently handle much larger buffer pools without a performance drop-off. There are some advantages to our reliance on the OS buffer cache, to be sure, but as RAM continues to grow this might start to get annoying. On a 4GB system you might have shared_buffers set to 25% of memory, but on a 64GB system it'll be a smaller percentage, and as memory capacities continue to clime it'll be smaller still. Unfortunately I don't have the hardware to investigate this, but it's worth thinking about, especially if we're thinking of doing things that add more caching. +1 To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 2:15 PM, Jim Nasby j...@nasby.net wrote: +1 To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... The problem is that the OS doesn't offer any mechanism that would allow us to obey the WAL-before-data rule. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Kevin Grittner kevin.gritt...@wicourts.gov Thursday 17 March 2011 22:02:18 Rados*aw Smogurarsmog...@softperience.eu wrote: I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? For something like this it makes perfect sense to show proof of concept before trying to cover everything. -Kevin Here I attach latest version of patch with few performance improvements (code is still dirty) and some reports from test, as well my simple tests. Actually there is small improvement without dropping system caches, and bigger with dropping. I have small performance decrease (if we can talk about measuring basing on this tests) to original PG version when dealing with same configuration, but increase is with 2nd level buffers... or maybe I badly compared reports. In tests I tried to choose typical, simple queries. Regards, Radek 2nd_lvl_cache_20110318.diff.bz2 Description: application/bzip test-scritps_20110319_0026.tar.bz2 Description: application/bzip-compressed-tar reports_20110318.tar.bz2 Description: application/bzip-compressed-tar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On 3/18/11 11:15 AM, Jim Nasby wrote: To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... As far as I know, no OS has a more sophisticated approach to eviction than LRU. And clock-sweep is a significant improvement on performance over LRU for frequently accessed database objects ... plus our optimizations around not overwriting the whole cache for things like VACUUM. 2-level caches work well for a variety of applications. Now, what would be *really* useful is some way to avoid all the data copying we do between shared_buffers and the FS cache. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 2nd Level Buffer Cache
Hi, I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? Regards, Radek diff --git a/.gitignore b/.gitignore index 3f11f2e..6542e35 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,4 @@ lcov.info /GNUmakefile /config.log /config.status +/nbproject/private/ \ No newline at end of file diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 2796445..0229f5a 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -508,7 +508,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) if (blkno != P_NEW) { /* Read an existing block of the relation */ - buf = ReadBuffer(rel, blkno); + buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND); LockBuffer(buf, access); _bt_checkpage(rel, buf); } @@ -548,7 +548,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) blkno = GetFreeIndexPage(rel); if (blkno == InvalidBlockNumber) break; - buf = ReadBuffer(rel, blkno); + buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND); if (ConditionalLockBuffer(buf)) { page = BufferGetPage(buf); diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index dadb49d..2922711 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -22,6 +22,7 @@ BufferDesc *BufferDescriptors; char *BufferBlocks; int32 *PrivateRefCount; +BufferLevelDesc *bufferLevels; /* * Data Structures: @@ -72,8 +73,7 @@ int32 *PrivateRefCount; void InitBufferPool(void) { - bool foundBufs, -foundDescs; + bool foundBufs, foundDescs, foundBufferLevels = false; BufferDescriptors = (BufferDesc *) ShmemInitStruct(Buffer Descriptors, @@ -83,19 +83,38 @@ InitBufferPool(void) ShmemInitStruct(Buffer Blocks, NBuffers * (Size) BLCKSZ, foundBufs); - if (foundDescs || foundBufs) +bufferLevels = (BufferLevelDesc*) +ShmemInitStruct(Buffer Levels Descriptors Table, + sizeof(BufferLevelDesc) * BUFFER_LEVEL_SIZE, +foundBufferLevels); + if (foundDescs || foundBufs || foundBufferLevels) { /* both should be present or neither */ - Assert(foundDescs foundBufs); + Assert(foundDescs foundBufs foundBufferLevels); /* note: this path is only taken in EXEC_BACKEND case */ } else { BufferDesc *buf; +BufferLevelDesc *bufferLevelDesc; + int i; - + buf = BufferDescriptors; +/* Initialize buffer levels. */ +//1st Level - Default +bufferLevelDesc = bufferLevels; +bufferLevelDesc-index = 0; +bufferLevelDesc-super = BUFFER_LEVEL_END_OF_LIST; +bufferLevelDesc-lower = BUFFER_LEVEL_END_OF_LIST; + +//2nd Level - For indices +bufferLevelDesc++; +bufferLevelDesc-index = 1; +bufferLevelDesc-super = BUFFER_LEVEL_END_OF_LIST; +bufferLevelDesc-lower = 0; + /* * Initialize all the buffer headers. */ @@ -117,6 +136,10 @@ InitBufferPool(void) */ buf-freeNext = i + 1; +/* Assign buffer level. */ +//TODO Currently hardcoded - +buf-buf_level = ( 0.3 * NBuffers i ) ? BUFFER_LEVEL_DEFAULT : BUFFER_LEVEL_2ND; + buf-io_in_progress_lock = LWLockAssign(); buf-content_lock = LWLockAssign(); } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1f89e52..867bae0 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -47,7 +47,8 @@ #include storage/standby.h #include utils/rel.h #include utils/resowner.h - +#include catalog/pg_type.h +#include funcapi.h /* Note: these two macros only work on shared buffers, not local ones! */ #define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)-buf_id) * BLCKSZ)) @@ -85,7 +86,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL; static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, - bool *hit); + bool *hit, BufferLevel bufferLevel); static bool PinBuffer(volatile BufferDesc *buf,
Re: [HACKERS] 2nd Level Buffer Cache
Rados*aw Smogurarsmog...@softperience.eu wrote: I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? For something like this it makes perfect sense to show proof of concept before trying to cover everything. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers