Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Mon, Nov 22, 2010 at 6:54 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 21.11.2010 15:18, Robert Haas wrote: On Sat, Nov 20, 2010 at 4:07 PM, Tom Lanet...@sss.pgh.pa.us wrote: Robert Haasrobertmh...@gmail.com writes: So what DO we need to guard against here? I think the general problem can be stated as process A changes two or more values in shared memory in a fairly short span of time, and process B, which is concurrently examining the same variables, sees those changes occur in a different order than A thought it made them in. In practice we do not need to worry about changes made with a kernel call in between, as any sort of context swap will cause the kernel to force cache synchronization. Also, the intention is that the locking primitives will take care of this for any shared structures that are protected by a lock. (There were some comments upthread suggesting maybe our lock code is not bulletproof; but if so that's something to fix in the lock code, not a logic error in code using the locks.) So what this boils down to is being an issue for shared data structures that we access without using locks. As, for example, the latch structures. So is the problem case a race involving owning/disowning a latch vs. setting that same latch? No. (or maybe that as well, but that's not what we've been concerned about here). As far as I've understood correctly, the problem is that process A does something like this: /* set a shared variable */ ((volatile bool *) shmem)-variable = true; /* Wake up process B to notice that we changed the variable */ SetLatch(); And process B does this: for (;;) { ResetLatch(); if (((volatile bool *) shmem)-variable) DoStuff(); WaitLatch(); } This is the documented usage pattern of latches. The problem arises if process A runs just before ResetLatch, but the effect of setting the shared variable doesn't become visible until after the if-test in process B. Process B will clear the is_set flag in ResetLatch(), but it will not DoStuff(), so it in effect misses the wakeup from process A and goes back to sleep even though it would have work to do. This situation doesn't arise in the current use of latches, because the shared state comparable to shmem-variable in the above example is protected by a spinlock. But it might become an issue in some future use case. Eh, so, should we do anything about this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On 21.11.2010 15:18, Robert Haas wrote: On Sat, Nov 20, 2010 at 4:07 PM, Tom Lanet...@sss.pgh.pa.us wrote: Robert Haasrobertmh...@gmail.com writes: So what DO we need to guard against here? I think the general problem can be stated as process A changes two or more values in shared memory in a fairly short span of time, and process B, which is concurrently examining the same variables, sees those changes occur in a different order than A thought it made them in. In practice we do not need to worry about changes made with a kernel call in between, as any sort of context swap will cause the kernel to force cache synchronization. Also, the intention is that the locking primitives will take care of this for any shared structures that are protected by a lock. (There were some comments upthread suggesting maybe our lock code is not bulletproof; but if so that's something to fix in the lock code, not a logic error in code using the locks.) So what this boils down to is being an issue for shared data structures that we access without using locks. As, for example, the latch structures. So is the problem case a race involving owning/disowning a latch vs. setting that same latch? No. (or maybe that as well, but that's not what we've been concerned about here). As far as I've understood correctly, the problem is that process A does something like this: /* set a shared variable */ ((volatile bool *) shmem)-variable = true; /* Wake up process B to notice that we changed the variable */ SetLatch(); And process B does this: for (;;) { ResetLatch(); if (((volatile bool *) shmem)-variable) DoStuff(); WaitLatch(); } This is the documented usage pattern of latches. The problem arises if process A runs just before ResetLatch, but the effect of setting the shared variable doesn't become visible until after the if-test in process B. Process B will clear the is_set flag in ResetLatch(), but it will not DoStuff(), so it in effect misses the wakeup from process A and goes back to sleep even though it would have work to do. This situation doesn't arise in the current use of latches, because the shared state comparable to shmem-variable in the above example is protected by a spinlock. But it might become an issue in some future use case. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Sat, Nov 20, 2010 at 4:07 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: So what DO we need to guard against here? I think the general problem can be stated as process A changes two or more values in shared memory in a fairly short span of time, and process B, which is concurrently examining the same variables, sees those changes occur in a different order than A thought it made them in. In practice we do not need to worry about changes made with a kernel call in between, as any sort of context swap will cause the kernel to force cache synchronization. Also, the intention is that the locking primitives will take care of this for any shared structures that are protected by a lock. (There were some comments upthread suggesting maybe our lock code is not bulletproof; but if so that's something to fix in the lock code, not a logic error in code using the locks.) So what this boils down to is being an issue for shared data structures that we access without using locks. As, for example, the latch structures. So is the problem case a race involving owning/disowning a latch vs. setting that same latch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 5:59 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: But what about timings vs. random other stuff? Like in this case there's a problem if the signal arrives before the memory update to latch-is_set becomes visible. I don't know what we need to do to guarantee that. I don't believe there's an issue there. A context swap into the kernel is certainly going to include msync. If you're afraid otherwise, you could put an msync before the kill() call, but I think it's a waste of effort. So what DO we need to guard against here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: So what DO we need to guard against here? I think the general problem can be stated as process A changes two or more values in shared memory in a fairly short span of time, and process B, which is concurrently examining the same variables, sees those changes occur in a different order than A thought it made them in. In practice we do not need to worry about changes made with a kernel call in between, as any sort of context swap will cause the kernel to force cache synchronization. Also, the intention is that the locking primitives will take care of this for any shared structures that are protected by a lock. (There were some comments upthread suggesting maybe our lock code is not bulletproof; but if so that's something to fix in the lock code, not a logic error in code using the locks.) So what this boils down to is being an issue for shared data structures that we access without using locks. As, for example, the latch structures. The other case that I can think of offhand is the signal multiplexing flags. I think we're all right there so far as the flags themselves are concerned because only one atomic update is involved on each side: there's no possibility of inconsistency due to cache visibility skew. But we'd be at some risk if we were using any such flag as a cue to go look at some other shared-memory state. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 05:38:14 Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: I'm all in favor of having some memory ordering primitives so that we can try to implement better algorithms, but if we use it here it amounts to a fairly significant escalation in the minimum requirements to compile PG (which is bad) rather than just a performance optimization (which is good). I don't believe there would be any escalation in compilation requirements: we already have the ability to invoke stronger primitives than these. What is needed is research to find out what the primitives are called, on platforms where we aren't relying on direct asm access. My feeling is it's time to bite the bullet and do that work. We shouldn't cripple the latch operations because of laziness at the outset. I don't think developing the code is the actual code is that hard - s_lock.c contains nearly everything necessary. An 'lock xchg' or similar is only marginally slower then the barrier-only implementation. So doing a TAS() on a slock_t in private memory should be an easy enough fallback implementation. So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses spinlocks for that purpose - no idea where that is true these days. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Thu, Nov 18, 2010 at 11:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I'm all in favor of having some memory ordering primitives so that we can try to implement better algorithms, but if we use it here it amounts to a fairly significant escalation in the minimum requirements to compile PG (which is bad) rather than just a performance optimization (which is good). I don't believe there would be any escalation in compilation requirements: we already have the ability to invoke stronger primitives than these. What is needed is research to find out what the primitives are called, on platforms where we aren't relying on direct asm access. I don't believe that's correct, although it's possible that I may be missing something. On any platform where we have TAS(), that should be sufficient to set the flag, but how will we read the flag? A simple fetch isn't guaranteed to be sufficient; for some architectures, you might need to insert a read fence, and I don't think we have anything like that defined right now. We've got special-cases in s_lock.h for all kinds of crazy architectures; and it's not obvious what would be needed. For example some operating system I've never heard of called SINIX has this: #include abi_mutex.h typedef abilock_t slock_t; #define TAS(lock) (!acquire_lock(lock)) #define S_UNLOCK(lock) release_lock(lock) #define S_INIT_LOCK(lock) init_lock(lock) #define S_LOCK_FREE(lock) (stat_lock(lock) == UNLOCKED) It's far from obvious to me how to make this do what we need - I have a sneaking suspicion it can't be done with those primitives at all - and I bet neither of us has a machine on which it can be tested. Now maybe we no longer care about supporting SINIX anyway, but the point is that if we make this change, every platform for which we don't have working TAS and read-fence operations becomes an unsupported platform. Forget about --disable-spinlocks; there is no such thing. That strikes me as an utterly unacceptable amount of collateral damage to avoid a basically harmless API change, not to mention a ton of work. You might be able to convince me that it's no longer important to support platforms without a working spinlock implementation (although I think it's rather nice that we can - might encourage someone to try out PG and then contribute an implementation for their favorite platform) but this is also going to break platforms that nominally have TAS now (some of the TAS implementations aren't really TAS, as in the above case, and we may not be able to easily determine what's required for a read-fence even where TAS is a real TAS). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund and...@anarazel.de wrote: So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses spinlocks for that purpose - no idea where that is true these days. Me neither, which is exactly the problem. Under Tom's proposal, any architecture we don't explicitly provide for, breaks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 15:16:24 Robert Haas wrote: On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund and...@anarazel.de wrote: So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses spinlocks for that purpose - no idea where that is true these days. Me neither, which is exactly the problem. Under Tom's proposal, any architecture we don't explicitly provide for, breaks. I doubt its that much of a problem as !defined(HAS_TEST_AND_SET) will be so slow that there would be noise from that side more often... Besides, we can just jump into the kernel and back in that case (which the TAS implementation already does), that does more than just a fence... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:27 AM, Aidan Van Dyk ai...@highrise.ca wrote: On Fri, Nov 19, 2010 at 9:16 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund and...@anarazel.de wrote: So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses spinlocks for that purpose - no idea where that is true these days. Me neither, which is exactly the problem. Under Tom's proposal, any architecture we don't explicitly provide for, breaks. Just a small point of clarification - you need to have both that unknown archtecture, and that architecture has to have postgres process running simultaneously on difference CPUs with different caches that are incoherent to have those problems. Sure you do. But so what? Are you going to compile PostgreSQL and implement TAS as a simple store and read-fence as a simple load? How likely is that to work out well? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:16 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund and...@anarazel.de wrote: So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses spinlocks for that purpose - no idea where that is true these days. Me neither, which is exactly the problem. Under Tom's proposal, any architecture we don't explicitly provide for, breaks. Just a small point of clarification - you need to have both that unknown archtecture, and that architecture has to have postgres process running simultaneously on difference CPUs with different caches that are incoherent to have those problems. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 15:29:10 Andres Freund wrote: Besides, we can just jump into the kernel and back in that case (which the TAS implementation already does), that does more than just a fence... Or if you don't believe that is enough initialize a lock on the stack, lock and forget it... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:29 AM, Andres Freund and...@anarazel.de wrote: On Friday 19 November 2010 15:16:24 Robert Haas wrote: On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund and...@anarazel.de wrote: So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses spinlocks for that purpose - no idea where that is true these days. Me neither, which is exactly the problem. Under Tom's proposal, any architecture we don't explicitly provide for, breaks. I doubt its that much of a problem as !defined(HAS_TEST_AND_SET) will be so slow that there would be noise from that side more often... Besides, we can just jump into the kernel and back in that case (which the TAS implementation already does), that does more than just a fence... Eh, really? If there's a workaround for platforms for which we don't know what the appropriate read-fencing incantation is, then I'd feel more comfortable about doing this. But I don't see how to make that work. The whole problem here is that API is designed in such a way that the signal handler might be invoked when the lock that it needs to grab is already held by the same process. The reason memory barriers solve the problem is because they'll be atomically released when we jump into the signal handler, but that is not true of a spin-lock or a semaphore. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:35 AM, Aidan Van Dyk ai...@highrise.ca wrote: On Fri, Nov 19, 2010 at 9:31 AM, Robert Haas robertmh...@gmail.com wrote: Just a small point of clarification - you need to have both that unknown archtecture, and that architecture has to have postgres process running simultaneously on difference CPUs with different caches that are incoherent to have those problems. Sure you do. But so what? Are you going to compile PostgreSQL and implement TAS as a simple store and read-fence as a simple load? How likely is that to work out well? If I was trying to port PostgreSQL to some strange architecture, and my strange architecture didtt' have all the normal TAS and memory bariers stuff because it was only a UP system with no cache, then yes, and it would work out well ;-) I get your point, but obviously this case isn't very interesting or likely in 2010. If it was some strange SMP architecture, I wouldn't expect *anything* to work out well if the architecture doesn't have some sort of TAS/memory barrier/cache-coherency stuff in it ;-) Well, you'd be pleasantly surprised to find that you could at least get it to compile using --disable-spinlocks. Yeah, the performance would probably be lousy and you might run out of semaphores, but at least for basic stuff it would run. Ripping that out just to avoid an API change in code we committed two months ago seems a bit extreme, especially since it's also going to implementing a read-fence operation on every platform we want to continue supporting. Maybe you could default the read-fence to just a simple read for platforms that are not known to have an issue, but all the platforms where TAS is calling some OS-provided routine that does mysterious magic under the covers are going to need attention; and I just don't think that cleaning up everything that's going to break is a very worthwhile investment of our limited development resources, even if it doesn't result in needlessly dropping platform support. If we're going to work on memory primitives, I would much rather see us put that effort into, say, implementing more efficient LWLock algorithms to solve the bottlenecks that the MOSBENCH guys found, rather than spending it on trying to avoid a minor API complication for the latch facility. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 15:38:37 Robert Haas wrote: Eh, really? If there's a workaround for platforms for which we don't know what the appropriate read-fencing incantation is, then I'd feel more comfortable about doing this. But I don't see how to make that work. The whole problem here is that API is designed in such a way that the signal handler might be invoked when the lock that it needs to grab is already held by the same process. The reason memory barriers solve the problem is because they'll be atomically released when we jump into the signal handler, but that is not true of a spin-lock or a semaphore. Well, its not generally true - you are right there. But there is a wide range for syscalls available where its inherently true (which is what I sloppily referred to). And you are allowed to call a, although quite restricted, set of system calls even in signal handlers. I don't have the list for older posix versions in mind, but for 2003 you can choose something from several like write, lseek,setpgid which inherently have to serialize. And I am quite sure there were sensible calls for earlier versions. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 15:14:58 Robert Haas wrote: On Thu, Nov 18, 2010 at 11:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I'm all in favor of having some memory ordering primitives so that we can try to implement better algorithms, but if we use it here it amounts to a fairly significant escalation in the minimum requirements to compile PG (which is bad) rather than just a performance optimization (which is good). I don't believe there would be any escalation in compilation requirements: we already have the ability to invoke stronger primitives than these. What is needed is research to find out what the primitives are called, on platforms where we aren't relying on direct asm access. I don't believe that's correct, although it's possible that I may be missing something. On any platform where we have TAS(), that should be sufficient to set the flag, but how will we read the flag? A simple fetch isn't guaranteed to be sufficient; for some architectures, you might need to insert a read fence, and I don't think we have anything like that defined right now. A TAS is both a read and write fence. After that you don't *need* to fetch it. And even if it were only a write fence on some platforms - if we consistently issue write fences at the relevant places that ought to be enough. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 15:49:45 Robert Haas wrote: If we're going to work on memory primitives, I would much rather see us put that effort into, say, implementing more efficient LWLock algorithms to solve the bottlenecks that the MOSBENCH guys found, rather than spending it on trying to avoid a minor API complication for the latch facility. But for that you will need more infrastructure in that area anyway. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:51 AM, Andres Freund and...@anarazel.de wrote: On Friday 19 November 2010 15:49:45 Robert Haas wrote: If we're going to work on memory primitives, I would much rather see us put that effort into, say, implementing more efficient LWLock algorithms to solve the bottlenecks that the MOSBENCH guys found, rather than spending it on trying to avoid a minor API complication for the latch facility. But for that you will need more infrastructure in that area anyway. True, but you don't have to do it all at once. You can continue to do the same old stuff on the platforms you currently support, and use the newer stuff on platforms where the right thing to do is readily apparent, like x64 and x86_64. And people can add support for their favorite platforms gradually over time, rather than having a flag day where we stop supporting everything we don't know what to do with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:49 AM, Andres Freund and...@anarazel.de wrote: Well, its not generally true - you are right there. But there is a wide range for syscalls available where its inherently true (which is what I sloppily referred to). And you are allowed to call a, although quite restricted, set of system calls even in signal handlers. I don't have the list for older posix versions in mind, but for 2003 you can choose something from several like write, lseek,setpgid which inherently have to serialize. And I am quite sure there were sensible calls for earlier versions. Well, it's not quite enough just to call into the kernel to serialize on some point of memory, because your point is to make sure that *this particular piece of memory* is coherent. It doesn't matter if the kernel has proper fencing in it's stuff if the memory it's guarding is in another cacheline, because that won't *necessarily* force cache coherency in your local lock/variable memory. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: If we're going to work on memory primitives, I would much rather see us put that effort into, say, implementing more efficient LWLock algorithms to solve the bottlenecks that the MOSBENCH guys found, rather than spending it on trying to avoid a minor API complication for the latch facility. I haven't read all of this very long thread yet, but I will point out that you seem to be arguing from the position that memory ordering primitives will only be useful for the latch code. This is nonsense of the first order. We already know that the sinval signalling mechanism could use it to avoid needing a spinlock. I submit that it's very likely that fixing communication bottlenecks elsewhere will similarly require memory ordering primitives if we are to avoid the stupid use a lock approach. I think it's time to build that infrastructure. BTW, I agree with Andres' point that we can probably default memory barriers to be no-ops on unknown platforms. Weak memory ordering isn't a common architectural choice. A look through s_lock.h suggests that PPC and MIPS are the only supported arches that need to worry about this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: If we're going to work on memory primitives, I would much rather see us put that effort into, say, implementing more efficient LWLock algorithms to solve the bottlenecks that the MOSBENCH guys found, rather than spending it on trying to avoid a minor API complication for the latch facility. I haven't read all of this very long thread yet, but I will point out that you seem to be arguing from the position that memory ordering primitives will only be useful for the latch code. This is nonsense of the first order. We already know that the sinval signalling mechanism could use it to avoid needing a spinlock. I submit that it's very likely that fixing communication bottlenecks elsewhere will similarly require memory ordering primitives if we are to avoid the stupid use a lock approach. I think it's time to build that infrastructure. I completely agree, but I'm not too sure I want to drop support for any platform for which we haven't yet implemented such primitives. What's different about this case is that fall back to taking the spin lock is not a workable option. BTW, I agree with Andres' point that we can probably default memory barriers to be no-ops on unknown platforms. Weak memory ordering isn't a common architectural choice. A look through s_lock.h suggests that PPC and MIPS are the only supported arches that need to worry about this. That's good to hear. I'm more worried, however, about architectures where we supposedly have TAS but it isn't really TAS but some OS-provided acquire a lock primitive. That won't generalize nicely to what we need for this case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 9:31 AM, Robert Haas robertmh...@gmail.com wrote: Just a small point of clarification - you need to have both that unknown archtecture, and that architecture has to have postgres process running simultaneously on difference CPUs with different caches that are incoherent to have those problems. Sure you do. But so what? Are you going to compile PostgreSQL and implement TAS as a simple store and read-fence as a simple load? How likely is that to work out well? If I was trying to port PostgreSQL to some strange architecture, and my strange architecture didtt' have all the normal TAS and memory bariers stuff because it was only a UP system with no cache, then yes, and it would work out well ;-) If it was some strange SMP architecture, I wouldn't expect *anything* to work out well if the architecture doesn't have some sort of TAS/memory barrier/cache-coherency stuff in it ;-) a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 15:58:39 Aidan Van Dyk wrote: On Fri, Nov 19, 2010 at 9:49 AM, Andres Freund and...@anarazel.de wrote: Well, its not generally true - you are right there. But there is a wide range for syscalls available where its inherently true (which is what I sloppily referred to). And you are allowed to call a, although quite restricted, set of system calls even in signal handlers. I don't have the list for older posix versions in mind, but for 2003 you can choose something from several like write, lseek,setpgid which inherently have to serialize. And I am quite sure there were sensible calls for earlier versions. Well, it's not quite enough just to call into the kernel to serialize on some point of memory, because your point is to make sure that *this particular piece of memory* is coherent. It doesn't matter if the kernel has proper fencing in it's stuff if the memory it's guarding is in another cacheline, because that won't *necessarily* force cache coherency in your local lock/variable memory. Yes and no. It provides the same guarantees as our current approach of using spinlocks for exactly that - that it theoretically is not enough is an independent issue (but *definitely* an issue). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On 11/19/2010 03:58 PM, Aidan Van Dyk wrote: Well, it's not quite enough just to call into the kernel to serialize on some point of memory, because your point is to make sure that *this particular piece of memory* is coherent. Well, that certainly doesn't apply to full fences, that are not specific to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or 'mf' on ia64. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: ... The reason memory barriers solve the problem is because they'll be atomically released when we jump into the signal handler, but that is not true of a spin-lock or a semaphore. Hm, I wonder whether your concern is stemming from a wrong mental model. There is nothing to release. In my view, a memory barrier primitive is a sequence point, having the properties that all writes issued before the barrier shall become visible to other processors before any writes after it, and also that no reads issued after the barrier shall be executed until those writes have become visible. (PPC can separate those two aspects, but I think we probably don't need to get that detailed for our purposes.) On most processors, the barrier primitive will just be ((void) 0) because they don't deal in out-of-order writes anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: I completely agree, but I'm not too sure I want to drop support for any platform for which we haven't yet implemented such primitives. What's different about this case is that fall back to taking the spin lock is not a workable option. The point I was trying to make is that the fallback position can reasonably be a no-op. That's good to hear. I'm more worried, however, about architectures where we supposedly have TAS but it isn't really TAS but some OS-provided acquire a lock primitive. That won't generalize nicely to what we need for this case. I did say we need some research ;-). We need to look into what's the appropriate primitive for any such OSes that are available for PPC or MIPS. I don't feel a need to be paranoid about it for other architectures. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Markus Wanner mar...@bluegap.ch writes: Well, that certainly doesn't apply to full fences, that are not specific to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or 'mf' on ia64. Hm, what do those do exactly? We've never had any such thing in the Intel-ish spinlock asm, but if out-of-order writes are possible I should think we'd need 'em. Or does lock xchgb imply an mfence? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On 11/19/2010 04:51 PM, Tom Lane wrote: Hm, what do those do exactly? Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. [1] Given the memory ordering guarantees of x86, this instruction might only be relevant for SMP systems, though. Or does lock xchgb imply an mfence? Probably on older architectures (given the name bus locked exchange), but OTOH I wouldn't bet on that still being true. Locking the entire bus sounds like a prohibitively expensive operation with today's amounts of cores per system. Regards Markus Wanner [1]: random google hit on 'mfence': http://siyobik.info/index.php?module=x86id=170 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 16:51:00 Tom Lane wrote: Markus Wanner mar...@bluegap.ch writes: Well, that certainly doesn't apply to full fences, that are not specific to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or 'mf' on ia64. Hm, what do those do exactly? We've never had any such thing in the Intel-ish spinlock asm, but if out-of-order writes are possible I should think we'd need 'em. Or does lock xchgb imply an mfence? Out of order writes are definitely possible if you consider multiple processors. Locked statments like 'lock xaddl;' guarantee that the specific operands (or their cachelines) are visible on all processors and are done atomically - but its not influencing the whole cache like mfence would. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Andres Freund and...@anarazel.de writes: Locked statments like 'lock xaddl;' guarantee that the specific operands (or their cachelines) are visible on all processors and are done atomically - but its not influencing the whole cache like mfence would. Where is this locking the whole cache meme coming from? What we're looking for has nothing to do with locking anything. It's primarily a directive to the processor to flush any dirty cache lines out to main memory. It's not going to block any other processors. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 17:25:57 Tom Lane wrote: Andres Freund and...@anarazel.de writes: Locked statments like 'lock xaddl;' guarantee that the specific operands (or their cachelines) are visible on all processors and are done atomically - but its not influencing the whole cache like mfence would. Where is this locking the whole cache meme coming from? What we're looking for has nothing to do with locking anything. It's primarily a directive to the processor to flush any dirty cache lines out to main memory. It's not going to block any other processors. I was never talking about 'locking the whole cache' - I was talking about flushing/fencing it like a global read/write barrier would. And lock xchgb/xaddl does not imply anything for other cachelines but its own. I only used 'locked' in the context of 'lock xaddl'. Am I misunderstanding you? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Andres Freund and...@anarazel.de writes: I was never talking about 'locking the whole cache' - I was talking about flushing/fencing it like a global read/write barrier would. And lock xchgb/xaddl does not imply anything for other cachelines but its own. If that's the case, why aren't the parallel regression tests falling over constantly? My recollection is that when I broke the sinval code by assuming strong memory ordering without spinlocks, it didn't take long at all for the PPC buildfarm members to expose the problem. If it's possible for Intel-ish processors to exhibit weak memory ordering behavior, I'm quite sure that our current code would be showing bugs everywhere. The impression I had of current Intel designs is that they ensure global cache coherency, ie if one processor has a dirty cache line the others know that, and will go get the updated data before attempting to access that piece of memory. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 10:44 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I completely agree, but I'm not too sure I want to drop support for any platform for which we haven't yet implemented such primitives. What's different about this case is that fall back to taking the spin lock is not a workable option. The point I was trying to make is that the fallback position can reasonably be a no-op. Hmm, maybe you're right. I was assuming weak memory ordering was a reasonably common phenomenon, but if it only applies to a very small number of architectures and we're pretty confident we know which ones they are, your approach would be far less frightening than I originally thought. But is that really true? I think it would be useful to try to build up a library of primitives in this area. For this particular task, we really only need a write-with-fence primitive and a read-with-fence primitive. On strong memory ordering machines, these can just do a store and a read, respectively; on weak memory ordering machines, they can insert whatever fencing operations are needed on either the store side or the load side. I think it would also be useful to provide macros for compare-and-swap and fetch-and-add on platforms where they are available. Then we could potentially write code like this: #ifdef HAVE_COMPARE_AND_SWAP ...do it the lock-free way... #else ...oh, well, do it with spinlocks... #endif -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: I think it would be useful to try to build up a library of primitives in this area. For this particular task, we really only need a write-with-fence primitive and a read-with-fence primitive. That's really entirely the wrong way to think about it. You need a fence primitive, full stop. It's a sequence point, not an operation in itself. It guarantees that reads/writes occurring before or after it aren't resequenced around it. I don't even understand what write with fence means --- is the write supposed to be fenced against other writes before it, or other writes after it? I think it would also be useful to provide macros for compare-and-swap and fetch-and-add on platforms where they are available. That would be a great deal more work, because it's not a no-op anywhere; and our need for it is still rather hypothetical. I'm surprised to see you advocating that when you didn't want to touch fencing a moment ago. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
* Andres Freund: I was never talking about 'locking the whole cache' - I was talking about flushing/fencing it like a global read/write barrier would. And lock xchgb/xaddl does not imply anything for other cachelines but its own. My understanding is that once you've seen the result of an atomic operation on i386 and amd64, you are guaranteed to observe all prior writes performed by the thread which did the atomic operation, too. Explicit fencing is only necessary if you need synchronization without atomic operations. -- Florian Weimerfwei...@bfk.de BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
I wrote: Markus Wanner mar...@bluegap.ch writes: Well, that certainly doesn't apply to full fences, that are not specific to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or 'mf' on ia64. Hm, what do those do exactly? I poked around in the Intel manuals a bit. They do have mfence (also lfence and sfence) but so far as I can tell, those are only used to manage loads and stores that are issued by special instructions that explicitly mark the operation as weakly ordered. So the reason we're not seeing bugs is presumably that C compilers don't generate such instructions. Also, Intel architectures do guarantee cache consistency across multiple processors (and it costs them a lot...) I found a fairly interesting and detailed paper about memory fencing in the Linux kernel: http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I think it would be useful to try to build up a library of primitives in this area. For this particular task, we really only need a write-with-fence primitive and a read-with-fence primitive. That's really entirely the wrong way to think about it. You need a fence primitive, full stop. It's a sequence point, not an operation in itself. It guarantees that reads/writes occurring before or after it aren't resequenced around it. I don't even understand what write with fence means --- is the write supposed to be fenced against other writes before it, or other writes after it? I was taking it to mean something similar to the memory guarantees around synchronized blocks in Java. At the start of a synchronized block you discard any cached data which you've previously read from or written to main memory, and must read everything fresh from that point. At the end of a synchronized block you must write any locally written values to main memory, although you retain them in your thread-local cache for possible re-use. Reads or writes from outside the synchronized block can be pulled into the block and reordered in among the reads and writes within the block (which may also be reordered) unless there's another block to contain them. It works fine once you have your head around it, and allows for significant optimization in a heavily multi-threaded application. I have no idea whether such a model would be useful for PostgreSQL. If I understand Tom he is proposing what sounds roughly like what could be achieved in the Java memory model by keeping all code for a process within a single synchronized block, with the fence being a point where you end it (flushing all local writes to main memory) and start a new one (forcing a discard of locally cached data). Of course I'm ignoring the locking aspect of synchronized blocks and just discussing the memory access aspect of them. (A synchronized block in Java always references some [any] Object, and causes an exclusive lock to be held on the object from one end of the block to the other.) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane t...@sss.pgh.pa.us wrote: That's really entirely the wrong way to think about it. You need a fence primitive, full stop. It's a sequence point, not an operation in itself. I was taking it to mean something similar to the memory guarantees around synchronized blocks in Java. At the start of a synchronized block you discard any cached data which you've previously read from or written to main memory, and must read everything fresh from that point. At the end of a synchronized block you must write any locally written values to main memory, although you retain them in your thread-local cache for possible re-use. That is basically the model that we have implemented in the spinlock primitives: taking a spinlock corresponds to starting a synchronized block and releasing the spinlock ends it. On processors that need it, the spinlock macros include memory fence instructions that implement the above semantics. However, for lock-free interactions I think this model isn't terribly helpful: it's not clear what is inside and what is outside the sync block, and forcing your code into that model doesn't improve either clarity or performance. What you typically need is a guarantee about the order in which writes become visible. To give a concrete example, the sinval bug I was mentioning earlier boiled down to assuming that a write into an element of the sinval message array would become visible to other processors before the change of the last-message pointer variable became visible to them. Without a fence instruction, that doesn't hold on WMO processors, and so they were able to fetch a stale message value. In some cases you also need to guarantee the order of reads. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Tom Lane t...@sss.pgh.pa.us wrote: What you typically need is a guarantee about the order in which writes become visible. In some cases you also need to guarantee the order of reads. Doesn't that suggest different primitives? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 18:46:00 Tom Lane wrote: I wrote: Markus Wanner mar...@bluegap.ch writes: Well, that certainly doesn't apply to full fences, that are not specific to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or 'mf' on ia64. Hm, what do those do exactly? I poked around in the Intel manuals a bit. They do have mfence (also lfence and sfence) but so far as I can tell, those are only used to manage loads and stores that are issued by special instructions that explicitly mark the operation as weakly ordered. So the reason we're not seeing bugs is presumably that C compilers don't generate such instructions. Well. Some memcpy() implementations use string (or SIMD) operations which are weakly ordered though. Also, Intel architectures do guarantee cache consistency across multiple processors (and it costs them a lot...) Only if you are talking about the *same* locations though. See example 8.2.3.4 Combined with: For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor. For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area. Which means something like (in intel's terminology) can happen: initially x = 0 P1: mov [_X], 1 P1: lock xchg Y, 1 P2. lock xchg [_Z], 1 P2: mov r1, [_X] A valid result is that r1 on P2 is 0. I think that is not biting pg because it always uses the same spinlocks at the reading and writing side - but I am not that sure about that. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Friday 19 November 2010 20:03:27 Andres Freund wrote: Which means something like (in intel's terminology) can happen: initially x = 0 P1: mov [_X], 1 P1: lock xchg Y, 1 P2. lock xchg [_Z], 1 P2: mov r1, [_X] A valid result is that r1 on P2 is 0. I think that is not biting pg because it always uses the same spinlocks at the reading and writing side - but I am not that sure about that. Which also seems to mean that a simple read memory barrier that does __asm__ __volatile__(lock; xaddl $0, ???) seems not to be enough unless you use the same address for all those barriers which would cause horrible cacheline bouncing. Am I missing something? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Fri, Nov 19, 2010 at 1:51 PM, Tom Lane t...@sss.pgh.pa.us wrote: However, for lock-free interactions I think this model isn't terribly helpful: it's not clear what is inside and what is outside the sync block, and forcing your code into that model doesn't improve either clarity or performance. What you typically need is a guarantee about the order in which writes become visible. To give a concrete example, the sinval bug I was mentioning earlier boiled down to assuming that a write into an element of the sinval message array would become visible to other processors before the change of the last-message pointer variable became visible to them. Without a fence instruction, that doesn't hold on WMO processors, and so they were able to fetch a stale message value. In some cases you also need to guarantee the order of reads. But what about timings vs. random other stuff? Like in this case there's a problem if the signal arrives before the memory update to latch-is_set becomes visible. I don't know what we need to do to guarantee that. This page seems to indicate that x86 is OK as far as this is concerned - we can simply store a 1 and everyone will see it: http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2004-08/0979.html ...but if we were to, say, increment a counter at that location, it would not be safe without a LOCK prefix (further messages in the thread indicate that you might also have a problem if the address in question is unaligned). It's not obvious to me, however, what might be required on other processors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: But what about timings vs. random other stuff? Like in this case there's a problem if the signal arrives before the memory update to latch-is_set becomes visible. I don't know what we need to do to guarantee that. I don't believe there's an issue there. A context swap into the kernel is certainly going to include msync. If you're afraid otherwise, you could put an msync before the kill() call, but I think it's a waste of effort. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Andres Freund and...@anarazel.de writes: On Friday 19 November 2010 18:46:00 Tom Lane wrote: I poked around in the Intel manuals a bit. They do have mfence (also lfence and sfence) but so far as I can tell, those are only used to manage loads and stores that are issued by special instructions that explicitly mark the operation as weakly ordered. So the reason we're not seeing bugs is presumably that C compilers don't generate such instructions. Well. Some memcpy() implementations use string (or SIMD) operations which are weakly ordered though. I'd expect memcpy to msync at completion of the move if it does that kind of thing. Otherwise it's failing to ensure that the move is really done before it returns. For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor. For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow itâs cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called âcache locking.â The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area. Like it says, the cache coherency mechanism prevents this from being a problem for us. Once the change is made in a processor's cache, it's the cache's job to ensure that all processors see it --- and on Intel architectures, the cache does take care of that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Saturday 20 November 2010 00:08:07 Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Friday 19 November 2010 18:46:00 Tom Lane wrote: I poked around in the Intel manuals a bit. They do have mfence (also lfence and sfence) but so far as I can tell, those are only used to manage loads and stores that are issued by special instructions that explicitly mark the operation as weakly ordered. So the reason we're not seeing bugs is presumably that C compilers don't generate such instructions. Well. Some memcpy() implementations use string (or SIMD) operations which are weakly ordered though. Like it says, the cache coherency mechanism prevents this from being a problem for us. Once the change is made in a processor's cache, it's the cache's job to ensure that all processors see it --- and on Intel architectures, the cache does take care of that. Check example 8.2.3.4 of 3a. - in my opinion that makes my example correct. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Mon, Nov 15, 2010 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: In SetLatch, is it enough to add the SpinLockAcquire() call *after* checking that is_set is not already set? Ie. still do the quick exit without holding a lock. Or do we need a memory barrier operation before the fetch, to ensure that we see if the other process has just cleared the flag with ResetLatch() ? Presumable ResetLatch() needs to call SpinLockAcquire() anyway to ensure that other processes see the clearing of the flag. Hmm ... I just remembered the reason why we didn't use a spinlock in these functions already. Namely, that it's unsafe for a signal handler to try to acquire a spinlock that the interrupted code might be holding. So I think a bit more thought is needed here. Maybe we need to bite the bullet and do memory barriers ... The signal handler just checks a process-local, volatile variable called waiting (which should be fine) and then sends a self-pipe byte. It deliberately *doesn't* take a spinlock. So unless I'm missing something (which is perfectly possible) protecting a few more things with a spinlock should be safe enough. Of course, there's still a potential *performance* problem if we end up doing a kernel call while holding a spin lock, but I'm uncertain whether we should worry about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: On Mon, Nov 15, 2010 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote: Hmm ... I just remembered the reason why we didn't use a spinlock in these functions already. Namely, that it's unsafe for a signal handler to try to acquire a spinlock that the interrupted code might be holding. The signal handler just checks a process-local, volatile variable called waiting (which should be fine) and then sends a self-pipe byte. It deliberately *doesn't* take a spinlock. I'm not talking about latch_sigusr1_handler. I'm talking about the several already-existing signal handlers that call SetLatch. Now maybe it's possible to prove that none of those can fire in a process that's mucking with the same latch in-line, but that sounds fragile as heck; and anyway what of future usage? Given that precedent, somebody is going to write something unsafe at some point, and it'll fail only often enough to be seen in the field but not in our testing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Thu, Nov 18, 2010 at 5:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Nov 15, 2010 at 11:12 AM, Tom Lane t...@sss.pgh.pa.us wrote: Hmm ... I just remembered the reason why we didn't use a spinlock in these functions already. Namely, that it's unsafe for a signal handler to try to acquire a spinlock that the interrupted code might be holding. The signal handler just checks a process-local, volatile variable called waiting (which should be fine) and then sends a self-pipe byte. It deliberately *doesn't* take a spinlock. I'm not talking about latch_sigusr1_handler. I'm talking about the several already-existing signal handlers that call SetLatch. Now maybe it's possible to prove that none of those can fire in a process that's mucking with the same latch in-line, but that sounds fragile as heck; and anyway what of future usage? Given that precedent, somebody is going to write something unsafe at some point, and it'll fail only often enough to be seen in the field but not in our testing. Oh, I get it. You're right. We can't possibly assume that that we're not trying to set the latch for our own process, because that's the whole point of having the self-pipe code in the first place. How about changing the API so that the caller must use one function, say, SetOwnedLatch(), to set a latch that they own, and another function, say, SetNonOwnedLatch(), to set a latch that they do not own? The first can simply set is_set (another process might fail to see that the latch has been set, but the worst thing they can do is set it over again, which should be fairly harmless) and send a self-pipe byte. The second can take the spin lock, set is_set, read the owner PID, release the spin lock, and send a signal. WaitLatch() can similarly take the spin lock before reading is_set. This requires the caller to know whether they are setting their own latch or someone else's latch, but I don't believe that's a problem given current usage. I'm all in favor of having some memory ordering primitives so that we can try to implement better algorithms, but if we use it here it amounts to a fairly significant escalation in the minimum requirements to compile PG (which is bad) rather than just a performance optimization (which is good). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Robert Haas robertmh...@gmail.com writes: I'm all in favor of having some memory ordering primitives so that we can try to implement better algorithms, but if we use it here it amounts to a fairly significant escalation in the minimum requirements to compile PG (which is bad) rather than just a performance optimization (which is good). I don't believe there would be any escalation in compilation requirements: we already have the ability to invoke stronger primitives than these. What is needed is research to find out what the primitives are called, on platforms where we aren't relying on direct asm access. My feeling is it's time to bite the bullet and do that work. We shouldn't cripple the latch operations because of laziness at the outset. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Mon, Nov 15, 2010 at 2:15 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Can you elaborate? Weak memory ordering means that stores into shared memory initiated by one processor are not guaranteed to be observed to occur in the same sequence by another processor. This implies first that the latch code could malfunction all by itself, if two processes manipulate a latch at about the same time, and second (probably much less likely) that there could be a malfunction involving a process that's waited on a latch not seeing the shared-memory status updates that another process did before setting the latch. This is not at all hypothetical --- my first attempt at rewriting the sinval signaling code, a couple years back, failed on PPC machines in the buildfarm because of exactly this type of issue. Hmm, SetLatch only sets one flag, so I don't see how it could malfunction all by itself. And I would've thought that declaring the Latch variable volatile prevents rearrangements. It's not a question of code rearrangement. Suppose at time zero, the latch is unset, but owned. At approximately the same time, SetLatch() is called in one process and WaitLatch() in another process. SetLatch() sees that the latch is not set and sends SIGUSR1 to the other process. The other process receives the signal but, since waiting is not yet set, it ignores the signal. It then drains the self-pipe and examines latch-is_set. But as it turns out, the update by the process which called SetLatch() isn't yet visible to this process, because this process has a copy of those bytes in some internal cache that isn't guaranteed to be fully coherent. So even though SetLatch() already changed latch-is_set to true, it still looks false here. Therefore, we go to sleep on the latch. At this point, we are very likely screwed. If we're lucky, yet a third process will come along, also see the latch as still unset (even though it is), and set it again, waking up the owner. But if we're unlucky, by the time that third process comes along, the memory update will have become visible everywhere and all future calls to SetLatch() will exit quickly, leaving the poor shmuck who waited on the latch sleeping for all eternity. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On 15.11.2010 15:22, Robert Haas wrote: On Mon, Nov 15, 2010 at 2:15 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Can you elaborate? Weak memory ordering means that stores into shared memory initiated by one processor are not guaranteed to be observed to occur in the same sequence by another processor. This implies first that the latch code could malfunction all by itself, if two processes manipulate a latch at about the same time, and second (probably much less likely) that there could be a malfunction involving a process that's waited on a latch not seeing the shared-memory status updates that another process did before setting the latch. This is not at all hypothetical --- my first attempt at rewriting the sinval signaling code, a couple years back, failed on PPC machines in the buildfarm because of exactly this type of issue. Hmm, SetLatch only sets one flag, so I don't see how it could malfunction all by itself. And I would've thought that declaring the Latch variable volatile prevents rearrangements. It's not a question of code rearrangement. Rearrangement of code, rearrangement of CPU instructions, or rearrangement of the order the changes in the memory become visible to other processes. The end result is the same. Suppose at time zero, the latch is unset, but owned. At approximately the same time, SetLatch() is called in one process and WaitLatch() in another process. SetLatch() sees that the latch is not set and sends SIGUSR1 to the other process. The other process receives the signal but, since waiting is not yet set, it ignores the signal. It then drains the self-pipe and examines latch-is_set. But as it turns out, the update by the process which called SetLatch() isn't yet visible to this process, because this process has a copy of those bytes in some internal cache that isn't guaranteed to be fully coherent. So even though SetLatch() already changed latch-is_set to true, it still looks false here. Therefore, we go to sleep on the latch. Surely marking the latch pointer volatile would force the store to is_set to be flushed, if not immediately, at least before the kill() system call. No? Looking at Tom's patch in sinvaladt.c, the problem there was that the the store of the shared maxMsgNum variable could become visible to other processes after the store of the message itself. Using a volatile pointer for maxMsgNum would not have helped with that, because the operations on other variables might still be rearranged with respect to the store of maxMsgNum. SetLatch is simpler, there is only one variable (ok, two, but your scenario didn't involve a change in owner_pid). It seems safe to assume that the store becomes visible before the system call. Tom's other scenario, where changing some other variable in shared memory might not have become visible to other processes when SetLatch() runs, seems more plausible (if harder to run into in practice). But if the variable is meant to be examined by other processes, then you should use a lock to protect it. Otherwise you'll have concurrency issues anyway. Or at least use a volatile pointer, I believe it's safe to assume that two operations using a volatile pointer will not be rearranged wrt. each other. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Mon, Nov 15, 2010 at 8:45 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It's not a question of code rearrangement. Rearrangement of code, rearrangement of CPU instructions, or rearrangement of the order the changes in the memory become visible to other processes. The end result is the same. I'll let Tom speak to this, because he understands it better than I do, but I don't think this is really true. Rearrangement of code is something that the compiler has control over, and volatile addresses that issue by preventing the compiler from making certain assumptions, but the order in which memory operations become visible is a result of the architecture of the memory bus, and the compiler doesn't directly do anything about that. It won't for example emit a cmpxchg instruction rather than a simple store just because you declared the variable volatile. Suppose at time zero, the latch is unset, but owned. At approximately the same time, SetLatch() is called in one process and WaitLatch() in another process. SetLatch() sees that the latch is not set and sends SIGUSR1 to the other process. The other process receives the signal but, since waiting is not yet set, it ignores the signal. It then drains the self-pipe and examines latch-is_set. But as it turns out, the update by the process which called SetLatch() isn't yet visible to this process, because this process has a copy of those bytes in some internal cache that isn't guaranteed to be fully coherent. So even though SetLatch() already changed latch-is_set to true, it still looks false here. Therefore, we go to sleep on the latch. Surely marking the latch pointer volatile would force the store to is_set to be flushed, if not immediately, at least before the kill() system call. No? I don't think so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Hmm, SetLatch only sets one flag, so I don't see how it could malfunction all by itself. And I would've thought that declaring the Latch variable volatile prevents rearrangements. It's not a question of code rearrangement. Precisely. volatile prevents the compiler from rearranging the instruction sequence in a way that would *issue* stores out-of-order. However, that doesn't prevent the hardware from *executing* the stores out-of-order from the point of view of a different processor. As Robert noted, the risk cases here come from caching; in particular, that a dirty cache line might get flushed to main memory later than some other dirty cache line. There are some architectures that guarantee that this won't happen (no doubt at significant expenditure of gates). There are others that don't, preferring to optimize the single-processor case. On those, you need an msync type of instruction to force dirty cache lines out to main memory between any two stores whose effects had better become visible to another processor in a certain order. I'm not sure if it's universal, but on PPC there are actually two different sync concepts involved, a write barrier that does the above and a read barrier that ensures the reading processor is up-to-date. I believe it's safe to assume that two operations using a volatile pointer will not be rearranged wrt. each other. This is entirely wrong, so far as cross-processor visibility of changes is concerned. Whether it should be true in some ideal reading of the C spec is not relevant: there are common architectures in which it is not true. The window for trouble is normally pretty small, and in particular any kernel call or context swap is usually going to force an msync. So I'm not sure that there are any risk spots in practice right now with SetLatch. But if we expand our use of it, we can be 100% certain we'll get bit eventually. Tom's other scenario, where changing some other variable in shared memory might not have become visible to other processes when SetLatch() runs, seems more plausible (if harder to run into in practice). But if the variable is meant to be examined by other processes, then you should use a lock to protect it. In that case, of what use is the latch stuff? The whole point with that (or at least a lot of the point) is to not have to take locks. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On Mon, Nov 15, 2010 at 9:51 AM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Hmm, SetLatch only sets one flag, so I don't see how it could malfunction all by itself. And I would've thought that declaring the Latch variable volatile prevents rearrangements. It's not a question of code rearrangement. Precisely. volatile prevents the compiler from rearranging the instruction sequence in a way that would *issue* stores out-of-order. However, that doesn't prevent the hardware from *executing* the stores out-of-order from the point of view of a different processor. As Robert noted, the risk cases here come from caching; in particular, that a dirty cache line might get flushed to main memory later than some other dirty cache line. There are some architectures that guarantee that this won't happen (no doubt at significant expenditure of gates). And in fact if this (interesting!) video is any indication, that problem is only going to get worse as core counts go up. This guy built a lock-free, wait-free hash table implementation that can run on a system with hundreds of cores. I'm just guessing here, but I strongly suspect that keeping memory in full sync across that many processors would just kill performance, so they shrug their shoulders and don't. The application programmer gets to pick up the pieces. http://video.google.com/videoplay?docid=2139967204534450862# -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On 15.11.2010 16:51, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: I believe it's safe to assume that two operations using a volatile pointer will not be rearranged wrt. each other. This is entirely wrong, so far as cross-processor visibility of changes is concerned. Ok. In SetLatch, is it enough to add the SpinLockAcquire() call *after* checking that is_set is not already set? Ie. still do the quick exit without holding a lock. Or do we need a memory barrier operation before the fetch, to ensure that we see if the other process has just cleared the flag with ResetLatch() ? Presumable ResetLatch() needs to call SpinLockAcquire() anyway to ensure that other processes see the clearing of the flag. Tom's other scenario, where changing some other variable in shared memory might not have become visible to other processes when SetLatch() runs, seems more plausible (if harder to run into in practice). But if the variable is meant to be examined by other processes, then you should use a lock to protect it. In that case, of what use is the latch stuff? The whole point with that (or at least a lot of the point) is to not have to take locks. The use case for a latch is to wake up another process to examine other piece of shared memory (or a file or something else), and take action based on that other state if needed. Access to that other piece of shared memory needs locking or some other means of concurrency control, regardless of the mechanism used to wake up the other process. Take the walsender latches for example. The other piece of shared memory is the current WAL flush location. The latch is used to wake up a walsender after flushing some WAL. The latch itself doesn't protect the access to the WAL flush pointer in any way, GetFlushRecPtr() uses a spinlock for that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: In SetLatch, is it enough to add the SpinLockAcquire() call *after* checking that is_set is not already set? Ie. still do the quick exit without holding a lock. Or do we need a memory barrier operation before the fetch, to ensure that we see if the other process has just cleared the flag with ResetLatch() ? Presumable ResetLatch() needs to call SpinLockAcquire() anyway to ensure that other processes see the clearing of the flag. Hmm ... I just remembered the reason why we didn't use a spinlock in these functions already. Namely, that it's unsafe for a signal handler to try to acquire a spinlock that the interrupted code might be holding. So I think a bit more thought is needed here. Maybe we need to bite the bullet and do memory barriers ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On 13.11.2010 17:07, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Come to think of it, I'm not really sure I understand what protects SetLatch() against memory ordering hazards. Is that actually safe? Hmm ... that's a good question. It certainly *looks* like it could malfunction on machines with weak memory ordering. Can you elaborate? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 13.11.2010 17:07, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Come to think of it, I'm not really sure I understand what protects SetLatch() against memory ordering hazards. Is that actually safe? Hmm ... that's a good question. It certainly *looks* like it could malfunction on machines with weak memory ordering. Can you elaborate? Weak memory ordering means that stores into shared memory initiated by one processor are not guaranteed to be observed to occur in the same sequence by another processor. This implies first that the latch code could malfunction all by itself, if two processes manipulate a latch at about the same time, and second (probably much less likely) that there could be a malfunction involving a process that's waited on a latch not seeing the shared-memory status updates that another process did before setting the latch. This is not at all hypothetical --- my first attempt at rewriting the sinval signaling code, a couple years back, failed on PPC machines in the buildfarm because of exactly this type of issue. The quick-and-dirty way to fix this is to attach a spinlock to each latch, because we already have memory ordering sync instructions in the spinlock primitives. Doing better would probably involve developing a new set of processor-specific primitives --- which would be pretty easy for the processors where we have gcc inline asm, but it would take some research for the platforms where we're relying on magic OS-provided subroutines. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)
On 14.11.2010 22:55, Tom Lane wrote: Heikki Linnakangasheikki.linnakan...@enterprisedb.com writes: On 13.11.2010 17:07, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: Come to think of it, I'm not really sure I understand what protects SetLatch() against memory ordering hazards. Is that actually safe? Hmm ... that's a good question. It certainly *looks* like it could malfunction on machines with weak memory ordering. Can you elaborate? Weak memory ordering means that stores into shared memory initiated by one processor are not guaranteed to be observed to occur in the same sequence by another processor. This implies first that the latch code could malfunction all by itself, if two processes manipulate a latch at about the same time, and second (probably much less likely) that there could be a malfunction involving a process that's waited on a latch not seeing the shared-memory status updates that another process did before setting the latch. This is not at all hypothetical --- my first attempt at rewriting the sinval signaling code, a couple years back, failed on PPC machines in the buildfarm because of exactly this type of issue. Hmm, SetLatch only sets one flag, so I don't see how it could malfunction all by itself. And I would've thought that declaring the Latch variable volatile prevents rearrangements. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Fri, Nov 12, 2010 at 11:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Bruce Momjian br...@momjian.us writes: Right. I propose that we set max_wal_senders to unlimited when wal_level = hot_standby. It's a memory allocation parameter ... you can't just set it to unlimited, at least not without a nontrivial amount of work. Yes. This thread would benefit from less uneducated speculation and more examination of what the parameter actually does. I've looked at how it's set up in the hopes of finding an easy fix and haven't come up with anything all that wonderful yet. In particular, there is this code, that gets run on every commit (unless max_wal_senders is zero): /* Wake up all walsenders */ void WalSndWakeup(void) { int i; for (i = 0; i max_wal_senders; i++) SetLatch(WalSndCtl-walsnds[i].latch); } Notice that this code is able to iterate over all of the WAL senders that might exist without taking any sort of lock. You do NOT want to unnecessarily iterate over the entire procarray here. The obvious fix is to keep an array that contains only the latches for the WAL senders that actually exist at the moment, but then you have to insert a lock acquisition and release in here, or risk backends failing to see an update to the shared variable that identifies the end of the list, which seems like a pretty bad idea too. One idea I've had is that we might want to think about defining an operation that is effectively store, with a memory barrier. For example, on x86, this could be implemented using xchg. I think if you have a single-word variable in shared memory that is always updated using a locked store, then individual backends should be able to read it unlocked without risk of seeing a stale value. So in the above example you could have an array in shared memory and a variable updated in the manner just described indicating how many slots of the array are in use, and backends could iterate up to the end of the array without needing a lock on the number of slots. Of course, you still have to figure out how to compact the array when a WAL sender exits, but maybe that could be done with an extra layer of indirection. Come to think of it, I'm not really sure I understand what protects SetLatch() against memory ordering hazards. Is that actually safe? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Robert Haas robertmh...@gmail.com writes: One idea I've had is that we might want to think about defining an operation that is effectively store, with a memory barrier. For example, on x86, this could be implemented using xchg. I think if you have a single-word variable in shared memory that is always updated using a locked store, then individual backends should be able to read it unlocked without risk of seeing a stale value. You're still guilty of fuzzy thinking here. What does stale mean? To do anything useful, you need to be able to fetch the value, execute some sequence of operations that depends on the value, and be assured that the value you fetched remains relevant throughout that sequence. A memory barrier by itself doesn't help. I have seen one or two places where we could use a memory barrier primitive that doesn't include a lock, but they had to do with ensuring that different writes would be seen to have occurred in a particular order. It is not obvious how we could use one here. Now, one could also argue that commit is already a sufficiently heavyweight operation that taking/releasing one more LWLock won't matter too much. But it would be nice to back that argument with some evidence. Come to think of it, I'm not really sure I understand what protects SetLatch() against memory ordering hazards. Is that actually safe? Hmm ... that's a good question. It certainly *looks* like it could malfunction on machines with weak memory ordering. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Sat, Nov 13, 2010 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: One idea I've had is that we might want to think about defining an operation that is effectively store, with a memory barrier. For example, on x86, this could be implemented using xchg. I think if you have a single-word variable in shared memory that is always updated using a locked store, then individual backends should be able to read it unlocked without risk of seeing a stale value. You're still guilty of fuzzy thinking here. What does stale mean? Well, that obviously depends on what algorithm you use to add and remove items from the array. I know my thinking is fuzzy; I was trying to say I'm mulling over whether we could do something like this rather than I know exactly how to make this work. *thinks it over* Here's an algorithm that might work. Let's assume that we're still going to allocate an array of size max_wal_senders, but we want to make max_wal_senders relatively large butl avoid iterating over the entire array on each commit. Store a variable in shared memory called FirstWalSenderOffset which is always updating using a memory barrier operation. This value is -1 if there are no currently connected WAL senders, or the array slot of a currently connected walsender if there is one. Each array slot also has a structure member called NextWalSenderOffset, so that the list of connect WAL senders is effectively stored as a linked list, but with array slot numbers rather than pointers. To add a WAL sender, we initialize the new array slot, setting NextWalSenderOffset to FirstWalSenderOffset, and then set FirstWalSenderOffset to the slot number of the newly allocated slot. To remove a WAL sender, we loop through the notional linked list and, when we find the pointer to the slot we want to remove, we override it with a pointer to the slot to which the removed entry points. All of these stores are done using the store-with-memory-barrier, so that readers can iterate over the linked list without a lock. However, we can't let two processes try to MODIFY the list at the same time (at least, not without a more powerful memory ordering primitive; COMPXCHG might be enough) so just protect list modification with a global spinlock; updates won't be frequent. This algorithm makes the assumption that it's OK for a scan to miss WAL senders that are added mid-scan. But the current code makes that assumption, too: a new WAL sender could grab an array slot we've already passed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Josh Berkus wrote: None of us know. What I do know is that I don't want PostgreSQL to be slower out of the box. Understandable. So it seems like the answer is getting replication down to one configuration variable for the common case. That eliminates the cycle of oops, need to set X and restart/reload without paying performance penalties on standalone servers. Right. I propose that we set max_wal_senders to unlimited when wal_level = hot_standby. When they tell us they are using hot_standby via wal_level, why make them change another setting (max_wal_senders)? Basically, we don't need to turn everything on by default, but some settings should trigger other behavior automatically. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Bruce Momjian br...@momjian.us writes: Right. I propose that we set max_wal_senders to unlimited when wal_level = hot_standby. It's a memory allocation parameter ... you can't just set it to unlimited, at least not without a nontrivial amount of work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
[going back on list with this] Selena Deckelmann selenama...@gmail.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: the other three DBAs here implemented the HS/SR while I was out They told me that it was working great once they figured it out, but it was confusing; it took them a lot of time and a few false starts to get it working. I've been trying to get details to support an improvement in documentation Just curious -- did they use the wiki documentation at all? Was any of that more or less helpful? I finally got a chance to chat with the other DBAs about this -- they actually *just* looked at the Wiki to get through it, and didn't go to the manual at all. They eventually concluded that the problems were all because this was done on a machine where we had multiple major releases of PostgreSQL running for different database clusters, and the version on $PATH was not changed to 9.0. Even though they started PostgreSQL on the standby server with an explicit path to the 9.0 version, it seemed to find some executable or library from 8.4 on $PATH, resulting in bizarre and unhelpful error messages. We do our own builds with a prefix like /usr/local/pgsql-9.0.1 and create a symlink from /usr/local/pgsql to what we want as the default on the machine, when an explicit path is not specified. When they pointed the symlink to 9.0.1 everything worked as expected. They said that except for the quirky path behavior, the installation went fine; the Wiki page instructions were clear and adequate and that installation process was not difficult or confusing. This path issue sounds like a bug to me -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Hi, On Wednesday 03 November 2010 20:28:03 Kevin Grittner wrote: They said that except for the quirky path behavior, the installation went fine; the Wiki page instructions were clear and adequate and that installation process was not difficult or confusing. This path issue sounds like a bug to me I guess you built both in the same place and just prefix installed it to different directories? That can cause issues like that unless you use --disable-rpath. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Andres Freund and...@anarazel.de wrote: I guess you built both in the same place and just prefix installed it to different directories? We always build in a directory tree with a name based on the version, with a prefix based on the version. This is routine for us. I have a hard time believing that they made an error on that, but I'll try to re-create in a controlled way so that I can report specific error messages and confirm my assumptions. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Joshua D. Drake wrote: On Wed, 2010-10-27 at 19:52 -0400, Robert Haas wrote: Josh Berkus wrote: *you don't know* how many .org users plan to implement replication, whether it's a minority or majority. None of us know. What I do know is that I don't want PostgreSQL to be slower out of the box. Poll TIME! If you do take a poll, be careful to put in an option or two to deal with environments where there is surgical implementation of replication features. We'll almost certainly be using SR with a custom WAL receiver as part of our solution for our biggest and most distributed data set (circuit court data), but an out of the box drop in usage there is not in the cards anytime soon; whereas we're already using HS/SR out of the box for a small RoR web app's data. By the way, the other three DBAs here implemented the HS/SR while I was out for a couple days while my dad was in the hospital (so they didn't want to even bother me with a phone call). They went straight from the docs, without the benefit of having tracked any PostgreSQL lists. They told me that it was working great once they figured it out, but it was confusing; it took them a lot of time and a few false starts to get it working. I've been trying to get details to support an improvement in documentation, but if those guys had problems I agree we need to do *something* to make this simpler -- they're bright professionals who manage hundreds of PostgreSQL databases on a full time basis. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Excerpts from Tom Lane's message of mié oct 27 19:01:38 -0300 2010: I don't know what Simon is thinking, but I think he's nuts. There is is obvious extra overhead in COMMIT: /* * Wake up all walsenders to send WAL up to the COMMIT record * immediately if replication is enabled */ if (max_wal_senders 0) WalSndWakeup(); which AFAICT is injecting multiple kernel calls into what's not only a hot-spot but a critical section (ie, any error - PANIC). Hmm, I wonder if that could be moved out of the critical section somehow. Obviously the point here is to allow wal senders to react before we write to clog (which is expensive by itself); but it seems hard to wake up some other process without incurring exactly the same cost (which is basically SetLatch) ... the only difference is that it would be a single one instead of one per walsender. BTW I note that there are no elog(ERROR) calls in that code path at all, because syscall errors are ignored, so PANIC is not a concern (as the code stands currently, at least). ISTM it would be good to have a comment on SetLatch stating that it's used inside critical sections, though. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Alvaro Herrera alvhe...@commandprompt.com writes: BTW I note that there are no elog(ERROR) calls in that code path at all, because syscall errors are ignored, so PANIC is not a concern (as the code stands currently, at least). ISTM it would be good to have a comment on SetLatch stating that it's used inside critical sections, though. Yeah, I was thinking the same while reading the code yesterday. It already notes that it's used in interrupt handlers, but the critical section angle is an additional reason not to elog(ERROR). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Thu, 2010-10-28 at 07:05 -0500, Kevin Grittner wrote: Joshua D. Drake wrote: On Wed, 2010-10-27 at 19:52 -0400, Robert Haas wrote: Josh Berkus wrote: *you don't know* how many .org users plan to implement replication, whether it's a minority or majority. None of us know. What I do know is that I don't want PostgreSQL to be slower out of the box. Poll TIME! If you do take a poll, be careful to put in an option or two to deal with environments where there is surgical implementation of replication features. And Poll it is: https://www.postgresqlconference.org/content/replication-poll You don't have to login to take it but of course it helps with validity of results. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
https://www.postgresqlconference.org/content/replication-poll You don't have to login to take it but of course it helps with validity of results. Oh, I'd already put something up on http://www.postgresql.org/community -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Thu, 2010-10-28 at 16:25 -0700, Josh Berkus wrote: https://www.postgresqlconference.org/content/replication-poll You don't have to login to take it but of course it helps with validity of results. Oh, I'd already put something up on http://www.postgresql.org/community Sorry, didn't know... I have 122 responses so far, which I think will be surprising (some of them certainly surprised me). I will keep it open until next week and then post the results. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Sorry, didn't know... I have 122 responses so far, which I think will be surprising (some of them certainly surprised me). I will keep it open until next week and then post the results. Well, for any community site poll, I hope you realize that there's a LOT of sampling error. Here's another one: http://www.postgresql.org/community/survey.71 -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Thu, 2010-10-28 at 17:12 -0700, Josh Berkus wrote: Sorry, didn't know... I have 122 responses so far, which I think will be surprising (some of them certainly surprised me). I will keep it open until next week and then post the results. Well, for any community site poll, I hope you realize that there's a LOT of sampling error. Here's another one: http://www.postgresql.org/community/survey.71 Oh sure. I don't expect this to be some kind of authoritative reference but it is certainly worth at least reviewing. If nothing else it is fun to see the responses and consider their meaning based on your own views. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Tue, 2010-10-19 at 15:32 -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Oct 19, 2010 at 12:18 PM, Josh Berkus j...@agliodbs.com wrote: On 10/19/2010 09:06 AM, Greg Smith wrote: I think Magnus's idea to bump the default to 5 triages the worst of the annoyance here, without dropping the feature (which has uses) or waiting for new development to complete. Setting max_wal_senders to a non-zero value causes additional work to be done every time a transaction commits, aborts, or is prepared. Yes. Sorry guys, but that is completely wrong. There is no additional work to be done each time a transaction commits, even with sync rep. And I don't mean nearly zero, I mean nada. This isn't just a numeric parameter; it's also a boolean indicating do I want to pay the overhead to be prepared to be a replication master?. Agreed, but its to do with wal_level. Josh has completely failed to make a case that that should be the default. Agreed. In fact, the system would fail to start at all if we just changed the default for max_wal_senders and not the default for wal_level. Agree with that as a problem. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, 2010-10-27 at 10:05 -0700, Josh Berkus wrote: Josh has completely failed to make a case that that should be the default. Agreed. In what way have a failed to make a case? I just removed a huge hurdle on the journey to simplification. That doesn't mean I think you have come up with an acceptable solution, though there probably is one. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Josh Berkus j...@agliodbs.com writes: Josh has completely failed to make a case that that should be the default. Agreed. In what way have a failed to make a case? You're assuming that we should set up the default behavior to support replication and penalize those who aren't using it. Considering that we haven't even *had* replication until now, it seems a pretty safe bet that the majority of our users aren't using it and won't appreciate that default. We routinely expend large amounts of effort to avoid cross-version performance regressions, and I don't see that this one is acceptable when others aren't. I entirely agree that it ought to be easier to set up replication. But there's a difference between having a big red EASY button for people to push, and pushing it for them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, 2010-10-27 at 15:33 -0400, Tom Lane wrote: Josh Berkus j...@agliodbs.com writes: Josh has completely failed to make a case that that should be the default. Agreed. In what way have a failed to make a case? You're assuming that we should set up the default behavior to support replication and penalize those who aren't using it. Considering that we haven't even *had* replication until now, it seems a pretty safe bet that the majority of our users aren't using it and won't appreciate that default. We routinely expend large amounts of effort to avoid cross-version performance regressions, and I don't see that this one is acceptable when others aren't. I entirely agree that it ought to be easier to set up replication. But there's a difference between having a big red EASY button for people to push, and pushing it for them. Replication is an option, not a requirement. So +1 on Tom's argument here. regards, tom lane -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
You're assuming that we should set up the default behavior to support replication and penalize those who aren't using it. What's the penalty? Simon just said that there isn't one. And there's a difference between saying that I failed to make a case vs. the cost is too great. Saying the former is saying that my argument lacks merit (or content) entirely, rather than saying that it's not sufficient. I made a case, the case just didn't persuade you ... yet. I entirely agree that it ought to be easier to set up replication. But there's a difference between having a big red EASY button for people to push, and pushing it for them. If we have a single boolean GUC called replication, I would be happy. Even if it defaulted to off. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, Oct 27, 2010 at 12:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: You're assuming that we should set up the default behavior to support replication and penalize those who aren't using it. Considering that we haven't even *had* replication until now, it seems a pretty safe bet that the majority of our users aren't using it and won't appreciate that default. We routinely expend large amounts of effort to avoid cross-version performance regressions, and I don't see that this one is acceptable when others aren't. So I think we're talking about two different things. a) whether to default to enabling the no-wal optimizations for newly created tables which break replication b) whether to impose a limit on the number of replication slaves by default c) whether we can make these flags changable without restarting I think (a) is a non-starter but if we can acheive (b) and (c) without (a) then we're in pretty good shape. Someone would still have to enable replication manually but they could do it without restarting, and once they do it that one parameter would be sufficient, there wouldn't be any hidden options lying in wait to trap them. I think (c) is doable -- if we remember when the last non-logged operation was we can refuse replication connections unless replication is active and the replica isn't requesting any wal from prior to the last unlogged operation. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Josh Berkus j...@agliodbs.com writes: You're assuming that we should set up the default behavior to support replication and penalize those who aren't using it. What's the penalty? Simon just said that there isn't one. I don't know what Simon is thinking, but I think he's nuts. There is is obvious extra overhead in COMMIT: /* * Wake up all walsenders to send WAL up to the COMMIT record * immediately if replication is enabled */ if (max_wal_senders 0) WalSndWakeup(); which AFAICT is injecting multiple kernel calls into what's not only a hot-spot but a critical section (ie, any error - PANIC). That's not even considering the extra WAL that is generated when you move up from wal_level = minimal. That's probably the bigger performance issue in practice. And there's a difference between saying that I failed to make a case vs. the cost is too great. I said, and meant, that you didn't make the case at all; you just presumed it was obvious that we should change the defaults to be replication-friendly. I don't think it is. As I said, I think that only a minority of our users are going to want replication. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
That's not even considering the extra WAL that is generated when you move up from wal_level = minimal. That's probably the bigger performance issue in practice. Yeah, I think we've established that we can't change that. I said, and meant, that you didn't make the case at all; you just presumed it was obvious that we should change the defaults to be replication-friendly. I don't think it is. As I said, I think that only a minority of our users are going to want replication. 50% of PGX's active clients have either already converted to 9.0 replication or have scheduled a conversion with us. I expect that to be 80% by the time 9.1 comes out, and the main reason why it's not 100% is that a few clients specifically need Slony (partial replication or similar) or ad-hoc replication systems. Every time I do a walk-through of how to do replication at a PG event it's packed. I've talked to dozens of people who are planning to implement 9.0 replication at conferences, and it's outpaced how does it compare to MySQL for stuff people ask me about at booths. From where I sit, you're *dramatically* underestimating the demand for replication. Maybe other people haven't had the same experiences, but I'm seeing an avalanche of demand. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, 2010-10-27 at 16:13 -0700, Josh Berkus wrote: That's not even considering the extra WAL that is generated when you move up from wal_level = minimal. That's probably the bigger performance issue in practice. Yeah, I think we've established that we can't change that. I said, and meant, that you didn't make the case at all; you just presumed it was obvious that we should change the defaults to be replication-friendly. I don't think it is. As I said, I think that only a minority of our users are going to want replication. 50% of PGX's active clients have either already converted to 9.0 replication or have scheduled a conversion with us. I expect that to be 80% by the time 9.1 comes out, and the main reason why it's not 100% is that a few clients specifically need Slony (partial replication or similar) or ad-hoc replication systems. That's interesting. ZERO % of CMD's clients have converted to 9.0 and many have no current inclination to do so because they are already easily served by Londiste, Slony, DRBD or Log Shipping. I would also agree that the minority of our users will want replication. The majority of CMD customers, PGX customers, EDB Customers will want replication but that is by far NOT the majority of our (.Org) users. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
I would also agree that the minority of our users will want replication. The majority of CMD customers, PGX customers, EDB Customers will want replication but that is by far NOT the majority of our (.Org) users. That just means that *you don't know* how many .org users plan to implement replication, whether it's a minority or majority. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, Oct 27, 2010 at 7:45 PM, Josh Berkus j...@agliodbs.com wrote: I would also agree that the minority of our users will want replication. The majority of CMD customers, PGX customers, EDB Customers will want replication but that is by far NOT the majority of our (.Org) users. That just means that *you don't know* how many .org users plan to implement replication, whether it's a minority or majority. None of us know. What I do know is that I don't want PostgreSQL to be slower out of the box. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, 2010-10-27 at 19:52 -0400, Robert Haas wrote: On Wed, Oct 27, 2010 at 7:45 PM, Josh Berkus j...@agliodbs.com wrote: I would also agree that the minority of our users will want replication. The majority of CMD customers, PGX customers, EDB Customers will want replication but that is by far NOT the majority of our (.Org) users. That just means that *you don't know* how many .org users plan to implement replication, whether it's a minority or majority. None of us know. What I do know is that I don't want PostgreSQL to be slower out of the box. Poll TIME! JD -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
None of us know. What I do know is that I don't want PostgreSQL to be slower out of the box. Understandable. So it seems like the answer is getting replication down to one configuration variable for the common case. That eliminates the cycle of oops, need to set X and restart/reload without paying performance penalties on standalone servers. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Thu, Oct 21, 2010 at 8:33 PM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: On Thu, Oct 21, 2010 at 4:21 PM, Josh Berkus j...@agliodbs.com wrote: On 10/20/10 6:54 PM, Robert Haas wrote: I find it impossible to believe that's a good decision, and IMHO we should be focusing on how to make the parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us most of the same benefits without throwing away hard-won performance. I'd be happy to accept that. ?Is it possible, though? I sketched an outline of the problem AIUI here: http://archives.postgresql.org/pgsql-hackers/2010-10/msg01348.php I think it's possible; I'm not quite sure how hard it is. Unfortunately, I've not had as much PG-hacking time lately as I'd like... Have we documented these TODOs? I have not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On 10/20/10 6:54 PM, Robert Haas wrote: I find it impossible to believe that's a good decision, and IMHO we should be focusing on how to make the parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us most of the same benefits without throwing away hard-won performance. I'd be happy to accept that. Is it possible, though? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Thu, Oct 21, 2010 at 4:21 PM, Josh Berkus j...@agliodbs.com wrote: On 10/20/10 6:54 PM, Robert Haas wrote: I find it impossible to believe that's a good decision, and IMHO we should be focusing on how to make the parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us most of the same benefits without throwing away hard-won performance. I'd be happy to accept that. Is it possible, though? I sketched an outline of the problem AIUI here: http://archives.postgresql.org/pgsql-hackers/2010-10/msg01348.php I think it's possible; I'm not quite sure how hard it is. Unfortunately, I've not had as much PG-hacking time lately as I'd like... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Robert Haas wrote: On Wed, Oct 20, 2010 at 3:40 PM, Greg Stark gsst...@mit.edu wrote: On Wed, Oct 20, 2010 at 6:29 AM, Robert Haas robertmh...@gmail.com wrote: Exactly. ?It doesn't take many 3-7% slowdowns to add up to being 50% or 100% slower, and that sucks. ?In fact, I'm still not convinced that we were wise to boost default_statistics_target as much as we did. ?I argued for a smaller boost at the time. Well we don't want to let ourselves be paralyzed by FUD so it was important to identify specific concerns and then tackle those concerns. Once we identified the worst-case planning cases we profiled them and found that the inflection point of the curve was fairly clearly above 100 but that there were cases where values below 1,000 caused problems. So I'm pretty happy with the evidence-based approach. The inflection point of the curve was certainly a good thing for us to look at but the fact remains that we took a hit on a trivial benchmark, and we can't afford to take too many of those. Agreed. If people start wondering if our new major releases are perhaps _slower_ than previous ones, we have lost a huge amount of momentum. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Robert Haas wrote: On Thu, Oct 21, 2010 at 4:21 PM, Josh Berkus j...@agliodbs.com wrote: On 10/20/10 6:54 PM, Robert Haas wrote: I find it impossible to believe that's a good decision, and IMHO we should be focusing on how to make the parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us most of the same benefits without throwing away hard-won performance. I'd be happy to accept that. ?Is it possible, though? I sketched an outline of the problem AIUI here: http://archives.postgresql.org/pgsql-hackers/2010-10/msg01348.php I think it's possible; I'm not quite sure how hard it is. Unfortunately, I've not had as much PG-hacking time lately as I'd like... Have we documented these TODOs? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, Oct 20, 2010 at 1:06 AM, Greg Smith g...@2ndquadrant.com wrote: Josh Berkus wrote: Well, now that you mention it, I also think that hot standby should be the default. Yes, I know about the overhead, but I also think that the number of our users who want easy replication *far* outnumber the users who care about an extra 10% WAL overhead. I think this whole situation is similar to the resistance to increasing default_statistics_target. There's additional overhead added by enabling both of these settings, in return for making it more likely for the average person to see useful behavior by default. If things are rejiggered so the advanced user has to turn things off in order to get optimal performance when not using these features, in return for the newbie being more likely to get things working in basic form, that's probably a net win for PostgreSQL advocacy. But much like default_statistics_target, there needs to be some more formal work done on quantifying just how bad each of these overheads really are first. We lost 3-7% on multiple simple benchmarks between 8.3 and 8.4[1] because of that retuning for ease of use on real-world workloads, and that's not something you want to repeat too often. Exactly. It doesn't take many 3-7% slowdowns to add up to being 50% or 100% slower, and that sucks. In fact, I'm still not convinced that we were wise to boost default_statistics_target as much as we did. I argued for a smaller boost at the time. Actually, I think the best thing for default_statistics_target might be to scale the target based on the number of rows in the table, e.g. given N rows: 10 + (N / 1000), if N 40,000 46 + (N / 1), if 50,000 N 3,540,000 400, if N 3,540,000 Consider a table with 2,000 rows. With default_statistics_target = 100, we can store up to 100 MCVs; and we break the remaining ~1900 values up into 100 buckets with 19 values/bucket. In most cases, that is probably overkill. Where you tend to run into problems with inadequate statistics is with the values that are not quite common enough to be an MCV, but are still significantly more common than their companions in the same bucket. However, with only 19 values in a bucket, you're probably not going to have that problem. If you scale the table down to 1000 rows you now have 9 values in a bucket, which makes it *really* unlikely you're going to have that problem. On the other hand, on a table with 4 million rows, it is entirely likely that there could be more than 100 values whose frequencies are worth tracking individually, and odds are good also that even if the planning time is a little longer to no purpose, it'll still be small relatively to the query execution time. It's unfortunately impractical for the size of the MCV list to track linearly with the size of the table, because there are O(n^2) algorithms in use, but I think some kind of graduated scheme might enable us to buy back some of that lost performance without damaging real workloads very much. Possibly even helping real workloads, because you may very well join large fact tables against small dimension tables, and odds are good that under the present scheme the fact tables have more statistics than they really need. As to replication, I don't believe the contention that most people will want to use replication. Many will, and that is fine, but many also won't. The world is full of development and test machines where replication is a non-issue, and some people won't even run it in production because the nature of their application makes the data on that box non-essential, or because they replicate with Bucardo or Slony. I completely agree that we should make it easier to get replication set up without (multiple) server restarts, but imposing a performance overhead on untuned systems is not the right way to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Greg Smith g...@2ndquadrant.com writes: Josh Berkus wrote: Well, now that you mention it, I also think that hot standby should be the default. Yes, I know about the overhead, but I also think that the number of our users who want easy replication *far* outnumber the users who care about an extra 10% WAL overhead. ... But much like default_statistics_target, there needs to be some more formal work done on quantifying just how bad each of these overheads really are first. Quite. Josh, have you got any evidence showing that the penalty is only 10%? There are cases, such as COPY and ALTER TABLE, where you'd be looking at 2X or worse penalties, because of the existing optimizations that avoid writing WAL at all for operations where a single final fsync can serve the purpose. I'm not sure what the penalty for typical workloads is, partly because I'm not sure what should be considered a typical workload for this purpose. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On 20.10.2010 17:19, Tom Lane wrote: Greg Smithg...@2ndquadrant.com writes: Josh Berkus wrote: Well, now that you mention it, I also think that hot standby should be the default. Yes, I know about the overhead, but I also think that the number of our users who want easy replication *far* outnumber the users who care about an extra 10% WAL overhead. ... But much like default_statistics_target, there needs to be some more formal work done on quantifying just how bad each of these overheads really are first. Quite. Josh, have you got any evidence showing that the penalty is only 10%? There are cases, such as COPY and ALTER TABLE, where you'd be looking at 2X or worse penalties, because of the existing optimizations that avoid writing WAL at all for operations where a single final fsync can serve the purpose. I'm not sure what the penalty for typical workloads is, partly because I'm not sure what should be considered a typical workload for this purpose. Going from wal_level='minimal' to 'archivë́' incurs the penalty on WAL-logging COPY etc. That's a big penalty. However, the difference between wal_level='archive' and wal_level='hot_standby' should be tiny. The big reason for separating those two in 9.0 was that it's all new code with new ways to fail and, yes, new bugs. It's not smart to expose people who are not interested in using hot standby to those issues. But maybe we feel more comfortable merging 'archive' and 'hot_standby' levels in 9.1. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Going from wal_level='minimal' to 'archiveÍ' incurs the penalty on WAL-logging COPY etc. That's a big penalty. However, the difference between wal_level='archive' and wal_level='hot_standby' should be tiny. I'm not sure I believe that either, because of the costs associated with logging lock acquisitions. We really need some actual benchmarks in this area, rather than handwaving ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
Excerpts from Robert Haas's message of mié oct 20 10:29:04 -0300 2010: Actually, I think the best thing for default_statistics_target might be to scale the target based on the number of rows in the table, e.g. given N rows: 10 + (N / 1000), if N 40,000 46 + (N / 1), if 50,000 N 3,540,000 400, if N 3,540,000 Consider a table with 2,000 rows. With default_statistics_target = 100, we can store up to 100 MCVs; and we break the remaining ~1900 values up into 100 buckets with 19 values/bucket. Maybe what should be done about this is to have separate sizes for the MCV list and the histogram, where the MCV list is automatically sized during ANALYZE. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, Oct 20, 2010 at 10:53 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of mié oct 20 10:29:04 -0300 2010: Actually, I think the best thing for default_statistics_target might be to scale the target based on the number of rows in the table, e.g. given N rows: 10 + (N / 1000), if N 40,000 46 + (N / 1), if 50,000 N 3,540,000 400, if N 3,540,000 Consider a table with 2,000 rows. With default_statistics_target = 100, we can store up to 100 MCVs; and we break the remaining ~1900 values up into 100 buckets with 19 values/bucket. Maybe what should be done about this is to have separate sizes for the MCV list and the histogram, where the MCV list is automatically sized during ANALYZE. I thought about that, but I'm not sure there's any particular advantage. Automatically scaling the histogram seems just as useful as automatically scaling the MCV list - both things will tend to reduce the estimation error. For a table with 2,000,000 rows, automatically setting the statistics target from 100 to the value that would be computed by the above formula, which happens to be 246, will help the 101th-246th most common values, because they will now be MCVs. It will also help all the remaining values, both because you've pulled 146 fairly common values out of the histogram buckets and also because each bucket now contains ~8130 values rather than ~20,000. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_wal_senders must die
On Wed, Oct 20, 2010 at 6:29 AM, Robert Haas robertmh...@gmail.com wrote: Exactly. It doesn't take many 3-7% slowdowns to add up to being 50% or 100% slower, and that sucks. In fact, I'm still not convinced that we were wise to boost default_statistics_target as much as we did. I argued for a smaller boost at the time. Well we don't want to let ourselves be paralyzed by FUD so it was important to identify specific concerns and then tackle those concerns. Once we identified the worst-case planning cases we profiled them and found that the inflection point of the curve was fairly clearly above 100 but that there were cases where values below 1,000 caused problems. So I'm pretty happy with the evidence-based approach. The problem with being overly conservative is that it gives free rein to the folks who were shouting that we should just set the default to 1,000. They weren't wrong that the 10 was overly conservative and in the absence of evidence 1,000 was just as reasonable. Actually, I think the best thing for default_statistics_target might be to scale the target based on the number of rows in the table, e.g. given N rows: The number of buckets needed isn't related to the population size -- it's related to how wide the ranges you'll be estimating selectivity for are. That is, with our current code, if you're selecting tuples within a range a..b and that range happens to be the same size as the bucket size then you'll get an accurate estimate with a fixed 95th percentile precision regardless of the size of the table (to an approximation). I'm not sure how our selectivity works at all for the degenerate case of selecting for specific values. I don't understand how histograms are useful for such estimates at all. I think the MCV lists are basically an attempt to overcome this problem and as you point out I'm not sure the statistics target is really the right thign to control them -- but since I don't think there's any real statistics behind them I'm not sure there's any right way to control them. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers