Re: [HACKERS] Online base backup from the hot-standby
1) Today you can do a backup by just calling pg_start_backup('x'); copy the data directory and pg_stop_backup(); You do not need to use pg_basebackup to create a backup. The solution you are proposing would require pg_basebackup to be used to build backups from standby servers. YES. 2) If I run pg_basebackup but do not specify '-x' then no pg_xlog segments are included in the output. The relevant pg_xlog segments are in the archive from the master. I can see situations where you are already copying the archive to the remote site that the new standby will be created in so you don't want to have to copy the pg_xlog segments twice over your network. No, I don't matter because of the same behavior as 9.1. Please see the part of Before: of the following answer. What Heikki is proposing will work both when you aren't using pg_basebackup (as long the output of pg_stop_backup() is somehow captured in a way that it can be read) and will also work with pg_basebackup when '-x' isn't specified. I receive this mail, so I notice I do wrong recognition to what Heikki is proposing. my recognition: Before: * I thought Heikki proposes, Execute SQL(pg_start_backup('x'); copy the data directory and pg_stop_backup();) from the standby server to the primary server. - I disliked this. Now: * Heikki is proposing both No using pg_basebackup and Not specify -x. So, * Use the output of pg_stop_backup(). * Don't use backup history file. he thinks. Right? Jun Ishizuka NTT Software Corporation TEL:045-317-7018 E-Mail: ishizuka@po.ntts.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SYNONYMS (again)
On Jun 23, 2011, at 12:52 AM, Alvaro Herrera wrote: Excerpts from Joshua D. Drake's message of mié jun 22 15:37:17 -0400 2011: Per: http://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php It seems we did come up with a use case in the procpid discussion. The ability to change the names of columns/databases etc, to handle the fixing of bad decision decisions during development over time. Thoughts? Let's start with what was discussed and supported in that thread, that is, databases. It seems less clear that columns are widely believed to be a good idea to have synonyms for. Besides, synonyms for databases should be reasonably simple to implement, which is not something I would say for columns. yes, implementing synonyms is not too hard. some time ago (3 or 4 years ago most likely) we already posted a patch providing support for synonyms. it was rejected because synonyms were said to be a bad design pattern which app developers to do nasty things. so, if you want to work on it maybe this patch is the place to start. many thanks, hans -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SYNONYMS (again)
On Jun 23, 2011, at 12:52 AM, Alvaro Herrera wrote: Excerpts from Joshua D. Drake's message of mié jun 22 15:37:17 -0400 2011: Per: http://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php It seems we did come up with a use case in the procpid discussion. The ability to change the names of columns/databases etc, to handle the fixing of bad decision decisions during development over time. Thoughts? Let's start with what was discussed and supported in that thread, that is, databases. It seems less clear that columns are widely believed to be a good idea to have synonyms for. Besides, synonyms for databases should be reasonably simple to implement, which is not something I would say for columns. sorry, i missed the links: http://archives.postgresql.org/pgsql-patches/2006-03/msg00085.php many thanks, hans -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hugetables question
On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote: I strictly disagree with opinion if there is 1% it's worthless. 1% here, 1% there, and finally You get 10%, but of course hugepages will work quite well if will be used in code that require many random jumps. I think this can be reproduced and some not-common case may be found to get performance of about 10% (maybe upload whole table in shared buffer and randomly peek records one by one). I think the point is not that 1% is worthless, but that it hasn't been shown that it is a 1% improvement, becuase the noise is too large. For benefits this small, what you need to is run each test 100 times and check the mean and standard deviation and see whether the improvment is real or not. When the benefit is 10% you only need a handful of runs to prove it, which is why they're accepted easier. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Hugetables question
Martijn van Oosterhout klep...@svana.org Thursday 23 of June 2011 09:10:20 On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote: I strictly disagree with opinion if there is 1% it's worthless. 1% here, 1% there, and finally You get 10%, but of course hugepages will work quite well if will be used in code that require many random jumps. I think this can be reproduced and some not-common case may be found to get performance of about 10% (maybe upload whole table in shared buffer and randomly peek records one by one). I think the point is not that 1% is worthless, but that it hasn't been shown that it is a 1% improvement, becuase the noise is too large. For benefits this small, what you need to is run each test 100 times and check the mean and standard deviation and see whether the improvment is real or not. When the benefit is 10% you only need a handful of runs to prove it, which is why they're accepted easier. Have a nice day, Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle I think conclusion from this test was Much more important things are to do, then 1% benefit - not 1% is worthless. I will try today hugepages, with random peeks. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] DROP Reworks Part.0 - 'missing_ok' support of get_object_address
I revised my patch based on your there-is-no-try-v2.patch. It enabled to implement 'missing_ok' support without modification of orders to solve the object name and relation locking. Thanks, 2011/6/22 Robert Haas robertmh...@gmail.com: On Wed, Jun 22, 2011 at 12:51 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of mié jun 22 08:56:02 -0400 2011: Another option might be to leave heap_openrv() and relation_openrv() alone and add a missing_ok argument to try_heap_openrv() and try_relation_openrv(). Passing true would give the same behavior as presently; passing false would make them behave like the non-try version. That would be pretty weird, having two functions, one of them sometimes doing the same thing as the other one. I understand Noah's concern but I think your original proposal was saner than both options presented so far. I agree with you. If we had a whole pile of options it might be worth having heap_openrv() and heap_openrv_extended() so as not to complicate the simple case, but since there's no forseeable need to add anything other than missing_ok, my gut is to just add it and call it good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- KaiGai Kohei kai...@kaigai.gr.jp pgsql-v9.2-drop-reworks-part-0.v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fixing PQsetvalue()
2011/6/23 Merlin Moncure mmonc...@gmail.com On Jun 6 ( http://archives.postgresql.org/pgsql-hackers/2011-06/msg00272.php), Pavel discovered an issue with PQsetvalue that could cause libpq to wander off into unallocated memory that was present in 9.0.x. A fairly uninteresting fix was quickly produced, but Tom indicated during subsequent review that he was not happy with the behavior of the function. Everyone was busy with the beta wrap at the time so I didn't press the issue. A little history here: PQsetvalue's (http://www.postgresql.org/docs/9.0/static/libpq-misc.html) main reason to exist is to allow creating a PGresult out of scratch data on a result created via PQmakeEmptyResult(). This behavior works as intended and is not controversial...it was initially done to support libpqtypes but has apparently found use in other places like ecpg. PQsetvalue *also* allows you to mess with results returned by the server using standard query methods for any tuple, not just the one you are creating as you iterate through. This behavior was unnecessary in terms of what libpqtypes and friends needed and may (as Tom suggested) come back to bite us at some point. As it turns out, PQsetvalue's operation on results that weren't created via PQmakeEmptyResult was totally busted because of the bug, so we have a unique opportunity to tinker with libpq here: you could argue that you have a window of opportunity to change the behavior here since we know it isn't being usefully used. I think it's too late for that but it's if it had to be done the time is now. Pavel actually has been requesting to go further with being able to mess around with PGresults (see: http://archives.postgresql.org/pgsql-interfaces/2011-06/msg0.php), such that the result would start to have some 'recordset' features you find in various other libraries. Maybe it's not the job of libpq to do that, but I happen to think it's pretty cool. Anyways, something has to be done -- I hate to see an unpatched known 9.0 issue remain in the wild. +1 Exactly at this moment I am thinking about using modifiable (via PQsetvalue) PGresult instead of std::map in my C++ library for store parameters for binding to executing command. I am already designed how to implement it, and I supposed that PQsetvalue is intended to work with any PGresult and not only with those which has been created via PQmakeEmptyResult... So, I am absolutely sure, that PQsetvalue should works with any PGresult. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- // Dmitriy.
Re: [HACKERS] fixing PQsetvalue()
Hello, Merlin. You wrote: MM On Jun 6 MM (http://archives.postgresql.org/pgsql-hackers/2011-06/msg00272.php), MM Pavel discovered an issue with PQsetvalue that could cause libpq to MM wander off into unallocated memory that was present in 9.0.x. A MM fairly uninteresting fix was quickly produced, but Tom indicated MM during subsequent review that he was not happy with the behavior of MM the function. Everyone was busy with the beta wrap at the time so I MM didn't press the issue. MM A little history here: PQsetvalue's MM (http://www.postgresql.org/docs/9.0/static/libpq-misc.html) main MM reason to exist is to allow creating a PGresult out of scratch data on MM a result created via PQmakeEmptyResult(). This behavior works as MM intended and is not controversial...it was initially done to support MM libpqtypes but has apparently found use in other places like ecpg. MM PQsetvalue *also* allows you to mess with results returned by the MM server using standard query methods for any tuple, not just the one MM you are creating as you iterate through. This behavior was MM unnecessary in terms of what libpqtypes and friends needed and may (as MM Tom suggested) come back to bite us at some point. As it turns out, MM PQsetvalue's operation on results that weren't created via MM PQmakeEmptyResult was totally busted because of the bug, so we have a MM unique opportunity to tinker with libpq here: you could argue that you MM have a window of opportunity to change the behavior here since we know MM it isn't being usefully used. I think it's too late for that but it's MM if it had to be done the time is now. MM Pavel actually has been requesting to go further with being able to MM mess around with PGresults (see: MM http://archives.postgresql.org/pgsql-interfaces/2011-06/msg0.php), MM such that the result would start to have some 'recordset' features you MM find in various other libraries. Maybe it's not the job of libpq to MM do that, but I happen to think it's pretty cool. Anyways, something MM has to be done -- I hate to see an unpatched known 9.0 issue remain in MM the wild. MM merlin I confirm my desire to have delete tuple routine. And I'm ready to create\test\modify any sources for this. -- With best wishes, Pavel mailto:pa...@gf.microolap.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fixing PQsetvalue()
you are creating as you iterate through. This behavior was unnecessary in terms of what libpqtypes and friends needed and may (as Tom suggested) come back to bite us at some point. As it turns out, PQsetvalue's operation on results that weren't created via PQmakeEmptyResult was totally busted because of the bug, so we have a unique opportunity to tinker with libpq here: you could argue that you +1 Exactly at this moment I am thinking about using modifiable (via PQsetvalue) PGresult instead of std::map in my C++ library for store parameters for binding to executing command. I am already designed how to implement it, and I supposed that PQsetvalue is intended to work with any PGresult and not only with those which has been created via PQmakeEmptyResult... So, I am absolutely sure, that PQsetvalue should works with any PGresult. All PGresults are created via PQmakeEmptyPGresult, including libpqtypes. Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult. It might be better to say a server result vs. a client result. Currently, PQsetvalue is broken when provided a server generated result. -- Andrew Chernow eSilo, LLC global backup http://www.esilo.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Make the visibility map crash-safe.
On Wed, Jun 22, 2011 at 10:23 PM, Robert Haas robertmh...@gmail.com wrote: Well, it seems I didn't put nearly enough thought into heap_update(). The fix for the immediate problem looks simple enough - all the code has been refactored to use the new API, so the calls can be easily be moved into the critical section (see attached). But looking at this a little more, I see that heap_update() is many bricks short of a load, because there are several places where the buffer can be unlocked and relocked, and we don't recheck whether the page is all-visible after reacquiring the lock. So I've got some more work to do here. See what you think of the attached. I *think* this covers all bases. It's a little more complicated than I would like, but I don't think fatally so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company heap-update-vm-fix-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] fixing PQsetvalue()
2011/6/23 Andrew Chernow a...@esilo.com you are creating as you iterate through. This behavior was unnecessary in terms of what libpqtypes and friends needed and may (as Tom suggested) come back to bite us at some point. As it turns out, PQsetvalue's operation on results that weren't created via PQmakeEmptyResult was totally busted because of the bug, so we have a unique opportunity to tinker with libpq here: you could argue that you +1 Exactly at this moment I am thinking about using modifiable (via PQsetvalue) PGresult instead of std::map in my C++ library for store parameters for binding to executing command. I am already designed how to implement it, and I supposed that PQsetvalue is intended to work with any PGresult and not only with those which has been created via PQmakeEmptyResult... So, I am absolutely sure, that PQsetvalue should works with any PGresult. All PGresults are created via PQmakeEmptyPGresult, including libpqtypes. Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult. It might be better to say a server result vs. a client result. Currently, PQsetvalue is broken when provided a server generated result. Yeah, yeah. Thanks for clarification! I am not libpq hacker, just user. So I was thinking that server-returned result creating by libpq's private functions, rather than public PQmakeEmptyPGresult... -1 although if this feature (which I personally thought already exists according to the documentation) make future optimizations impossible or hard (as mentioned by Tom). Otherwise - +1. -- Andrew Chernow eSilo, LLC global backup http://www.esilo.com/ -- // Dmitriy.
Re: [HACKERS] fixing PQsetvalue()
On Thu, Jun 23, 2011 at 7:54 AM, Andrew Chernow a...@esilo.com wrote: you are creating as you iterate through. This behavior was unnecessary in terms of what libpqtypes and friends needed and may (as Tom suggested) come back to bite us at some point. As it turns out, PQsetvalue's operation on results that weren't created via PQmakeEmptyResult was totally busted because of the bug, so we have a unique opportunity to tinker with libpq here: you could argue that you +1 Exactly at this moment I am thinking about using modifiable (via PQsetvalue) PGresult instead of std::map in my C++ library for store parameters for binding to executing command. I am already designed how to implement it, and I supposed that PQsetvalue is intended to work with any PGresult and not only with those which has been created via PQmakeEmptyResult... So, I am absolutely sure, that PQsetvalue should works with any PGresult. All PGresults are created via PQmakeEmptyPGresult, including libpqtypes. Actually, libpqtypes calls PQcopyResult which calls PQmakeEmptyPGresult. It might be better to say a server result vs. a client result. Currently, PQsetvalue is broken when provided a server generated result. er, right-- the divergence was in how the tuples were getting added -- thanks for the clarification. essentially your attached patch fixes that divergence. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hugetables question
On Thu, Jun 23, 2011 at 5:01 AM, Radosław Smogura rsmog...@softperience.eu wrote: I think conclusion from this test was Much more important things are to do, then 1% benefit - not 1% is worthless. I will try today hugepages, with random peeks. I think the real conclusion here is Linux will soon do this for us automatically without us needing to do anything, so pretty soon there will be no possible benefit at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table Partitioning
On Tue, Jun 21, 2011 at 1:52 PM, David Fetter da...@fetter.org wrote: Still, I think a pretty clear way forward here is to try to figure out a way to add some explicit syntax for range partitions, so that you can say... foo_a is for all rows where foo starts with 'a' foo_b is for all rows where foo starts with 'b' ... foo_xyz is for all rows where foo starts with 'xyz' If we have that data represented explicitly in the system catalog, then we can look at doing built-in INSERT-routing and UPDATE-handling. For an added bonus, it's a more natural syntax. Does someone else have such a syntax? Does The Standard™ have anything to say? Yes, and I don't know, respectively. There have been previous discussions of this. You might want to start by reading the discussion around the previous patch. https://commitfest.postgresql.org/action/patch_view?id=266 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On 11-06-23 02:41 AM, Jun Ishiduka wrote: I receive this mail, so I notice I do wrong recognition to what Heikki is proposing. my recognition: Before: * I thought Heikki proposes, Execute SQL(pg_start_backup('x'); copy the data directory and pg_stop_backup();) from the standby server to the primary server. - I disliked this. Now: * Heikki is proposing both No using pg_basebackup and Not specify -x. So, * Use the output of pg_stop_backup(). * Don't use backup history file. he thinks. Right? What I think he is proposing would not require using pg_stop_backup() but you could also modify pg_stop_back() to work as well. What do you think of this idea? Do you see how the patch can be reworked to accomplish this? Jun Ishizuka NTT Software Corporation TEL:045-317-7018 E-Mail: ishizuka@po.ntts.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table Partitioning
Robert Haas robertmh...@gmail.com wrote: David Fetter da...@fetter.org wrote: Does The Standard* have anything to say? I don't know I dug around in the standard a bit and didn't find anything. Given that the standard doesn't even touch the issue of indexes because that a performance tuning implementation detail, it would seem out of character for them to define table partitioning. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] spinlock contention
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote: On Jun12, 2011, at 23:39 , Robert Haas wrote: So, the majority (60%) of the excess spinning appears to be due to SInvalReadLock. A good chunk are due to ProcArrayLock (25%). Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, so I guess that two adjacent LWLocks currently share one cache line. Currently, the ProcArrayLock has index 4 while SInvalReadLock has index 5, so if I'm not mistaken exactly the two locks where you saw the largest contention on are on the same cache line... Might make sense to try and see if these numbers change if you either make LWLockPadded 64bytes or arrange the locks differently... I fooled around with this a while back and saw no benefit. It's possible a more careful test would turn up something, but I think the only real way forward here is going to be to eliminate some of that locking altogether. SInvalReadLock is a good example of a lock which seems barely to be necessary at all. Except on extraordinarily rare occasions, it is only ever taken in share mode. Only SICleanupQueue() needs to lock out readers, and in the (probably common) case where all backends are reasonably close to being caught up, it wouldn't even matter if the determination of minMsgNum were a little bit fuzzy. I've been straining my brain trying to figure out whether there's some way to rewrite this logic so that it can run concurrently with readers; then we could get rid of SInvalReadLock altogether. I have a feeling if I were 25% smarter I could do it, but I'm not. So my fallback position is to try to find some way to optimize the common case where there are no messages to be read. We're already allowing readers and writers to run in parallel, so it must not be important for a reader to see a message that a writer is in the process of writing, but has not yet finished writing. I believe the idea here is that sinval messages should be emitted before releasing the related locks, so if we've successfully acquired a lock on an object, then any pertinent sinval messages must already be completely written. What I'm thinking we could do is just keep a global counter indicating how many messages have been written, and each backend could keep a counter indicating the highest-numbered message it has so far read. When a message is written, the global counter is bumped. When a backend goes to AcceptInvalidationMessages(), it compares the global counter to its own counter, and if they're the same, then it doesn't bother taking SInvalReadLock and just exits quickly. There is an obvious (potential) memory-ordering problem here. If it's possible for a backend doing AcceptInvalidationMessages() to see a stale version of the counter, then it might fail to read messages from the queue that it really ought to have seen. Protecting the global counter with a spinlock defeats the purpose of doing any of this in the first place, because the whole problem is too many people fighting over the spinlock protecting SInvalReadLock. It should be sufficient, though, for the writing backend to execute a memory-barrier operation just after bumping the global counter and the reading backend to execute one just before. As it happens, LWLockRelease() is such a barrier, since it must acquire and release a spinlock, so we should be able to just bump the counter right before releasing the LWLock and call it good. On the reading side, the immediately-preceding lock acquisition seems like it ought to be sufficient - surely it can't be possible to acquire a heavyweight lock on an object without seeing memory updates that some other process made before releasing the same heavyweight lock. A possible objection here is that a 32-bit counter might wrap around, giving a false negative, and a read from a 64-bit counter might not be atomic. In practice, the chances of a problem here seem remote, but I think we can guard against it easily enough. We can put each backend's counter of messages read into shared memory, protected by a per-backend lock (probably the same LWLock the fast relation lock patch already introduces). When a backend compares its counter to the global counter, it does so while holding this lock. When SICleanupQueue() resets a backend, it grabs the lock and sets that backend's counter to a special value (like 0) that we guarantee will never match the global counter (which we can cause to wrap from 2^32-1 to 1). So then AcceptInvalidationMessages() - in the common case where there are no invalidation messages to process - will only need the per-backend LWLock rather than fighting over SInvalLock. If anyone has read this far, you have my undying gratitude especially if you respond with comments. ProcArrayLock looks like a tougher nut to crack - there's simply no way, with the system we have right now, that you can take a snapshot without
Re: [HACKERS] spinlock contention
On 23.06.2011 18:42, Robert Haas wrote: ProcArrayLock looks like a tougher nut to crack - there's simply no way, with the system we have right now, that you can take a snapshot without locking the list of running processes. I'm not sure what to do about that, but we're probably going to have to come up with something, because it seems clear that once we eliminate the lock manager LWLock contention, this is a major bottleneck. ProcArrayLock is currently held for a relatively long period of time when a snapshot is taken, especially if there's a lot of backends. There are three operations to the procarray: 1. Advertising a new xid belonging to my backend. 2. Ending a transaction. 3. Getting a snapshot. Advertising a new xid is currently done without a lock, assuming that setting an xid in shared memory is atomic. To end a transaction, you acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire ProcArrayLock in shared mode, and scan the whole procarray. I wonder if it would be a better tradeoff to keep a materialized snapshot in shared memory that's updated every time a new xid is assigned or a transaction ends. Getting a snapshot would become much cheaper, as you could just memcpy the ready-built snapshot from shared memory into local memory. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlock contention
On Jun23, 2011, at 17:42 , Robert Haas wrote: On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote: On Jun12, 2011, at 23:39 , Robert Haas wrote: So, the majority (60%) of the excess spinning appears to be due to SInvalReadLock. A good chunk are due to ProcArrayLock (25%). Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, so I guess that two adjacent LWLocks currently share one cache line. Currently, the ProcArrayLock has index 4 while SInvalReadLock has index 5, so if I'm not mistaken exactly the two locks where you saw the largest contention on are on the same cache line... Might make sense to try and see if these numbers change if you either make LWLockPadded 64bytes or arrange the locks differently... I fooled around with this a while back and saw no benefit. It's possible a more careful test would turn up something, but I think the only real way forward here is going to be to eliminate some of that locking altogether. It seems hard to believe that there isn't some effect of two locks sharing a cache line. There are architectures (even some of the Intel architectures, I believe) where cache lines are 32 bit, though - so measuring this certainly isn't easy. Also, I'm absolute no expert for this, so it might very well be that' I'm missing something fundamental. SInvalReadLock is a good example of a lock which seems barely to be necessary at all. Except on extraordinarily rare occasions, it is only ever taken in share mode. This is the case we should optimize for, I think. Currently, there's contention even between two SHARE lockers of a LWLock because our LWLock implementation uses a spinlock underneath. I've googled around a bit, and found this: http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git It's an userspace rwlock implementation based on atomic instructions and futexes (for blocking) written by Linus Torwalds as a response to the following thread http://lwn.net/Articles/284741/ It seems to require only a single atomic increment instruction to acquire a share lock in the uncontested case, and a single compare- and-exchange instruction to acquire an exlcusive lock (again in the uncontested case). The code doesn't seem to have a license attached, and seems to have been written over a weekend and never touched since, so we might not want (or be able to) to use this directly. But it'd still be interesting to see if replacing our LWLocks with this implementation affects throughput. Only SICleanupQueue() needs to lock out readers, and in the (probably common) case where all backends are reasonably close to being caught up, it wouldn't even matter if the determination of minMsgNum were a little bit fuzzy. I've been straining my brain trying to figure out whether there's some way to rewrite this logic so that it can run concurrently with readers; then we could get rid of SInvalReadLock altogether. I have a feeling if I were 25% smarter I could do it, but I'm not. This sounds like something which could be done with RCU (Read-Copy-Update) in a lockless way. The idea is to copy the data structure (or parts) of it before updating it, and to commit the change afterwards by adjusting a single pointer to point to the new structure (kinda like we do atomic commits by an atomic update of one bit in the clog). The hard part is usually to figure when it's OK to clean up or reuse the old version of the data structure - for that, you need to know that all previous readers have completed. Here's a rough description of how that could work for the invalidation queue. We'd have two queue structures in shared memory, each containing a version number. We'd also have a current pointer pointing to the most recent one. Each reader would publish the version number of the queue he's about to read (the current one) in shared memory (and issue a write barrier). SICleanupQueue() would check whether the non-current queue structure is unused by comparing its version to the published versions of all backends (after issuing a read barrier, and after taking a lock that prevents concurrent SICleanupQueue runs of course). If there still are users of that version, SICleanupQueue() out-waits them. Afterwards, it simply copies the current structure over the old one, does the necessary cleanups, increments the version number to be larger than the one of the current structure, and *then* updates the current pointer. So my fallback position is to try to find some way to optimize the common case where there are no messages to be read. We're already allowing readers and writers to run in parallel, so it must not be important for a reader to see a message that a writer is in the process of writing, but has not yet finished writing. I believe the idea here is that sinval messages should be emitted before releasing the related locks, so if we've successfully acquired a lock on
Re: [HACKERS] SYNONYMS (again)
Gurjeet Singh singh.gurj...@gmail.com wrote: Instead of just synonyms of columns, why don't we think about implementing virtual columns (feature as named in other RDBMS). This is the ability to define a column in a table which is derived using an expression around other non-virtual columns. How do you see that working differently from what PostgreSQL can currently do? test=# create table line_item(id int primary key not null, quantity int not null, unit_price numeric(13,2)); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index line_item_pkey for table line_item CREATE TABLE test=# insert into line_item values (1,15,'12.53'),(2,5,'16.23'); INSERT 0 2 test=# create function line_total(line_item) returns numeric(13,2) language sql immutable as $$ select ($1.quantity * $1.unit_price)::numeric(13,2);$$; CREATE FUNCTION test=# select li.id, li.line_total from line_item li; id | line_total + 1 | 187.95 2 | 81.15 (2 rows) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlock contention
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 23.06.2011 18:42, Robert Haas wrote: ProcArrayLock looks like a tougher nut to crack - there's simply no way, with the system we have right now, that you can take a snapshot without locking the list of running processes. I'm not sure what to do about that, but we're probably going to have to come up with something, because it seems clear that once we eliminate the lock manager LWLock contention, this is a major bottleneck. ProcArrayLock is currently held for a relatively long period of time when a snapshot is taken, especially if there's a lot of backends. There are three operations to the procarray: 1. Advertising a new xid belonging to my backend. 2. Ending a transaction. 3. Getting a snapshot. Advertising a new xid is currently done without a lock, assuming that setting an xid in shared memory is atomic. The issue isn't just whether it's atomic, but whether some other backend scanning the ProcArray just afterwards might fail to see the update due to memory-ordering effects. But I think even if they do, there's no real harm done. Since we're holding XidGenLock, we know that the XID we are advertising is higher than any XID previously advertised, and in particular it must be higher than latestCompletedXid, so GetSnapshotdata() will ignore it anyway. I am rather doubtful that this code is strictly safe: myproc-subxids.xids[nxids] = xid; myproc-subxids.nxids = nxids + 1; In practice, the window for a race condition is minimal, because (a) in many cases, nxids will be on the same cache line as the xid being set; (b) even if it's not, we almost immediately release XidGenLock, which acts as a memory barrier; (c) on many common architectures, such as x86, stores are guaranteed to become visible in execution order; (d) if, on some architecture, you manged to see the stores out of order and thus loaded a garbage XID from the end of the array, it might happen to be zero (which would be ignored, as a non-normal transaction ID) or the transaction might happen never to examine a tuple with that XID anyway. To end a transaction, you acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire ProcArrayLock in shared mode, and scan the whole procarray. I wonder if it would be a better tradeoff to keep a materialized snapshot in shared memory that's updated every time a new xid is assigned or a transaction ends. Getting a snapshot would become much cheaper, as you could just memcpy the ready-built snapshot from shared memory into local memory. At least in the case of the pgbench -S workload, I doubt that would help at all. The problem isn't that the LWLocks are being held for too long -- they are all being held in shared mode and don't conflict anyway. The problem is that everyone has to acquire and release the spinlock in order to acquire the LWLock, and again to release it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlock contention
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug f...@phlo.org wrote: On Jun23, 2011, at 17:42 , Robert Haas wrote: On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug f...@phlo.org wrote: On Jun12, 2011, at 23:39 , Robert Haas wrote: So, the majority (60%) of the excess spinning appears to be due to SInvalReadLock. A good chunk are due to ProcArrayLock (25%). Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, so I guess that two adjacent LWLocks currently share one cache line. Currently, the ProcArrayLock has index 4 while SInvalReadLock has index 5, so if I'm not mistaken exactly the two locks where you saw the largest contention on are on the same cache line... Might make sense to try and see if these numbers change if you either make LWLockPadded 64bytes or arrange the locks differently... I fooled around with this a while back and saw no benefit. It's possible a more careful test would turn up something, but I think the only real way forward here is going to be to eliminate some of that locking altogether. It seems hard to believe that there isn't some effect of two locks sharing a cache line. There are architectures (even some of the Intel architectures, I believe) where cache lines are 32 bit, though - so measuring this certainly isn't easy. Also, I'm absolute no expert for this, so it might very well be that' I'm missing something fundamental. SInvalReadLock is a good example of a lock which seems barely to be necessary at all. Except on extraordinarily rare occasions, it is only ever taken in share mode. This is the case we should optimize for, I think. Currently, there's contention even between two SHARE lockers of a LWLock because our LWLock implementation uses a spinlock underneath. I've googled around a bit, and found this: http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git It's an userspace rwlock implementation based on atomic instructions and futexes (for blocking) written by Linus Torwalds as a response to the following thread http://lwn.net/Articles/284741/ It seems to require only a single atomic increment instruction to acquire a share lock in the uncontested case, and a single compare- and-exchange instruction to acquire an exlcusive lock (again in the uncontested case). The code doesn't seem to have a license attached, and seems to have been written over a weekend and never touched since, so we might not want (or be able to) to use this directly. But it'd still be interesting to see if replacing our LWLocks with this implementation affects throughput. Only SICleanupQueue() needs to lock out readers, and in the (probably common) case where all backends are reasonably close to being caught up, it wouldn't even matter if the determination of minMsgNum were a little bit fuzzy. I've been straining my brain trying to figure out whether there's some way to rewrite this logic so that it can run concurrently with readers; then we could get rid of SInvalReadLock altogether. I have a feeling if I were 25% smarter I could do it, but I'm not. This sounds like something which could be done with RCU (Read-Copy-Update) in a lockless way. The idea is to copy the data structure (or parts) of it before updating it, and to commit the change afterwards by adjusting a single pointer to point to the new structure (kinda like we do atomic commits by an atomic update of one bit in the clog). The hard part is usually to figure when it's OK to clean up or reuse the old version of the data structure - for that, you need to know that all previous readers have completed. Here's a rough description of how that could work for the invalidation queue. We'd have two queue structures in shared memory, each containing a version number. We'd also have a current pointer pointing to the most recent one. Each reader would publish the version number of the queue he's about to read (the current one) in shared memory (and issue a write barrier). SICleanupQueue() would check whether the non-current queue structure is unused by comparing its version to the published versions of all backends (after issuing a read barrier, and after taking a lock that prevents concurrent SICleanupQueue runs of course). If there still are users of that version, SICleanupQueue() out-waits them. Afterwards, it simply copies the current structure over the old one, does the necessary cleanups, increments the version number to be larger than the one of the current structure, and *then* updates the current pointer. So my fallback position is to try to find some way to optimize the common case where there are no messages to be read. We're already allowing readers and writers to run in parallel, so it must not be important for a reader to see a message that a writer is in the process of writing, but has not yet finished writing. I believe the idea here is that sinval
Re: [HACKERS] SYNONYMS (again)
On Thu, Jun 23, 2011 at 2:58 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Gurjeet Singh singh.gurj...@gmail.com wrote: Instead of just synonyms of columns, why don't we think about implementing virtual columns (feature as named in other RDBMS). This is the ability to define a column in a table which is derived using an expression around other non-virtual columns. How do you see that working differently from what PostgreSQL can currently do? test=# create table line_item(id int primary key not null, quantity int not null, unit_price numeric(13,2)); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index line_item_pkey for table line_item CREATE TABLE test=# insert into line_item values (1,15,'12.53'),(2,5,'16.23'); INSERT 0 2 test=# create function line_total(line_item) returns numeric(13,2) language sql immutable as $$ select ($1.quantity * $1.unit_price)::numeric(13,2);$$; CREATE FUNCTION test=# select li.id, li.line_total from line_item li; id | line_total + 1 | 187.95 2 | 81.15 (2 rows) For one, this column is not part of the table, so we can't gather statistics on them to help the optimizer. We can'r create primary keys on this expression. Also, say if the query wasn't fetching all the columns and we had just the line_total call in SELECT list, the executor has to fetch the whole row and pass it on to the function even though the function uses only part of the row (2 columns in this case). Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Wed, Jun 22, 2011 at 11:26 AM, Simon Riggs si...@2ndquadrant.com wrote: For people that still think there is still risk, I've added a parameter called relaxed_ddl_locking which defaults to off. That way people can use this feature in an informed way without exposing us to support risks from its use. I can't get excited about adding a GUC that says you can turn this off if you want but don't blame us if it breaks. That just doesn't seem like good software engineering to me. I think we should be clear that this adds *one* lock/unlock for each object for each session, ONCE only after each transaction that executed a DDL statement against an object. So with 200 sessions, a typical ALTER TABLE statement will cause 400 locks. The current lock manager maxes out above 50,000 locks per second, so any comparative analysis will show this is a minor blip on even a heavy workload. I think that using locks to address this problem is the wrong approach. I think the right fix is to use something other than SnapshotNow to do the system catalog scans. However, if we were going to use locking, then we'd need to ensure that: (1) No transaction which has made changes to a system catalog may commit while some other transaction is in the middle of scanning that catalog. (2) No transaction which has made changes to a set of system catalogs may commit while some other transaction is in the midst of fetching interrelated data from a subset of those catalogs. It's important to note, I think, that the problem here doesn't occur at the time that the table rows are updated, because those rows aren't visible yet. The problem happens at commit time. So unless I'm missing something, ALTER TABLE would only need to acquire the relation definition lock after it has finished all of its other work. In fact, it doesn't even really need to acquire it then, either. It could just queue up a list of relation definition locks to acquire immediately before commit, which would be advantageous if the transaction went on to abort, since in that case we'd skip the work of acquiring the locks at all. In fact, without doing that, this patch would undo much of the point of the original ALTER TABLE lock strength reduction: begin; alter table foo alter column a set storage plain; start a new psql session in another window select * from foo; In master, the SELECT completes without blocking. With this patch applied, the SELECT blocks, just as it did in 9.0, because it can't rebuild the relcache entry without getting the relation definition lock, which the alter table will hold until commit. In fact, the same thing happens even if foo has been accessed previously in the same session. If there is already an open *transaction* that has accessed foo, then, it seems, it can continue to do so until it commits. But as soon as it tries to start a new transaction, it blocks again. I don't quite understand why that is - I didn't think we invalidated the entire relcache on every commit. But that's what happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlock contention
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug f...@phlo.org wrote: It seems hard to believe that there isn't some effect of two locks sharing a cache line. There are architectures (even some of the Intel architectures, I believe) where cache lines are 32 bit, though - so measuring this certainly isn't easy. Also, I'm absolute no expert for this, so it might very well be that' I'm missing something fundamental. Well, I'm sure there is some effect, but my experiments seem to indicate that it's not a very important one. Again, please feel free to provide contrary evidence. I think the basic issue is that - in the best possible case - padding the LWLocks so that you don't have two locks sharing a cache line can reduce contention on the busier lock by at most 2x. (The less busy lock may get a larger reduction but that may not help you much.) If you what you really need is for the contention to decrease by 1000x, you're just not really moving the needle. That's why the basic fast-relation-lock patch helps so much: it replaces a system where every lock request results in contention with a system where NONE of them do. SInvalReadLock is a good example of a lock which seems barely to e necessary at all. Except on extraordinarily rare occasions, it is only ever taken in share mode. This is the case we should optimize for, I think. Currently, there's contention even between two SHARE lockers of a LWLock because our LWLock implementation uses a spinlock underneath. I've googled around a bit, and found this: http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git It's an userspace rwlock implementation based on atomic instructions and futexes (for blocking) written by Linus Torwalds as a response to the following thread http://lwn.net/Articles/284741/ It seems to require only a single atomic increment instruction to acquire a share lock in the uncontested case, and a single compare- and-exchange instruction to acquire an exlcusive lock (again in the uncontested case). The code doesn't seem to have a license attached, and seems to have been written over a weekend and never touched since, so we might not want (or be able to) to use this directly. But it'd still be interesting to see if replacing our LWLocks with this implementation affects throughput. I tried rewriting the LWLocks using CAS. It actually seems to make things slightly worse on the tests I've done so far, perhaps because I didn't make it respect spins_per_delay. Perhaps fetch-and-add would be better, but I'm not holding my breath. Everything I'm seeing so far leads me to the belief that we need to get rid of the contention altogether, not just contend more quickly. Only SICleanupQueue() needs to lock out readers, and in the (probably common) case where all backends are reasonably close to being caught up, it wouldn't even matter if the determination of minMsgNum were a little bit fuzzy. I've been straining my brain trying to figure out whether there's some way to rewrite this logic so that it can run concurrently with readers; then we could get rid of SInvalReadLock altogether. I have a feeling if I were 25% smarter I could do it, but I'm not. This sounds like something which could be done with RCU (Read-Copy-Update) in a lockless way. The idea is to copy the data structure (or parts) of it before updating it, and to commit the change afterwards by adjusting a single pointer to point to the new structure (kinda like we do atomic commits by an atomic update of one bit in the clog). The hard part is usually to figure when it's OK to clean up or reuse the old version of the data structure - for that, you need to know that all previous readers have completed. Here's a rough description of how that could work for the invalidation queue. We'd have two queue structures in shared memory, each containing a version number. We'd also have a current pointer pointing to the most recent one. Each reader would publish the version number of the queue he's about to read (the current one) in shared memory (and issue a write barrier). SICleanupQueue() would check whether the non-current queue structure is unused by comparing its version to the published versions of all backends (after issuing a read barrier, and after taking a lock that prevents concurrent SICleanupQueue runs of course). If there still are users of that version, SICleanupQueue() out-waits them. Afterwards, it simply copies the current structure over the old one, does the necessary cleanups, increments the version number to be larger than the one of the current structure, and *then* updates the current pointer. Interesting idea. So my fallback position is to try to find some way to optimize the common case where there are no messages to be read. We're already allowing readers and writers to run in parallel, so it must not be important for a reader to see a message that a writer is in the process of writing, but has
Re: [HACKERS] spinlock contention
On Jun23, 2011, at 22:15 , Robert Haas wrote: On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug f...@phlo.org wrote: It seems hard to believe that there isn't some effect of two locks sharing a cache line. There are architectures (even some of the Intel architectures, I believe) where cache lines are 32 bit, though - so measuring this certainly isn't easy. Also, I'm absolute no expert for this, so it might very well be that' I'm missing something fundamental. Well, I'm sure there is some effect, but my experiments seem to indicate that it's not a very important one. Again, please feel free to provide contrary evidence. I think the basic issue is that - in the best possible case - padding the LWLocks so that you don't have two locks sharing a cache line can reduce contention on the busier lock by at most 2x. (The less busy lock may get a larger reduction but that may not help you much.) If you what you really need is for the contention to decrease by 1000x, you're just not really moving the needle. Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate the most heavily contested LWLocks for the others might still be worthwhile. That's why the basic fast-relation-lock patch helps so much: it replaces a system where every lock request results in contention with a system where NONE of them do. I tried rewriting the LWLocks using CAS. It actually seems to make things slightly worse on the tests I've done so far, perhaps because I didn't make it respect spins_per_delay. Perhaps fetch-and-add would be better, but I'm not holding my breath. Everything I'm seeing so far leads me to the belief that we need to get rid of the contention altogether, not just contend more quickly. Is there a patch available? How did you do the slow path (i.e. the case where there's contention and you need to block)? It seems to me that without some kernel support like futexes it's impossible to do better than LWLocks already do, because any simpler scheme like while (atomic_inc_post(lock) 0) { atomic_dec(lock); block(lock); } for the shared-locker case suffers from a race condition (the lock might be released before you actually block()). There is an obvious (potential) memory-ordering problem here. If it's possible for a backend doing AcceptInvalidationMessages() to see a stale version of the counter, then it might fail to read messages from the queue that it really ought to have seen. Protecting the global counter with a spinlock defeats the purpose of doing any of this in the first place, because the whole problem is too many people fighting over the spinlock protecting SInvalReadLock. It should be sufficient, though, for the writing backend to execute a memory-barrier operation just after bumping the global counter and the reading backend to execute one just before. As it happens, LWLockRelease() is such a barrier, since it must acquire and release a spinlock, so we should be able to just bump the counter right before releasing the LWLock and call it good. On the reading side, the immediately-preceding lock acquisition seems like it ought to be sufficient - surely it can't be possible to acquire a heavyweight lock on an object without seeing memory updates that some other process made before releasing the same heavyweight lock. If we start doing lockless algorithms, I really think we should add explicit barrier primitives. We could start out with something simple such as #define MemoryBarrierWrite \ do { slock_t l; S_UNLOCK(l); } while (0) #define MemoryBarrierRead \ do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); } #define MemoryBarrierReadWrite \ do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); } But yeah, it seems that in the case of the sinval queue, the surrounding LWLocks actually provide the necessary barriers. Yes, a set of general memory barrier primitives would be handy. Unfortunately, it would probably be quite a bit of work to develop this and verify that it works properly on all supported platforms. The idea would be to start out with something trivial like the above. Maybe with an #if for compilers which have something like GCC's __sync_synchronize(). We could then gradually add implementations for specific architectures, hopefully done by people who actually own the hardware and can test. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlock contention
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug f...@phlo.org wrote: Well, I'm sure there is some effect, but my experiments seem to indicate that it's not a very important one. Again, please feel free to provide contrary evidence. I think the basic issue is that - in the best possible case - padding the LWLocks so that you don't have two locks sharing a cache line can reduce contention on the busier lock by at most 2x. (The less busy lock may get a larger reduction but that may not help you much.) If you what you really need is for the contention to decrease by 1000x, you're just not really moving the needle. Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate the most heavily contested LWLocks for the others might still be worthwhile. Hey, if we can show that it works, sign me up. That's why the basic fast-relation-lock patch helps so much: it replaces a system where every lock request results in contention with a system where NONE of them do. I tried rewriting the LWLocks using CAS. It actually seems to make things slightly worse on the tests I've done so far, perhaps because I didn't make it respect spins_per_delay. Perhaps fetch-and-add would be better, but I'm not holding my breath. Everything I'm seeing so far leads me to the belief that we need to get rid of the contention altogether, not just contend more quickly. Is there a patch available? How did you do the slow path (i.e. the case where there's contention and you need to block)? It seems to me that without some kernel support like futexes it's impossible to do better than LWLocks already do, because any simpler scheme like while (atomic_inc_post(lock) 0) { atomic_dec(lock); block(lock); } for the shared-locker case suffers from a race condition (the lock might be released before you actually block()). Attached... The idea would be to start out with something trivial like the above. Maybe with an #if for compilers which have something like GCC's __sync_synchronize(). We could then gradually add implementations for specific architectures, hopefully done by people who actually own the hardware and can test. Yes. But if we go that route, then we have to also support a code path for architectures for which we don't have that support. That's going to be more work, so I don't want to do it until we have a case where there is a good, clear benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company lwlock-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SYNONYMS (again)
On Wed, Jun 22, 2011 at 3:37 PM, Joshua D. Drake j...@commandprompt.comwrote: Per: http://archives.postgresql.**org/pgsql-hackers/2010-11/**msg02043.phphttp://archives.postgresql.org/pgsql-hackers/2010-11/msg02043.php It seems we did come up with a use case in the procpid discussion. The ability to change the names of columns/databases etc, to handle the fixing of bad decision decisions during development over time. Thoughts? Instead of just synonyms of columns, why don't we think about implementing virtual columns (feature as named in other RDBMS). This is the ability to define a column in a table which is derived using an expression around other non-virtual columns. I agree it would be much more difficult and some may even argue it is pointless in the presence of views and expression indexes, but I leave that as an exercise for others. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] crash-safe visibility map, take five
On Wed, Jun 22, 2011 at 10:40 PM, Jeff Davis pg...@j-davis.com wrote: 1. Torn pages -- not a problem because it's a single bit and idempotent. Right. 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing an action that makes the page all-visible. Depending on what action makes a page all-visible: a. An old snapshot is released -- not a problem, because if there is a crash all snapshots are released. b. Cleanup action on the page -- not a problem, because that will create a WAL record and update the page's LSN before setting the PD_ALL_VISIBLE. Lazy VACUUM is the only thing that makes a page all visible. I don't understand the part about snapshots. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote: Lazy VACUUM is the only thing that makes a page all visible. I don't understand the part about snapshots. Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. After an INSERT to a new page, and after all snapshots are released, the page becomes all-visible; and thus subject to being marked with PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there is no cleanup action that takes place here, so nothing else will bump the LSN either. So, let's say that we hypothetically had persistent snapshots, then you'd have the following problem: 1. INSERT to a new page, marking it with LSN X 2. WAL flushed to LSN Y (Y X) 2. Some persistent snapshot (that doesn't see the INSERT) is released, and generates WAL recording that fact with LSN Z (Z Y) 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE 4. page is written out because LSN is still X 5. crash Now, the persistent snapshot is still present because LSN Z never made it to disk; but the page is marked with PD_ALL_VISIBLE. Sure, if these hypothetical persistent snapshots were transactional, and if synchronous_commit is on, then LSN Z would be flushed before step 3; but that's another set of assumptions. That's why I left it simple and said that the assumption was snapshots are released if there's a crash. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_upgrade defaulting to port 25432
Tom Lane wrote: Christopher Browne cbbro...@gmail.com writes: On Wed, Jun 15, 2011 at 5:35 PM, Bruce Momjian br...@momjian.us wrote: [ just recommend using a different port number during pg_upgrade ] +1... That seems to have lots of nice properties. Yeah, that seems like an appropriate expenditure of effort. It's surely not bulletproof, since someone could intentionally connect to the actual port number, but getting to bulletproof is a lot more work than anyone seems to want to do right now. (And, as Bruce pointed out, no complete solution would be back-patchable anyway.) I have created the following patch which uses 25432 as the default port number for pg_upgrade. It also creates two new environment variables, OLDPGPORT and NEWPGPORT, to control the port values because we don't want to default to PGPORT anymore. This will allow most users to use pg_upgrade without needing to specify port parameters, and will prevent accidental connections. The user does have to specify the port number for live checks, but that was always the case because you have to use different port numbers for old/new in that case. I updated the error message to be clearer about this. The patch is small and might be possible for 9.1, except it changes user-visible behavior, which would suggest it should be in 9.2, plus it doesn't address a serious bug. Comments? I will batckpatch a doc recommendation to use 25432 as a port number for versions of pg_upgrade that don't get this patch. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_upgrade/check.c b/contrib/pg_upgrade/check.c new file mode 100644 index 1ee2aca..323b5f1 *** a/contrib/pg_upgrade/check.c --- b/contrib/pg_upgrade/check.c *** output_check_banner(bool *live_check) *** 30,37 { *live_check = true; if (old_cluster.port == new_cluster.port) ! pg_log(PG_FATAL, When checking a live server, ! the old and new port numbers must be different.\n); pg_log(PG_REPORT, Performing Consistency Checks on Old Live Server\n); pg_log(PG_REPORT, \n); } --- 30,37 { *live_check = true; if (old_cluster.port == new_cluster.port) ! pg_log(PG_FATAL, When checking a live old server, ! you must specify the old server's port number.\n); pg_log(PG_REPORT, Performing Consistency Checks on Old Live Server\n); pg_log(PG_REPORT, \n); } diff --git a/contrib/pg_upgrade/option.c b/contrib/pg_upgrade/option.c new file mode 100644 index 4401a81..7fbfa8c *** a/contrib/pg_upgrade/option.c --- b/contrib/pg_upgrade/option.c *** parseCommandLine(int argc, char *argv[]) *** 58,65 os_info.progname = get_progname(argv[0]); /* Process libpq env. variables; load values here for usage() output */ ! old_cluster.port = getenv(PGPORT) ? atoi(getenv(PGPORT)) : DEF_PGPORT; ! new_cluster.port = getenv(PGPORT) ? atoi(getenv(PGPORT)) : DEF_PGPORT; os_user_effective_id = get_user_info(os_info.user); /* we override just the database user name; we got the OS id above */ --- 58,65 os_info.progname = get_progname(argv[0]); /* Process libpq env. variables; load values here for usage() output */ ! old_cluster.port = getenv(OLDPGPORT) ? atoi(getenv(OLDPGPORT)) : DEF_PGUPORT; ! new_cluster.port = getenv(NEWPGPORT) ? atoi(getenv(NEWPGPORT)) : DEF_PGUPORT; os_user_effective_id = get_user_info(os_info.user); /* we override just the database user name; we got the OS id above */ *** Options:\n\ *** 231,238 -G, --debugfile=FILENAME output debugging activity to file\n\ -k, --linklink instead of copying files to new cluster\n\ -l, --logfile=FILENAMElog session activity to file\n\ ! -p, --old-port=OLDPORTold cluster port number (default %d)\n\ ! -P, --new-port=NEWPORTnew cluster port number (default %d)\n\ -u, --user=NAME clusters superuser (default \%s\)\n\ -v, --verbose enable verbose output\n\ -V, --version display version information, then exit\n\ --- 231,238 -G, --debugfile=FILENAME output debugging activity to file\n\ -k, --linklink instead of copying files to new cluster\n\ -l, --logfile=FILENAMElog session activity to file\n\ ! -p, --old-port=OLDPGPORT old cluster port number (default %d)\n\ ! -P, --new-port=NEWPGPORT new cluster port number (default %d)\n\ -u, --user=NAME clusters superuser (default \%s\)\n\ -v, --verbose enable verbose output\n\ -V, --version display version information, then exit\n\ diff --git a/contrib/pg_upgrade/pg_upgrade.h
Re: [HACKERS] crash-safe visibility map, take five
On Thu, Jun 23, 2011 at 6:40 PM, Jeff Davis pg...@j-davis.com wrote: On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote: Lazy VACUUM is the only thing that makes a page all visible. I don't understand the part about snapshots. Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. After an INSERT to a new page, and after all snapshots are released, the page becomes all-visible; and thus subject to being marked with PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there is no cleanup action that takes place here, so nothing else will bump the LSN either. So, let's say that we hypothetically had persistent snapshots, then you'd have the following problem: 1. INSERT to a new page, marking it with LSN X 2. WAL flushed to LSN Y (Y X) 2. Some persistent snapshot (that doesn't see the INSERT) is released, and generates WAL recording that fact with LSN Z (Z Y) 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE 4. page is written out because LSN is still X 5. crash Now, the persistent snapshot is still present because LSN Z never made it to disk; but the page is marked with PD_ALL_VISIBLE. Sure, if these hypothetical persistent snapshots were transactional, and if synchronous_commit is on, then LSN Z would be flushed before step 3; but that's another set of assumptions. That's why I left it simple and said that the assumption was snapshots are released if there's a crash. I don't really think that's a separate set of assumptions - if we had some system whereby snapshots could survive a crash, then they'd have to be WAL-logged (because that's how we make things survive crashes). And anything that is WAL-logged must obey the WAL-before-data rule. We have a system already that ensures that when synchronous_commit=off, CLOG pages can't be flushed before the corresponding WAL record makes it to disk. For a system like what you're describing, you'd need something similar - these crash-surviving snapshots would have to make sure that no action which depended on their state hit the disk before the WAL record marking the state change hit the disk. I guess the point you are driving at here is that a page can only go from being all-visible to not-all-visible by virtue of being modified. There's no other piece of state (like a persistent snapshot) that can be lost as part of a crash that would make us need change our mind and decide that an all-visible XID is really not all-visible after all. (The reverse is not true: since snapshots are ephemeral, a crash will render every row either all-visible or dead.) I guess I never thought about documenting that particular aspect of it because (to me) it seems fairly self-evident. Maybe I'm wrong... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: Keywords in pg_hba.conf should be field-specific
On 24 June 2011 03:48, Alvaro Herrera alvhe...@commandprompt.com wrote: I have touched next_token() and next_token_expand() a bit more, because it seemed to me that they could be simplified further (the bit about returning the comma in the token, instead of being a boolean return, seemed strange). Please let me know what you think. Cool. I didn't like the way it returned the comma either, but I thought it would be best if I kept the changes in my patch to a minimum. Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_upgrade version check improvements and small fixes
Dan McGee wrote: On Wed, Jun 22, 2011 at 8:54 PM, Bruce Momjian br...@momjian.us wrote: 0003 is what I really wanted to solve, which was my failure with pg_upgrade. The call to pg_ctl didn't succeed because the binaries didn't match the data directory, thus resulting in this: The error had nothing to do with trust at all; it was simply that I tried to use 9.0 binaries with an 8.4 data directory. My patch checks for this and ensures that the -D bindir is the correct version, just as the -B datadir has to be the correct version. I had not thought about testing if the bin and data directory were the same major version, but you are right that it generates an odd error if they are not. I changed your sscanf format string because the version can be just 9.2dev and there is no need for the minor version. No problem- you're going to know way more about this than me, and I was just going off what I saw in my two installed versions and a quick look over at the code of pg_ctl to make sure that string was never translated. On a side note I don't think I ever mentioned to everyone else why parsing the version from pg_ctl rather than pg_config was done- this is so we don't introduce any additional binary requirements. Tom Lane noted in an earlier cleanup series that pg_config is not really needed at all in the old cluster, so I wanted to keep it that way, but pg_ctl has always been required. Yes, I looked specifically for that. We could have used pg_controldata too, but pg_ctl seemed just as good. ? I saw no reason to test if the binary version matches the pg_upgrade version because we already test the cluster version, and we test the cluster version is the same as the binary version. Duh. That makes sense. Thanks for getting to these so quickly! To partially toot my own horn but also show where a user like me encountered this, after some packaging hacking, anyone running Arch Linux should be able to do a pg_upgrade from here on out by installing the postgresql-old-upgrade package (http://www.archlinux.org/packages/extra/x86_64/postgresql-old-upgrade/) and possible consulting the Arch wiki. Great. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
What I think he is proposing would not require using pg_stop_backup() but you could also modify pg_stop_back() to work as well. What do you think of this idea? Do you see how the patch can be reworked to accomplish this? The logic that not use pg_stop_backup() would be difficult, because pg_stop_backup() is used to identify minRecoveryPoint. Jun Ishizuka NTT Software Corporation TEL:045-317-7018 E-Mail: ishizuka@po.ntts.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers