Re: [HACKERS] TAP test breakage on MacOS X
Noah Misch n...@leadboat.com writes: On Thu, Oct 30, 2014 at 10:49:33PM -0400, Tom Lane wrote: I think the installs as such aren't the only reason for the sucky performance. We need to also reduce the number of initdb cycles incurred by the TAP tests. It's useless for example that the pg_controldata test creates its very own $PGDATA rather than sharing one with other tests. One could memoize initdb within the suite. Call it once per distinct command line, caching the resulting data directory. Copy the cached data directory for each test desiring one. At least on older/slower machines like prairiedog, even having to copy $PGDATA for each test is unappealing. Some numbers for reference: make install22 sec initdb 76 sec initdb -N 33 sec cp -r $PGDATA /tmp 17 sec regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 2014-10-31 14:55:11 +0900, Michael Paquier wrote: On Tue, Oct 28, 2014 at 9:25 PM, Simon Riggs si...@2ndquadrant.com wrote: On 13 October 2014 10:05, Petr Jelinek p...@2ndquadrant.com wrote: I worked bit on this patch to make it closer to committable state. Here is updated version that works with current HEAD for the October committfest. I've reviewed this and it looks good to me. Clean, follows existing code neatly, documented and tested. Please could you document a few things * ExtendCommitTS() works only because commit_ts_enabled can only be set at server start. We need that documented so somebody doesn't make it more easily enabled and break something. (Could we make it enabled at next checkpoint or similar?) * The SLRU tracks timestamps of both xids and subxids. We need to document that it does this because Subtrans SLRU is not persistent. If we made Subtrans persistent we might need to store only the top level xid's commitTS, but that's very useful for typical use cases and wouldn't save much time at commit. Hm. What is the performance impact of this feature using the latest version of this patch? I haven't measured it recently, but it wasn't large, but measureable. I imagine that the penalty of the additional operations this feature introduces is not zero, particularly for loads with lots of short transactions. Which is why you can disable it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback
We seem to be going in circles. You suggested having two options, --feedback, and --fsync, which is almost exactly what Furuya posted originally. I objected to that, because I think that user interface is too complicated. Instead, I suggested having just a single option called --synchronous, or even better, have no option at all and have the server tell the client if it's participating in synchronous replication, and have pg_receivexlog automatically fsync when it is, and not otherwise [1]. That way you don't need to expose any new options to the user. What did you think of that idea? I think it's pretty weird to make the fsync behavior of the client is controlled by the server. I also don't think that fsync() on the client side is useless in asynchronous replication. Yeah, it's true that there are no *guarantees* with asynchronous replication, but the bound on how long the data can take to get out to disk is a heck of a lot shorter if you fsync frequently than if you don't. And on the flip side, that has a performance impact. So I actually think the design you proposed is not as good as what was proposed by Furuya and Simon. But I don't feel incredibly strongly about it. Thanks for lots of comments!! I fixed the patch. As a default, it behave like a walreceiver. Same as walreceiver, it fsync and send a feedback after fsync. When --no-fsync is specified, it will fsync only wal has switched. feedback message is specified by --status-interval. Regards, -- Furuya Osamu pg_receivexlog-fsync-feedback-v7.patch Description: pg_receivexlog-fsync-feedback-v7.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Lockless StrategyGetBuffer() clock sweep
On Thu, Oct 30, 2014 at 5:01 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-30 10:23:56 +0530, Amit Kapila wrote: I have a feeling that this might also have some regression at higher loads (like scale_factor = 5000, shared_buffers = 8GB, client_count = 128, 256) for the similar reasons as bgreclaimer patch, means although both reduces contention around spin lock, however it moves contention somewhere else. I have yet to take data before concluding anything (I am just waiting for your other patch (wait free LW_SHARED) to be committed). I have a hard time to see how this could be. In the uncontended case the number of cachelines touched and the number of atomic operations is exactly the same. In the contended case the new implementation does far fewer atomic ops - and doesn't do spinning. What's your theory? I have observed that once we reduce the contention in one path, it doesn't always lead to performance/scalability gain and rather shifts to other lock if that exists. This is the reason why we have to work on reducing contention around both BufFreeList and Buf Mapping tables lock together. I have taken some performance data and it seems this patch also exhibits similar behaviour as bgreclaimer patch and I believe resolving contention around dynahash can improve the situation (Robert's chash patch can be helpful). Performance Data -- Configuration and Db Details IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB max_connections = 300 shared_buffers = 8GB checkpoint_segments=30 checkpoint_timeout=15min Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins Test mode - pgbench readonly (-M prepared) Data below is median of 3 runs, for individual run data check document attached with this mail. Scale_Factor = 1000 Patch_ver/Client_Count 128 256 HEAD 265502 201689 Patch 283448 224888 Scale_Factor = 5000 Patch_ver/Client_Count 128 256 HEAD 190435 177477 Patch 171485 167794 The above data indicates that there is performance gain at scale factor 1000, however there is a regression at scale factor 5000. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com perf_lockless_strategy_getbuf.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Tweaking Foreign Keys for larger tables
Various ways of tweaking Foreign Keys are suggested that are helpful for larger databases. * Deferrable Enforcement Timing Clause * NOT DEFERRABLE - immediate execution * DEFERRABLE *INITIALLY IMMEDIATE - existing *INITIALLY DEFERRED - existing *INITIALLY NOT ENFORCED FK created, but is not enforced during DML. Will be/Must be marked NOT VALID when first created. We can run a VALIDATE on the constraint at any time; if it passes the check it is marked VALID and presumed to stay that way until the next VALIDATE run. If it fails that check the FK would be marked as NOT VALID, causing it to be no longer useful for optimization. This allows FKs to be checked in bulk, rather than executing during front-end code path, but yet still be there for optimization and documentation (or visibility by tools etc). There is no corresponding SET CONSTRAINTs call for the NOT ENFORCED case, since that would require us to mark the constraint as not valid. * Referenced Table actions ON DELETE IGNORE ON UPDATE IGNORE If we allow this specification then the FK is one way - we check the existence of a row in the referenced table, but there is no need for a trigger on the referenced table to enforce an action on delete or update, so no need to lock the referenced table when adding FKs. This is very useful for very highly referenced tables. Or for larger tables where we aren't planning on deleting or updating the referenced table without also deleting or updating the referencing table. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Reducing Catalog Locking
Recent work on parallel query has opened my eyes to exactly how frequently we request locks on various catalog tables. (Attached file is a lock analysis on a representative Pg server). Given these are catalog tables, we aren't doing much to them that requires a strong lock. Specifically, only CLUSTER and VACUUM FULL touch those tables like that. When we do that, pretty much everything else hangs, cos you can't get much done while fundamental tables are locked. So my proposal is that we invent a big catalog lock. The benefit of this is that we greatly reduce lock request traffic, as well as being able to control exactly when such requests occur. (Fine grained locking isn't always helpful). Currently, SearchCatCache() requests locks on individual catalog tables. Alternatively, we could request an AccessShareLock on a big catalog lock that must be accessed first before a strong relation-specific lock is requested. We just need to change the lockid used for each cache. We can still CREATE, ALTER, DROP and VACUUM all catalog tables - but this idea would block VACUUM FULL, but that would have been blocked anyway by general activity. We reduce lock traffic by having SearchCatCache() use a new call heap_catalog_open() which calls a new LockCatalogLock() which specifically caches whether we have already locked the BigCatalogLock or not. That cache can be cleared easily and cheaply at EOX. And it can be set quickly and easily in parallel worker tasks. We then add a special locking clause for VAC FULL on catalog tables. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services stat.out Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 31 October 2014 04:42, Amit Kapila amit.kapil...@gmail.com wrote: In fact it would be more sensible to lock the toast table earlier. It might make some sense to lock the toast table earlier for this particular case, but I don't think in general it will be feasible to lock all the tables (including catalog tables) which might get used in the overall operation before starting parallel operation. I believe it will make doing parallel stuff difficult if we try to go via this route, as we need to always do this for any operation we want to make parallel. It will always have the risk that we might miss something and another thing is it might not be feasible in all kind of cases. If I understand correctly, you are suggesting that to get the first version of parallel operation out earlier, we should try to avoid all the stuff (like group locking, ...), I advocate trying to avoid it, to see. The problems cited so far are not so extreme they cannot be easily got around, if you have a will to do so. I've just posted about an idea to reduce catalog locking. have some restrictions on which kind of cases Create Index can work, do some hackery in Create Index path to avoid any kind of locking problems and do the parallel operation for External Sort (which might avoid need for shared memory allocator) and then call it a day and then do the things which we need for other parallel operations as and when they are required. I think this might be a good approach in itself if somebody wants to target something to release early, however in medium to short term when we want to provide non-trivial parallel operations we have to eventually solve those problems and delaying will only make the things worse. My (by now) long experience on Postgres has shown me that developing something now is a good route to take. By moving towards a solution at a reasonable pace you learn things which may affect the longer term viability of the project. New functionality in each release is useful. Big bang developments that don't deliver until the end have a habit of not delivering at the end either. MVP is a term frequently used in VCs for a reason. Lack of urgency in delivering a useful outcome from the project looks strange. If all you develop in this release is core infrastructure for parallelism and a custom scan API, it begins to look like there may not be an open source version delivered at all. How people make their income is up to them, but its a concern that's been raised my way before, so it seems reasonable to do so for others also. That is why we have spent much time explaining clearly the goals and licencing of BDR, for example. No need for big arguments; these are not allegations or implications etc.. In short, I agree that there is a merit in your idea w.r.t getting the first version of parallel operation out early, however if we see from medium to long term, I feel solving group locking problem (or some other general infrastructure what Robert mentioned upthread) can yield better results, unless you feel that is not at all required for parallel operations. Is it genuinely required for most parallel operations? I think it's clear that none of us knows the answer. Sure, the general case needs it, but is the general case the same thing as the reasonably common case? Certainly, having a working prototype for something useful would help build the case for why these things are needed. It would certainly help the cause to have something working in this release. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs si...@2ndquadrant.com wrote: Is it genuinely required for most parallel operations? I think it's clear that none of us knows the answer. Sure, the general case needs it, but is the general case the same thing as the reasonably common case? Well, I think that the answer is pretty clear. Most of the time, perhaps in 99.9% of cases, group locking will make no difference as to whether a parallel operation succeeds or fails. Occasionally, however, it will cause an undetected deadlock. I don't hear anyone arguing that that's OK. Does anyone wish to make that argument? If not, then we must prevent it. The only design, other than what I've proposed here, that seems like it will do that consistently in all cases is to have the user backend lock every table that the child backend might possibly want to lock and retain those locks throughout the entire duration of the computation whether the child would actually need those locks or not. I think that could be made to work, but there are two probems: 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. 2. It might end up taking any more locks than necessary and holding them for much longer than necessary. Right now, for example, a syscache lookup locks the table only if we actually need to read from it and releases the lock as soon as the actual read is complete. Under this design, every syscache that the parallel worker might conceivably consult needs to be locked for the entire duration of the parallel computation. I would expect this to provoke a violent negative reaction from at least one prominent community member (and I bet users wouldn't like it much, either). So, I am still of the opinion that group locking makes sense. While I appreciate the urge to avoid solving difficult problems where it's reasonably avoidable, I think we're in danger of spending more effort trying to avoid solving this particular problem than it would take to actually solve it. Based on what I've done so far, I'm guessing that a complete group locking patch will be between 1000 and 1500 lines of code and that nearly all of the new logic will be skipped when it's not in use (i.e. no parallelism). That sounds to me like a hell of a deal compared to trying to predict what locks the child might conceivably take and preemptively acquire them all, which sounds annoyingly tedious even for simple things (like equality operators) and nearly impossible for anything more complicated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On Fri, Oct 31, 2014 at 6:35 AM, Simon Riggs si...@2ndquadrant.com wrote: Recent work on parallel query has opened my eyes to exactly how frequently we request locks on various catalog tables. (Attached file is a lock analysis on a representative Pg server). That analysis is interesting. Given these are catalog tables, we aren't doing much to them that requires a strong lock. Specifically, only CLUSTER and VACUUM FULL touch those tables like that. When we do that, pretty much everything else hangs, cos you can't get much done while fundamental tables are locked. True, although it's currently the case that catalog tables are only locked for the minimum time necessary. So, VF on pg_class will block nearly any new query from starting up, but already-running queries may be able to keep going, and idle transactions don't cause a problem. If we held system catalogs until transaction commit, a VF on pg_class would basically wait until every other transaction in the system completed and preclude any other transaction from starting until it finished. That's significantly more problematic in my view. I think that the fast-path locking mechanism prevents the overwhelming majority of lock-related pain for these kinds of things. Most system catalog locks are weak within the meaning of fast-path locking, so the main lock table is rarely touched at all, and manipulating our own PGPROC - which generally nobody else is touching - just isn't that expensive. On a related note, I've previously had the thought that it would be nice to have a big DDL lock - that is, a lock that prevents concurrent DDL without preventing anything else - so that pg_dump could get just that one lock and then not worry about the state of the world changing under it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 2014-10-31 08:54:54 -0400, Robert Haas wrote: On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs si...@2ndquadrant.com wrote: Is it genuinely required for most parallel operations? I think it's clear that none of us knows the answer. Sure, the general case needs it, but is the general case the same thing as the reasonably common case? Well, I think that the answer is pretty clear. Most of the time, perhaps in 99.9% of cases, group locking will make no difference as to whether a parallel operation succeeds or fails. Occasionally, however, it will cause an undetected deadlock. I don't hear anyone arguing that that's OK. Does anyone wish to make that argument? Maybe we can, as a first step, make those edges in the lock graph visible to the deadlock detector? It's pretty clear that undetected deadlocks aren't ok, but detectable deadlocks in a couple corner cases might be acceptable. If not, then we must prevent it. The only design, other than what I've proposed here, that seems like it will do that consistently in all cases is to have the user backend lock every table that the child backend might possibly want to lock and retain those locks throughout the entire duration of the computation whether the child would actually need those locks or not. I think that could be made to work, but there are two probems: 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. Well, there's really no need to be absolutely general here. We're only going to support a subset of functionality as parallelizable. And in that case we don't need anything complicated to acquire all locks. It seems quite possible to combine a heuristic approach with improved deadlock detection. 2. It might end up taking any more locks than necessary and holding them for much longer than necessary. Right now, for example, a syscache lookup locks the table only if we actually need to read from it and releases the lock as soon as the actual read is complete. Under this design, every syscache that the parallel worker might conceivably consult needs to be locked for the entire duration of the parallel computation. I would expect this to provoke a violent negative reaction from at least one prominent community member (and I bet users wouldn't like it much, either). I see little reason to do this for system level relations. So, I am still of the opinion that group locking makes sense. While I appreciate the urge to avoid solving difficult problems where it's reasonably avoidable, I think we're in danger of spending more effort trying to avoid solving this particular problem than it would take to actually solve it. Based on what I've done so far, I'm guessing that a complete group locking patch will be between 1000 and 1500 lines of code and that nearly all of the new logic will be skipped when it's not in use (i.e. no parallelism). That sounds to me like a hell of a deal compared to trying to predict what locks the child might conceivably take and preemptively acquire them all, which sounds annoyingly tedious even for simple things (like equality operators) and nearly impossible for anything more complicated. What I'm worried about is the performance impact of group locking when it's not in use. The heavyweight locking code is already quite complex and often a noticeable bottleneck... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On 10/30/2014 09:17 PM, Andres Freund wrote: On 2014-10-30 21:03:43 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-10-30 20:13:53 -0400, Tom Lane wrote: As I said upthread, that approach seems to me to be contrary to the project policy about how configure should behave. I don't think that holds much water. There's a fair amount of things that configure detects automatically. I don't think the comparison to plperl or such is meaningful - that's a runtime/install time difference. These tests are not. Meh. Right now, it's easy to dismiss these tests as unimportant, figuring that they play little part in whether the completed build is reliable. But that may not always be true. If they do become a significant part of our test arsenal, silently omitting them will not be cool for configure to do. Well, I'm all for erroring out if somebody passed --enable-foo-tests and the prerequisites aren't there. What I *am* against is requiring an explicit flag to enable them because then they'll just not be run in enough environments. And that's what's much more likely to cause unnoticed bugs. When this is properly sorted out I will enable this in the buildfarm default configuration. So I don't think that's going to be an issue in the long term. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
Simon Riggs si...@2ndquadrant.com writes: Recent work on parallel query has opened my eyes to exactly how frequently we request locks on various catalog tables. (Attached file is a lock analysis on a representative Pg server). Given these are catalog tables, we aren't doing much to them that requires a strong lock. Specifically, only CLUSTER and VACUUM FULL touch those tables like that. When we do that, pretty much everything else hangs, cos you can't get much done while fundamental tables are locked. So don't do that --- I'm not aware that either operation is ever considered recommended on catalogs. So my proposal is that we invent a big catalog lock. The benefit of this is that we greatly reduce lock request traffic, as well as being able to control exactly when such requests occur. (Fine grained locking isn't always helpful). Currently, SearchCatCache() requests locks on individual catalog tables. Alternatively, we could request an AccessShareLock on a big catalog lock that must be accessed first before a strong relation-specific lock is requested. We just need to change the lockid used for each cache. I doubt that this can ever be safe, because it will effectively assume that all operations on catalog tables are done by code that knows that it is accessing a catalog. What about manual DML, or even DDL, on a catalog? Miss even one place that can modify a table, and you have a problem. More to the point, how would using a big lock not make the contention situation *worse* rather than better? At least if you decide you need to cluster pg_statistic, you aren't blocking sessions that don't need to touch pg_statistic --- and furthermore, they aren't blocking you. I think the proposal would render it completely impossible to ever get a strong lock on a catalog table in a busy system, not even a little-used catalog. In fact, since we can assume that a transaction trying to do CLUSTER pg_class will have touched at least one syscache during startup, this proposal would absolutely guarantee that would fail (even in a completely idle system) because it would already hold the BigLock, and that would have to be seen as existing use of the table. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Missing FIN_CRC32 calls in logical replication code
On 2014-10-27 09:30:33 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-10-27 12:51:44 +0200, Heikki Linnakangas wrote: replication/slot.c and replication/logical/snapbuild.c use a CRC on the physical slot and snapshot files. It uses the same algorithm as used e.g. for the WAL. However, they are not doing the finalization step, FIN_CRC32() on the calculated checksums. Not that it matters much, but it's a bit weird and inconsistent, and was probably an oversight. Hm. Good catch - that's stupid. I wonder what to do about it. I'm tempted to just add a comment about it to 9.4 and fix it on master as changing it is essentially a catversion bump. Any objections to that? Yeah, I think you should get it right the first time. It hardly seems likely that any 9.4 beta testers are depending on those files to be stable yet. Since both state files have the version embedded it'd be trivial to just do the FIN_CRC32() when loading a version 2 file. Does anybody object to the relevant two lines of code + docs? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
Robert Haas robertmh...@gmail.com writes: On a related note, I've previously had the thought that it would be nice to have a big DDL lock - that is, a lock that prevents concurrent DDL without preventing anything else - so that pg_dump could get just that one lock and then not worry about the state of the world changing under it. Hm ... how would that work exactly? Every DDL operation has to take the BigDDLLock in shared mode, and then pg_dump takes it in exclusive mode? That would preclude two pg_dumps running in parallel, which maybe isn't a mainstream usage but still there's never been such a restriction before. Parallel pg_dump might have an issue in particular. But more to the point, this seems like optimizing pg_dump startup by adding overhead everywhere else, which doesn't really sound like a great tradeoff to me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On 2014-10-31 09:48:52 -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On a related note, I've previously had the thought that it would be nice to have a big DDL lock - that is, a lock that prevents concurrent DDL without preventing anything else - so that pg_dump could get just that one lock and then not worry about the state of the world changing under it. Hm ... how would that work exactly? Every DDL operation has to take the BigDDLLock in shared mode, and then pg_dump takes it in exclusive mode? That would preclude two pg_dumps running in parallel, which maybe isn't a mainstream usage but still there's never been such a restriction before. Parallel pg_dump might have an issue in particular. It should probably be a heavyweight lock. Then every DDL operation can take it in RowExclusiveLock mode and pg_dump can take ShareLock. As RowExclusive is a fastpath elegible lock, that'll not even hit the global lock table most of the time. But more to the point, this seems like optimizing pg_dump startup by adding overhead everywhere else, which doesn't really sound like a great tradeoff to me. Well, it'd finally make pg_dump correct under concurrent DDL. That's quite a worthwile thing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On Fri, Oct 31, 2014 at 9:54 AM, Andres Freund and...@2ndquadrant.com wrote: But more to the point, this seems like optimizing pg_dump startup by adding overhead everywhere else, which doesn't really sound like a great tradeoff to me. Well, it'd finally make pg_dump correct under concurrent DDL. That's quite a worthwile thing. Yeah, exactly. I agree with Tom that the overhead might be a concern. But on the other hand, nobody has been more concerned about the failure of pg_dump to handle this issue correctly than Tom. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Tue, Dec 10, 2013 at 2:48 PM, Robert Haas robertmh...@gmail.com wrote: Speaking of the functionality this does offer, it seems pretty limited. A commit timestamp is nice, but it isn't very interesting on its own. You really also want to know what the transaction did, who ran it, etc. ISTM some kind of a auditing or log-parsing system that could tell you all that would be much more useful, but this patch doesn't get us any closer to that. For what it's worth, I think that this has been requested numerous times over the years by numerous developers of replication solutions. My main question (apart from whether or not it may have bugs) is whether it makes a noticeable performance difference. If it does, that sucks. If it does not, maybe we ought to enable it by default. +1 It's also requested now and then in the context of auditing and forensic analysis of application problems. But I also agree that the tolerance for performance overhead is got to be quite low. If a GUC is introduced to manage the tradeoff, it should be defaulted to 'on'. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
Andres Freund and...@2ndquadrant.com writes: On 2014-10-31 09:48:52 -0400, Tom Lane wrote: But more to the point, this seems like optimizing pg_dump startup by adding overhead everywhere else, which doesn't really sound like a great tradeoff to me. Well, it'd finally make pg_dump correct under concurrent DDL. That's quite a worthwile thing. I lack adequate caffeine at the moment, so explain to me how this adds any guarantees whatsoever? It sounded like only a performance optimization from here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
Merlin Moncure mmonc...@gmail.com writes: It's also requested now and then in the context of auditing and forensic analysis of application problems. But I also agree that the tolerance for performance overhead is got to be quite low. If a GUC is introduced to manage the tradeoff, it should be defaulted to 'on'. Alvaro's original submission specified that the feature defaults to off. Since there's no use-case for it unless you've installed some third-party code (eg an external replication solution), I think that should stay true. The feature's overhead might be small, but it is most certainly not zero, and people shouldn't be paying for it unless they need it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-31 08:54:54 -0400, Robert Haas wrote: On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs si...@2ndquadrant.com wrote: Is it genuinely required for most parallel operations? I think it's clear that none of us knows the answer. Sure, the general case needs it, but is the general case the same thing as the reasonably common case? Well, I think that the answer is pretty clear. Most of the time, perhaps in 99.9% of cases, group locking will make no difference as to whether a parallel operation succeeds or fails. Occasionally, however, it will cause an undetected deadlock. I don't hear anyone arguing that that's OK. Does anyone wish to make that argument? Maybe we can, as a first step, make those edges in the lock graph visible to the deadlock detector? It's pretty clear that undetected deadlocks aren't ok, but detectable deadlocks in a couple corner cases might be acceptable. Interesting idea. I agree that would be more acceptable. It would be kind of sad, though, if you got a deadlock from this: BEGIN; LOCK TABLE foo; SELECT * FROM foo; You can, of course, avoid that, but it's basically transferring the burden from the lock manager to every bit of parallel code we ever write. If it turns out to be easier to do this than what I'm currently proposing, I'm definitely amenable to going this route for version 1, but I don't think it's going to be appealing as a permanent solution. 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. Well, there's really no need to be absolutely general here. We're only going to support a subset of functionality as parallelizable. And in that case we don't need anything complicated to acquire all locks. See, that's what I don't believe. Even very simple things like equality operators do a surprising amount of catalog locking. 2. It might end up taking any more locks than necessary and holding them for much longer than necessary. Right now, for example, a syscache lookup locks the table only if we actually need to read from it and releases the lock as soon as the actual read is complete. Under this design, every syscache that the parallel worker might conceivably consult needs to be locked for the entire duration of the parallel computation. I would expect this to provoke a violent negative reaction from at least one prominent community member (and I bet users wouldn't like it much, either). I see little reason to do this for system level relations. I'm not following you. So, I am still of the opinion that group locking makes sense. While I appreciate the urge to avoid solving difficult problems where it's reasonably avoidable, I think we're in danger of spending more effort trying to avoid solving this particular problem than it would take to actually solve it. Based on what I've done so far, I'm guessing that a complete group locking patch will be between 1000 and 1500 lines of code and that nearly all of the new logic will be skipped when it's not in use (i.e. no parallelism). That sounds to me like a hell of a deal compared to trying to predict what locks the child might conceivably take and preemptively acquire them all, which sounds annoyingly tedious even for simple things (like equality operators) and nearly impossible for anything more complicated. What I'm worried about is the performance impact of group locking when it's not in use. The heavyweight locking code is already quite complex and often a noticeable bottleneck... I certainly agree that it would be unacceptable for group locking to significantly regress the normal case. I'm pretty optimistic about reducing the overhead to near-zero, though - a few extra branch instructions on the non-fast-path case should be lost in the noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CREATE INDEX CONCURRENTLY?
I have not kept up with PostgreSQL changes and have just been using it. A co-worker recently told me that you need to word CONCURRENTLY in CREATE INDEX to avoid table locking. I called BS on this because to my knowledge PostgreSQL does not lock tables. I referenced this page in the documentation: http://www.postgresql.org/docs/9.3/static/locking-indexes.html However, I do see this sentence in the indexing page that was not in the docs prior to 8.0: Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Is this true? When/why the change? When we use concurrently, it seems to hang. I am looking into it.
Re: [HACKERS] infinite loop in _bt_getstackbuf
On Thu, Oct 30, 2014 at 11:45 PM, Noah Misch n...@leadboat.com wrote: Given the lack of prior complaints about this loop, I'm not sure I see the need to work harder than that; corruption of this sort must be quite rare. Looks like _bt_getstackbuf() is always called with some buffer lock held, so CHECK_FOR_INTERRUPTS() alone would not help: http://www.postgresql.org/message-id/flat/16519.1401395...@sss.pgh.pa.us That's the insert path. What about the vacuum path? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 31/10/14 15:07, Tom Lane wrote: Merlin Moncure mmonc...@gmail.com writes: It's also requested now and then in the context of auditing and forensic analysis of application problems. But I also agree that the tolerance for performance overhead is got to be quite low. If a GUC is introduced to manage the tradeoff, it should be defaulted to 'on'. Alvaro's original submission specified that the feature defaults to off. Since there's no use-case for it unless you've installed some third-party code (eg an external replication solution), I think that should stay true. The feature's overhead might be small, but it is most certainly not zero, and people shouldn't be paying for it unless they need it. Agreed, that's why it stayed 'off' in my version too. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Column Redaction
On 16 October 2014 01:29, Claudio Freire klaussfre...@gmail.com wrote: But in any case, if the deterrence isn't enough, and you get attacked, anything involving redaction as fleshed out in the OP is good for nothing. The damage has been done already. The feature doesn't meaningfully slow down extraction of data, so anything you do can only punish the attacker, not prevent further data theft or damaged reputation/business. Deterrence is exactly the goal. Only punishing the attacker is exactly what this is for. This is not the same thing as preventative security. Redaction is designed to prevent authorized users from accidental misuse. Your business already trusts these people. You know their names, their addresses, their bank account details and you'll have already run security scans on them. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 2014-10-31 10:08:59 -0400, Robert Haas wrote: On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-31 08:54:54 -0400, Robert Haas wrote: On Fri, Oct 31, 2014 at 6:41 AM, Simon Riggs si...@2ndquadrant.com wrote: Is it genuinely required for most parallel operations? I think it's clear that none of us knows the answer. Sure, the general case needs it, but is the general case the same thing as the reasonably common case? Well, I think that the answer is pretty clear. Most of the time, perhaps in 99.9% of cases, group locking will make no difference as to whether a parallel operation succeeds or fails. Occasionally, however, it will cause an undetected deadlock. I don't hear anyone arguing that that's OK. Does anyone wish to make that argument? Maybe we can, as a first step, make those edges in the lock graph visible to the deadlock detector? It's pretty clear that undetected deadlocks aren't ok, but detectable deadlocks in a couple corner cases might be acceptable. Interesting idea. I agree that would be more acceptable. It would be kind of sad, though, if you got a deadlock from this: BEGIN; LOCK TABLE foo; SELECT * FROM foo; I have serious doubts about the number of cases where it's correct to access relations in a second backend that are exclusively locked. Not so much when that happens for a user issued LOCK statement of course, but when the system has done so. Personally I think to stay sane we'll have to forbid access to anything normal other backends can't access either - otherwise things will get too hard to reason about. So just refusing parallelism as soon as anything has taken an access exclusive lock doesn't sound too bad to me. You can, of course, avoid that, but it's basically transferring the burden from the lock manager to every bit of parallel code we ever write. If it turns out to be easier to do this than what I'm currently proposing, I'm definitely amenable to going this route for version 1, but I don't think it's going to be appealing as a permanent solution. It's probably not the solution forever, I agree. But it might be the simplest way forward till we have more actual users. 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. Well, there's really no need to be absolutely general here. We're only going to support a subset of functionality as parallelizable. And in that case we don't need anything complicated to acquire all locks. See, that's what I don't believe. Even very simple things like equality operators do a surprising amount of catalog locking. So what? I don't see catalog locking as something problematic? Maybe I'm missing something, but I don't see cases would you expect deadlocks or anything like it on catalog relations? So, I am still of the opinion that group locking makes sense. While I appreciate the urge to avoid solving difficult problems where it's reasonably avoidable, I think we're in danger of spending more effort trying to avoid solving this particular problem than it would take to actually solve it. Based on what I've done so far, I'm guessing that a complete group locking patch will be between 1000 and 1500 lines of code and that nearly all of the new logic will be skipped when it's not in use (i.e. no parallelism). That sounds to me like a hell of a deal compared to trying to predict what locks the child might conceivably take and preemptively acquire them all, which sounds annoyingly tedious even for simple things (like equality operators) and nearly impossible for anything more complicated. What I'm worried about is the performance impact of group locking when it's not in use. The heavyweight locking code is already quite complex and often a noticeable bottleneck... I certainly agree that it would be unacceptable for group locking to significantly regress the normal case. I'm pretty optimistic about reducing the overhead to near-zero, though - a few extra branch instructions on the non-fast-path case should be lost in the noise. I'm less worried about the additional branches than about increasing the size of the datastructures. They're already pretty huge. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE INDEX CONCURRENTLY?
On 10/31/2014 10:28 AM, Mark Woodward wrote: I have not kept up with PostgreSQL changes and have just been using it. A co-worker recently told me that you need to word CONCURRENTLY in CREATE INDEX to avoid table locking. I called BS on this because to my knowledge PostgreSQL does not lock tables. I referenced this page in the documentation: http://www.postgresql.org/docs/9.3/static/locking-indexes.html That page refers to using the indexes, not creating them. However, I do see this sentence in the indexing page that was not in the docs prior to 8.0: Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Is this true? When/why the change? When we use concurrently, it seems to hang. I am looking into it. Creating indexes always did lock tables. See for example http://www.postgresql.org/docs/7.4/static/explicit-locking.html#LOCKING-TABLES there CREATE INDEX is documented to take a SHARE lock on the table. CONCURRENTLY was an additional feature to allow you to get around this, at the possible cost of some extra processing. So we haven't made things harder, we've made them easier, and your understanding of old releases is incorrect. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE INDEX CONCURRENTLY?
On Fri, Oct 31, 2014 at 2:28 PM, Mark Woodward mark.woodw...@actifio.com wrote: I have not kept up with PostgreSQL changes and have just been using it. A co-worker recently told me that you need to word CONCURRENTLY in CREATE INDEX to avoid table locking. I called BS on this because to my knowledge PostgreSQL does not lock tables. I referenced this page in the documentation: You can read from tables while a normal index build is in progress but you can't insert, update, or delete from them. CREATE INDEX CONCURRENTLY allows you to insert, update, and delete data while the index build is running at the expense of having the index build take longer. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On 31 October 2014 13:39, Tom Lane t...@sss.pgh.pa.us wrote: I doubt that this can ever be safe, because it will effectively assume that all operations on catalog tables are done by code that knows that it is accessing a catalog. What about manual DML, or even DDL, on a catalog? I've never really understood why you think its a good idea to allow such commands. It's pretty easy to see that can screw things up a million ways. It would be easy enough to make the superuser check acquire the BigCatalogLock before it does anything else. That way only the superuser code path would be affected by the special case required to get around that problem. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On 2014-10-31 10:02:28 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-10-31 09:48:52 -0400, Tom Lane wrote: But more to the point, this seems like optimizing pg_dump startup by adding overhead everywhere else, which doesn't really sound like a great tradeoff to me. Well, it'd finally make pg_dump correct under concurrent DDL. That's quite a worthwile thing. I lack adequate caffeine at the moment, so explain to me how this adds any guarantees whatsoever? It sounded like only a performance optimization from here. A performance optimization might be what Simon intended, but it isn't primarily what I (and presumably Robert) thought it be useful for. Consider the example in http://archives.postgresql.org/message-id/20130507141526.GA6117%40awork2.anarazel.de If pg_dump were to take the 'ddl lock' *before* acquiring the snapshot to lock all tables, that scenario couldn't happen anymore. As soon as pg_dump has acquired the actual locks the ddl lock could be released again. Taking the ddl lock from SQL would probably require some 'backup' or superuser permission, but luckily there seems to be movement around that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On 31 October 2014 13:03, Robert Haas robertmh...@gmail.com wrote: Given these are catalog tables, we aren't doing much to them that requires a strong lock. Specifically, only CLUSTER and VACUUM FULL touch those tables like that. When we do that, pretty much everything else hangs, cos you can't get much done while fundamental tables are locked. True, although it's currently the case that catalog tables are only locked for the minimum time necessary. So, VF on pg_class will block nearly any new query from starting up, but already-running queries may be able to keep going, and idle transactions don't cause a problem. If we held system catalogs until transaction commit, a VF on pg_class would basically wait until every other transaction in the system completed and preclude any other transaction from starting until it finished. That's significantly more problematic in my view. No, not really. As soon as you put that VF in there, queries will begin to block. It doesn't really matter at what point they block, so it doesn't make the problem worse. VFs on pg_class are very rare and not usually run while trying to make a normal workload happen, so its a strange thing to care about how well that is optimized. VACUUM FULL on pg_class only ever happens because of temp tables anyway. I have been investigating that for other purposes, see new thread soon. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Temp tables, pg_class_temp and AccessExclusiveLocks
While investigating how to reduce logging of AccessExclusiveLocks for temp tables, I came up with the attached patch. My feeling was that's an ugly patch, but it set me thinking that a more specialised code path around temp tables could be useful in other ways, and therefore worth pursuing further. The patch creates a relation_open_temp(), which could then allow a RelationIdGetTempRelation() which could have a path that calls ScanPgRelation on a new catalog table called pg_class_temp, and other _temp catalog table accesses. So we could isolate all temp table access to specific tables. That would * cause less bloat in the main catalog tables * avoid logging AccessExclusiveLocks on temp tables * less WAL traffic Why less WAL traffic? Because we can choose then to make *_temp catalog tables 1. use unlogged table mechanism 2. use local tables inside the local buffer manager My preference would be (1) because all we need to do is change the catalog table oid when searching, we don't need to do anything else. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 31 October 2014 12:54, Robert Haas robertmh...@gmail.com wrote: 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. The Pareto principle offers ways to cope with the world's lack of simplicity. You mentioned earlier that functions would need to be marked proisparallel etc.. What conditions will that be protecting against? If we aren't going to support the general case where every single thing works, can we at least discuss what the list of cases is that we will support. I don't think we can argue that everything must be generic when we already admit it won't be. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Temp tables, pg_class_temp and AccessExclusiveLocks
On 31 October 2014 14:53, Simon Riggs si...@2ndquadrant.com wrote: While investigating how to reduce logging of AccessExclusiveLocks for temp tables, I came up with the attached patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services temp_tables_skip_logging_locks.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing Catalog Locking
On 31 October 2014 14:49, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-31 10:02:28 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-10-31 09:48:52 -0400, Tom Lane wrote: But more to the point, this seems like optimizing pg_dump startup by adding overhead everywhere else, which doesn't really sound like a great tradeoff to me. Well, it'd finally make pg_dump correct under concurrent DDL. That's quite a worthwile thing. I lack adequate caffeine at the moment, so explain to me how this adds any guarantees whatsoever? It sounded like only a performance optimization from here. A performance optimization might be what Simon intended, but it isn't primarily what I (and presumably Robert) thought it be useful for. Consider the example in http://archives.postgresql.org/message-id/20130507141526.GA6117%40awork2.anarazel.de If pg_dump were to take the 'ddl lock' *before* acquiring the snapshot to lock all tables, that scenario couldn't happen anymore. As soon as pg_dump has acquired the actual locks the ddl lock could be released again. Taking the ddl lock from SQL would probably require some 'backup' or superuser permission, but luckily there seems to be movement around that. Good idea. But it is a different idea. I can do that as well... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf
Hi, Attached is the v2 of the patch with the review comments addressed (see below). On 29/10/14 21:08, Petr Jelinek wrote: On 29/10/14 20:27, Asif Naeem wrote: 1. It seems that following log message need to be more descriptive about reason for shutdown i.e. + if (recoveryShutdownAtTarget reachedStopPoint) + { + ereport(LOG, (errmsg(shutting down))); Agreed, I just wasn't sure on what exactly to writes, I originally had there shutting down by user request or some such but that's misleading. I changed it to shutting down at recovery target hope that's better. 2. As Simon suggesting following recovery settings are not clear i.e. shutdown_at_recovery_target = true pause_at_recovery_target = true Hmm I completely missed Simon's email, strange. Well other option would be to throw error if both are set to true - error will have to happen anyway if action_at_recovery_target is set to shutdown while pause_at_recovery_target is true (I think we need to keep pause_at_recovery_target for compatibility). But considering all of you think something like action_at_recovery_target is better solution, I will do it that way then. Done, there is now action_at_recovery_target which can be set to either pause, continue or shutdown, defaulting to pause (which is same as old behavior of pause_at_recovery_target defaulting to true). I also added check that prohibits using both pause_at_recovery_target and action_at_recovery_target in the same config, mainly to avoid users shooting themselves in the foot. 3. As it don’t rename reconvery.conf, subsequent attempt (without any changes in reconvery.conf) to start of server keep showing the following i.e. ... LOG: redo starts at 0/1803620 DEBUG: checkpointer updated shared memory configuration values LOG: consistent recovery state reached at 0/1803658 LOG: restored log file 00010002 from archive DEBUG: got WAL segment from archive LOG: restored log file 00010003 from archive DEBUG: got WAL segment from archive LOG: restored log file 00010004 from archive DEBUG: got WAL segment from archive LOG: restored log file 00010005 from archive DEBUG: got WAL segment from archive LOG: restored log file 00010006 from archive DEBUG: got WAL segment from archive … Yes, it will still replay everything since last checkpoint, that's by design since otherwise we would have to write checkpoint on shutdown and that would mean the instance can't be used as hot standby anymore and I consider that an important feature to have. I added note to the documentation that says this will happen. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml index 0f1ff34..fe42394 100644 --- a/doc/src/sgml/recovery-config.sgml +++ b/doc/src/sgml/recovery-config.sgml @@ -289,12 +289,39 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p' # Windows /term listitem para -Specifies whether recovery should pause when the recovery target -is reached. The default is true. -This is intended to allow queries to be executed against the -database to check if this recovery target is the most desirable -point for recovery. The paused state can be resumed by using -functionpg_xlog_replay_resume()/ (See +Alias for action_at_recovery_target, literaltrue/ is same as +action_at_recovery_target = literalpause/ and literalfalse/ +is same as action_at_recovery_target = literalcontinue/. + /para + para +This setting has no effect if xref linkend=guc-hot-standby is not +enabled, or if no recovery target is set. + /para + /listitem + /varlistentry + + /variablelist + + varlistentry id=action-at-recovery-target + xreflabel=action_at_recovery_target + termvarnameaction_at_recovery_target/varname (typeenum/type) + indexterm +primaryvarnameaction_at_recovery_target/ recovery parameter/primary + /indexterm + /term + listitem + para +Specifies what action should PostgreSQL do once the recovery target is +reached. The default is literalpause/, which means recovery will +be paused. literalcontinue/ means there will be no special action +taken when recovery target is reached and normal operation will continue. +Finally literalshutdown/ will stop the PostgreSQL instance at +recovery target. + /para +The intended use of literalpause/ setting is to allow queries to be +executed against the database to check if this recovery target is the +most desirable
Re: [HACKERS] tracking commit timestamps
Hi, On 28/10/14 13:25, Simon Riggs wrote: On 13 October 2014 10:05, Petr Jelinek p...@2ndquadrant.com wrote: I worked bit on this patch to make it closer to committable state. Here is updated version that works with current HEAD for the October committfest. I've reviewed this and it looks good to me. Clean, follows existing code neatly, documented and tested. Thanks for looking at this. Please could you document a few things * ExtendCommitTS() works only because commit_ts_enabled can only be set at server start. We need that documented so somebody doesn't make it more easily enabled and break something. (Could we make it enabled at next checkpoint or similar?) Maybe we could, but that means some kind of two step enabling facility and I don't want to write that as part of the initial patch since that will need separate discussion (i.e. do we want to have new GUC flag for this, or hack solution just for committs?). So maybe later in a follow-up patch. * The SLRU tracks timestamps of both xids and subxids. We need to document that it does this because Subtrans SLRU is not persistent. If we made Subtrans persistent we might need to store only the top level xid's commitTS, but that's very useful for typical use cases and wouldn't save much time at commit. Attached version with the above comments near the relevant code. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c index 3b8241b..f0a023f 100644 --- a/contrib/pg_upgrade/pg_upgrade.c +++ b/contrib/pg_upgrade/pg_upgrade.c @@ -423,8 +423,10 @@ copy_clog_xlog_xid(void) /* set the next transaction id and epoch of the new cluster */ prep_status(Setting next transaction ID and epoch for new cluster); exec_prog(UTILITY_LOG_FILE, NULL, true, - \%s/pg_resetxlog\ -f -x %u \%s\, - new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid, + \%s/pg_resetxlog\ -f -x %u -c %u \%s\, + new_cluster.bindir, + old_cluster.controldata.chkpnt_nxtxid, + old_cluster.controldata.chkpnt_nxtxid, new_cluster.pgdata); exec_prog(UTILITY_LOG_FILE, NULL, true, \%s/pg_resetxlog\ -f -e %u \%s\, diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c index bfb3573..c0a0409 100644 --- a/contrib/pg_xlogdump/rmgrdesc.c +++ b/contrib/pg_xlogdump/rmgrdesc.c @@ -9,6 +9,7 @@ #include postgres.h #include access/clog.h +#include access/committs.h #include access/gin.h #include access/gist_private.h #include access/hash.h diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore new file mode 100644 index 000..1f95503 --- /dev/null +++ b/contrib/test_committs/.gitignore @@ -0,0 +1,5 @@ +# Generated subdirectories +/log/ +/isolation_output/ +/regression_output/ +/tmp_check/ diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile new file mode 100644 index 000..2240749 --- /dev/null +++ b/contrib/test_committs/Makefile @@ -0,0 +1,45 @@ +# Note: because we don't tell the Makefile there are any regression tests, +# we have to clean those result files explicitly +EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/test_committs +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif + +# We can't support installcheck because normally installcheck users don't have +# the required track_commit_timestamp on +installcheck:; + +check: regresscheck + +submake-regress: + $(MAKE) -C $(top_builddir)/src/test/regress all + +submake-test_committs: + $(MAKE) -C $(top_builddir)/contrib/test_committs + +REGRESSCHECKS=committs_on + +regresscheck: all | submake-regress submake-test_committs + $(MKDIR_P) regression_output + $(pg_regress_check) \ + --temp-config $(top_srcdir)/contrib/test_committs/committs.conf \ + --temp-install=./tmp_check \ + --extra-install=contrib/test_committs \ + --outputdir=./regression_output \ + $(REGRESSCHECKS) + +regresscheck-install-force: | submake-regress submake-test_committs + $(pg_regress_installcheck) \ + --extra-install=contrib/test_committs \ + $(REGRESSCHECKS) + +PHONY: submake-test_committs submake-regress check \ + regresscheck regresscheck-install-force \ No newline at end of file diff --git a/contrib/test_committs/committs.conf b/contrib/test_committs/committs.conf new file mode 100644 index 000..d221a60 --- /dev/null +++ b/contrib/test_committs/committs.conf @@ -0,0 +1 @@ +track_commit_timestamp = on \ No newline at end of file diff --git a/contrib/test_committs/expected/committs_on.out b/contrib/test_committs/expected/committs_on.out new file mode 100644 index 000..9920343 --- /dev/null +++ b/contrib/test_committs/expected/committs_on.out
[HACKERS] why does LIMIT 2 take orders of magnitude longer than LIMIT 1 in this query?
I'm on PostgreSQL 9.3. This should reproduce on any table with 100,000+ rows. The EXPLAIN ANALYZE shows many more rows getting scanned with LIMIT 2, but I can't figure out why. Limit 1: EXPLAIN ANALYZE WITH base AS ( SELECT *, ROW_NUMBER() OVER () AS rownum FROM a_big_table ), filter AS ( SELECT rownum, true AS thing FROM base ) SELECT * FROM base LEFT JOIN filter USING (rownum) WHERE filter.thing LIMIT 1 Result: Limit (cost=283512.19..283517.66 rows=1 width=2114) (actual time=0.019..0.019 rows=1 loops=1) CTE base - WindowAgg (cost=0.00..188702.69 rows=4740475 width=101) (actual time=0.008..0.008 rows=1 loops=1) - Seq Scan on a_big_table (cost=0.00..129446.75 rows=4740475 width=101) (actual time=0.003..0.003 rows=1 loops=1) CTE filter - CTE Scan on base base_1 (cost=0.00..94809.50 rows=4740475 width=8) (actual time=0.000..0.000 rows=1 loops=1) - Nested Loop (cost=0.00..307677626611.24 rows=56180269915 width=2114) (actual time=0.018..0.018 rows=1 loops=1) Join Filter: (base.rownum = filter.rownum) - CTE Scan on base (cost=0.00..94809.50 rows=4740475 width=2113) (actual time=0.011..0.011 rows=1 loops=1) - CTE Scan on filter (cost=0.00..94809.50 rows=2370238 width=9) (actual time=0.002..0.002 rows=1 loops=1) Filter: thing Total runtime: 0.057 ms Limit 2: EXPLAIN ANALYZE WITH base AS ( SELECT *, ROW_NUMBER() OVER () AS rownum FROM a_big_table ), filter AS ( SELECT rownum, true AS thing FROM base ) SELECT * FROM base LEFT JOIN filter USING (rownum) WHERE filter.thing LIMIT 2 Result: Limit (cost=283512.19..283523.14 rows=2 width=2114) (actual time=0.018..14162.283 rows=2 loops=1) CTE base - WindowAgg (cost=0.00..188702.69 rows=4740475 width=101) (actual time=0.008..4443.359 rows=4714243 loops=1) - Seq Scan on a_big_table (cost=0.00..129446.75 rows=4740475 width=101) (actual time=0.002..1421.622 rows=4714243 loops=1) CTE filter - CTE Scan on base base_1 (cost=0.00..94809.50 rows=4740475 width=8) (actual time=0.001..10214.684 rows=4714243 loops=1) - Nested Loop (cost=0.00..307677626611.24 rows=56180269915 width=2114) (actual time=0.018..14162.280 rows=2 loops=1) Join Filter: (base.rownum = filter.rownum) Rows Removed by Join Filter: 4714243 - CTE Scan on base (cost=0.00..94809.50 rows=4740475 width=2113) (actual time=0.011..0.028 rows=2 loops=1) - CTE Scan on filter (cost=0.00..94809.50 rows=2370238 width=9) (actual time=0.009..6595.770 rows=2357122 loops=2) Filter: thing Total runtime: 14247.374 ms
Re: [HACKERS] tracking commit timestamps
On 31 October 2014 15:46, Petr Jelinek p...@2ndquadrant.com wrote: Attached version with the above comments near the relevant code. Looks cooked and ready to serve. Who's gonna commit this? Alvaro, or do you want me to? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: why does LIMIT 2 take orders of magnitude longer than LIMIT 1 in this query?
Chris Rogers wrote I'm on PostgreSQL 9.3. This should reproduce on any table with 100,000+ rows. The EXPLAIN ANALYZE shows many more rows getting scanned with LIMIT 2, but I can't figure out why. EXPLAIN ANALYZE WITH base AS ( SELECT *, ROW_NUMBER() OVER () AS rownum FROM a_big_table ), filter AS ( SELECT rownum, true AS thing FROM base ) SELECT * FROM base LEFT JOIN filter USING (rownum) WHERE filter.thing LIMIT 1 The LIMIT 1 case has been optimized (special cased) while all others end up using a normal plan. Two things make your example query particularly unrealistic: 1. The presence of a ROW_NUMBER() window aggregate on an unsorted input 2. A LEFT JOIN condition matched with a WHERE clause with a right-side column being non-NULL David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/why-does-LIMIT-2-take-orders-of-magnitude-longer-than-LIMIT-1-in-this-query-tp5825209p5825212.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] why does LIMIT 2 take orders of magnitude longer than LIMIT 1 in this query?
Chris Rogers teuk...@gmail.com writes: I'm on PostgreSQL 9.3. This should reproduce on any table with 100,000+ rows. The EXPLAIN ANALYZE shows many more rows getting scanned with LIMIT 2, but I can't figure out why. This is not -hackers material. The first row pulled from the nestloop LEFT JOIN is created by joining the first output row from base to the first output row from filter (which is indirectly also the first row from base). So, cheap; we've only had to read one row from a_big_table. The second attempt to pull a row from the nestloop LEFT JOIN requires evaluating all the rest of the output of the filter CTE, to see if there are any more filter rows matching the first base row. (There are not, since the rownum output is unique, but the nestloop doesn't know that.) That in turn causes scanning all the rest of base and so all the rest of a_big_table. Only after that do we get to the second base row at which another join output row is possible. If the planner were about an order of magnitude cleverer than it is, it might realize that this would happen and switch to another plan ... although TBH the only way I can see to not have a large startup cost would be to somehow realize that the outputs of both CTEs are sorted by rownum and hence can be mergejoined without an explicit sort step. That would require more understanding of the behavior of row_number() than it's got or probably should have. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE IF NOT EXISTS INDEX
All, FWIW, I've cleanly applied v8 of this patch to master (252e652) and check-world was successful. I also successfully ran through a few manual test cases. -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com
Re: [HACKERS] CREATE IF NOT EXISTS INDEX
On Fri, Oct 31, 2014 at 2:46 PM, Adam Brightwell adam.brightw...@crunchydatasolutions.com wrote: All, FWIW, I've cleanly applied v8 of this patch to master (252e652) and check-world was successful. I also successfully ran through a few manual test cases. Thanks for your review! Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] CREATE INDEX CONCURRENTLY?
Am Freitag, den 31.10.2014, 14:43 + schrieb Greg Stark: On Fri, Oct 31, 2014 at 2:28 PM, Mark Woodward mark.woodw...@actifio.com wrote: I have not kept up with PostgreSQL changes and have just been using it. A co-worker recently told me that you need to word CONCURRENTLY in CREATE INDEX to avoid table locking. I called BS on this because to my knowledge PostgreSQL does not lock tables. I referenced this page in the documentation: You can read from tables while a normal index build is in progress but you can't insert, update, or delete from them. CREATE INDEX CONCURRENTLY allows you to insert, update, and delete data while the index build is running at the expense of having the index build take longer. I believe there is one caveat: If there is an idle-in-transaction backend from before the start of CREATE INDEX CONCURRENTLY, it can hold up the index creation indefinitely as long as it doesn't commit. src/backend/access/heap/README.HOT mentions this WRT CIC: Then we wait until every transaction that could have a snapshot older than the second reference snapshot is finished. This ensures that nobody is alive any longer who could need to see any tuples that might be missing from the index, as well as ensuring that no one can see any inconsistent rows in a broken HOT chain (the first condition is stronger than the second). I have seen CIC stall at clients when there were (seemlingy) unrelated idle-in-transactions open (their locks even touching only other schemas). I believe it depends on the specific locks that the other backend acquired, but at least with a DECLARE CURSOR I can trivially reproduce it: first session: postgres=# CREATE SCHEMA foo1; CREATE SCHEMA postgres=# CREATE TABLE foo1.foo1 (id int); CREATE TABLE postgres=# CREATE SCHEMA foo2; CREATE SCHEMA postgres=# CREATE TABLE foo2.foo2 (id int); CREATE TABLE second session: postgres=# BEGIN; DECLARE c1 CURSOR FOR SELECT * FROM foo1.foo1; BEGIN DECLARE CURSOR first session: postgres=# CREATE INDEX CONCURRENTLY ixfoo2 ON foo2.foo2(id); (hangs) I wonder whether that is pilot error (fair enough), or whether something could be done about this? Michael -- Michael Banck Projektleiter / Berater Tel.: +49 (2161) 4643-171 Fax: +49 (2161) 4643-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 10:38 AM, Andres Freund and...@2ndquadrant.com wrote: I have serious doubts about the number of cases where it's correct to access relations in a second backend that are exclusively locked. Not so much when that happens for a user issued LOCK statement of course, but when the system has done so. Personally I think to stay sane we'll have to forbid access to anything normal other backends can't access either - otherwise things will get too hard to reason about. So just refusing parallelism as soon as anything has taken an access exclusive lock doesn't sound too bad to me. That restriction seems onerous to me; for example, it would prevent a parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL. Those seems like worthy targets for parallelism, and maybe even early targets, since I expect a lot of the real complexity here to be in the query planner, and you can avoid most of that complexity when planning DDL. I also think it's unnecessary. It's wrong to think of two parallel backends that both want AccessExclusiveLock on a given target relation as conflicting with each other; if the process were not running in parallel, those lock requests would both have come from the same process and both requests would have been granted. Parallelism is generally not about making different things happen; it's about spreading the stuff that would happen anyway across multiple backends, and things that would succeed if done in a single backend shouldn't fail when spread across 2 or more without some compelling justification. The cases where it's actually unsafe for a table to be accessed even by some other code within the same backend are handled not by the lock manager, but by CheckTableNotInUse(). That call should probably fail categorically if issued while parallelism is active. 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. Well, there's really no need to be absolutely general here. We're only going to support a subset of functionality as parallelizable. And in that case we don't need anything complicated to acquire all locks. See, that's what I don't believe. Even very simple things like equality operators do a surprising amount of catalog locking. So what? I don't see catalog locking as something problematic? Maybe I'm missing something, but I don't see cases would you expect deadlocks or anything like it on catalog relations? Suppose somebody fires off a parallel sort on a text column, or a parallel sequential scan involving a qual of the form textcol = 'zog'. We launch a bunch of workers to do comparisons; they do lookups against pg_collation. After some but not all of them have loaded the collation information they need into their local cache, the DBA types cluster pg_collate. It queues up for an AccessExclusiveLock. The remaining parallel workers join the wait queue waiting for their AccessShareLock. The system is now deadlocked; the only solution is to move the parallel workers ahead of the AccessExclusiveLock request, but the deadlock detector won't do that unless it knows about the relationship among the parallel workers. It's possible to construct more obscure cases as well. For example, suppose a user (for reasons known only to himself and God) does LOCK TABLE pg_opclass and then begins a sort of a column full of enums. Meanwhile, another user tries to CLUSTER pg_enum. This will deadlock if, and only if, some of the enum OIDs are odd. I mention this example not because I think it's a particularly useful case but because it illustrates just how robust the current system is: it will catch that case, and break the deadlock somehow, and you as a PostgreSQL backend hacker do not need to think about it. The guy who added pg_enum did not need to think about it. It just works. If we decide that parallelism is allowed to break that guarantee, then every patch that does parallel anything has got to worry about what undetected deadlocks might result, and then argue about which of them are obscure enough that we shouldn't care and which are not. That doesn't sound like a fun way to spend my time. But, really, the first case is the one I'm more worried about: a bunch of parallel backends all grab the same locks, but slightly separated in time. A strong locker joins the queue midway through and we're hosed. Obviously any system cache lookups whatsoever are going to be separated in time like this, just because the world isn't synchronous. The pg_enum case is an example of how the lookups could be more widely spaced: each backend will scan pg_enum when it first hits an odd OID, and that could happen much later for some than others. But even a small window is enough to render undetected deadlock a possibility, and you will not convince me that hope the race condition is narrow enough in practice that this never happens is a remotely sane strategy. Down the road, it's
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 31 October 2014 18:29, Robert Haas robertmh...@gmail.com wrote: Suppose somebody fires off a parallel sort on a text column, or a parallel sequential scan involving a qual of the form textcol = 'zog'. We launch a bunch of workers to do comparisons; they do lookups against pg_collation. After some but not all of them have loaded the collation information they need into their local cache, the DBA types cluster pg_collate. It queues up for an AccessExclusiveLock. The remaining parallel workers join the wait queue waiting for their AccessShareLock. The system is now deadlocked; the only solution is to move the parallel workers ahead of the AccessExclusiveLock request, but the deadlock detector won't do that unless it knows about the relationship among the parallel workers. It's an obscure case and its not the only solution either. I'm really surprised that having workers do their own locking doesn't scare you. Personally, it scares me. It's not like we're the first do parallel query. Honestly, how many parallel databases do you think do this? I can't see this being a practical, workable solution for running a parallel query across a cluster of many nodes. Shared, distributed lock table? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 11:02 AM, Simon Riggs si...@2ndquadrant.com wrote: You mentioned earlier that functions would need to be marked proisparallel etc.. What conditions will that be protecting against? If we aren't going to support the general case where every single thing works, can we at least discuss what the list of cases is that we will support. In general, any operation that relies on backend-private state not shared with a cooperating backend isn't parallel-safe. For example, consider: SELECT setseed(1); SELECT random() FROM generate_series(1,1000) g; This sequence of queries can be relied upon to produce the same output every time. But if some of the random() calls are executed in another backend, you'll get different results because random() works by using backend-private memory to store its pseudo-random state. It's unlikely to be worth the effort to move the pseudo-random state to a DSM for parallelism, so we probably just want to disallow parallel query when random() is in use, or, better still, disallow parallelizing only the particular query node that uses random(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 2014-10-31 14:29:57 -0400, Robert Haas wrote: On Fri, Oct 31, 2014 at 10:38 AM, Andres Freund and...@2ndquadrant.com wrote: I have serious doubts about the number of cases where it's correct to access relations in a second backend that are exclusively locked. Not so much when that happens for a user issued LOCK statement of course, but when the system has done so. Personally I think to stay sane we'll have to forbid access to anything normal other backends can't access either - otherwise things will get too hard to reason about. So just refusing parallelism as soon as anything has taken an access exclusive lock doesn't sound too bad to me. That restriction seems onerous to me; for example, it would prevent a parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL. Those seems like worthy targets for parallelism, and maybe even early targets, since I expect a lot of the real complexity here to be in the query planner, and you can avoid most of that complexity when planning DDL. Ugh. I think that's aiming too high. You'll suddenly need to share cche invalidations between the backends. I think we should start by requiring that there's no DDL done *at all* in the transaction that starts the parallel activity. For CREATE INDEX PARALLEL that can be done by reusing the logic we have for CONCURRENTLY, thereby moving the parallelized part into a transaction that hasn't done any DDL yet. I also think it's unnecessary. It's wrong to think of two parallel backends that both want AccessExclusiveLock on a given target relation as conflicting with each other; if the process were not running in parallel, those lock requests would both have come from the same process and both requests would have been granted. I don't think that's a realistic view. There's lots of backend private state around. If we try to make all that coherent between backends we'll fail. Parallelism is generally not about making different things happen; it's about spreading the stuff that would happen anyway across multiple backends, and things that would succeed if done in a single backend shouldn't fail when spread across 2 or more without some compelling justification. The cases where it's actually unsafe for a table to be accessed even by some other code within the same backend are handled not by the lock manager, but by CheckTableNotInUse(). That call should probably fail categorically if issued while parallelism is active. Which will e.g. prohibit CLUSTER and VACUUM FULL. 1. Turing's theorem being what it is, predicting what catalog tables the child might lock is not necessarily simple. Well, there's really no need to be absolutely general here. We're only going to support a subset of functionality as parallelizable. And in that case we don't need anything complicated to acquire all locks. See, that's what I don't believe. Even very simple things like equality operators do a surprising amount of catalog locking. So what? I don't see catalog locking as something problematic? Maybe I'm missing something, but I don't see cases would you expect deadlocks or anything like it on catalog relations? Suppose somebody fires off a parallel sort on a text column, or a parallel sequential scan involving a qual of the form textcol = 'zog'. We launch a bunch of workers to do comparisons; they do lookups against pg_collation. After some but not all of them have loaded the collation information they need into their local cache, the DBA types cluster pg_collate. It queues up for an AccessExclusiveLock. The remaining parallel workers join the wait queue waiting for their AccessShareLock. The system is now deadlocked; the only solution is to move the parallel workers ahead of the AccessExclusiveLock request, but the deadlock detector won't do that unless it knows about the relationship among the parallel workers. I think you would need to make the case more complex - we only hold locks on system object for a very short time, and mostly not in a nested fashion. But generally I think it's absolutely perfectly alright to fail in such case. We need to fail reliably, but *none* of this is a new hazard. You can have pretty similar issues today, without any parallelism. It's possible to construct more obscure cases as well. For example, suppose a user (for reasons known only to himself and God) does LOCK TABLE pg_opclass and then begins a sort of a column full of enums. Meanwhile, another user tries to CLUSTER pg_enum. This will deadlock if, and only if, some of the enum OIDs are odd. I mention this example not because I think it's a particularly useful case but because it illustrates just how robust the current system is: it will catch that case, and break the deadlock somehow, and you as a PostgreSQL backend hacker do not need to think about it. The guy who added pg_enum did not need to think about it. It just works. If we decide that
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 2:46 PM, Simon Riggs si...@2ndquadrant.com wrote: On 31 October 2014 18:29, Robert Haas robertmh...@gmail.com wrote: Suppose somebody fires off a parallel sort on a text column, or a parallel sequential scan involving a qual of the form textcol = 'zog'. We launch a bunch of workers to do comparisons; they do lookups against pg_collation. After some but not all of them have loaded the collation information they need into their local cache, the DBA types cluster pg_collate. It queues up for an AccessExclusiveLock. The remaining parallel workers join the wait queue waiting for their AccessShareLock. The system is now deadlocked; the only solution is to move the parallel workers ahead of the AccessExclusiveLock request, but the deadlock detector won't do that unless it knows about the relationship among the parallel workers. It's an obscure case and its not the only solution either. I don't think that's an obscure situation at all. Do you really think a patch that could cause an attempt to VACUUM FULL a system catalog to suffer an undetected deadlock meets this community's quality standards? Because that's what we're talking about. You are right that it isn't the only solution. I have said the same thing myself, multiple times, on this thread. I'm really surprised that having workers do their own locking doesn't scare you. Personally, it scares me. Frankly, the reverse decision scares me a heck of a lot more. It superficially seems like a good idea to have the master take locks on behalf of the workers, but when you start to really think about how many low-level parts of the code take locks, it quickly becomes evident, at least to me, that trying to make the resulting system robust will be a nightmare. Don't forget that there are not only relation locks but also page locks, tuple locks, relation extension locks, XID and VXID locks, locks on arbitrary database or shared objects. The deadlock detector handles them all. If avoiding deadlock outside the lock manager is so easy, why do we have a deadlock detector in the first place? Why not just change all of our other code to avoid them instead? It's not like we're the first do parallel query. Honestly, how many parallel databases do you think do this? All of them. I might be wrong, of course, but I see no reason at all to believe that Oracle or SQL Server have ignoring deadlock detection when parallel query is in use. I can't see this being a practical, workable solution for running a parallel query across a cluster of many nodes. Shared, distributed lock table? I am not proposing a distributed lock manager that can run across many nodes. The considerations for such a piece of software are largely orthogonal to the problem that I'm actually trying to solve here, and at least an order of magnitude more complex. If you make it sound like that's the project that I'm doing, then of course it's going to sound frighteningly complicated. But I'm not. I'm sort of baffled by your concern on this point. Sure, the lock manager is a complex and performance-sensitive piece of code, and we have to be careful not to break it or make it slower. But we're not talking about launching a rocket to Mars here; we're just talking about making the lock manager interface look the same way to two cooperating backends issuing lock requests in parallel that it looks to a single backend making those requests serially. I don't think it's noticeably harder than getting Hot Standby conflict resolution work, or getting historical snapshots working for logical decoding, or making the visibility map crash-safe - in fact, I'd bet on it being substantially easier than the first two of those, because all the logic is inside a single, relatively simple backend subsystem. Why we'd want to minimize complexity there at the cost of spreading it across half the backend is mysterious to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 3:32 PM, Andres Freund and...@2ndquadrant.com wrote: So just refusing parallelism as soon as anything has taken an access exclusive lock doesn't sound too bad to me. That restriction seems onerous to me; for example, it would prevent a parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL. Those seems like worthy targets for parallelism, and maybe even early targets, since I expect a lot of the real complexity here to be in the query planner, and you can avoid most of that complexity when planning DDL. Ugh. I think that's aiming too high. You'll suddenly need to share cche invalidations between the backends. I think we should start by requiring that there's no DDL done *at all* in the transaction that starts the parallel activity. For CREATE INDEX PARALLEL that can be done by reusing the logic we have for CONCURRENTLY, thereby moving the parallelized part into a transaction that hasn't done any DDL yet. I don't think that's correct. We only need to process local invalidation messages after CommandCounterIncrement(), which I anticipate prohibiting during parallel execution (short thought should convince you that anything else is completely nuts). So I think there's no real problem with invalidation messages being generated during parallel execution. I wouldn't anticipate supporting that in V1, because they'd have to be propagated back to the user backend prior to commit, but in the longer term it seems pretty feasible. I also think it's unnecessary. It's wrong to think of two parallel backends that both want AccessExclusiveLock on a given target relation as conflicting with each other; if the process were not running in parallel, those lock requests would both have come from the same process and both requests would have been granted. I don't think that's a realistic view. There's lots of backend private state around. If we try to make all that coherent between backends we'll fail. I agree that there's a lot of backend private state, and that's the principle problem with making this work at all. The challenge here is to figure out which parts of that backend-private state we absolutely must make coherent between backends to enable parallelism to have a useful, non-trivial scope, and which parts we can punt on. I have ideas about that - which are mostly documented at https://wiki.postgresql.org/wiki/Parallel_Sort - but I'm trying to keep an open mind on the details. Parallelism is generally not about making different things happen; it's about spreading the stuff that would happen anyway across multiple backends, and things that would succeed if done in a single backend shouldn't fail when spread across 2 or more without some compelling justification. The cases where it's actually unsafe for a table to be accessed even by some other code within the same backend are handled not by the lock manager, but by CheckTableNotInUse(). That call should probably fail categorically if issued while parallelism is active. Which will e.g. prohibit CLUSTER and VACUUM FULL. I don't think so. I think it's safe to get to the point where we check that the table is not in use, then do a parallel scan and sort, then collect the results and write them out. But I would not try to make it safe, for example, to be scanning a table and have some user-defined function invoked along the way to go try to do anything that involves CheckTableNotInUse(). Many of those would fall over anyway because of PreventTransactionChain(), but I've got no problem nailing that hatch shut twice. And conversely, I would expect we'd disallow the sort operators invoked by CLUSTER from reaching any code guarded by CheckTableNotInUse(). Suppose somebody fires off a parallel sort on a text column, or a parallel sequential scan involving a qual of the form textcol = 'zog'. We launch a bunch of workers to do comparisons; they do lookups against pg_collation. After some but not all of them have loaded the collation information they need into their local cache, the DBA types cluster pg_collate. It queues up for an AccessExclusiveLock. The remaining parallel workers join the wait queue waiting for their AccessShareLock. The system is now deadlocked; the only solution is to move the parallel workers ahead of the AccessExclusiveLock request, but the deadlock detector won't do that unless it knows about the relationship among the parallel workers. I think you would need to make the case more complex - we only hold locks on system object for a very short time, and mostly not in a nested fashion. Hmm. You're right. If the parallel backends release their locks quickly, the CLUSTER could run and then the parallel job could finish. But it sounds as if you get the idea. But generally I think it's absolutely perfectly alright to fail in such case. We need to fail reliably, but *none* of this is a new hazard. You can have pretty similar issues today,
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 2014-10-31 15:56:42 -0400, Robert Haas wrote: On Fri, Oct 31, 2014 at 3:32 PM, Andres Freund and...@2ndquadrant.com wrote: So just refusing parallelism as soon as anything has taken an access exclusive lock doesn't sound too bad to me. That restriction seems onerous to me; for example, it would prevent a parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL. Those seems like worthy targets for parallelism, and maybe even early targets, since I expect a lot of the real complexity here to be in the query planner, and you can avoid most of that complexity when planning DDL. Ugh. I think that's aiming too high. You'll suddenly need to share cche invalidations between the backends. I think we should start by requiring that there's no DDL done *at all* in the transaction that starts the parallel activity. For CREATE INDEX PARALLEL that can be done by reusing the logic we have for CONCURRENTLY, thereby moving the parallelized part into a transaction that hasn't done any DDL yet. I don't think that's correct. We only need to process local invalidation messages after CommandCounterIncrement(), which I anticipate prohibiting during parallel execution (short thought should convince you that anything else is completely nuts). It is more complex, even without CCI. As long as you're doing anything inside a transaction that already has done DDL, you'll have to play nasty tricks. Imagine the primary backend has done a BEGIN; CLUSTER pg_class; and then then starts a child backend. Which will get the same snapshot, combocids, yada yada. But it *also* will have preexisting cache entries. Those you need to invalidate before it can do anything correct. I'm sorry to be a bit pessimistic here. But my intuition says that starting to do group locking opens a can of worms that'll take us a long time to close again. Maybe I'm just imagining complexity that won't actually be there. But I have a hard time believing that. What's the distinction between teach the deadlock detector to catch these kind of cases and group locking? Because I think those are at least 90% the same thing. I understand under 'group locking' that a primary/second backends can coown a lock that normally is self-exclusive. That's a fair bit more than adding an edge to the deadlock graph between primary/secondary backends to make the deadlock detector recognize problems. What I have serious doubts about is 'coowning' locks. Especially if two backends normally wouldn't be able to both get that lock. I wonder if we couldn't implement a 'halfway' by allowing parallel workers to jump the lockqueue if the parent process already has the lock. That'd only work for nonconflicting locks, but I think that's actually good. The patch takes precisely that approach; that part of the logic is already implemented. Well, my point is that if the solution is just to jump the queue, there really isn't any data structure changes needed. Secondary acquirers just need to check whether a lock is already owned by the primary and then acquire the lock in the absolutely normal way - with a completely separate entry. Only that they ignored the queue. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DISTINCT with btree skip scan
On 27 October 2014 20:24, David Rowley dgrowle...@gmail.com wrote: I've had a quick look at this and it seems like a great win! I'm quite surprised that we've not got this already. I think this technology could also really help performance of queries such as SELECT * from bigtable bt WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol = obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve that first off, but it could be done later once the benefits of this are more commonly realised. I think our shortfalls in this area have not gone unnoticed. I was reading this post https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html about comparing performance of COUNT(DISTINCT col). I think this would give a big win for query 3 in that post. I'm trying to find some time make some changes to transform queries to allow the group by to happen before the joins when possible, so between that and your patch we'd be 2 steps closer to making query 1 in the link above perform a little better on PostgreSQL. Do you think you'll manage to get time to look at this a bit more? I'd be keen to look into the costing side of it if you think that'll help you any? Hi David, Sorry for the silence, I have been busy moving countries. I am definitely interested in collaborating on a series of patches to implement various kinds of skip-based plans as seen in other RDBMSs if others think it could be useful. I see skip-based DISTINCT as a good place to start. (I suspect the semi-related skip scan plan (for the case when your WHERE clause doesn't have a qual for the leading column(s)) is much harder to plan and execute and I also suspect it's less generally useful). Here is a rebased version of that patch which fixes a crashing off-by-one error, and handles backward scans properly, I think. This is still just a quick hack to play with the general idea at this stage. It works by adding a new index operation 'skip' which the executor code can use during a scan to advance to the next value (for some prefix of the index's columns). That may be a terrible idea and totally unnecessary... but let me explain my reasoning: 1. Perhaps some index implementations can do something better than a search for the next key value from the root. Is it possible or desirable to use the current position as a starting point for a btree traversal? I don't know. 2. It seemed that I'd need to create a new search ScanKey to use the 'rescan' interface for skipping to the next value, but I already had an insertion ScanKey so I wanted a way to just reuse that. But maybe there is some other way to reuse existing index interfaces, or maybe there is an easy way to make a new search ScanKey from the existing insertion ScanKey? Currently it uses the existing IndexOnlyScan plan node, adding an extra member. I briefly tried making a different plan called DistinctIndexScan but it seemed I was going to need to duplicate or refactor/share a lot of IndexOnlyScan code (this might be the right thing to do but at this stage I'm just trying to demo the general idea with minimal changes). As for costing, I haven't looked at that part of PG at all yet. If you *can* do a distinct index scan, would you *always* want to? How well could we estimate the cost using existing statistics? I suppose the expected number of page fetches is proportional to the expected number of distinct values (ie skip operations) times btree depth, and the cost may be adjusted in some way for cache effects (?), but how well can we estimate the number of distinct values (a), (a, b) or (a, b, c) in some range for an index on (a, b, c, d)? As before, you still need the kludge of disabling hashagg to reach the new plan: postgres=# set enable_hashagg = true; SET Time: 0.213 ms postgres=# select distinct a from foo; ┌───┐ │ a │ ├───┤ │ 6 │ │ 8 │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 9 │ │ 7 │ │ 0 │ └───┘ (10 rows) Time: 890.166 ms postgres=# set enable_hashagg = false; SET Time: 0.271 ms postgres=# select distinct a from foo; ┌───┐ │ a │ ├───┤ │ 0 │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ 8 │ │ 9 │ └───┘ (10 rows) Time: 0.573 ms Best regards, Thomas Munro diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index 53cf96f..5f10d7f 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -29,6 +29,7 @@ * index_can_return - does index support index-only scans? * index_getprocid - get a support procedure OID * index_getprocinfo - get a support procedure's lookup info + * index_skip - advance to the next key value in a scan * * NOTES * This file contains the index_ routines which used @@ -742,6 +743,37 @@ index_can_return(Relation indexRelation) PointerGetDatum(indexRelation))); } +bool +index_can_skip(Relation indexRelation) +{ + FmgrInfo *procedure; + + RELATION_CHECKS; + + /* amcanskip is optional; assume FALSE if
Re: [HACKERS] alter user/role CURRENT_USER
All, I agree that we should probably seperate the concerns here. Personally, I like the idea of being able to say CURRENT_USER in utility commands to refer to the current user where a role would normally be expected, as I could see it simplifying things for some applications, but that's a new feature and independent of making role-vs-user cases more consistent. So, I've been doing a little digging and it would appear that the ALTER ROLE/USER consistency was brought up earlier in the year. http://www.postgresql.org/message-id/cadyruxmv-tvsbv7mttcs+qedny6xj1+krtzfowvuhdjc5mg...@mail.gmail.com It was returned with feedback in Commitfest 2014-06 and apparently lost steam: https://commitfest.postgresql.org/action/patch_view?id=1408 Tom put forward a suggestion for how to fix it: http://www.postgresql.org/message-id/21570.1408832...@sss.pgh.pa.us I have taken that patch and updated the documentation (attached) and ran it through some cursory testing. At any rate, this is probably a good starting point for those changes. -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com diff --git a/doc/src/sgml/ref/alter_user.sgml b/doc/src/sgml/ref/alter_user.sgml new file mode 100644 index 58ae1da..ac05682 *** a/doc/src/sgml/ref/alter_user.sgml --- b/doc/src/sgml/ref/alter_user.sgml *** ALTER USER replaceable class=PARAMETER *** 38,47 ALTER USER replaceable class=PARAMETERname/replaceable RENAME TO replaceablenew_name/replaceable ! ALTER USER replaceable class=PARAMETERname/replaceable SET replaceableconfiguration_parameter/replaceable { TO | = } { replaceablevalue/replaceable | DEFAULT } ! ALTER USER replaceable class=PARAMETERname/replaceable SET replaceableconfiguration_parameter/replaceable FROM CURRENT ! ALTER USER replaceable class=PARAMETERname/replaceable RESET replaceableconfiguration_parameter/replaceable ! ALTER USER replaceable class=PARAMETERname/replaceable RESET ALL /synopsis /refsynopsisdiv --- 38,47 ALTER USER replaceable class=PARAMETERname/replaceable RENAME TO replaceablenew_name/replaceable ! ALTER USER replaceable class=PARAMETERname/replaceable [ IN DATABASE replaceable class=PARAMETERdatabase_name/replaceable ] SET replaceableconfiguration_parameter/replaceable { TO | = } { replaceablevalue/replaceable | DEFAULT } ! ALTER USER { replaceable class=PARAMETERname/replaceable | ALL } [ IN DATABASE replaceable class=PARAMETERdatabase_name/replaceable ] SET replaceableconfiguration_parameter/replaceable FROM CURRENT ! ALTER USER { replaceable class=PARAMETERname/replaceable | ALL } [ IN DATABASE replaceable class=PARAMETERdatabase_name/replaceable ] RESET replaceableconfiguration_parameter/replaceable ! ALTER USER { replaceable class=PARAMETERname/replaceable | ALL } [ IN DATABASE replaceable class=PARAMETERdatabase_name/replaceable ] RESET ALL /synopsis /refsynopsisdiv diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y new file mode 100644 index 0de9584..d7c0586 *** a/src/backend/parser/gram.y --- b/src/backend/parser/gram.y *** static Node *makeRecursiveViewSelect(cha *** 230,236 AlterFdwStmt AlterForeignServerStmt AlterGroupStmt AlterObjectSchemaStmt AlterOwnerStmt AlterSeqStmt AlterSystemStmt AlterTableStmt AlterTblSpcStmt AlterExtensionStmt AlterExtensionContentsStmt AlterForeignTableStmt ! AlterCompositeTypeStmt AlterUserStmt AlterUserMappingStmt AlterUserSetStmt AlterRoleStmt AlterRoleSetStmt AlterPolicyStmt AlterDefaultPrivilegesStmt DefACLAction AnalyzeStmt ClosePortalStmt ClusterStmt CommentStmt --- 230,236 AlterFdwStmt AlterForeignServerStmt AlterGroupStmt AlterObjectSchemaStmt AlterOwnerStmt AlterSeqStmt AlterSystemStmt AlterTableStmt AlterTblSpcStmt AlterExtensionStmt AlterExtensionContentsStmt AlterForeignTableStmt ! AlterCompositeTypeStmt AlterUserMappingStmt AlterRoleStmt AlterRoleSetStmt AlterPolicyStmt AlterDefaultPrivilegesStmt DefACLAction AnalyzeStmt ClosePortalStmt ClusterStmt CommentStmt *** static Node *makeRecursiveViewSelect(cha *** 520,525 --- 520,527 %type str opt_existing_window_name %type boolean opt_if_not_exists + %type str role_or_user + /* * Non-keyword token types. These are hard-wired into the flex lexer. * They must be listed first so that their numeric codes do not depend on *** stmt : *** 756,763 | AlterTSConfigurationStmt | AlterTSDictionaryStmt | AlterUserMappingStmt - | AlterUserSetStmt - | AlterUserStmt | AnalyzeStmt | CheckPointStmt | ClosePortalStmt --- 758,763 *** CreateUserStmt: *** 1033,1042 * * Alter a postgresql DBMS role * */ AlterRoleStmt: ! ALTER ROLE RoleId opt_with AlterOptRoleList { AlterRoleStmt *n =
[HACKERS] Let's drop two obsolete features which are bear-traps for novices
PostgreSQL has two bits of obsolete, incomplete functionality which entrap and frustrate new users in large numbers. Both of these features share the following characteristics: * added more than 10 years ago * have the same names as useful features from other databases * were never finished and lack critical functionality * have not seen significant work in the last 4 releases Every other day on IRC I run into a newbie who has used one of these features under the mistaken impression that it is useful, and then had to be guided in how to get their data out of this broken feature at some length. Unknown are the number of users who didn't ask for help but simply chose to use a different database instead. Of course, I'm talking about the MONEY type and hash indexes (not the hash ops class, which is useful, just the index type). It's time to put both of these features out to pasture. Certainly neither of theise features would be accepted into PostgreSQL today given the shape they're in. Having these broken features around is like leaving an armed bear-trap in a public park. Now, I know the first thing someone will do is jump up and claim that they were just about to fix WAL-logging on hash indexes, or add casts to the money type. But if that hasn't happened in the last 5 years, it's not going to happen. We'd be doing our users a huge favor by just removing them in 9.5. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Josh Berkus j...@agliodbs.com writes: Of course, I'm talking about the MONEY type and hash indexes (not the hash ops class, which is useful, just the index type). It's time to put both of these features out to pasture. Certainly neither of theise features would be accepted into PostgreSQL today given the shape they're in. I don't care one way or the other about the money type, but I will defend hash indexes, especially seeing that we've already added a pretty in-your-face warning as of 9.5: regression=# create table foo(f1 int); CREATE TABLE regression=# create index on foo using hash (f1); WARNING: hash indexes are not WAL-logged and their use is discouraged CREATE INDEX Now, I know the first thing someone will do is jump up and claim that they were just about to fix WAL-logging on hash indexes, I don't know if/when that will happen as such, but Simon was making noises about writing code to treat hash indexes as unlogged automatically, which would more or less fix the worst risks. That's not just a special case for hash indexes, but any index AM that lacks WAL support, as third-party AMs might well do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
Hi, On 2014-03-31 12:10:01 +0200, Andres Freund wrote: I recently have seen some perf profiles in which _mdfd_getseg() was in the top #3 when VACUUMing large (~200GB) relations. Called by mdread(), mdwrite(). Looking at it's implementation, I am not surprised. It iterates over all segment entries a relations has; for every read or write. That's not painful for smaller relations, but at a couple of hundred GB it starts to be annoying. Especially if kernel readahead has already read in all data from disk. I don't have a good idea what to do about this yet, but it seems like something that should be fixed mid-term. The best I can come up is is caching the last mdvec used, but that's fairly ugly. Alternatively it might be a good idea to not store MdfdVec as a linked list, but as a densely allocated array. I've seen this a couple times more since. On larger relations it gets even more pronounced. When sequentially scanning a 2TB relation, _mdfd_getseg() gets up to 80% proportionate CPU time towards the end of the scan. I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? With some additional work we could save a bit more memory by getting rid of the mdfd_segno as it's essentially redundant - but that's not entirely trivial and I'm unsure if it's worth it. I've also attached a second patch that makes PageIsVerified() noticeably faster when the page is new. That's helpful and related because it makes it easier to test the correctness of the md.c rewrite by faking full 1GB segments. It's also pretty simple, so there's imo little reason not to do it. Greetings, Andres Freund PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633 -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From f7bf48137d5c3c4268e0fb6231ca6996d9fe Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Thu, 30 Oct 2014 13:24:20 +0100 Subject: [PATCH 1/2] Faster PageIsVerified() for the all zeroes case. --- src/backend/storage/page/bufpage.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 6351a9b..c175ed0 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -81,7 +81,7 @@ bool PageIsVerified(Page page, BlockNumber blkno) { PageHeader p = (PageHeader) page; - char *pagebytes; + size_t *pagebytes; int i; bool checksum_failure = false; bool header_sane = false; @@ -118,10 +118,17 @@ PageIsVerified(Page page, BlockNumber blkno) return true; } - /* Check all-zeroes case */ + /* + * Check all-zeroes case. Luckily BLCKSZ is guaranteed to always be a + * multiple of size_t - and it's much faster compare memory using the + * processor's native word size. + */ + StaticAssertStmt(BLCKSZ == (BLCKSZ / sizeof(size_t)) * sizeof(size_t), + BLCKSZ has to be a multiple of sizeof(size_t)); + all_zeroes = true; - pagebytes = (char *) page; - for (i = 0; i BLCKSZ; i++) + pagebytes = (size_t *) page; + for (i = 0; i (BLCKSZ / sizeof(size_t)); i++) { if (pagebytes[i] != 0) { -- 2.0.0.rc2.4.g1dc51c6.dirty From 5bc40c11081a0d034f2857b45cb3d45c5b63fa58 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Thu, 30 Oct 2014 13:24:52 +0100 Subject: [PATCH 2/2] Much faster/O(1) mfdvec. --- src/backend/storage/smgr/md.c | 307 +--- src/backend/storage/smgr/smgr.c | 4 +- src/include/storage/smgr.h | 4 +- 3 files changed, 166 insertions(+), 149 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 167d61c..6a7afa8 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -92,27 +92,22 @@ * out to an unlinked old copy of a segment file that will eventually * disappear. * - * The file descriptor pointer (md_fd field) stored in the SMgrRelation - * cache is, therefore, just the head of a list of MdfdVec objects, one - * per segment. But note the md_fd pointer can be NULL, indicating - * relation not open. + * File descriptors are stored in the md_seg_fds arrays inside + * SMgrRelation. The length of these arrays is stored in md_seg_no. Note + * that md_nfd having a specific value does not necessarily mean the relation + * doesn't have additional segments; we may just not have opened the next + * segment yet. (We could not have all segments are in the array as an + *
Re: [HACKERS] tracking commit timestamps
On Sat, Nov 1, 2014 at 1:15 AM, Simon Riggs si...@2ndquadrant.com wrote: On 31 October 2014 15:46, Petr Jelinek p...@2ndquadrant.com wrote: Attached version with the above comments near the relevant code. Looks cooked and ready to serve. Who's gonna commit this? Alvaro, or do you want me to? Could you hold on a bit? I'd like to have a look at it more deeply and by looking at quickly at the code there are a couple of things that could be improved. Regards, -- Michael
Re: [HACKERS] _mdfd_getseg can be expensive
Andres Freund and...@2ndquadrant.com writes: I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the md_seg_no stuff, and why is that array typed size_t? IOW, why didn't you *just* replace the linked list with an array? This patch seems to be making some other changes that you've failed to explain. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] _mdfd_getseg can be expensive
On 2014-10-31 18:48:45 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the md_seg_no stuff, and why is that array typed size_t? IOW, why didn't you *just* replace the linked list with an array? It stores the length of the array of _MdfdVec entries. To know whether it's safe to access some element we first need to check whether we've loaded that many entries. It's size_t[] because that seemed to be the most appropriate type for the lenght of an array. It's an array because md.c had already chosen to represent relation forks via an array indexed by the fork. So size_t md_seg_no[MAX_FORKNUM + 1]; contains the length of the _MdfdVec array for each fork. These arrays are stored in: struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; This patch seems to be making some other changes that you've failed to explain. I'm not aware of any that aren't just a consequence of not iterating through the linked list anymore. What change are you thinking of? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On 10/24/14, 6:17 PM, Jim Nasby wrote: - Does anyone have a tangible suggestion for how to reduce the code duplication in patch #6? Between execute_sql_string() and tcop/exec_simple_query()? Is there stuff in exec_simple that's not safe for bgwriter? I'm not seeing why we can't use exec_simple. :/ There are some differences if you compare them closely. Without doing a deep dive, I'm guessing that most of the extra stuff would be safe to re-use; it just wouldn't affect execute_sql_string. Obviously we could add a boolean to exec_simple_query for the case when it's being used by a bgwriter. Though, it seems like the biggest differences have to do with logging Here's the differences I see: - Disallowing transaction commands - Logging - What memory context is used (could we just use that differently in a pg_backend backend?) - Portal output format - What to do with the output of intermediate commands (surely there's other places we need to handle that? plpgsql maybe?) I think all of those except logging could be handled by a function serving both exec_simple_query and execute_sql_command that accepts a few booleans (or maybe just one to indicate the caller) and some if's. At least I don't think it'd be too bad, without actually writing it. I'm not sure what to do about logging. If the original backend has logging turned on, is it that horrible to do the same logging as exec_simple_query would do? Another option would be factoring out parts of exec_simple_query; the for loop over the parse tree might be a good candidate. But I suspect that would be uglier than a separate support function. I do feel uncomfortable with the amount of duplication there is right now though... I took a stab at this by refactoring the guts of exec_simple_query (patch attached) into a new function called exec_query_string (also attached in raw form). As indicated it needs a bit more work. In particular, how it's dealing with excluding transactional commands is rather ugly. Why do we need to do that in pg_background? Andres was concerned about the performance impact of doing this. I tested this by doing for i in {1..99}; do echo 'SELECT 1;' test.sql; done and then time psql -f test.sql /dev/nul It appears there may be a ~1% performance hit, but my laptop isn't repeatable enough to be sure. I did try manually in-lining exec_query_string to see if it was the function call causing the issue; it didn't seem to make a difference. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com Author: Jim Nasby jim.na...@bluetreble.com Date: Thu, 30 Oct 2014 20:25:34 -0500 Move the bulk of exec_simple_query into exec_query_string() so that pg_backend can also make use of it. I’m not thrilled about the name, and I’m not sure tcopprot.h is the right place to expose this. Also note the XXX comments. --- contrib/pg_background/pg_background.c | 141 ++ src/backend/tcop/postgres.c | 83 src/include/tcop/tcopprot.h | 7 ++ 3 files changed, 80 insertions(+), 151 deletions(-) diff --git a/contrib/pg_background/pg_background.c b/contrib/pg_background/pg_background.c index a566ffb..075ecd8 100644 --- a/contrib/pg_background/pg_background.c +++ b/contrib/pg_background/pg_background.c @@ -817,10 +817,6 @@ pg_background_worker_main(Datum main_arg) static void execute_sql_string(const char *sql) { - List *raw_parsetree_list; - ListCell *lc1; - boolisTopLevel; - int commands_remaining; MemoryContext parsecontext; MemoryContext oldcontext; @@ -839,139 +835,16 @@ execute_sql_string(const char *sql) ALLOCSET_DEFAULT_MINSIZE, ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE); - oldcontext = MemoryContextSwitchTo(parsecontext); - raw_parsetree_list = pg_parse_query(sql); - commands_remaining = list_length(raw_parsetree_list); - isTopLevel = commands_remaining == 1; - MemoryContextSwitchTo(oldcontext); /* -* Do parse analysis, rule rewrite, planning, and execution for each raw -* parsetree. We must fully execute each query before beginning parse -* analysis on the next one, since there may be interdependencies. +* Do the real work */ - foreach(lc1, raw_parsetree_list) - { - Node *parsetree = (Node *) lfirst(lc1); - const char *commandTag; - charcompletionTag[COMPLETION_TAG_BUFSIZE]; - List *querytree_list, - *plantree_list; - bool
Re: [HACKERS] infinite loop in _bt_getstackbuf
On Fri, Oct 31, 2014 at 10:29:53AM -0400, Robert Haas wrote: On Thu, Oct 30, 2014 at 11:45 PM, Noah Misch n...@leadboat.com wrote: Given the lack of prior complaints about this loop, I'm not sure I see the need to work harder than that; corruption of this sort must be quite rare. Looks like _bt_getstackbuf() is always called with some buffer lock held, so CHECK_FOR_INTERRUPTS() alone would not help: http://www.postgresql.org/message-id/flat/16519.1401395...@sss.pgh.pa.us That's the insert path. What about the vacuum path? I am not aware of an occasion where the vacuum path will call _bt_getstackbuf() without already holding some buffer lock. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Oct 31, 2014 at 4:10 PM, Andres Freund and...@2ndquadrant.com wrote: I don't think that's correct. We only need to process local invalidation messages after CommandCounterIncrement(), which I anticipate prohibiting during parallel execution (short thought should convince you that anything else is completely nuts). It is more complex, even without CCI. As long as you're doing anything inside a transaction that already has done DDL, you'll have to play nasty tricks. Imagine the primary backend has done a BEGIN; CLUSTER pg_class; and then then starts a child backend. Which will get the same snapshot, combocids, yada yada. But it *also* will have preexisting cache entries. Those you need to invalidate before it can do anything correct. Where will those preexisting cache entries come from, exactly? The postmaster is forking the parallel worker, not the user backend. I'm sorry to be a bit pessimistic here. But my intuition says that starting to do group locking opens a can of worms that'll take us a long time to close again. Maybe I'm just imagining complexity that won't actually be there. But I have a hard time believing that. What's the distinction between teach the deadlock detector to catch these kind of cases and group locking? Because I think those are at least 90% the same thing. I understand under 'group locking' that a primary/second backends can coown a lock that normally is self-exclusive. That's a fair bit more than adding an edge to the deadlock graph between primary/secondary backends to make the deadlock detector recognize problems. I guess. It seems like a pretty minor extension to me. Getting the deadlock detector to do the right thing is the hard part. What I have serious doubts about is 'coowning' locks. Especially if two backends normally wouldn't be able to both get that lock. Perhaps we should discuss that more. To me it seems pretty obvious that's both safe and important. I wonder if we couldn't implement a 'halfway' by allowing parallel workers to jump the lockqueue if the parent process already has the lock. That'd only work for nonconflicting locks, but I think that's actually good. The patch takes precisely that approach; that part of the logic is already implemented. Well, my point is that if the solution is just to jump the queue, there really isn't any data structure changes needed. Secondary acquirers just need to check whether a lock is already owned by the primary and then acquire the lock in the absolutely normal way - with a completely separate entry. Only that they ignored the queue. Well, they need to be able to identify who is in their group. You could possibly do that without any changes at all to the lock manager data structures, but it seems a bit complex. I took the approach of storing the group leader's PGPROC * in each PROCLOCK, which makes it trivial (and is enough information for the deadlock detector to work correctly, too). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On Fri, Oct 31, 2014 at 6:07 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't know if/when that will happen as such, but Simon was making noises about writing code to treat hash indexes as unlogged automatically, which would more or less fix the worst risks. That's not just a special case for hash indexes, but any index AM that lacks WAL support, as third-party AMs might well do. As someone writing a 3rd-party AM, literally right this moment, do you have a link to that thread? While I follow this list fairly closely I don't remember seeing this. I'd love to understand the thoughts around handling extension-based AMs. eric
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 31 October 2014 19:36, Robert Haas robertmh...@gmail.com wrote: It's an obscure case and its not the only solution either. I don't think that's an obscure situation at all. Do you really think a patch that could cause an attempt to VACUUM FULL a system catalog to suffer an undetected deadlock meets this community's quality standards? Because that's what we're talking about. Nobody has said that allowing undetected deadlocks is acceptable, so your other comments are void. I've suggested *stricter* locking, which would obviously allow deadlock detection. You recognised that by claiming that the locking I had proposed was actually too strict, which is where the above example came from. Yes, I have proposed stricter locking, but as explained, the only things this would slow down are catalog VAC FULLs, which are already a problem. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 31 October 2014 18:47, Robert Haas robertmh...@gmail.com wrote: On Fri, Oct 31, 2014 at 11:02 AM, Simon Riggs si...@2ndquadrant.com wrote: You mentioned earlier that functions would need to be marked proisparallel etc.. What conditions will that be protecting against? If we aren't going to support the general case where every single thing works, can we at least discuss what the list of cases is that we will support. In general, any operation that relies on backend-private state not shared with a cooperating backend isn't parallel-safe. For example, Yes, the principle is easy to understand, but that was not the question. You are saying that placing restrictions on which functions can execute is necessary, yet at the same time saying that we must have generalised locking for workers because restrictions on locking are not acceptable. I don't point that out because I want an argument or to prove a point, but because there are important things on the table here. What are the specific restrictions you are suggesting we place on parallel workers? We need that definition clear so we can decide how to mark the functions. If we don't know, which is easily understandable, then the best way to find that out is to build a restricted solution and to expand slowly outwards to find problems. An obvious milestone there is whether a function contains SQL, which is the point chosen by some other databases. I personally hope to expand upon that, but it would be useful to reach it first and then continue from there. I was already thinking of implementing CONTAINS NO SQL as a SQL Standard function marking since it can be used to optimize COPY further. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers