Re: [HACKERS] FIPS mode?
Michael Paquierwrites: > On Sat, Jun 24, 2017 at 12:56 PM, Curtis Ruck > wrote: >> If I clean this up some, maintain styleguide, what is the likely hood of >> getting this included in the redhat packages, since redhat ships a certified >> FIPS implementation? > So they are applying a custom patch to it already? Don't believe so. It's been a few years since I was at Red Hat, but my recollection is that their approach was that it was a system-wide configuration choice changing libc's behavior, and there were only very minor fixes required to PG's behavior, all of which got propagated upstream (see, eg, commit 01824385a). It sounds like Curtis is trying to enable FIPS mode inside Postgres within a system where it isn't enabled globally, which according to my recollection has basically nothing to do with complying with the actual federal security standard. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race conditions with WAL sender PID lookups
On Wed, Jun 21, 2017 at 10:52:50PM -0700, Noah Misch wrote: > On Thu, Jun 15, 2017 at 09:04:44AM +0100, Simon Riggs wrote: > > On 15 June 2017 at 02:59, Noah Mischwrote: > > > > > Formally, the causative commit is the one that removed the superuser() > > > test, > > > namely 25fff40. > > > > > > [Action required within three days. This is a generic notification.] > > > > > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > > > since you committed the patch believed to have created it, you own this > > > open > > > item. If some other commit is more relevant or if this does not belong > > > as a > > > v10 open item, please let us know. Otherwise, please observe the policy > > > on > > > open item ownership[1] and send a status update within three calendar > > > days of > > > this message. Include a date for your subsequent status update. Testers > > > may > > > discover new open items at any time, and I want to plan to get them all > > > fixed > > > well in advance of shipping v10. Consequently, I will appreciate your > > > efforts > > > toward speedy resolution. Thanks. > > > > > > [1] > > > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > > > Thanks for letting me know. I'm on leave at present, so won't be able > > to get to this until June 20, though will make it a priority for then. > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due for your status update. Please reacquaint yourself with the policy on open item ownership[1] and then reply immediately. If I do not hear from you by 2017-06-25 06:00 UTC, I will transfer this item to release management team ownership without further notice. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FIPS mode?
On Sat, Jun 24, 2017 at 12:56 PM, Curtis Ruckwrote: > I've got a requirement for enabling FIPS support in our environment. > Looking at postgresql's be-secure-openssl.c and mucking with it, it seems > fairly straight forward to just add a few ifdefs and enable fips with a new > configure flag and a new postgresql.conf configuration setting. > > If I clean this up some, maintain styleguide, what is the likely hood of > getting this included in the redhat packages, since redhat ships a certified > FIPS implementation? So they are applying a custom patch to it already? > For what its worth, I've got the FIPS_mode_set(1) working and postgresql > seems to function properly. I'd just like to see this in upstream so I > don't end up maintaining a long-lived branch. > > Looking at scope, logically it seems mostly confined to libpq, and > be-secure-openssl.c, though i'd expect pgcrypto to be affected. Yes, I would imagine atht this is located into be-secure-openssl.c and fe-secure-openssl.c as everything should be done when initializing the SSL context. Here is a manual about patch submission: https://wiki.postgresql.org/wiki/Submitting_a_Patch As things are now, the next version where new features are accepted will be 11, with a commit fest beginning in September. Here is its website where patches need to be registered for review and possible integration into the tree: https://commitfest.postgresql.org/ -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Causal reads take II
On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munrowrote: > Apply the patch after first applying a small bug fix for replication > lag tracking[4]. Then: That bug fix was committed, so now causal-reads-v17.patch can be applied directly on top of master. > 1. Set up some streaming replicas. > 2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in > the primary's postgresql.conf. > 3. Set causal_reads = on in some transactions on various nodes. > 4. Try to break it! Someone asked me off-list how to set this up quickly and easily for testing. Here is a shell script that will start up a primary server (port 5432) and 3 replicas (ports 5441 to 5443). Set the two paths at the top of the file before running in. Log in with psql postgres [-p ], then SET causal_reads = on to test its effect. causal_reads_max_replay_lay is set to 2s and depending on your hardware you might find that stuff like CREATE TABLE big_table AS SELECT generate_series(1, 1000) or a large COPY data load causes replicas to be kicked out of the set after a while; you can also pause replay on the replicas with SELECT pg_wal_replay_pause() and pg_wal_replay_resume(), kill -STOP/-CONT or -9 the walreceiver processes to similar various failure modes, or run the replicas remotely and unplug the network. SELECT application_name, replay_lag, causal_reads_state FROM pg_state_replication to see the current situation, and also monitor the primary's LOG messages about transitions. You should find that the "read-your-writes-or-fail-explicitly" guarantee is upheld, no matter what you do, and furthermore than failing or lagging replicas don't hold hold the primary up very long: in the worst case causal_reads_lease_time for lost contact, and in the best case the time to exchange a couple of messages with the standby to tell it its lease is revoked and it should start raising an error. You might find test-causal-reads.c[1] useful for testing. > Maybe it needs a better name. Ok, how about this: the feature could be called "synchronous replay". The new column in pg_stat_replication could be called sync_replay (like the other sync_XXX columns). The GUCs could be called synchronous replay, synchronous_replay_max_lag and synchronous_replay_lease_time. The language in log messages could refer to standbys "joining the synchronous replay set". Restating the purpose of the feature with that terminology: If synchronous_replay is set to on, then you see the effects of all synchronous_replay = on transactions that committed before your transaction began, or an error is raised if that is not possible on the current node. This allows applications to direct read-only queries to read-only replicas for load balancing without seeing stale data. Is that clearer? Restating the relationship with synchronous replication with that terminology: while synchronous_commit and synchronous_standby_names are concerned with distributed durability, synchronous_replay is concerned with distributed visibility. While the former prevents commits from returning if the configured level of durability isn't met (for example "must be flushed on master + any 2 standbys"), the latter will simply drop any standbys from the synchronous replay set if they fail or lag more than synchronous_replay_max_lag. It is reasonable to want to use both features at once: my policy on distributed durability might be that I want all transactions to be flushed to disk on master + any of three servers before I report information to users, and my policy on distributed visibility might be that I want to be able to run read-only queries on any of my six read-only replicas, but don't want to wait for any that lag by more than 1 second. Thoughts? [1] https://www.postgresql.org/message-id/CAEepm%3D3NF%3D7eLkVR2fefVF9bg6RxpZXoQFmOP3RWE4r4iuO7vg%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com test-causal-reads.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FIPS mode?
I've got a requirement for enabling FIPS support in our environment. Looking at postgresql's be-secure-openssl.c and mucking with it, it seems fairly straight forward to just add a few ifdefs and enable fips with a new configure flag and a new postgresql.conf configuration setting. If I clean this up some, maintain styleguide, what is the likely hood of getting this included in the redhat packages, since redhat ships a certified FIPS implementation? For what its worth, I've got the FIPS_mode_set(1) working and postgresql seems to function properly. I'd just like to see this in upstream so I don't end up maintaining a long-lived branch. Looking at scope, logically it seems mostly confined to libpq, and be-secure-openssl.c, though i'd expect pgcrypto to be affected.
Re: [HACKERS] Broken hint bits (freeze)
On Fri, Jun 23, 2017 at 8:18 PM, Bruce Momjianwrote: > On Fri, Jun 23, 2017 at 08:10:17AM +0530, Amit Kapila wrote: >> On Wed, Jun 21, 2017 at 10:03 PM, Bruce Momjian wrote: >> > On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote: >> >> On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila >> >> wrote: >> >> > Hmm. I think we need something that works with lesser effort because >> >> > not all users will be as knowledgeable as you are, so if they make any >> >> > mistakes in copying the file manually, it can lead to problems. How >> >> > about issuing a notification (XLogArchiveNotifySeg) in shutdown >> >> > checkpoint if archiving is enabled? >> >> > >> >> >> >> I have thought more about the above solution and it seems risky to >> >> notify archiver for incomplete WAL segments (which will be possible in >> >> this case as there is no guarantee that Checkpoint record will fill >> >> the segment). So, it seems to me we should update the document unless >> >> you or someone has some solution to this problem. >> > >> > The over-arching question is how do we tell users to verify that the WAL >> > has been replayed on the standby? I am thinking we would say that for >> > streaming replication, the "Latest checkpoint location" should match on >> > the primary and standby, while for log shipping, the standbys should be >> > exactly one WAL file less than the primary. >> > >> >> I am not sure if we can say "standbys should be exactly one WAL file >> less than the primary" because checkpoint can create few more WAL >> segments for future use. I think to make this work user needs to >> carefully just copy the next WAL segment (next to the last file in >> standby) which will contain checkpoint record. Ideally, there should >> be some way either in form of a tool or a functionality in the >> database server with which this last file can be copied but I think in >> the absence of that we can at least document this fact. > > I was not clear. I was not saying there can be only one extra WAL file. > I am saying the "Latest checkpoint location" should be one WAL file > farther on the master. I think the big problem is that we need a full > replay of that WAL file, not just having it one less than the master. > If the user has properly shutdown, then that last file should only have checkpoint record, is it safe to proceed with upgrade without actually copying that file? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken hint bits (freeze)
On Fri, Jun 23, 2017 at 8:47 PM, Sergey Burladyanwrote: > Bruce Momjian writes: > >> On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote: >> > On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila >> > wrote: >> > > Hmm. I think we need something that works with lesser effort because >> > > not all users will be as knowledgeable as you are, so if they make any >> > > mistakes in copying the file manually, it can lead to problems. How >> > > about issuing a notification (XLogArchiveNotifySeg) in shutdown >> > > checkpoint if archiving is enabled? >> > > >> > >> > I have thought more about the above solution and it seems risky to >> > notify archiver for incomplete WAL segments (which will be possible in >> > this case as there is no guarantee that Checkpoint record will fill >> > the segment). So, it seems to me we should update the document unless >> > you or someone has some solution to this problem. > >> As far as I know this is the only remaining open issue. Sergey, please >> verify. I appreciate the work everyone has done to improve this, and >> all the existing fixes have been pushed to all supported branches. :-) > > Yes, thank you all for your help! > > Yes, this is last issue with checkpoint that I know, how to ensure that > standby sync all shared buffers into disk on it shutdown. > I think if we have a command like Alter System Flush Shared Buffers, then it would have been helpful in what you need here. You could have run it before shutdown. > I thinking about enforce restartpoint on shutdown, like: > src/backend/access/transam/xlog.c > - 8639 if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || > - 8640 XLByteLE(lastCheckPoint.redo, > ControlFile->checkPointCopy.redo)) > - 8641 { > + 8639 if ( !(flags & CHECKPOINT_IS_SHUTDOWN) && > (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || > + 8640 XLByteLE(lastCheckPoint.redo, > ControlFile->checkPointCopy.redo) ) > + 8641 { > > But I still not read source and not sure about this solution. > It might serve your purpose, but I think it will not be safe to perform restartpoint always at shutdown. It will delete the WAL files which should be deleted only after the actual checkpoint record is received from the master. > > PS: > I successfully upgraded last night from 9.2 to 9.4 and find other issue :-) > > It is about hash index and promote: > 1. create master > 2. create standby from it > 3. create unlogged table and hash index like: > create unlogged table test (id int primary key, v text); > create index on test using hash (id); > 3. stop master > 4. promote standby > > now, if you try to upgrade this new promoted master pg_upgrade will stop > on this hash index: > error while creating link for relation "public.test_id_idx" > ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"): No such file or > directory > Failure, exiting > I am not sure if this is a problem because in the version you are trying hash indexes are not WAL-logged and the creation of same will not be replicated on standby, so the error seems to be expected. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] An attempt to reduce WALWriteLock contention
On Thu, Jun 22, 2017 at 6:54 AM, Andres Freundwrote: > On 2017-06-21 00:57:32 -0700, jasrajd wrote: >> We are also seeing contention on the walwritelock and repeated writes to the >> same offset if we move the flush outside the lock in the Azure environment. >> pgbench doesn't scale beyond ~8 cores without saturating the IOPs or >> bandwidth. Is there more work being done in this area? > That should not happen if the writes from various backends are combined in some way. However, it is not very clear what exactly you have done as part of taking flush calls out of walwritelock. Can you share patch or some details about how you have done it and how have you measured the contention you are seeing? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
Peter Eisentrautwrites: > Do you want to take a look at move those elog calls around a bit? That > should do it. It would be a good idea to have some clarity on *why* that should do it. Looking at the original report's log, but without having actually reproduced the problem, I guess what is happening is this: 1. Subscription worker process (23117) gets a duplicate key conflict while trying to apply an update, and in consequence it exits. (Is that supposed to happen?) 2. Publication server process (23124) doesn't notice client connection loss right away. By chance, the next thing it tries to send to the client is the debug output from LogicalIncreaseRestartDecodingForSlot. Then it detects loss of connection (at 2017-06-21 14:55:12.033) and FATAL's out. But since the spinlock stuff has no tracking infrastructure, we don't know we are still holding the replication slot mutex. 3. Process exit cleanup does know that it's supposed to release the replication slot, so it tries to take the mutex spinlock ... again. Eventually that times out and we get the "stuck spinlock" panic. All correct so far? So, okay, the proximate cause of the crash is a blatant violation of the rule that spinlocks may only be held across straight-line code segments. But I'm wondering about the client exit having occurred in the first place. Why is that, and how would one ever recover? It sure looks like this isn't the first subscription worker process that has tried and failed to apply the update. If our attitude towards this situation is that it's okay to fork-bomb your server with worker processes continually respawning and making no progress, well, I don't think that's good enough. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/21/17 22:47, Peter Eisentraut wrote: > On 6/20/17 22:44, Noah Misch wrote: >>> A patch has been posted, and it's being reviewed. Next update Monday. >> >> This PostgreSQL 10 open item is past due for your status update. Kindly send >> a status update within 24 hours, and include a date for your subsequent >> status >> update. Refer to the policy on open item ownership: >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I'm not sure how to proceed here. Nobody is else seems terribly > interested, and this is really a minor issue, so I don't want to muck > around with the locking endlessly. Let's say, if there are no new > insights by Friday, I'll pick one of the proposed patches or just close > it without any patch. After some comments, a new patch has been posted, and I'll give until Monday for review. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 6/22/17 03:26, Masahiko Sawada wrote: > Thank you for the patch. Some comment and question. > DropSubscriptionRelState requests ExclusiveLock but why is it not > ShareRowExclusiveLock? fixed > I test DROP SUBSCIPTION case but even with this patch, "tuple > concurrently updated" is still occurred. > > @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) > } > heap_endscan(scan); > > - heap_close(rel, RowExclusiveLock); > + heap_close(rel, lockmode); > } > > I think we should not release the table lock here, should be > heap_close(rel, NoLock) instead? After changed it so on my > environment, DROP SUBSCRIPTION seems to work fine. fixed > Also, ALTER SUBSCRIPTION SET PUBLICATION seems the same. Even with > Petr's patch, the same error still occurred during ALTER SUBSCRIPTION > SET PUBLICATION. Currently it acquires RowExclusiveLock on > pg_subscription_rel but as long as both DDL and table sync worker > could modify the same record on pg_subscription this error can > happen. fixed > On the other hand if we take a strong lock on pg_subscription > during DDL the deadlock risk will be increased. The original problem was that DROP TABLE locked things in the order 1) user table, 2) pg_subscription_rel, whereas a full-database VACUUM would lock things in the opposite order. I think this was a pretty wide window if you have many tables. In this case, however, we are only dealing with pg_subscription and pg_subscription_rel, and I think even VACUUM would always processes them in the same order. Could please try the attached patch so see if it addresses your test scenarios? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From fc50b9e287e8c3c29be21d82c07c39023cbf3882 Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Fri, 23 Jun 2017 21:26:46 -0400 Subject: [PATCH v2] Parametrize locking in RemoveSubscriptionRel Revising the change in 521fd4795e3ec3d0b263b62e5eb58e1557be9c86, we need to take a stronger lock when removing entries from pg_subscription_rel when ALTER/DROP SUBSCRIPTION. Otherwise, users might see "tuple concurrently updated" when table sync workers make updates at the same time as DDL commands are run. For DROP TABLE, we keep the lower lock level that was put in place by the above commit, so that DROP TABLE does not deadlock with whole-database VACUUM. --- src/backend/catalog/heap.c| 2 +- src/backend/catalog/pg_subscription.c | 6 +++--- src/backend/commands/subscriptioncmds.c | 4 ++-- src/include/catalog/pg_subscription_rel.h | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index a376b99f1e..8a96bbece6 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1844,7 +1844,7 @@ heap_drop_with_catalog(Oid relid) /* * Remove any associated relation synchronization states. */ - RemoveSubscriptionRel(InvalidOid, relid); + RemoveSubscriptionRel(InvalidOid, relid, RowExclusiveLock); /* * Forget any ON COMMIT action for the rel diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index c69c461b62..1d80191ca8 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -369,7 +369,7 @@ GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn, * subscription, or for a particular relation, or both. */ void -RemoveSubscriptionRel(Oid subid, Oid relid) +RemoveSubscriptionRel(Oid subid, Oid relid, LOCKMODE lockmode) { Relationrel; HeapScanDesc scan; @@ -377,7 +377,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) HeapTuple tup; int nkeys = 0; - rel = heap_open(SubscriptionRelRelationId, RowExclusiveLock); + rel = heap_open(SubscriptionRelRelationId, lockmode); if (OidIsValid(subid)) { @@ -405,7 +405,7 @@ RemoveSubscriptionRel(Oid subid, Oid relid) } heap_endscan(scan); - heap_close(rel, RowExclusiveLock); + heap_close(rel, NoLock); } diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 9cbd36f646..57954c978b 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -595,7 +595,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data) { char *namespace; - RemoveSubscriptionRel(sub->oid, relid); + RemoveSubscriptionRel(sub->oid, relid, ShareRowExclusiveLock); logicalrep_worker_stop(sub->oid, relid); @@ -910,7 +910,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0); /*
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
On 6/23/17 16:15, Andres Freund wrote: > On 2017-06-23 13:26:58 -0400, Alvaro Herrera wrote: >> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have >> some elog(DEBUG1) calls with the slot spinlock held. That's probably >> uncool. > > It definitely is not cool, rather daft even (it's probably me who wrote > that). No idea why I've done it like that, makes no sense. Do you want to take a look at move those elog calls around a bit? That should do it. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend can terminate background workers and autovacuum launchers
Alvaro, all, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: > Yugo Nagata wrote: > > > I tried to make it. Patch attached. > > > > It is easy and simple. Although I haven't tried for autovacuum worker, > > I confirmed I can change other process' parameters without affecting others. > > Do you want this in PG? > > One advantage of this gadget is that you can signal backends that you > own without being superuser, so +1 for the general idea. (Please do > create a fresh thread so that this can be appropriately linked to in > commitfest app, though.) Well, that wouldn't quite work with the proposed patch because the proposed patch REVOKE's execute from public... I'm trying to figure out how it's actually useful to be able to signal a backend that you own about a config change that you *aren't* able to make without being a superuser.. Now, if you could tell the other backend to use a new value for some USERSET GUC, then *that* would be useful and interesting. I haven't got any particularly great ideas for how to make that happen though. > You need a much bigger patch for the autovacuum worker. The use case I > had in mind is that you have a maintenance window and can run fast > vacuuming during it, but if the table is large and vacuum can't finish > within that window then you want vacuum to slow down, without stopping > it completely. But implementing this requires juggling the > stdVacuumCostDelay/Limit values during the reload, which are currently > read at start of vacuuming only; the working values are overwritten from > those during a rebalance. Being able to change those values during a vacuuming run would certainly be useful, I've had cases where I would have liked to have been able to do just exactly that. That's largely independent of this though. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
On 6/23/17 16:10, Peter Eisentraut wrote: >> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have >> some elog(DEBUG1) calls with the slot spinlock held. That's probably >> uncool. > Removing the call you pointed out doesn't make a difference, but it's > possibly something similar. Correction: The fault actually is in that function. I had only looked at the "if" branch, but it was the "else" branch that triggered the problem in this run. If I remove those elog calls, everything works smoothly. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small bug in replication lag tracking
On Sat, Jun 24, 2017 at 6:07 AM, Simon Riggswrote: > On 23 June 2017 at 08:18, Simon Riggs wrote: >> On 23 June 2017 at 06:45, Thomas Munro wrote: >> >>> I discovered a thinko in the new replication lag interpolation code >>> that can cause a strange number to be reported occasionally. >> >> Thanks, will review. > > I've pushed this, but I think we should leave the code alone after > this and wait for user feedback. Thanks! Yeah, I haven't heard any feedback about this, which I've been interpreting as good news... but if anyone's looking for something to beta test, reports would be most welcome. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Can ICU be used for a database's default sort order?
Peter Geogheganwrites: > On Fri, Jun 23, 2017 at 11:32 AM, Peter Eisentraut > wrote: >> 1) Associate by name only. That is, you can create a database with any >> COLLATION "foo" that you want, and it's only checked when you first >> connect to or do anything in the database. >> >> 2) Create shared collations. Then we'd need a way to manage having a >> mix of shared and non-shared collations around. >> >> There are significant pros and cons to all of these ideas. Some people >> I talked to appeared to prefer the shared collations approach. > I strongly prefer the second approach. The only downside that occurs > to me is that that approach requires more code. Is there something > that I've missed? I'm not very clear on how you'd bootstrap template1 into anything other than C locale in the second approach. With our existing libc-based stuff, it's possible to define what the database's locale is before there are any catalogs. It's not apparent how to do that with a collation-based solution. In my mind, collations are just a SQL-syntax wrapper for locales that are really defined one level down. I think we'd be well advised to carry that same approach into the database properties, because otherwise we have circularities to deal with. So I'm imagining something more like create database encoding 'utf8' lc_collate 'icu-en_US' lc_ctype ... where lc_collate is just a string that we know how to interpret, the same as now. We could optionally reduce the amount of notation involved by merging the lc_collate and lc_ctype parameters into one, say create database encoding 'utf8' locale 'icu-en_US' ... I'm not too clear on how this would play with other libc locale functionality (lc_monetary and so on), but we'd have to deal with that question anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend can terminate background workers and autovacuum launchers
On Sat, Jun 24, 2017 at 5:07 AM, Alvaro Herrerawrote: > Yugo Nagata wrote: > >> I tried to make it. Patch attached. >> >> It is easy and simple. Although I haven't tried for autovacuum worker, >> I confirmed I can change other process' parameters without affecting others. >> Do you want this in PG? Just browsing the patch... +if (r == SIGNAL_BACKEND_NOSUPERUSER) +ereport(WARNING, +(errmsg("must be a superuser to terminate superuser process"))); + +if (r == SIGNAL_BACKEND_NOPERMISSION) +ereport(WARNING, + (errmsg("must be a member of the role whose process is being terminated or member of pg_signal_backend"))); Both messages are incorrect. This is not trying to terminate a process. +Datum +pg_reload_conf_pid(PG_FUNCTION_ARGS) I think that the naming is closer to pg_reload_backend. Documentation is needed as well. > One advantage of this gadget is that you can signal backends that you > own without being superuser, so +1 for the general idea. (Please do > create a fresh thread so that this can be appropriately linked to in > commitfest app, though.) That would be nice. > You need a much bigger patch for the autovacuum worker. The use case I > had in mind is that you have a maintenance window and can run fast > vacuuming during it, but if the table is large and vacuum can't finish > within that window then you want vacuum to slow down, without stopping > it completely. But implementing this requires juggling the > stdVacuumCostDelay/Limit values during the reload, which are currently > read at start of vacuuming only; the working values are overwritten from > those during a rebalance. Yeah, that's independent from the patch above. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Can ICU be used for a database's default sort order?
On Fri, Jun 23, 2017 at 11:32 AM, Peter Eisentrautwrote: > It's something I hope to address soon. I hope you do. I think that we'd realize significant benefits by having ICU become the defacto standard collation provider, that most users get without even realizing it. As things stand, you have to make a point of specifying an ICU collation as your per-column collation within every CREATE TABLE. That's a significant barrier to adoption. > 1) Associate by name only. That is, you can create a database with any > COLLATION "foo" that you want, and it's only checked when you first > connect to or do anything in the database. > > 2) Create shared collations. Then we'd need a way to manage having a > mix of shared and non-shared collations around. > > There are significant pros and cons to all of these ideas. Some people > I talked to appeared to prefer the shared collations approach. I strongly prefer the second approach. The only downside that occurs to me is that that approach requires more code. Is there something that I've missed? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
I wrote: >> One question that I've got is why the ICU portion refuses to load >> any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()). >> Surely this is completely wrong? I should think that what we load into >> pg_collation ought to be independent of template1's encoding, the same >> as it is for libc collations, and the right place to be making a test >> like that is where somebody attempts to use an ICU collation. Pursuant to the second part of that: I checked on what happens if you try to use an ICU collation in a database with a not-supported-by-ICU encoding. We have to cope with that scenario even with the current (broken IMO) initdb behavior, because even if template1 has a supported encoding, it's possible to create another database that doesn't. It does fail more or less cleanly; you get an "encoding "foo" not supported by ICU" message at runtime (out of get_encoding_name_for_icu). But that's quite a bit unlike the behavior for libc collations: with those, you get an error in collation name lookup, along the lines of collation "en_DK.utf8" for encoding "SQL_ASCII" does not exist The attached proposed patch makes the behavior for ICU collations the same, by dint of injecting the is_encoding_supported_by_icu() check into collation name lookup. regards, tom lane diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c index 64f6fee..029a132 100644 *** a/src/backend/catalog/namespace.c --- b/src/backend/catalog/namespace.c *** OpfamilyIsVisible(Oid opfid) *** 1915,1923 --- 1915,1974 } /* + * lookup_collation + * If there's a collation of the given name/namespace, and it works + * with the given encoding, return its OID. Else return InvalidOid. + */ + static Oid + lookup_collation(const char *collname, Oid collnamespace, int32 encoding) + { + Oid collid; + HeapTuple colltup; + Form_pg_collation collform; + + /* Check for encoding-specific entry (exact match) */ + collid = GetSysCacheOid3(COLLNAMEENCNSP, + PointerGetDatum(collname), + Int32GetDatum(encoding), + ObjectIdGetDatum(collnamespace)); + if (OidIsValid(collid)) + return collid; + + /* + * Check for any-encoding entry. This takes a bit more work: while libc + * collations with collencoding = -1 do work with all encodings, ICU + * collations only work with certain encodings, so we have to check that + * aspect before deciding it's a match. + */ + colltup = SearchSysCache3(COLLNAMEENCNSP, + PointerGetDatum(collname), + Int32GetDatum(-1), + ObjectIdGetDatum(collnamespace)); + if (!HeapTupleIsValid(colltup)) + return InvalidOid; + collform = (Form_pg_collation) GETSTRUCT(colltup); + if (collform->collprovider == COLLPROVIDER_ICU) + { + if (is_encoding_supported_by_icu(encoding)) + collid = HeapTupleGetOid(colltup); + else + collid = InvalidOid; + } + else + { + collid = HeapTupleGetOid(colltup); + } + ReleaseSysCache(colltup); + return collid; + } + + /* * CollationGetCollid * Try to resolve an unqualified collation name. * Returns OID if collation found in search path, else InvalidOid. + * + * Note that this will only find collations that work with the current + * database's encoding. */ Oid CollationGetCollid(const char *collname) *** CollationGetCollid(const char *collname) *** 1935,1953 if (namespaceId == myTempNamespace) continue; /* do not look in temp namespace */ ! /* Check for database-encoding-specific entry */ ! collid = GetSysCacheOid3(COLLNAMEENCNSP, ! PointerGetDatum(collname), ! Int32GetDatum(dbencoding), ! ObjectIdGetDatum(namespaceId)); ! if (OidIsValid(collid)) ! return collid; ! ! /* Check for any-encoding entry */ ! collid = GetSysCacheOid3(COLLNAMEENCNSP, ! PointerGetDatum(collname), ! Int32GetDatum(-1), ! ObjectIdGetDatum(namespaceId)); if (OidIsValid(collid)) return collid; } --- 1986,1992 if (namespaceId == myTempNamespace) continue; /* do not look in temp namespace */ ! collid = lookup_collation(collname, namespaceId, dbencoding); if (OidIsValid(collid)) return collid; } *** CollationGetCollid(const char *collname) *** 1961,1966 --- 2000,2008 * Determine whether a collation (identified by OID) is visible in the * current search path. Visible means "would be found by searching * for the unqualified collation name". + * + * Note that only collations that work with the current database's encoding + * will be considered visible. */ bool CollationIsVisible(Oid collid) *** CollationIsVisible(Oid collid) *** 1990,1998 { /* * If it is in the path, it might still not be visible; it could be ! * hidden by another conversion of the same name earlier in the path. ! * So we must do a slow
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
Peter Eisentrautwrites: > On 6/23/17 13:26, Alvaro Herrera wrote: >> Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have >> some elog(DEBUG1) calls with the slot spinlock held. That's probably >> uncool. > I can reproduce the issue with --client-min-messages=debug2 or debug1, > but it doesn't appear with the default settings. I don't always get the > "stuck spinlock" message, but it hangs badly pretty reliably after two > or three rounds of erroring. > Removing the call you pointed out doesn't make a difference, but it's > possibly something similar. Maybe some function invoked in a debug ereport accesses shared memory or takes locks (eg, for catalog access)? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb initalization failure for collation "ja_JP"
I wrote: > One question that I've got is why the ICU portion refuses to load > any entries unless is_encoding_supported_by_icu(GetDatabaseEncoding()). > Surely this is completely wrong? I should think that what we load into > pg_collation ought to be independent of template1's encoding, the same > as it is for libc collations, and the right place to be making a test > like that is where somebody attempts to use an ICU collation. But > I've not tried to test it. So I did test that, and found out the presumable reason why that's there: icu_from_uchar() falls over if the database encoding is unsupported, and we use that to convert ICU "display names" for use as comments for the ICU collations. But that's not very much less wrongheaded, because it will allow non-ASCII characters into the initial database contents, which is absolutely not acceptable. We assume we can bit-copy the contents of template0 and it will be valid in any encoding. Therefore, I think the right thing to do is remove that test and change get_icu_locale_comment() so that it rejects non-ASCII text, making the encoding conversion trivial, as in the attached patch. On my Fedora 25 laptop, the only collations that go without a comment in this approach are the "nb" ones (Norwegian Bokmål). As I recall, that locale is a second-class citizen for other reasons already, precisely because of its loony insistence on a non-ASCII name even when we're asking for an Anglicized version. I'm inclined to add a test to reject non-ASCII in the ICU locale names as well as the comments. We've had to do that for libc locale names, and this experience shows that the ICU locale maintainers don't have their heads screwed on any straighter. But this patch doesn't do that. regards, tom lane diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c index 1c43f0b..6febe63 100644 *** a/src/backend/commands/collationcmds.c --- b/src/backend/commands/collationcmds.c *** get_icu_language_tag(const char *localen *** 431,437 /* * Get a comment (specifically, the display name) for an ICU locale. ! * The result is a palloc'd string. */ static char * get_icu_locale_comment(const char *localename) --- 431,439 /* * Get a comment (specifically, the display name) for an ICU locale. ! * The result is a palloc'd string, or NULL if we can't get a comment ! * or find that it's not all ASCII. (We can *not* accept non-ASCII ! * comments, because the contents of template0 must be encoding-agnostic.) */ static char * get_icu_locale_comment(const char *localename) *** get_icu_locale_comment(const char *local *** 439,444 --- 441,447 UErrorCode status; UChar displayname[128]; int32 len_uchar; + int32 i; char *result; status = U_ZERO_ERROR; *** get_icu_locale_comment(const char *local *** 446,456 displayname, lengthof(displayname), ); if (U_FAILURE(status)) ! ereport(ERROR, ! (errmsg("could not get display name for locale \"%s\": %s", ! localename, u_errorName(status; ! icu_from_uchar(, displayname, len_uchar); return result; } --- 449,468 displayname, lengthof(displayname), ); if (U_FAILURE(status)) ! return NULL; /* no good reason to raise an error */ ! /* Check for non-ASCII comment */ ! for (i = 0; i < len_uchar; i++) ! { ! if (displayname[i] > 127) ! return NULL; ! } ! ! /* OK, transcribe */ ! result = palloc(len_uchar + 1); ! for (i = 0; i < len_uchar; i++) ! result[i] = displayname[i]; ! result[len_uchar] = '\0'; return result; } *** pg_import_system_collations(PG_FUNCTION_ *** 642,655 /* Load collations known to ICU */ #ifdef USE_ICU - if (!is_encoding_supported_by_icu(GetDatabaseEncoding())) - { - ereport(NOTICE, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("encoding \"%s\" not supported by ICU", - pg_encoding_to_char(GetDatabaseEncoding(); - } - else { int i; --- 654,659 *** pg_import_system_collations(PG_FUNCTION_ *** 661,666 --- 665,671 { const char *name; char *langtag; + char *icucomment; const char *collcollate; UEnumeration *en; UErrorCode status; *** pg_import_system_collations(PG_FUNCTION_ *** 686,693 CommandCounterIncrement(); ! CreateComments(collid, CollationRelationId, 0, ! get_icu_locale_comment(name)); } /* --- 691,700 CommandCounterIncrement(); ! icucomment = get_icu_locale_comment(name); ! if (icucomment) ! CreateComments(collid, CollationRelationId, 0, ! icucomment); } /* *** pg_import_system_collations(PG_FUNCTION_ *** 720,727 CommandCounterIncrement(); ! CreateComments(collid,
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
On 2017-06-23 13:26:58 -0400, Alvaro Herrera wrote: > Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have > some elog(DEBUG1) calls with the slot spinlock held. That's probably > uncool. It definitely is not cool, rather daft even (it's probably me who wrote that). No idea why I've done it like that, makes no sense. Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
On 6/23/17 13:26, Alvaro Herrera wrote: > Albe Laurenz wrote: >> Peter Eisentraut wrote: >>> On 6/21/17 09:02, Albe Laurenz wrote: 2017-06-21 14:55:12.033 CEST [23124] LOG: could not send data to client: Broken pipe 2017-06-21 14:55:12.033 CEST [23124] FATAL: connection to client lost 2017-06-21 14:55:17.032 CEST [23133] LOG: logical replication apply worker for subscription "reprec" has started DEBUG: received replication command: IDENTIFY_SYSTEM DEBUG: received replication command: START_REPLICATION SLOT "reprec" LOGICAL 0/0 (proto_version '1', publication_names '"repsend"') 2017-06-21 14:57:24.552 CEST [23124] PANIC: stuck spinlock detected at ReplicationSlotRelease, slot.c:394 2017-06-21 14:57:24.885 CEST [23070] LOG: server process (PID 23124) was terminated by signal 6: Aborted 2017-06-21 14:57:24.885 CEST [23070] LOG: terminating any other active server processes 2017-06-21 14:57:24.887 CEST [23134] LOG: could not send data to client: Broken pipe 2017-06-21 14:57:24.890 CEST [23070] LOG: all server processes terminated; reinitializing >>> >>> I can't reproduce that. I let it loop around for about 10 minutes and >>> it was fine. >>> >>> I notice that you have some debug settings on. Could you share your >>> exact setup steps from initdb, as well as configure options, just in >>> case one of these settings is causing a problem? > > Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have > some elog(DEBUG1) calls with the slot spinlock held. That's probably > uncool. I can reproduce the issue with --client-min-messages=debug2 or debug1, but it doesn't appear with the default settings. I don't always get the "stuck spinlock" message, but it hangs badly pretty reliably after two or three rounds of erroring. Removing the call you pointed out doesn't make a difference, but it's possibly something similar. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend can terminate background workers and autovacuum launchers
Yugo Nagata wrote: > I tried to make it. Patch attached. > > It is easy and simple. Although I haven't tried for autovacuum worker, > I confirmed I can change other process' parameters without affecting others. > Do you want this in PG? One advantage of this gadget is that you can signal backends that you own without being superuser, so +1 for the general idea. (Please do create a fresh thread so that this can be appropriately linked to in commitfest app, though.) You need a much bigger patch for the autovacuum worker. The use case I had in mind is that you have a maintenance window and can run fast vacuuming during it, but if the table is large and vacuum can't finish within that window then you want vacuum to slow down, without stopping it completely. But implementing this requires juggling the stdVacuumCostDelay/Limit values during the reload, which are currently read at start of vacuuming only; the working values are overwritten from those during a rebalance. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Same expression more than once in partition key
Peter Eisentrautwrites: > We also allow the same column more than once in an index. We probably > don't have to be more strict here. There actually are valid uses for the same column more than once in an index, eg if you use a different operator class for each instance. I find it hard to envision a similar use-case in partitioning though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REPLICA IDENTITY FULL
On 6/23/17 13:14, Alvaro Herrera wrote: > Andres Freund wrote: >> On 2017-06-23 13:05:21 -0400, Alvaro Herrera wrote: >>> Tom Lane wrote: Peter Eisentrautwrites: > Any thoughts about keeping datumAsEqual() as a first check? I did some > light performance tests, but it was inconclusive. Seems like it would tend to be a win if, in fact, the values are usually equal. If they're usually not, then it's a loser. Do we have any feeling for which case is more common? >> >> Seems like a premature optimization to me - if you care about >> performance and do this frequently, you're not going to end up using >> FULL. If we want to performance optimize, it'd probably better to >> lookup candidate keys and use those if available. > > I can get behind that argument. Thanks for the feedback. I have committed it after removing the datumIsEqual() call. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shift_sjis_2004 related autority files are remaining
On 6/23/17 11:15, Tatsuo Ishii wrote: >> For clarity, I personally perfer to keep all the source text file >> in the repository, especially so that we can detect changes of >> them. But since we decide that at least most of them not to be >> there (from a reason of license), I just don't see a reason to >> keep only the rest even without the restriction. > > So are you saying that if n/m of authority files are not kept because > of license issue, then m-n authority files should not be kept as well? > What's the benefit for us by doing so? I don't have a clear opinion on this particular issue, but I think we should have clarity on why particular files or code exist. So why do these files exist and the others don't? Is it just the license? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Same expression more than once in partition key
On 6/23/17 09:54, Tom Lane wrote: >> However, I can use same expression more than once in partition key. > > ... and so, I can't get excited about extending it to expressions. > Where would you stop --- for example, are i and (i) equal? What > about (i) and (case when true then i else j end)? In principle, > the system could recognize both those identities, but it seems > silly to expend code on it just for nagware purposes. We also allow the same column more than once in an index. We probably don't have to be more strict here. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgjdbc logical replication client throwing exception
On 6/23/17 06:11, sanyam jain wrote: > I'm trying to create something like pg_recvlogical.c in java using > pgjdbc.Its working for small transactions but when i do a big > transaction client throws below exception: What does the server log say? If nothing interesting, turn up debugging. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Can ICU be used for a database's default sort order?
On 6/22/17 23:10, Peter Geoghegan wrote: > On Thu, Jun 22, 2017 at 7:10 PM, Tom Lanewrote: >> Is there some way I'm missing, or is this just a not-done-yet feature? > > It's a not-done-yet feature. It's something I hope to address soon. The main definitional challenge is how to associate a pg_database entry with a collation. What we currently effectively do is duplicate the fields of pg_collation in pg_database. But I imagine over time we'll add more properties in pg_collation, along with additional ALTER COLLATION commands etc., so duplicating all of that would be a significant amount of code complication and result in a puzzling user interface. Ideally, I'd like to see CREATE DATABASE ... COLLATION "foo". But the problem is of course that collations are per-database objects. Possible solutions: 1) Associate by name only. That is, you can create a database with any COLLATION "foo" that you want, and it's only checked when you first connect to or do anything in the database. 2) Create shared collations. Then we'd need a way to manage having a mix of shared and non-shared collations around. There are significant pros and cons to all of these ideas. Some people I talked to appeared to prefer the shared collations approach. Other ideas? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix a typo in snapmgr.c
On 2017-06-23 19:21:57 +0100, Simon Riggs wrote: > On 23 June 2017 at 08:23, Simon Riggswrote: > > On 23 June 2017 at 08:21, Andres Freund wrote: > >> On 2017-06-07 10:17:31 -0700, Andres Freund wrote: > >>> On 2017-05-08 09:12:13 -0400, Tom Lane wrote: > >>> > Simon Riggs writes: > >>> > > So rearranged code a little to keep it lean. > >>> > > >>> > Didn't you break it with that? As it now stands, the memcpy will > >>> > copy the nonzero value. > >>> > >>> I've not seen a fix and/or alleviating comment about this so far. Did I > >>> miss something? > >> > >> Simon, FWIW, I plan to either revert or fix this up soon-ish. Unless > >> you're going to actually respond on this thread? > > > > Sorry, I've only just seen Tom's reply. Will fix. > > I don't see a bug. Perhaps I'm tired and can't see it yet. > > Will fix if you thwack me with the explanation. Wasn't my complaint, but here we go: Previous code: /* * Ignore the SubXID array if it has overflowed, unless the snapshot was * taken during recovey - in that case, top-level XIDs are in subxip as * well, and we mustn't lose them. */ if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery) serialized_snapshot.subxcnt = 0; /* Copy struct to possibly-unaligned buffer */ memcpy(start_address, _snapshot, sizeof(SerializedSnapshotData)); i.e. if suboverflowed, start_address would contain subxcnt = 0. New code: /* Copy struct to possibly-unaligned buffer */ memcpy(start_address, _snapshot, sizeof(SerializedSnapshotData)); /* Copy XID array */ if (snapshot->xcnt > 0) memcpy((TransactionId *) (start_address + sizeof(SerializedSnapshotData)), snapshot->xip, snapshot->xcnt * sizeof(TransactionId)); /* * Copy SubXID array. Don't bother to copy it if it had overflowed, * though, because it's not used anywhere in that case. Except if it's a * snapshot taken during recovery; all the top-level XIDs are in subxip as * well in that case, so we mustn't lose them. */ if (serialized_snapshot.suboverflowed && !snapshot->takenDuringRecovery) serialized_snapshot.subxcnt = 0; Here the copy is done before subxcnt = 0. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix a typo in snapmgr.c
On 23 June 2017 at 08:23, Simon Riggswrote: > On 23 June 2017 at 08:21, Andres Freund wrote: >> On 2017-06-07 10:17:31 -0700, Andres Freund wrote: >>> On 2017-05-08 09:12:13 -0400, Tom Lane wrote: >>> > Simon Riggs writes: >>> > > So rearranged code a little to keep it lean. >>> > >>> > Didn't you break it with that? As it now stands, the memcpy will >>> > copy the nonzero value. >>> >>> I've not seen a fix and/or alleviating comment about this so far. Did I >>> miss something? >> >> Simon, FWIW, I plan to either revert or fix this up soon-ish. Unless >> you're going to actually respond on this thread? > > Sorry, I've only just seen Tom's reply. Will fix. I don't see a bug. Perhaps I'm tired and can't see it yet. Will fix if you thwack me with the explanation. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small bug in replication lag tracking
On 23 June 2017 at 08:18, Simon Riggswrote: > On 23 June 2017 at 06:45, Thomas Munro wrote: > >> I discovered a thinko in the new replication lag interpolation code >> that can cause a strange number to be reported occasionally. > > Thanks, will review. I've pushed this, but I think we should leave the code alone after this and wait for user feedback. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Code quality issues in ICU patch
On Fri, Jun 23, 2017 at 12:31:40PM -0400, Tom Lane wrote: > icu_to_uchar() and icu_from_uchar(), and perhaps other places, are > touchingly naive about integer overflow hazards in buffer size > calculations. I call particular attention to this bit in > icu_from_uchar(): > > len_result = UCNV_GET_MAX_BYTES_FOR_STRING(len_uchar, > ucnv_getMaxCharSize(icu_converter)); > > The ICU man pages say that that macro is defined as > > #define UCNV_GET_MAX_BYTES_FOR_STRING(length, maxCharSize) > (((int32_t)(length)+10)*(int32_t)(maxCharSize)) > > which means that getting this to overflow (resulting in > probably-exploitable memory overruns) would be about as hard as > taking candy from a baby. So it kicks off really loud and persistent alarms, and isn't as easy as you thought, even taking this into account? Best, David. -- David Fetterhttp://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Incorrect documentation about pg_stat_activity
On 21 June 2017 at 15:18, Simon Riggswrote: > On 21 June 2017 at 16:15, Yugo Nagata wrote: >> On Wed, 21 Jun 2017 19:08:35 +0530 >> Kuntal Ghosh wrote: >> >>> On Wed, Jun 21, 2017 at 6:05 PM, Yugo Nagata wrote: >>> > >>> > Attached is a patch for the documentation fix. >>> > >>> Please attach the patch as well. :-) >> >> I'm sorry, I forgot it. I attahed this. > > Thanks, will apply. Done, thanks -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
Albe Laurenz wrote: > Peter Eisentraut wrote: > > On 6/21/17 09:02, Albe Laurenz wrote: > >> 2017-06-21 14:55:12.033 CEST [23124] LOG: could not send data to client: > >> Broken pipe > >> 2017-06-21 14:55:12.033 CEST [23124] FATAL: connection to client lost > >> 2017-06-21 14:55:17.032 CEST [23133] LOG: logical replication apply > >> worker for subscription "reprec" has started > >> DEBUG: received replication command: IDENTIFY_SYSTEM > >> DEBUG: received replication command: START_REPLICATION SLOT "reprec" > >> LOGICAL 0/0 (proto_version '1', publication_names '"repsend"') > >> 2017-06-21 14:57:24.552 CEST [23124] PANIC: stuck spinlock detected at > >> ReplicationSlotRelease, slot.c:394 > >> 2017-06-21 14:57:24.885 CEST [23070] LOG: server process (PID 23124) was > >> terminated by signal 6: Aborted > >> 2017-06-21 14:57:24.885 CEST [23070] LOG: terminating any other active > >> server processes > >> 2017-06-21 14:57:24.887 CEST [23134] LOG: could not send data to client: > >> Broken pipe > >> 2017-06-21 14:57:24.890 CEST [23070] LOG: all server processes > >> terminated; reinitializing > > > > I can't reproduce that. I let it loop around for about 10 minutes and > > it was fine. > > > > I notice that you have some debug settings on. Could you share your > > exact setup steps from initdb, as well as configure options, just in > > case one of these settings is causing a problem? Hmm, so for instance in LogicalIncreaseRestartDecodingForSlot() we have some elog(DEBUG1) calls with the slot spinlock held. That's probably uncool. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REPLICA IDENTITY FULL
Andres Freund wrote: > On 2017-06-23 13:05:21 -0400, Alvaro Herrera wrote: > > Tom Lane wrote: > > > Peter Eisentrautwrites: > > > > Any thoughts about keeping datumAsEqual() as a first check? I did some > > > > light performance tests, but it was inconclusive. > > > > > > Seems like it would tend to be a win if, in fact, the values are > > > usually equal. If they're usually not, then it's a loser. Do > > > we have any feeling for which case is more common? > > Seems like a premature optimization to me - if you care about > performance and do this frequently, you're not going to end up using > FULL. If we want to performance optimize, it'd probably better to > lookup candidate keys and use those if available. I can get behind that argument. > > who in their right minds would use floating point columns as part of > > replica identity ...? > > Since this is FULL, it'll be all columns... Yeah, I was thinking you shouldn't have floating point columns if you're going to use FULL as identity. But you're tautologically right: doing the wrong thing is likely not the right thing to do. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REPLICA IDENTITY FULL
On 2017-06-23 13:05:21 -0400, Alvaro Herrera wrote: > Tom Lane wrote: > > Peter Eisentrautwrites: > > > Any thoughts about keeping datumAsEqual() as a first check? I did some > > > light performance tests, but it was inconclusive. > > > > Seems like it would tend to be a win if, in fact, the values are > > usually equal. If they're usually not, then it's a loser. Do > > we have any feeling for which case is more common? Seems like a premature optimization to me - if you care about performance and do this frequently, you're not going to end up using FULL. If we want to performance optimize, it'd probably better to lookup candidate keys and use those if available. > Though, thinking about it, maybe the datumIsEqual test would give the > wrong answer for floating point values, and there'd be no fallback to > equality with the logic I propose. But then maybe that's all > right --- I don't think it'd be ok, we shouldn't just do the wrong thing because we think it's unlikely to happen. > who in their right minds would use floating point columns as part of > replica identity ...? Since this is FULL, it'll be all columns... - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REPLICA IDENTITY FULL
Tom Lane wrote: > Peter Eisentrautwrites: > > Any thoughts about keeping datumAsEqual() as a first check? I did some > > light performance tests, but it was inconclusive. > > Seems like it would tend to be a win if, in fact, the values are > usually equal. If they're usually not, then it's a loser. Do > we have any feeling for which case is more common? What about keeping the datumIsEqual test for fixed length pass-by-value types (I'm mostly thinking about fixed-width integers here ...) and always use the full blown equality comparison for anything more elaborate than that? Though, thinking about it, maybe the datumIsEqual test would give the wrong answer for floating point values, and there'd be no fallback to equality with the logic I propose. But then maybe that's all right --- who in their right minds would use floating point columns as part of replica identity ...? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
Masahiko Sawada wrote: > On Fri, Jun 23, 2017 at 3:31 PM, Michael Paquier >wrote: > > On Fri, Jun 23, 2017 at 3:17 PM, Amit Langote > > wrote: > >> Initially, I had naively set wal_consistency_check = all before running > >> make installcheck and then had to wait for a long time to confirm that WAL > >> generated by the gin test indeed caused consistency check failure on the > >> standby with the v1 patch. > > > > wal_consistency_check = gin would have saved you a lot of I/O. > > > >> But I can see Sawada-san's point that there should be some way for > >> developers writing code that better had gone through WAL consistency > >> checking facility to do it without much hassle. But then again, it may > >> not be that frequent to need that. > > Yeah, it should be optional. I imagined providing such an option of > pg_regress or TAP test for the developers. As far as I know it is possible to have third-party modules that extend the buildfarm client script so that it runs additional tests that the standard ones. You could have a custom module installed in some powerful machine of yours that runs the WAL consistency check and report the results to the buildfarm. A single animal running that test should be enough, right? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Code quality issues in ICU patch
icu_to_uchar() and icu_from_uchar(), and perhaps other places, are touchingly naive about integer overflow hazards in buffer size calculations. I call particular attention to this bit in icu_from_uchar(): len_result = UCNV_GET_MAX_BYTES_FOR_STRING(len_uchar, ucnv_getMaxCharSize(icu_converter)); The ICU man pages say that that macro is defined as #define UCNV_GET_MAX_BYTES_FOR_STRING(length, maxCharSize) (((int32_t)(length)+10)*(int32_t)(maxCharSize)) which means that getting this to overflow (resulting in probably-exploitable memory overruns) would be about as hard as taking candy from a baby. I also notice that the general approach to handling ICU-reported error conditions is like if (U_FAILURE(status)) ereport(ERROR, (errmsg("ucnv_fromUChars failed: %s", u_errorName(status; This lacks an errcode() setting, which is contrary to project policy, and the error message violates our message style guidelines. I don't particularly feel like fixing these things myself, but somebody needs to; the overflow issues in particular are stop-ship security hazards. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken hint bits (freeze)
Sergey Burladyanwrites: > 1. create master > 2. create standby from it > 3. create unlogged table and hash index like: > create unlogged table test (id int primary key, v text); > create index on test using hash (id); > 3. stop master > 4. promote standby > > now, if you try to upgrade this new promoted master pg_upgrade will stop > on this hash index: > error while creating link for relation "public.test_id_idx" > ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"): No such file or > directory > Failure, exiting > > I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from > scratch and it complete successfully. Missed test script for it. -- Sergey Burladyan test_hash.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken hint bits (freeze)
Bruce Momjianwrites: > On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote: > > On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila > > wrote: > > > Hmm. I think we need something that works with lesser effort because > > > not all users will be as knowledgeable as you are, so if they make any > > > mistakes in copying the file manually, it can lead to problems. How > > > about issuing a notification (XLogArchiveNotifySeg) in shutdown > > > checkpoint if archiving is enabled? > > > > > > > I have thought more about the above solution and it seems risky to > > notify archiver for incomplete WAL segments (which will be possible in > > this case as there is no guarantee that Checkpoint record will fill > > the segment). So, it seems to me we should update the document unless > > you or someone has some solution to this problem. > As far as I know this is the only remaining open issue. Sergey, please > verify. I appreciate the work everyone has done to improve this, and > all the existing fixes have been pushed to all supported branches. :-) Yes, thank you all for your help! Yes, this is last issue with checkpoint that I know, how to ensure that standby sync all shared buffers into disk on it shutdown. I thinking about enforce restartpoint on shutdown, like: src/backend/access/transam/xlog.c - 8639 if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || - 8640 XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)) - 8641 { + 8639 if ( !(flags & CHECKPOINT_IS_SHUTDOWN) && (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || + 8640 XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo) ) + 8641 { But I still not read source and not sure about this solution. PS: I successfully upgraded last night from 9.2 to 9.4 and find other issue :-) It is about hash index and promote: 1. create master 2. create standby from it 3. create unlogged table and hash index like: create unlogged table test (id int primary key, v text); create index on test using hash (id); 3. stop master 4. promote standby now, if you try to upgrade this new promoted master pg_upgrade will stop on this hash index: error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"): No such file or directory Failure, exiting I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from scratch and it complete successfully. -- Sergey Burladyan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shift_sjis_2004 related autority files are remaining
> For clarity, I personally perfer to keep all the source text file > in the repository, especially so that we can detect changes of > them. But since we decide that at least most of them not to be > there (from a reason of license), I just don't see a reason to > keep only the rest even without the restriction. So are you saying that if n/m of authority files are not kept because of license issue, then m-n authority files should not be kept as well? What's the benefit for us by doing so? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Same expression more than once in partition key
On Fri, 23 Jun 2017 09:54:17 -0400 Tom Lanewrote: > Yugo Nagata writes: > > When we create a range partitioned table, we cannot use > > a column more than once in the partition key. > > > postgres=# create table t (i int) partition by range(i,i); > > ERROR: column "i" appears more than once in partition key > > AFAIK, that's just a simple test for obvious mistakes; it's not a > property that's essential for correctness. Thanks. This is what I want to know. > > > However, I can use same expression more than once in partition key. > > ... and so, I can't get excited about extending it to expressions. > Where would you stop --- for example, are i and (i) equal? What > about (i) and (case when true then i else j end)? In principle, > the system could recognize both those identities, but it seems > silly to expend code on it just for nagware purposes. > Agreed. > regards, tom lane -- Yugo Nagata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend can terminate background workers and autovacuum launchers
On Thu, 22 Jun 2017 14:08:30 +0900 Michael Paquierwrote: > On Thu, Jun 22, 2017 at 1:52 PM, Yugo Nagata wrote: > > On Thu, 22 Jun 2017 12:05:19 +0900 > > Michael Paquier wrote: > >> signal-able). Different thought, but I'd love to see a SQL function > >> that allows triggering SIGHUP on a specific process, like an > >> autovacuum worker to change its cost parameters. There is > >> pg_reload_conf() to do so but that's system-wide. > > > > For example, is that like this? > > > > =# alter system set autovacuum_vacuum_cost_delay to 10; > > =# select pg_reload_conf(); > > =# alter system reset autovacuum_vacuum_cost_delay; > > No need to reset the parameter afterwards as this makes it sensible to > updates afterwards, but you have the idea. Note that this is rather > recent, as autovacuum listens to SIGHUP only since a75fb9b3. I tried to make it. Patch attached. It is easy and simple. Although I haven't tried for autovacuum worker, I confirmed I can change other process' parameters without affecting others. Do you want this in PG? [session A (PID:18375)] =# show autovacuum_vacuum_cost_delay; autovacuum_vacuum_cost_delay -- 20ms (1 row) [session B] postgres=# alter system set autovacuum_vacuum_cost_delay to 10; ALTER SYSTEM postgres=# select pg_reload_conf(18375); pg_reload_conf t (1 row) postgres=# show autovacuum_vacuum_cost_delay; autovacuum_vacuum_cost_delay -- 20ms (1 row) [session A again] postgres=# show autovacuum_vacuum_cost_delay; autovacuum_vacuum_cost_delay -- 10ms (1 row) > -- > Michael -- Yugo Nagata diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 0fdad0c..d79faf7 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1128,6 +1128,7 @@ REVOKE EXECUTE ON FUNCTION pg_wal_replay_pause() FROM public; REVOKE EXECUTE ON FUNCTION pg_wal_replay_resume() FROM public; REVOKE EXECUTE ON FUNCTION pg_rotate_logfile() FROM public; REVOKE EXECUTE ON FUNCTION pg_reload_conf() FROM public; +REVOKE EXECUTE ON FUNCTION pg_reload_conf(int) FROM public; REVOKE EXECUTE ON FUNCTION pg_current_logfile() FROM public; REVOKE EXECUTE ON FUNCTION pg_current_logfile(text) FROM public; diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 62341b8..cc173ba 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -339,6 +339,36 @@ pg_reload_conf(PG_FUNCTION_ARGS) PG_RETURN_BOOL(true); } +/* + * Signal to reload the database configuration for a specified backend + * + * Permission checking for this function is managed through the normal + * GRANT system. + */ +Datum +pg_reload_conf_pid(PG_FUNCTION_ARGS) +{ + int pid = PG_GETARG_INT32(0); + int r = pg_signal_backend(pid, SIGHUP); + + if (r == SIGNAL_BACKEND_NOSUPERUSER) + ereport(WARNING, +(errmsg("must be a superuser to terminate superuser process"))); + + if (r == SIGNAL_BACKEND_NOPERMISSION) + ereport(WARNING, + (errmsg("must be a member of the role whose process is being terminated or member of pg_signal_backend"))); + + if (r) + { + ereport(WARNING, +(errmsg("failed to send signal to backend: %d", pid))); + PG_RETURN_BOOL(false); + } + + PG_RETURN_BOOL(true); +} + /* * Rotate log file diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 1191b4a..7258f15 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -3255,6 +3255,8 @@ DESCR("true if wal replay is paused"); DATA(insert OID = 2621 ( pg_reload_conf PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 16 "" _null_ _null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ )); DESCR("reload configuration files"); +DATA(insert OID = 3409 ( pg_reload_conf PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_reload_conf_pid _null_ _null_ _null_ )); +DESCR("reload configuration files"); DATA(insert OID = 2622 ( pg_rotate_logfile PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 16 "" _null_ _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ )); DESCR("rotate log file"); DATA(insert OID = 3800 ( pg_current_logfilePGNSP PGUID 12 1 0 0 0 f f f f f f v s 0 0 25 "" _null_ _null_ _null_ _null_ _null_ pg_current_logfile _null_ _null_ _null_ )); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken hint bits (freeze)
On Fri, Jun 23, 2017 at 08:10:17AM +0530, Amit Kapila wrote: > On Wed, Jun 21, 2017 at 10:03 PM, Bruce Momjianwrote: > > On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote: > >> On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila > >> wrote: > >> > Hmm. I think we need something that works with lesser effort because > >> > not all users will be as knowledgeable as you are, so if they make any > >> > mistakes in copying the file manually, it can lead to problems. How > >> > about issuing a notification (XLogArchiveNotifySeg) in shutdown > >> > checkpoint if archiving is enabled? > >> > > >> > >> I have thought more about the above solution and it seems risky to > >> notify archiver for incomplete WAL segments (which will be possible in > >> this case as there is no guarantee that Checkpoint record will fill > >> the segment). So, it seems to me we should update the document unless > >> you or someone has some solution to this problem. > > > > The over-arching question is how do we tell users to verify that the WAL > > has been replayed on the standby? I am thinking we would say that for > > streaming replication, the "Latest checkpoint location" should match on > > the primary and standby, while for log shipping, the standbys should be > > exactly one WAL file less than the primary. > > > > I am not sure if we can say "standbys should be exactly one WAL file > less than the primary" because checkpoint can create few more WAL > segments for future use. I think to make this work user needs to > carefully just copy the next WAL segment (next to the last file in > standby) which will contain checkpoint record. Ideally, there should > be some way either in form of a tool or a functionality in the > database server with which this last file can be copied but I think in > the absence of that we can at least document this fact. I was not clear. I was not saying there can be only one extra WAL file. I am saying the "Latest checkpoint location" should be one WAL file farther on the master. I think the big problem is that we need a full replay of that WAL file, not just having it one less than the master. I have no idea how do explain that. It is easy for streaming replication since the "Latest checkpoint location" should match, which is simple. Also, we need something that can be backpatched. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Tomas Vondrawrites: > It would be really great if you could explain why BitmapScans are > dubious, instead of just labeling them as dubious. (I guess you mean > Bitmap Heap Scans, right?) The two things I'm aware of are (a) the assumption you noted, that fetching tuples in TID sort order is a reasonably efficient thing, and (b) the "offset" part of a TID can't exceed MaxHeapTuplesPerPage --- see data structure in tidbitmap.c. The latter issue means that you don't really have a full six bytes to play with in a TID, only about five. I don't think (b) would be terribly hard to fix if we had a motivation to, but I wonder whether there aren't other places that also know this about TIDs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REPLICA IDENTITY FULL
Peter Eisentrautwrites: > Any thoughts about keeping datumAsEqual() as a first check? I did some > light performance tests, but it was inconclusive. Seems like it would tend to be a win if, in fact, the values are usually equal. If they're usually not, then it's a loser. Do we have any feeling for which case is more common? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/23/17 10:38 AM, Teodor Sigaev wrote: 1. Table AM with a 6-byte TID. 2. Table AM with a custom locator format, which could be TID-like. 3. Table AM with no locators. Currently TID has its own type in system catalog. Seems, it's possible that storage claims type of TID which it uses. Then, AM could claim it too, so the core based on that information could solve the question about AM-storage compatibility. Storage could also claim that it hasn't TID type at all so it couldn't be used with any access method, use case: compressed column oriented storage. Isn't the fact that TID is an existing type defined in system catalogs is a fairly insignificant detail? I mean, we could just as easily define a new 64-bit locator data type, and use that instead, for example. The main issue here is that we assume things about the TID contents, i.e. that it contains page/offset etc. And Bitmap nodes rely on that, to some extent - e.g. when prefetching data. > As I remeber, only GIN depends on TID format, other indexes use it as opaque type. Except, at least, btree and GiST - they believe that internal pointers are the same as outer (to heap) I think you're probably right - GIN does compress the posting lists by exploiting the TID redundancy (that it's page/offset structure), and I don't think there are other index AMs doing that. But I'm not sure we can simply rely on that - it's possible people will try to improve other index types (e.g. by adding similar compression to other index types). Moreover we now support extensions defining custom index AMs, and we have no clue what people may do in those. So this would clearly require some sort of flag for each index AM. > Another dubious part - BitmapScan. It would be really great if you could explain why BitmapScans are dubious, instead of just labeling them as dubious. (I guess you mean Bitmap Heap Scans, right?) I see no conceptual issue with bitmap scans on arbitrary locator types, as long as there's sufficient locality information encoded in the value. What I mean by that is that for any two locator values A and B: (1) if (A < B) then (A is stored before B) (2) if (A is close to B) then (A is stored close to B) Without these features it's probably futile to try to do bitmap scans, because the bitmap would not result in mostly sequential access pattern and things like prefetch would not be very efficient, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Same expression more than once in partition key
Yugo Nagatawrites: > When we create a range partitioned table, we cannot use > a column more than once in the partition key. > postgres=# create table t (i int) partition by range(i,i); > ERROR: column "i" appears more than once in partition key AFAIK, that's just a simple test for obvious mistakes; it's not a property that's essential for correctness. > However, I can use same expression more than once in partition key. ... and so, I can't get excited about extending it to expressions. Where would you stop --- for example, are i and (i) equal? What about (i) and (case when true then i else j end)? In principle, the system could recognize both those identities, but it seems silly to expend code on it just for nagware purposes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] transition table behavior with inheritance appears broken
> "Noah" == Noah Mischwrites: Noah> This PostgreSQL 10 open item is past due for your status update. Noah> Kindly send a status update within 24 hours, oops, sorry! I forgot to include a date in the last one, and in fact a personal matter delayed things anyway. I expect to have this wrapped up by 23:59 BST on the 24th. -- Andrew. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Removing deprecation changing automatically type of opaque functions
Hi all, Per bug #14706, it has been discussed the possibility to remove the transformation of opaque functions happening for example when declaring a trigger: https://www.postgresql.org/message-id/cab7npqtc-7u1k2vz3qlxvsyaxbvncrc9zmuvyf4-wuzj8zf...@mail.gmail.com As far as I can see, the deprecation is present since 2002 via commit a2a3192, and has been around to allow the restore of old dump files. On HEAD pg_dump supports servers down to 8.0, which was released in 2005. So there are already a couple of years in-between. This is of course not a topic for 10, just for 11 for when its development opens. Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REPLICA IDENTITY FULL
On 6/20/17 00:10, Andres Freund wrote: > On 2017-06-20 11:52:10 +0900, Tatsuo Ishii wrote: >> If my understanding is correct, it would not be easy to fix, no? >> >>> We might be able to refine that, but there is a general problem that >>> without an index and an operator class, we are just doing our random >>> best to match the values. > > I don't see the problem as being big. We should just look up the > default btree opclass and use the relevant operator. That's a how a > number of things already work. Patch for that. Any thoughts about keeping datumAsEqual() as a first check? I did some light performance tests, but it was inconclusive. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 00ea950753c070d440818c71d06e2bbc53b6c294 Mon Sep 17 00:00:00 2001 From: Peter EisentrautDate: Fri, 23 Jun 2017 08:55:13 -0400 Subject: [PATCH] Fix replication with replica identity full The comparison with the target rows on the subscriber side was done with datumIsEqual(), which can have false negatives. For instance, it didn't work reliably for text columns. So use the equality operator provided by the type cache as fallback. Also add more user documentation about replica identity requirements. Reported-by: Tatsuo Ishii --- doc/src/sgml/logical-replication.sgml | 28 +++- src/backend/executor/execReplication.c | 22 +- src/test/subscription/t/001_rep_changes.pl | 23 --- 3 files changed, 64 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index 92ec175af1..fa8ae536d9 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -110,11 +110,29 @@ Publication Publications can choose to limit the changes they produce to any combination of INSERT, UPDATE, and DELETE, similar to how triggers are fired by - particular event types. If a table without a REPLICA - IDENTITY is added to a publication that - replicates UPDATE or DELETE - operations then subsequent UPDATE - or DELETE operations will fail on the publisher. + particular event types. By default, all operation types are replicated. + + + + A published table must have a replica identity configured in + order to be able to replicate UPDATE + and DELETE operations, so that appropriate rows to + update or delete can be identified on the subscriber side. By default, + this is the primary key, if there is one. Another unique index (with + certain additional requirements) can also be set to be the replica + identity. If the table does not have any suitable key, then it can be set + to replica identity full, which means the entire row becomes + the key. This, however, is very inefficient and should only be used as a + fallback if no other solution is possible. If a replica identity other + than full is set on the publisher side, a replica identity + comprising the same or fewer columns must also be set on the subscriber + side. See for details on + how to set the replica identity. If a table without a replica identity is + added to a publication that replicates UPDATE + or DELETE operations then + subsequent UPDATE or DELETE + operations will cause an error on the publisher. INSERT + operations can proceed regardless of any replica identity. diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index 6dae79a8f0..97da47a1e2 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -24,12 +24,14 @@ #include "parser/parsetree.h" #include "storage/bufmgr.h" #include "storage/lmgr.h" +#include "utils/builtins.h" #include "utils/datum.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/rel.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/typcache.h" #include "utils/tqual.h" @@ -245,9 +247,27 @@ tuple_equals_slot(TupleDesc desc, HeapTuple tup, TupleTableSlot *slot) continue; att = desc->attrs[attrnum]; + + /* +* To compare for equality, first try datumIsEqual(). If that returns +* false, try the equality operator. +*/ if (!datumIsEqual(values[attrnum], slot->tts_values[attrnum], att->attbyval, att->attlen)) - return false; + { + TypeCacheEntry *typentry; + + typentry = lookup_type_cache(att->atttypid, TYPECACHE_EQ_OPR_FINFO); + if (!OidIsValid(typentry->eq_opr_finfo.fn_oid)) + ereport(ERROR, +
Re: [HACKERS] Causal reads take II
On Wed, May 24, 2017 at 3:58 PM, Thomas Munrowrote: >> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote: >>> I'm wondering about status of this patch and how can I try it out? > > I ran into a problem while doing this, and it may take a couple more > days to fix it since I am at pgcon this week. More soon. Apologies for the extended delay. Here is the rebased patch, now with a couple of improvements (see below). To recap, this is the third part of the original patch series[1], which had these components: 1. synchronous_commit = remote_apply, committed in PostgreSQL 9.6 2. replication lag tracking, committed in PostgreSQL 10 3. causal_reads, the remaining part, hereby proposed for PostgreSQL 11 The goal is to allow applications to move arbitrary read-only transactions to physical replica databases and still know that they can see all preceding write transactions or get an error. It's something like regular synchronous replication with synchronous_commit = remote_apply, except that it limits the impact on the primary and handles failure transitions with defined semantics. The inspiration for this kind of distributed read-follows-write consistency using read leases was a system called Comdb2[2][3], whose designer encouraged me to try to extend Postgres's streaming replication to do something similar. Read leases can also be found in some consensus systems like Google Megastore, albeit in more ambitious form IIUC. The name is inspired by a MySQL Galera feature (approximately the same feature but the approach is completely different; Galera adds read latency, whereas this patch does not). Maybe it needs a better name. Is this is a feature that people want to see in PostgreSQL? IMPROVEMENTS IN V17 The GUC to enable the feature is now called "causal_reads_max_replay_lag". Standbys listed in causal_reads_standby_names whose pg_stat_replication.replay_lag doesn't exceed that time are "available" for causal reads and will be waited for by the primary when committing. When they exceed that threshold they are briefly in "revoking" state and then "unavailable", and when the go return to an acceptable level they are briefly in "joining" state before reaching "available". CR states appear in pg_stat_replication and transitions are logged at LOG level. A new GUC called "causal_reads_lease_time" controls the lifetime of read leases sent from the primary to the standby. This affects the frequency of lease replacement messages, and more importantly affects the worst case of commit stall that can be introduced if connectivity to a standby is lost and we have to wait for the last sent lease to expire. In the previous version, one single GUC controlled both maximum tolerated replay lag and lease lifetime, which was good from the point of view that fewer GUCs are better, but bad because it had to be set fairly high when doing both jobs to be conservative about clock skew. The lease lifetime must be at least 4 x maximum tolerable clock skew. After the recent botching of a leap-second transition on a popular public NTP network (TL;DR OpenNTP is not a good choice of implementation to add to a public time server pool) I came to the conclusion that I wouldn't want to recommend a default max clock skew under 1.25s, to allow for some servers to be confused about leap seconds for a while or to be running different smearing algorithms. A reasonable causal_reads_lease_time recommendation for people who don't know much about the quality of their time source might therefore be 5s. I think it's reasonable to want to set the maximum tolerable replay lag to lower time than that, or in fact as low as you like, depending on your workload and hardware. Therefore I decided to split the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag" and "causal_reads_lease_time". This new version introduces fast lease revocation. Whenever the primary decides that a standby is not keeping up, it kicks it out of the set of CR-available standbys and revokes its lease, so that anyone trying to run causal reads transactions there will start receiving a new error. In the previous version, it always did that by blocking commits while waiting for the most recently sent lease to expire, which I now call "slow revocation" because it could take several seconds. Now it blocks commits only until the standby acknowledges that it is no longer available for causal reads OR the lease expires: ideally that takes the time of a network a round trip. Slow revocation is still needed in various failure cases such as lost connectivity. TESTING Apply the patch after first applying a small bug fix for replication lag tracking[4]. Then: 1. Set up some streaming replicas. 2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in the primary's postgresql.conf. 3. Set causal_reads = on in some transactions on various nodes. 4. Try to break it! As long as your
Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
On 06/23/2017 12:11 AM, Tom Lane wrote: > Andrew Dunstanwrites: >> On 06/22/2017 10:24 AM, Tom Lane wrote: >>> That earlier request is still valid. Also, if you can reproduce the >>> symptom that lorikeet just showed and get a stack trace from the >>> (hypothetical) postmaster core dump, that would be hugely valuable. >> See attached log and stacktrace > The stacktrace seems to be from the parallel-query-leader backend. > Was there another core file? > > The lack of any indication in the postmaster log that the postmaster saw > the parallel worker's crash sure looks the same as lorikeet's failure. > But if the postmaster didn't dump core, then we're still at a loss as to > why it didn't respond. > > Rerunning with some different settings to see if I can get separate cores. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication: stuck spinlock at ReplicationSlotRelease
Peter Eisentraut wrote: > On 6/21/17 09:02, Albe Laurenz wrote: >> 2017-06-21 14:55:12.033 CEST [23124] LOG: could not send data to client: >> Broken pipe >> 2017-06-21 14:55:12.033 CEST [23124] FATAL: connection to client lost >> 2017-06-21 14:55:17.032 CEST [23133] LOG: logical replication apply worker >> for subscription "reprec" has started >> DEBUG: received replication command: IDENTIFY_SYSTEM >> DEBUG: received replication command: START_REPLICATION SLOT "reprec" >> LOGICAL 0/0 (proto_version '1', publication_names '"repsend"') >> 2017-06-21 14:57:24.552 CEST [23124] PANIC: stuck spinlock detected at >> ReplicationSlotRelease, slot.c:394 >> 2017-06-21 14:57:24.885 CEST [23070] LOG: server process (PID 23124) was >> terminated by signal 6: Aborted >> 2017-06-21 14:57:24.885 CEST [23070] LOG: terminating any other active >> server processes >> 2017-06-21 14:57:24.887 CEST [23134] LOG: could not send data to client: >> Broken pipe >> 2017-06-21 14:57:24.890 CEST [23070] LOG: all server processes terminated; >> reinitializing > > I can't reproduce that. I let it loop around for about 10 minutes and > it was fine. > > I notice that you have some debug settings on. Could you share your > exact setup steps from initdb, as well as configure options, just in > case one of these settings is causing a problem? System is Linux, RHEL 6.9, Kernel 2.6.32-696.1.1.el6.x86_64. Configure options: CFLAGS='-Wall -O0 -g' ./configure --prefix=/home/laurenz/pg10 \ --sysconfdir=/magwien/etc \ --docdir=/home/laurenz/pg10/doc \ --mandir=/home/laurenz/pg10/man \ --with-perl \ --with-tcl \ --with-tclconfig=/usr/lib64 \ --with-python \ --with-openssl \ --with-pam \ --with-gssapi \ --enable-nls=de \ --enable-thread-safety \ --with-libxml \ --with-libxslt \ --enable-debug \ --enable-cassert \ --with-ldap $ initdb -A trust -E UTF8 --locale=de_DE.UTF-8 --lc-messages=C -T german -U postgres -W -k dbhome $ grep -v '^[[:space:]]*#' postgresql.conf | grep -v '^$' | sed -e 's/#.*$//' port = 5432 max_connections = 100 password_encryption = scram-sha-256 shared_buffers = 128MB dynamic_shared_memory_type = posix wal_level = logical log_destination = 'stderr' logging_collector = on log_filename = 'postgresql-%Y-%m-%d.log' log_rotation_size = 0 client_min_messages = debug2 log_min_messages = error log_connections = off log_disconnections = off log_timezone = 'Europe/Vienna' datestyle = 'iso, dmy' timezone = 'Europe/Vienna' lc_messages = 'C' lc_monetary = 'de_DE.UTF-8' lc_numeric = 'de_DE.UTF-8' lc_time = 'de_DE.UTF-8' default_text_search_config = 'pg_catalog.german' The crash happens reliably here. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pgjdbc logical replication client throwing exception
Hi, I'm trying to create something like pg_recvlogical.c in java using pgjdbc.Its working for small transactions but when i do a big transaction client throws below exception: org.postgresql.util.PSQLException: Database connection failed when reading from copy at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1028) at org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:41) at org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:150) at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:119) at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:73) at XLogApplier.main(XLogApplier.java:63) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:140) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:109) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:191) at org.postgresql.core.PGStream.receive(PGStream.java:495) at org.postgresql.core.PGStream.receive(PGStream.java:479) at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1161) at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1026) ... 5 more PG Server reports a broken pipe. Can someone please tell what can be wrong?Its happening only when i add millions of rows. Thanks, Sanyam Jain
Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()
Thanks for the review again. On 2017/06/22 19:55, Ashutosh Bapat wrote: > On Thu, Jun 15, 2017 at 4:06 PM, Amit Langote >wrote: >> >> Anyway, I tried to implement the refactoring in patch 0002, which is not >> all of the patch 0001 that Jeevan posted. Please take a look. I wondered >> if we should emit a NOTICE when an individual leaf partition validation >> can be skipped? > > Yes. We emit an INFO for the table being attached. We want to do the > same for the child. Or NOTICE In both the places. Actually, I meant INFO. >> No point in adding a new test if the answer to that is >> no, I'd think. Updated the patch 0002 so that an INFO message is printed for each leaf partition and a test for the same. >> Attaching here 0001 which fixes the bug (unchanged from the previous >> version) and 0002 which implements the refactoring (and the feature to >> look at the individual leaf partitions' constraints to see if validation >> can be skipped.) > > Comments on 0001 patch. > + * > + * We request an exclusive lock on all the partitions, because we may > + * decide later in this function to scan them to validate the new > + * partition constraint. > Does that mean that we may not scan the partitions later, in which the > stronger > lock we took was not needed. Is that right? Yes. I wrote that comment thinking only about the deadlock hazard which had then occurred to me, so the text somehow ended up being reflective of that thinking. Please see the new comment, which hopefully is more informative. > Don't we need an exclusive lock to > make sure that the constraints are not changed while we are validating those? If I understand your question correctly, you meant to ask if we don't need the strongest lock on individual partitions while looking at their constraints to prove that we don't need to scan them. We do and we do take the strongest lock on individual partitions even today in the second call to find_all_inheritors(). We're trying to eliminate the second call here. With the current code, we take AccessShareLock in the first call when checking the circularity of inheritance. Then if attachRel doesn't have the constraint to avoid the scan, we decide to look at individual partitions (their rows, not constraints, as of now) when we take AccessExclusiveLock. That might cause a deadlock (was able to reproduce one using the debugger). > if (!skip_validate) > May be we should turn this into "else" condition of the "if" just above. Yes, done. > +/* > + * We already collected the list of partitions, including the table > + * named in the command itself, which should appear at the head of > the > + * list. > + */ > +Assert(list_length(attachRel_children) >= 1 && > +linitial_oid(attachRel_children) == RelationGetRelid(attachRel)); > Probably the Assert should be moved next to find_all_inheritors(). But I don't > see much value in this comment and the Assert at this place. I agree, removed both. > > +/* Skip the original table if it's partitioned. */ > +if (part_relid == RelationGetRelid(attachRel) && > +attachRel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > +continue; > + > > There isn't much point in checking this for every child in the loop. Instead, > we should remove attachRel from the attachRel_children if there are more than > one elements in the list (which means the attachRel is partitioned, may be > verify that by Asserting). Rearranged code considering these comments. > Comments on 0002 patch. > Thanks for the refactoring. The code looks really good now. Thanks. > The name skipPartConstraintValidation() looks very specific to the case at > hand. The function is really checking whether existing constraints on the > table > can imply the given constraints (which happen to be partition constraints). > How > about PartConstraintImpliedByRelConstraint()? The idea is to pick a general > name so that the function can be used for purposes other than skipping > validation scan in future. I liked this idea, so done. > * This basically returns if the partrel's existing constraints, which > returns "true". Add "otherwise returns false". > > if (constr != NULL) > { > ... > } > return false; > May be you should return false when constr == NULL (I prefer !constr, but > others may have different preferences.). That should save us one level of > indentation. At the end just return whatever predicate_implied_by() returns. Good suggestion, done. Attached updated patches. Thanks, Amit From ee034a3bcb25a8b516220636134fe3ed38796cfe Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 14 Jun 2017 11:32:01 +0900 Subject: [PATCH 1/2] Cope with differing attnos in ATExecAttachPartition code If the table being attached has different attnos from the parent for the partitioning columns
Re: [HACKERS] Pluggable storage
1. Table AM with a 6-byte TID. 2. Table AM with a custom locator format, which could be TID-like. 3. Table AM with no locators. Currently TID has its own type in system catalog. Seems, it's possible that storage claims type of TID which it uses. Then, AM could claim it too, so the core based on that information could solve the question about AM-storage compatibility. Storage could also claim that it hasn't TID type at all so it couldn't be used with any access method, use case: compressed column oriented storage. As I remeber, only GIN depends on TID format, other indexes use it as opaque type. Except, at least, btree and GiST - they believ that internal pointers are the same as outer (to heap) Another dubious part - BitmapScan. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dynamic instrumentation of lwlock wait times (lwlock flamegraphs)
> On 23 Jun 2017, at 00:08, Andres Freundwrote: > > Hi, > > At pgcon some people were talking about the difficulty of instrumenting > the time actually spent waiting for lwlocks and related measurements. > I'd mentioned that linux these days provides infrastructure to measure > such things in unmodified binaries. > > Attached is a prototype of a script that measures the time spent inside > PGSemaphoreLock(), aggregates that inside the kernel, grouped by pid and > stacktrace. That allows one to generate nice flame graphs showing which > part of the code waits how long for lwlocks. > > The attached script clearly needs improvements, but I thought I'd show > some of the results it can get. To run it you need the the python > library of the 'bcc' project [1], and a sufficiently new kernel. Some > distributions, e.g. newer debian versions, package this as python-bpfcc > and similar. > > The output of the script can be turned into a flamegraph with > https://github.com/brendangregg/FlameGraph 's flamegraph.pl. > > Here's a few examples of a pgbench run. The number is the number of > clients, sync/async indicates synchronous_commit on/off. All the > numbers here were generated with the libpq & pgbench batch patches > applied and in use, but that's just because that was the state of my > working tree. > > http://anarazel.de/t/2017-06-22/pgsemwait_8_sync.svg > http://anarazel.de/t/2017-06-22/pgsemwait_8_async.svg > http://anarazel.de/t/2017-06-22/pgsemwait_64_sync.svg > http://anarazel.de/t/2017-06-22/pgsemwait_64_async.svg > > A bonus, not that representative one is the wait for a pgbench readonly > run after the above, with autovacuum previously disabled: > http://anarazel.de/t/2017-06-22/pgsemwait_64_select.svg > interesting to see how the backends themselves never end up having to > flush WAL themselves, or at least not in a manner triggering contention. > > I plan to write a few more of these, because they're hugely useful to > understand actual locking behaviour. Among them: > - time beteen Acquire/Release of lwlocks, grouped by lwlock > - time beteen Acquire/Release of lwlocks, grouped by stack > - callstack of acquirer and waker of lwlocks, grouped by caller stack, waiter > stack > - ... > > I think it might be interesting to collect a few of these somewhere > centrally once halfway mature. Maybe in src/tools or such. Wow, that’s extremely helpful, thanks a lot. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Autovacuum launcher occurs error when cancelled by SIGINT
On Fri, Jun 23, 2017 at 3:01 AM, Thomas Munrowrote: > On Thu, Jun 22, 2017 at 4:29 AM, Kuntal Ghosh > wrote: >> On Wed, Jun 21, 2017 at 7:52 PM, Dilip Kumar wrote: >>> On Wed, Jun 21, 2017 at 7:47 PM, Kuntal Ghosh >>> wrote: > IMHO, It's not a good idea to use DSM call to verify the DSA handle. > Okay. Is there any particular scenario you've in mind where this may fail? >>> >>> It's not about failure, but about the abstraction. When we are using >>> the DSA we should not directly access the DSM which is under DSA. >>> >> Okay. I thought that I've found at least one usage of >> dsm_find_mapping() in the code. :-) >> >> But, I've some more doubts. >> 1. When should we use dsm_find_mapping()? (The first few lines of >> dsm_attach is same as dsm_find_mapping().) >> 2. As a user of dsa, how should we check whether my dsa handle is >> already attached? I guess this is required because, if a user tries to >> re-attach a dsa handle, it's punishing the user by throwing an error >> and the user wants to avoid such errors. > > I thought about this when designing the DSA API. I couldn't think of > any good reason to provide an 'am-I-already-attached?' function > equivalent to dsm_find_mapping. It seemed to me that the client code > shouldn't ever be in any doubt about whether it's attached, and that > wilfully or absent-mindedly throwing away dsa_area pointers and having > to ask for them again doesn't seem like a very good design. I suspect > the same applies to dsm_find_mapping, and I don't see any callers in > the source tree or indeed anywhere on the internet (based on a quick > Google search). But I could be missing something. > Thanks a lot for the clarification. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken O(n^2) avoidance in wal segment recycling.
On Fri, Jun 23, 2017 at 3:25 PM, Michael Paquierwrote: > On Thu, Jun 22, 2017 at 6:10 AM, Andres Freund wrote: >> We've not heard any complaints about this afaik, but it's not something >> that's easily diagnosable as being a problem. Therefore I suspect we >> should fix and backpatch this? > > Agreed. I have just poked at this problem and have finished with the > attached. Logically it is not complicated at the argument values used > by the callers of RemoveXlogFile() are never updated when scanning > pg_wal. Surely this patch needs an extra pair of eyes. Andres has pointed me out offline that a bad copy-paste re-introduced fb886c15. So it is important to properly track the segment name of the switch point as well as the the segment from where the recycling begins. There was as well a second mistake in RemoveOldXlogFiles() causing unnecessary segments to be kept caused by the same copy-pasto. -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0a6314a642..fd35cd311c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -878,7 +878,8 @@ static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr); static void XLogFileClose(void); static void PreallocXlogFiles(XLogRecPtr endptr); static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr); -static void RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr); +static void RemoveXlogFile(const char *segname, XLogSegNo *endlogSegNo, + XLogSegNo recycleSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); static void CleanupBackupHistory(void); @@ -3829,10 +3830,18 @@ UpdateLastRemovedPtr(char *filename) static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) { + XLogSegNo endlogSegNo; + XLogSegNo recycleSegNo; DIR *xldir; struct dirent *xlde; char lastoff[MAXFNAMELEN]; + /* + * Initialize info about where to try to recycle to. + */ + XLByteToSeg(endptr, endlogSegNo); + recycleSegNo = XLOGfileslop(PriorRedoPtr); + xldir = AllocateDir(XLOGDIR); if (xldir == NULL) ereport(ERROR, @@ -3875,7 +3884,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) /* Update the last removed location in shared memory first */ UpdateLastRemovedPtr(xlde->d_name); -RemoveXlogFile(xlde->d_name, PriorRedoPtr, endptr); +RemoveXlogFile(xlde->d_name, , recycleSegNo); } } } @@ -3905,8 +3914,16 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) struct dirent *xlde; char switchseg[MAXFNAMELEN]; XLogSegNo endLogSegNo; + XLogSegNo switchLogSegNo; + XLogSegNo recycleSegNo; - XLByteToPrevSeg(switchpoint, endLogSegNo); + /* + * Initialize info about where to begin the work. This will recycle + * arbitrarily 10 segments. + */ + XLByteToPrevSeg(switchpoint, switchLogSegNo); + XLByteToSeg(switchpoint, endLogSegNo); + recycleSegNo = endLogSegNo + 10; xldir = AllocateDir(XLOGDIR); if (xldir == NULL) @@ -3918,7 +3935,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) /* * Construct a filename of the last segment to be kept. */ - XLogFileName(switchseg, newTLI, endLogSegNo); + XLogFileName(switchseg, newTLI, switchLogSegNo); elog(DEBUG2, "attempting to remove WAL segments newer than log file %s", switchseg); @@ -3944,7 +3961,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) * - but seems safer to let them be archived and removed later. */ if (!XLogArchiveIsReady(xlde->d_name)) -RemoveXlogFile(xlde->d_name, InvalidXLogRecPtr, switchpoint); +RemoveXlogFile(xlde->d_name, , recycleSegNo); } } @@ -3954,31 +3971,22 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) /* * Recycle or remove a log file that's no longer needed. * - * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the - * redo pointer of the previous checkpoint. These are used to determine - * whether we want to recycle rather than delete no-longer-wanted log files. - * If PriorRedoRecPtr is not known, pass invalid, and the function will - * recycle, somewhat arbitrarily, 10 future segments. + * segname is the name of the segment to recycle or remove. endlogSegNo + * is the segment number of the current (or recent) end of WAL. recycleSegNo + * is the segment number to recycle up to. + * + * endlogSegNo gets incremented if the segment is recycled so as it is not + * checked again with future callers of this function. */ static void -RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) +RemoveXlogFile(const char *segname, XLogSegNo *endlogSegNo, + XLogSegNo recycleSegNo) { char path[MAXPGPATH]; #ifdef WIN32 char newpath[MAXPGPATH]; #endif struct stat
Re: [HACKERS] Multi column range partition table
On 23 June 2017 at 08:01, Ashutosh Bapatwrote: > The way we have designed our syntax, we don't have a way to tell that > p3 comes after p2 and they have no gap between those. But I don't > think that's your question. What you are struggling with is a way to > specify a lower bound (10, +infinity) so that anything with i1 > 10 > would go to partition 3. > I think actually there is a fundamental problem here, which arises because UNBOUNDED has 2 different meanings depending on context, and thus it is not possible in general to specify the start of one range to be equal to the end of the previous range, as is necessary to get contiguous non-overlapping ranges. Note that this isn't just a problem for floating point datatypes either, it also applies to other types such as strings. For example, given a partition over (text, int) types defined with the following values: FROM ('a', UNBOUNDED) TO ('b', UNBOUNDED) which is equivalent to FROM ('a', -INFINITY) TO ('b', +INFINITY) where should the next range start? Even if you were to find a way to specify "the next string after 'b'", it wouldn't exactly be pretty. The problem is that the above partition corresponds to "all the strings starting with 'a', plus the string 'b', which is pretty ugly. A neater way to define the pair of ranges in this case would be: FROM ('a', -INFINITY) TO ('b', -INFINITY) FROM ('b', -INFINITY) TO ('c', -INFINITY) since then all strings starting with 'a' would fall into the first partition and all the strings starting with 'b' would fall into the second one. Currently, when there are 2 partition columns, the partition constraint is defined as (a is not null) and (b is not null) and (a > al or (a = al and b >= bl)) and (a < au or (a = au and b < bu)) if the upper bound bu were allowed to be -INFINITY (something that should probably be forbidden unless the previous column's upper bound were finite), then this would simplify to (a is not null) and (b is not null) and (a > al or (a = al and b >= bl)) and (a < au) and in the example above, where al is -INFINITY, it would further simplify to (a is not null) and (b is not null) and (a >= al) and (a < au) There would also be a similar simplification possible if the lower bound of a partition column were allowed to be +INFINITY. So, I think that having UNBOUNDED represent both -INFINITY and +INFINITY depending on context is a design flaw, and that we need to allow both -INFINITY and +INFINITY as upper and lower bounds (provided they are preceded by a column with a finite bound). I think that, in general, that's the only way to allow contiguous non-overlapping partitions to be defined on multiple columns. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On Fri, Jun 23, 2017 at 3:31 PM, Michael Paquierwrote: > On Fri, Jun 23, 2017 at 3:17 PM, Amit Langote > wrote: >> Initially, I had naively set wal_consistency_check = all before running >> make installcheck and then had to wait for a long time to confirm that WAL >> generated by the gin test indeed caused consistency check failure on the >> standby with the v1 patch. > > wal_consistency_check = gin would have saved you a lot of I/O. > >> But I can see Sawada-san's point that there should be some way for >> developers writing code that better had gone through WAL consistency >> checking facility to do it without much hassle. But then again, it may >> not be that frequent to need that. Yeah, it should be optional. I imagined providing such an option of pg_regress or TAP test for the developers. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix a typo in snapmgr.c
On 23 June 2017 at 08:21, Andres Freundwrote: > On 2017-06-07 10:17:31 -0700, Andres Freund wrote: >> On 2017-05-08 09:12:13 -0400, Tom Lane wrote: >> > Simon Riggs writes: >> > > So rearranged code a little to keep it lean. >> > >> > Didn't you break it with that? As it now stands, the memcpy will >> > copy the nonzero value. >> >> I've not seen a fix and/or alleviating comment about this so far. Did I >> miss something? > > Simon, FWIW, I plan to either revert or fix this up soon-ish. Unless > you're going to actually respond on this thread? Sorry, I've only just seen Tom's reply. Will fix. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix a typo in snapmgr.c
On 2017-06-07 10:17:31 -0700, Andres Freund wrote: > On 2017-05-08 09:12:13 -0400, Tom Lane wrote: > > Simon Riggswrites: > > > So rearranged code a little to keep it lean. > > > > Didn't you break it with that? As it now stands, the memcpy will > > copy the nonzero value. > > I've not seen a fix and/or alleviating comment about this so far. Did I > miss something? Simon, FWIW, I plan to either revert or fix this up soon-ish. Unless you're going to actually respond on this thread? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small bug in replication lag tracking
On 23 June 2017 at 06:45, Thomas Munrowrote: > I discovered a thinko in the new replication lag interpolation code > that can cause a strange number to be reported occasionally. Thanks, will review. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Same expression more than once in partition key
On Fri, 23 Jun 2017 15:57:54 +0900 Yugo Nagatawrote: > Hi, > > When we create a range partitioned table, we cannot use > a column more than once in the partition key. > > postgres=# create table t (i int) partition by range(i,i); > ERROR: column "i" appears more than once in partition key > > However, I can use same expression more than once in partition key. > > postgres=# create table t (i int) partition by range((i),(i)); > CREATE TABLE > > Although this can be blocked by the attached parch, for example, > the following is still possible. I forgot to attach it. > > postgres=# create table t (i int) partition by range((i),i); > CREATE TABLE > > Can these definition be a problem in the internal of partitioning? > If not, we might not need to check column that apears more than once > in the partition key. > > Regards, > > > > -- > Yugo Nagata -- Yugo Nagata diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7d9c769..dc4540b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13171,6 +13171,19 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy) PartitionElem *pelem = castNode(PartitionElem, lfirst(l)); ListCell *lc; + if (pelem->expr) + { + /* Copy, to avoid scribbling on the input */ + pelem = copyObject(pelem); + + /* Now do parse transformation of the expression */ + pelem->expr = transformExpr(pstate, pelem->expr, + EXPR_KIND_PARTITION_EXPRESSION); + + /* we have to fix its collations too */ + assign_expr_collations(pstate, pelem->expr); + } + /* Check for PARTITION BY ... (foo, foo) */ foreach(lc, newspec->partParams) { @@ -13183,19 +13196,11 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy) errmsg("column \"%s\" appears more than once in partition key", pelem->name), parser_errposition(pstate, pelem->location))); - } - - if (pelem->expr) - { - /* Copy, to avoid scribbling on the input */ - pelem = copyObject(pelem); - - /* Now do parse transformation of the expression */ - pelem->expr = transformExpr(pstate, pelem->expr, - EXPR_KIND_PARTITION_EXPRESSION); - - /* we have to fix its collations too */ - assign_expr_collations(pstate, pelem->expr); + else if (pelem->expr && pparam->expr && equal(pelem->expr, pparam->expr)) +ereport(ERROR, + (errcode(ERRCODE_DUPLICATE_COLUMN), + errmsg("same expression appears more than once in partition key"), + parser_errposition(pstate, pelem->location))); } newspec->partParams = lappend(newspec->partParams, pelem); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On Fri, Jun 23, 2017 at 6:58 AM, Amit Langotewrote: > On 2017/06/22 20:48, amul sul wrote: >> Hi, >> >> While working on the another patch, I came across the case where >> I need an auto generated partition for a mutil-column range partitioned >> table having following range bound: >> >> PARTITION p1 FROM (UNBOUNDED, UNBOUNDED) TO (10, 10) >> PARTITION p2 FROM (10, 10) TO (10, UNBOUNDED) >> PARTITION p3 FROM (10, UNBOUNDED) TO (20, 10) >> PARTITION p4 FROM (20, 10) TO (20, UNBOUNDED) >> PARTITION p5 FROM (20, UNBOUNDED) TO (UNBOUNDED, UNBOUNDED) >> >> In this, a lower bound of the partition is an upper bound of the >> previous partition. >> >> While trying to create p3 partition with (10, UNBOUNDED) to (20, 10) bound, >> got an overlap partition error. >> >> Here is the SQL to reproduced this error: >> >> CREATE TABLE range_parted ( i1 int, i2 int ) PARTITION BY RANGE (i1, i2); >> CREATE TABLE p1 PARTITION OF range_parted FOR VALUES FROM (UNBOUNDED, >> UNBOUNDED) TO (10, 10); >> CREATE TABLE p2 PARTITION OF range_parted FOR VALUES FROM (10, 10) TO >> (10, UNBOUNDED); >> CREATE TABLE p3 PARTITION OF tab1 FOR VALUES FROM (10, UNBOUNDED) TO (20, >> 10); >> >> ERROR: partition "p3" would overlap partition "tab1_p_10_10" >> >> This happened because of UNBOUNDED handling, where it is a negative infinite >> if it is in FROM clause. Wondering can't we explicitly treat this as >> a positive infinite value, can we? The way we have designed our syntax, we don't have a way to tell that p3 comes after p2 and they have no gap between those. But I don't think that's your question. What you are struggling with is a way to specify a lower bound (10, +infinity) so that anything with i1 > 10 would go to partition 3. > > No, we cannot. What would be greater than (or equal to) +infinite? > Nothing. So, even if you will want p3 to accept (10, 9890148), it won't > because 9890148 is not >= +infinite. It will accept only the rows where > the first column is > 10 (second column is not checked in that case). > > You will have to define p3 as follows: > > CREATE TABLE p3 PARTITION OF tab1 FOR VALUES FROM (11, UNBOUNDED) TO (20, 10); That's not exactly same as specifying (10, +infinity) in case i1 is a float. A user can not precisely tell what would be the acceptable value just greater than 10. An UNBOUNDED in the lower bound is always considered as -infinity for that data type. There is no way to specify a lower bound which has +infinity in it. +infinite as a lower bounds for the first key may not make sense (although that means that the partition will always be empty), but it does make sense for keys after the first as Amul has explained below. The question is do we have support for that and if not, will we consider it for v10 or v11 and how. > > It's fine to use the previous partition's upper bound as the lower bound > of the current partition, if the former does contain an UNBOUNDED value, > because whereas a finite value divides the range into two parts (assigned > to the two partitions respectively), an UNBOUNDED value does not. The > latter represents an abstract end of the range (either on the positive > side or the negative). Not exactly for second key onwards. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Same expression more than once in partition key
Hi, When we create a range partitioned table, we cannot use a column more than once in the partition key. postgres=# create table t (i int) partition by range(i,i); ERROR: column "i" appears more than once in partition key However, I can use same expression more than once in partition key. postgres=# create table t (i int) partition by range((i),(i)); CREATE TABLE Although this can be blocked by the attached parch, for example, the following is still possible. postgres=# create table t (i int) partition by range((i),i); CREATE TABLE Can these definition be a problem in the internal of partitioning? If not, we might not need to check column that apears more than once in the partition key. Regards, -- Yugo Nagata-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix a typo in README.dependencies
On Fri, Jun 23, 2017 at 2:58 AM, Alvaro Herrerawrote: > Ashutosh Bapat wrote: >> On Mon, Jun 5, 2017 at 8:22 AM, atorikoshi >> wrote: >> > Hi, >> > >> > I found below formula to compute selectivities, but >> > I think the last Probability 'P(b=?)' should be 'P(c=?)'. >> > >> >> P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?)) >> > >> > >> > Attached patch fixes it, and it also adds some spaces >> > following another formula which is on line 86 and >> > computes P(a=?, b=?). >> >> Agree. Also using "d" for "degree of functional dependence (b=>c) as >> well is confusing. We are already using "d" for "degree of functional >> dependence (a=>b). Here' patch to use "d'" instead of "d". > > Since the surrounding text uses single quotes to talk about each letter, > I thought it was better to use a new letter (e) so that we don't require > the "prime" notation, which would end up being either inconsistent, > confusing, stupid-looking, or combinations thereof. Makes sense. > > Also, your proposed text had a slight mistake: it's not (b=>c) that > d' is for, but (a,b=>c). Sorry for that. > > Pushed with those corrections. Thanks. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix a typo in README.dependencies
On 2017/06/23 6:28, Alvaro Herrera wrote: Ashutosh Bapat wrote: On Mon, Jun 5, 2017 at 8:22 AM, atorikoshiwrote: Hi, I found below formula to compute selectivities, but I think the last Probability 'P(b=?)' should be 'P(c=?)'. P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?)) Attached patch fixes it, and it also adds some spaces following another formula which is on line 86 and computes P(a=?, b=?). Agree. Also using "d" for "degree of functional dependence (b=>c) as well is confusing. We are already using "d" for "degree of functional dependence (a=>b). Here' patch to use "d'" instead of "d". Since the surrounding text uses single quotes to talk about each letter, I thought it was better to use a new letter (e) so that we don't require the "prime" notation, which would end up being either inconsistent, confusing, stupid-looking, or combinations thereof. Also, your proposed text had a slight mistake: it's not (b=>c) that d' is for, but (a,b=>c). Pushed with those corrections. Thanks for the reports and patches! Thanks! -- Atsushi Torikoshi NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On Fri, Jun 23, 2017 at 3:17 PM, Amit Langotewrote: > Initially, I had naively set wal_consistency_check = all before running > make installcheck and then had to wait for a long time to confirm that WAL > generated by the gin test indeed caused consistency check failure on the > standby with the v1 patch. wal_consistency_check = gin would have saved you a lot of I/O. > But I can see Sawada-san's point that there should be some way for > developers writing code that better had gone through WAL consistency > checking facility to do it without much hassle. But then again, it may > not be that frequent to need that. Yeah, I have my own set of generic scripts for that. There could be a set of modules out of the main check-world, the risk that those finish rotting is high though... -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Broken O(n^2) avoidance in wal segment recycling.
On Thu, Jun 22, 2017 at 6:10 AM, Andres Freundwrote: > We've not heard any complaints about this afaik, but it's not something > that's easily diagnosable as being a problem. Therefore I suspect we > should fix and backpatch this? Agreed. I have just poked at this problem and have finished with the attached. Logically it is not complicated at the argument values used by the callers of RemoveXlogFile() are never updated when scanning pg_wal. Surely this patch needs an extra pair of eyes. -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0a6314a642..d594401b6e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -878,7 +878,8 @@ static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr); static void XLogFileClose(void); static void PreallocXlogFiles(XLogRecPtr endptr); static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr); -static void RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr); +static void RemoveXlogFile(const char *segname, XLogSegNo *endlogSegNo, + XLogSegNo recycleSegNo); static void UpdateLastRemovedPtr(char *filename); static void ValidateXLOGDirectoryStructure(void); static void CleanupBackupHistory(void); @@ -3829,10 +3830,18 @@ UpdateLastRemovedPtr(char *filename) static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) { + XLogSegNo endlogSegNo; + XLogSegNo recycleSegNo; DIR *xldir; struct dirent *xlde; char lastoff[MAXFNAMELEN]; + /* + * Initialize info about where to try to recycle to. + */ + XLByteToPrevSeg(endptr, endlogSegNo); + recycleSegNo = XLOGfileslop(PriorRedoPtr); + xldir = AllocateDir(XLOGDIR); if (xldir == NULL) ereport(ERROR, @@ -3875,7 +3884,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) /* Update the last removed location in shared memory first */ UpdateLastRemovedPtr(xlde->d_name); -RemoveXlogFile(xlde->d_name, PriorRedoPtr, endptr); +RemoveXlogFile(xlde->d_name, , recycleSegNo); } } } @@ -3905,8 +3914,14 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) struct dirent *xlde; char switchseg[MAXFNAMELEN]; XLogSegNo endLogSegNo; + XLogSegNo recycleSegNo; + /* + * Initialize info about where to begin the work. This will recycle + * arbitrarily 10 segments. + */ XLByteToPrevSeg(switchpoint, endLogSegNo); + recycleSegNo = endLogSegNo + 10; xldir = AllocateDir(XLOGDIR); if (xldir == NULL) @@ -3944,7 +3959,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) * - but seems safer to let them be archived and removed later. */ if (!XLogArchiveIsReady(xlde->d_name)) -RemoveXlogFile(xlde->d_name, InvalidXLogRecPtr, switchpoint); +RemoveXlogFile(xlde->d_name, , recycleSegNo); } } @@ -3954,31 +3969,22 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI) /* * Recycle or remove a log file that's no longer needed. * - * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the - * redo pointer of the previous checkpoint. These are used to determine - * whether we want to recycle rather than delete no-longer-wanted log files. - * If PriorRedoRecPtr is not known, pass invalid, and the function will - * recycle, somewhat arbitrarily, 10 future segments. + * segname is the name of the segment to recycle or remove. endlogSegNo + * is the segment number of the current (or recent) end of WAL. recycleSegNo + * is the segment number to recycle up to. + * + * endlogSegNo gets incremented if the segment is recycled so as it is not + * checked again with future callers of this function. */ static void -RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) +RemoveXlogFile(const char *segname, XLogSegNo *endlogSegNo, + XLogSegNo recycleSegNo) { char path[MAXPGPATH]; #ifdef WIN32 char newpath[MAXPGPATH]; #endif struct stat statbuf; - XLogSegNo endlogSegNo; - XLogSegNo recycleSegNo; - - /* - * Initialize info about where to try to recycle to. - */ - XLByteToSeg(endptr, endlogSegNo); - if (PriorRedoPtr == InvalidXLogRecPtr) - recycleSegNo = endlogSegNo + 10; - else - recycleSegNo = XLOGfileslop(PriorRedoPtr); snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname); @@ -3987,9 +3993,9 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr) * segment. Only recycle normal files, pg_standby for example can create * symbolic links pointing to a separate archive directory. */ - if (endlogSegNo <= recycleSegNo && + if (*endlogSegNo <= recycleSegNo && lstat(path, ) == 0 && S_ISREG(statbuf.st_mode) && - InstallXLogFileSegment(, path, + InstallXLogFileSegment(endlogSegNo, path, true, recycleSegNo, true)) { ereport(DEBUG2, @@ -3997,7 +4003,7 @@ RemoveXlogFile(const char
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/06/23 15:07, Michael Paquier wrote: > On Fri, Jun 23, 2017 at 2:56 PM, Masahiko Sawada> wrote: >> Thank you for updating the patch. It looks good to me. >> BTW I'm inclined to have a regression test case where doing 'make >> check' to the streaming replication environment with >> wal_consistency_check on standby server so that we can detect a bug >> around the wal. > > This would be very costly. A single run of the main regression tests > generates 15 to 20GB of FPWs if I recall correctly. Tiny buildfarm > members would suffer on that. *cough* Initially, I had naively set wal_consistency_check = all before running make installcheck and then had to wait for a long time to confirm that WAL generated by the gin test indeed caused consistency check failure on the standby with the v1 patch. But I can see Sawada-san's point that there should be some way for developers writing code that better had gone through WAL consistency checking facility to do it without much hassle. But then again, it may not be that frequent to need that. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On Fri, Jun 23, 2017 at 2:56 PM, Masahiko Sawadawrote: > Thank you for updating the patch. It looks good to me. > BTW I'm inclined to have a regression test case where doing 'make > check' to the streaming replication environment with > wal_consistency_check on standby server so that we can detect a bug > around the wal. This would be very costly. A single run of the main regression tests generates 15 to 20GB of FPWs if I recall correctly. Tiny buildfarm members would suffer on that. *cough* -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 2017/06/23 13:42, amul sul wrote: > On Fri, Jun 23, 2017 at 6:58 AM, Amit Langote wrote: >> On 2017/06/22 20:48, amul sul wrote: >>> This happened because of UNBOUNDED handling, where it is a negative infinite >>> if it is in FROM clause. Wondering can't we explicitly treat this as >>> a positive infinite value, can we? >> >> No, we cannot. What would be greater than (or equal to) +infinite? >> Nothing. So, even if you will want p3 to accept (10, 9890148), it won't >> because 9890148 is not >= +infinite. It will accept only the rows where >> the first column is > 10 (second column is not checked in that case). >> >> You will have to define p3 as follows: >> >> CREATE TABLE p3 PARTITION OF tab1 FOR VALUES FROM (11, UNBOUNDED) TO (20, >> 10); >> > What if the partition key column is FLOAT ? I would say use a value such that the btfloat4cmp (or btfloat8cmp) will tell it to be greater than 10. Of course, we can't write what I just said in the user-level documentation, because the fact that we use system- or user-defined btree comparison proc (btfloat4/8cmp) for partitioning may be irrelevant to the users. Although, we do mention in the documentation that we use btree operator class specified semantics for partitioning. In any case, defining your partitioning or indexing on raw float type column(s) is prone to semantic caveats, I'd think. Also, there was interesting exchange on this topic during the patch development [1]. Excerpt: "Same for ranges of floating-point numbers, which are also probably an unlikely candidate for a partitioning key anyway." Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoaucSqQ%3DdJFhaojSpb1706MQYo1Tfn_3tWv6CVWhAOdrQ%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers