Re: [HACKERS] logical decoding of two-phase transactions
Hi, FYI all, wanted to mention that I am working on an updated version of the latest patch that I plan to submit to a later CF. Regards, Nikhils On 14 May 2017 at 04:02, Dmitry Dolgov <9erthali...@gmail.com> wrote: > On 13 May 2017 at 22:22, Tom Lane <t...@sss.pgh.pa.us> wrote: >> >> Apparently you are not testing against current HEAD. That's been there >> since d10c626de (a whole two days now ;-)) > > Indeed, I was working on a more than two-day old antiquity. Unfortunately, > it's even more complicated > to apply this patch against the current HEAD, so I'll wait for a rebased > version. -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtrans links incorrectly
> I'm suggesting we take the approach that if there is a problem we can > recreate it as a way of exploring what conditions are required and > therefore work out the impact. Nikhil Sontakke appears to have > re-created something, but not quite what I had expected. I think he > will post here tomorrow with an update for us to discuss. > > So, I reverted commit 0874d4f3e183757ba15a4b3f3bf563e0393dd9c2 to go back to the earlier bad swapped arguments to SubTransSetParent resulting in incorrect parent linkages and used the attached TAP test patch. The test prepares a 2PC with more than 64 subtransactions. It then stops the master and promotes the standby. A SELECT query on the newly promoted master on any of the tables involved in the 2PC hangs. The hang is due to a loop in SubTransGetTopmostTransaction(). Due to incorrect linkages, we get a circular reference in parentxid <-> subxid inducing the infinite loop. Any further DML on these objects which will need to check visibility of these tuples hangs as well. All unrelated objects and new transactions are ok AFAICS. I do not see any data loss, which is good. However tables involved in the 2PC are inaccessible till after a hard restart. The attached TAP test patch can be considered for commit to test handling 2PC with large subtransactions on promoted standby instances. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services subxid_bug_with_test_case_v1.0.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtrans links incorrectly
> > >>>> Also, when I fix that, it gets further but still crashes at the same > >>>> Assert in SubTransSetParent. The proximate cause this time seems to > be > >>>> that RecoverPreparedTransactions's calculation of overwriteOK is > wrong: > >>>> it's computing that as "false", but in reality the subtrans link in > >>>> question has already been set. > >> > >>> Not sure about that, investigating. > >> > >> As a quick hack, I just hotwired RecoverPreparedTransactions to set > >> overwriteOK = true always, and with that and the SubTransSetParent > >> argument-order fix, HEAD passes the recovery tests. Maybe we can > >> be smarter than that, but this might be a good short-term fix to get > >> the buildfarm green again. > > > > That would work. I've been looking into a fix I can explain, but "do > > it always" may actually be it. > > Here's what's happening: On Master: BEGIN; INSERT INTO t_009_tbl VALUES (42); SAVEPOINT s1; INSERT INTO t_009_tbl VALUES (43); PREPARE TRANSACTION 'xact_009_1'; On Master: Do a fast shutdown On Slave: Restart the slave. This causes StandbyRecoverPreparedTransactions() to be invoked which causes ProcessTwoPhaseBuffer() to be invoked with setParent set to true. This sets up parent xid for the above subtransaction. On Slave: Promote the slave. This causes RecoverPreparedTransactions() to be invoked which ends up calling SubTransSetParent() for the above subxid. The overwriteOK bool remains false since we don't go over the PGPROC_MAX_CACHED_SUBXIDS limit. Since the parent xid was already set on restart above, we hit the assert. --- I am wondering if StandbyRecoverPreparedTransactions() needs the parent linkage at all? I don't see SubTransGetParent() and related functions being called on the standby but we need to be careful here. As a quick test, If I don't call SubTransSetParent() as part of StandbyRecoverPreparedTransactions() then all recovery tests pass ok. But the fact that StandbyRecoverPreparedTransactions() takes overwriteOK as a parameter means it might not be so simple.. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Failed recovery with new faster 2PC code
Please find attached a second version of my bug fix which is stylistically better and clearer than the first one. Regards, Nikhils On 18 April 2017 at 13:47, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: > Hi, > > There was a bug in the redo 2PC remove code path. Because of which, > autovac would think that the 2PC is gone and cause removal of the > corresponding clog entry earlier than needed. > > Please find attached, the bug fix: 2pc_redo_remove_bug.patch. > > I have been testing this on top of Michael's 2pc-restore-fix.patch and > things seem to be ok for the past one+ hour. Will keep it running for long. > > Jeff, thanks for these very useful scripts. I am going to make a habit to > run these scripts on my side from now on. Do you have any other script that > I could try against these patches? Please let me know. > > Regards, > Nikhils > > On 18 April 2017 at 12:09, Nikhil Sontakke <nikh...@2ndquadrant.com> > wrote: > >> >> >> On 17 April 2017 at 15:02, Nikhil Sontakke <nikh...@2ndquadrant.com> >> wrote: >> >>> >>> >>>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71 >>>> >> Author: Simon Riggs <si...@2ndquadrant.com> >>>> >> Date: Tue Apr 4 15:56:56 2017 -0400 >>>> >> >>>> >>Speedup 2PC recovery by skipping two phase state files in normal >>>> path >>>> > >>>> > Thanks Jeff for your tests. >>>> > >>>> > So that's now two crash bugs in as many days and lack of clarity about >>>> > how to fix it. >>>> > >>>> >>> >>> The issue seems to be that a prepared transaction is yet to be >> committed. But autovacuum comes in and causes the clog to be truncated >> beyond this prepared transaction ID in one of the runs. >> >> We only add the corresponding pgproc entry for a surviving 2PC >> transaction on completion of recovery. So could be a race condition here. >> Digging in further. >> >> Regards, >> Nikhils >> -- >> Nikhil Sontakke http://www.2ndQuadrant.com/ >> PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services >> > > > > -- > Nikhil Sontakke http://www.2ndQuadrant.com/ > PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services > -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services 2pc_redo_remove_bug_v2.0.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failed recovery with new faster 2PC code
Hi, There was a bug in the redo 2PC remove code path. Because of which, autovac would think that the 2PC is gone and cause removal of the corresponding clog entry earlier than needed. Please find attached, the bug fix: 2pc_redo_remove_bug.patch. I have been testing this on top of Michael's 2pc-restore-fix.patch and things seem to be ok for the past one+ hour. Will keep it running for long. Jeff, thanks for these very useful scripts. I am going to make a habit to run these scripts on my side from now on. Do you have any other script that I could try against these patches? Please let me know. Regards, Nikhils On 18 April 2017 at 12:09, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: > > > On 17 April 2017 at 15:02, Nikhil Sontakke <nikh...@2ndquadrant.com> > wrote: > >> >> >>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71 >>> >> Author: Simon Riggs <si...@2ndquadrant.com> >>> >> Date: Tue Apr 4 15:56:56 2017 -0400 >>> >> >>> >>Speedup 2PC recovery by skipping two phase state files in normal >>> path >>> > >>> > Thanks Jeff for your tests. >>> > >>> > So that's now two crash bugs in as many days and lack of clarity about >>> > how to fix it. >>> > >>> >> >> The issue seems to be that a prepared transaction is yet to be committed. > But autovacuum comes in and causes the clog to be truncated beyond this > prepared transaction ID in one of the runs. > > We only add the corresponding pgproc entry for a surviving 2PC transaction > on completion of recovery. So could be a race condition here. Digging in > further. > > Regards, > Nikhils > -- > Nikhil Sontakke http://www.2ndQuadrant.com/ > PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services > -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services 2pc_redo_remove_bug.patch Description: Binary data 2pc-restore-fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failed recovery with new faster 2PC code
On 17 April 2017 at 15:02, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: > > >> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71 >> >> Author: Simon Riggs <si...@2ndquadrant.com> >> >> Date: Tue Apr 4 15:56:56 2017 -0400 >> >> >> >>Speedup 2PC recovery by skipping two phase state files in normal >> path >> > >> > Thanks Jeff for your tests. >> > >> > So that's now two crash bugs in as many days and lack of clarity about >> > how to fix it. >> > >> > > The issue seems to be that a prepared transaction is yet to be committed. But autovacuum comes in and causes the clog to be truncated beyond this prepared transaction ID in one of the runs. We only add the corresponding pgproc entry for a surviving 2PC transaction on completion of recovery. So could be a race condition here. Digging in further. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Failed recovery with new faster 2PC code
> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71 > >> Author: Simon Riggs> >> Date: Tue Apr 4 15:56:56 2017 -0400 > >> > >>Speedup 2PC recovery by skipping two phase state files in normal path > > > > Thanks Jeff for your tests. > > > > So that's now two crash bugs in as many days and lack of clarity about > > how to fix it. > > > I am trying to reproduce and debug it from my end as well. Regards, Nikhils
Re: [HACKERS] Speedup twophase transactions
> > > I don't have anything else to say about this patch, so should we mark > that as ready for committer? There are still a couple of days left > until the end of the CF, and quite a lot has happened, so this could > get on time into PG10. > Yes, let's mark it "ready for committer". This patch/feature has been a long journey! :) Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
Please ignore reports about errors in other tests. Seem spurious.. Regards, Nikhils On 28 March 2017 at 10:40, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: > Hi Micheal, > > The latest patch looks good. By now doing a single scan of shmem two phase > data, we have removed the double loops in all the affected functions which > is good. > > My only question is if the added call to restoreTwoPhaseData() is good > enough to handle all the 3 functions PrescanPreparedTransactions(), > StandbyRecoverPreparedTransactions() and RecoverPreparedTransactions() > appropriately? It looks as if it does, but we need to be doubly sure.. > > PFA, revised patch with a very minor typo fix and rebase against latest > master. The test cases pass as needed. > > Oh, btw, while running TAP tests, I got a few errors in unrelated tests. > > "# testing connection parameter "target_session_attrs" > > not ok 5 - connect to node master if mode "read-write" and > master,standby_1 listed > > > # Failed test 'connect to node master if mode "read-write" and > master,standby_1 listed' > > # at t/001_stream_rep.pl line 93. > > # got: '' > > # expected: '1' > > not ok 6 - connect to node master if mode "read-write" and > standby_1,master listed > > > # Failed test 'connect to node master if mode "read-write" and > standby_1,master listed' > > # at t/001_stream_rep.pl line 93. > > # got: '' > > # expected: '1' > > not ok 7 - connect to node master if mode "any" and master,standby_1 listed > > > # Failed test 'connect to node master if mode "any" and master,standby_1 > listed' > > # at t/001_stream_rep.pl line 93. > > # got: '' > > # expected: '1' > > not ok 8 - connect to node standby_1 if mode "any" and standby_1,master > listed" > > Again, not related to this recovery code path, but not sure if others see > this as well. > > Regards, > Nikhils > > On 27 March 2017 at 05:35, Michael Paquier <michael.paqu...@gmail.com> > wrote: > >> On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke >> <nikh...@2ndquadrant.com> wrote: >> > I was away for a bit. I will take a look at this patch and get back to >> you >> > soon. >> >> No problem. Thanks for your time! >> -- >> Michael >> > > > > -- > Nikhil Sontakke http://www.2ndQuadrant.com/ > PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services > -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
Hi Micheal, The latest patch looks good. By now doing a single scan of shmem two phase data, we have removed the double loops in all the affected functions which is good. My only question is if the added call to restoreTwoPhaseData() is good enough to handle all the 3 functions PrescanPreparedTransactions(), StandbyRecoverPreparedTransactions() and RecoverPreparedTransactions() appropriately? It looks as if it does, but we need to be doubly sure.. PFA, revised patch with a very minor typo fix and rebase against latest master. The test cases pass as needed. Oh, btw, while running TAP tests, I got a few errors in unrelated tests. "# testing connection parameter "target_session_attrs" not ok 5 - connect to node master if mode "read-write" and master,standby_1 listed # Failed test 'connect to node master if mode "read-write" and master,standby_1 listed' # at t/001_stream_rep.pl line 93. # got: '' # expected: '1' not ok 6 - connect to node master if mode "read-write" and standby_1,master listed # Failed test 'connect to node master if mode "read-write" and standby_1,master listed' # at t/001_stream_rep.pl line 93. # got: '' # expected: '1' not ok 7 - connect to node master if mode "any" and master,standby_1 listed # Failed test 'connect to node master if mode "any" and master,standby_1 listed' # at t/001_stream_rep.pl line 93. # got: '' # expected: '1' not ok 8 - connect to node standby_1 if mode "any" and standby_1,master listed" Again, not related to this recovery code path, but not sure if others see this as well. Regards, Nikhils On 27 March 2017 at 05:35, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke > <nikh...@2ndquadrant.com> wrote: > > I was away for a bit. I will take a look at this patch and get back to > you > > soon. > > No problem. Thanks for your time! > -- > Michael > -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services twophase_recovery_shmem_280317.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Thanks Michael, I was away for a bit. I will take a look at this patch and get back to you soon. Regards, Nikhils On 22 March 2017 at 07:40, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Fri, Mar 17, 2017 at 5:15 PM, Michael Paquier > <michael.paqu...@gmail.com> wrote: > > On Fri, Mar 17, 2017 at 5:00 PM, Nikhil Sontakke > > <nikh...@2ndquadrant.com> wrote: > >> Micheal, it looks like you are working on a final version of this > patch? I > >> will wait to review it from my end, then. > > > > I have to admit that I am beginning to get drawn into it... > > And here is what I got. I have found a couple of inconsistencies in > the patch, roughly: > - During recovery entries marked with ondisk = true should have their > start and end LSN reset to InvalidXLogRecPtr. This was actually > leading to some inconsistencies in MarkAsPreparing() for 2PC > transactions staying around for more than 2 checkpoints. > - RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions() > and PrescanPreparedTransactions() doing both a scan of pg_twophase and > the shared memory entries was way too complicated. I have changed > things so as only memory entries are scanned by those routines, but an > initial scan of pg_twophase is done before recovery. > - Some inconsistencies in the comments and some typos found on the way. > - Simplification of some routines used in redo, as well as simplified > the set of routines made available to users. > > Tests are passing for me, an extra lookup would be nice. > -- > Michael > -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
> > > > > Ok, we can do that and then yes, RecoverPreparedTransaction() can just > have > > one loop going through the shmem entries. BUT, we cannot ignore > > "inredo"+"ondisk" entries. For such entries, we will have to read and > > recover from the corresponding 2PC files. > > Yes. About other things I found... In CheckPointTwoPhase(), I am > actually surprised that prepare_start_lsn and prepare_end_lsn are not > reset to InvalidXLogRecPtr when a shmem entry is flushed to disk after > ondisk is set to true, there is no need for them as the data does not > need to be fetched from WAL segments so we had better be consistent > (regression tests fail if I do that). And the two extra arguments in > MarkAsPreparing() are really unnecessary, they get set all the time to > InvalidXLogRecPtr. > Micheal, it looks like you are working on a final version of this patch? I will wait to review it from my end, then. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
> > > > > I don't think this will work. We cannot replace pg_twophase with shmem > > entries + WAL pointers. This is because we cannot expect to have WAL > entries > > around for long running prepared queries which survive across > checkpoints. > > But at the beginning of recovery, we can mark such entries with ondisk > and inredo, in which case the WAL pointers stored in the shmem entries > do not matter because the data is already on disk. > Ok, we can do that and then yes, RecoverPreparedTransaction() can just have one loop going through the shmem entries. BUT, we cannot ignore "inredo"+"ondisk" entries. For such entries, we will have to read and recover from the corresponding 2PC files. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
> > Nikhil, do you mind if I try something like that? As we already know > what is the first XID when beginning redo via > ShmemVariableCache->nextXid it is possible to discard 2PC files that > should not be here. Yeah, that is ok. > What makes me worry is the control of the maximum > number of entries in shared memory. If there are legit 2PC files that > are flushed on disk at checkpoint, you would finish with potentially > more 2PC transactions than what should be possible (even if updates of > max_prepared_xacts are WAL-logged). > The max_prepared_xacts number restricts the number of pending PREPARED transactions *across* the 2PC files and shmem inredo entries. We can never have more entries than this value. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
> > + * * RecoverPreparedTransactions(), StandbyRecoverPreparedTransact > ions() > + *and PrescanPreparedTransactions() have been modified to go > throug > + *gxact->inredo entries that have not made to disk yet. > > It seems to me that there should be an initial scan of pg_twophase at > the beginning of recovery, discarding on the way with a WARNING > entries that are older than the checkpoint redo horizon. This should > fill in shmem entries using something close to PrepareRedoAdd(), and > mark those entries as inredo. Then, at the end of recovery, > PrescanPreparedTransactions does not need to look at the entries in > pg_twophase. And that's the case as well of > RecoverPreparedTransaction(). I think that you could get the patch > much simplified this way, as any 2PC data can be fetched directly from > WAL segments and there is no need to rely on scans of pg_twophase, > this is replaced by scans of entries in TwoPhaseState. > > I don't think this will work. We cannot replace pg_twophase with shmem entries + WAL pointers. This is because we cannot expect to have WAL entries around for long running prepared queries which survive across checkpoints. Regards, Nikhils
Re: [HACKERS] Speedup twophase transactions
> Thanks for the new version. Let's head toward a final patch. > > :-) > > > > Have added a new function to do this now. It reads either from disk or > > shared memory and produces error/log messages accordingly as well now. > > And that's ProcessTwoPhaseBufferAndReturn in the patch. > ProcessTwoPhaseBuffer may be a better name. > > Renamed to ProcessTwoPhaseBuffer() > > Since we are using the TwoPhaseState structure to track redo entries, at > end > > of recovery, we will find existing entries. Please see my comments above > for > > RecoverPreparedTransactions() > > This is absolutely not good, because it is a direct result of the > interactions of the first loop of RecoverPreparedTransaction() with > its second loop, and because MarkAsPreparing() can finished by being > called *twice* from the same transaction. I really think that this > portion should be removed and that RecoverPreparedTransactions() > should be more careful when scanning the entries in pg_twophase by > looking up at what exists as well in shared memory, instead of doing > that in MarkAsPreparing(). > > Ok. Modified MarkAsPreparing() to call a new MarkAsPreparingGuts() function. This function takes in a "gxact' and works on it. RecoverPreparedTransaction() now calls a newly added RecoverFromTwoPhaseBuffer() function which checks if an entry already exists via redo and calls the MarkAsPreparingGuts() function by passing in that gxact. Otherwise the existing MarkAsPreparing() gets called. > Here are some more comments: > > + /* > +* Recreate its GXACT and dummy PGPROC > +*/ > + gxactnew = MarkAsPreparing(xid, gid, > + hdr->prepared_at, > + hdr->owner, hdr->database, > + gxact->prepare_start_lsn, > + gxact->prepare_end_lsn); > + > + Assert(gxactnew == gxact); > Here it would be better to set ondisk to false. This makes the code > more consistent with the previous loop, and the intention clear. > > Done. > The first loop of RecoverPreparedTransactions() has a lot in common > with its second loop. You may want to refactor a little bit more here. > > Done. Added the new function RecoverFromTwoPhaseBuffer() as mentioned above. > +/* > + * PrepareRedoRemove > + * > + * Remove the corresponding gxact entry from TwoPhaseState. Also > + * remove the 2PC file. > + */ > This could be a bit more expanded. The removal of the 2PC does not > happen after removing the in-memory data, it would happen if the > in-memory data is not found. > > Done > +MarkAsPreparingInRedo(TransactionId xid, const char *gid, > + TimestampTz prepared_at, Oid owner, Oid databaseid, > + XLogRecPtr prepare_start_lsn, XLogRecPtr prepare_end_lsn) > +{ > + GlobalTransaction gxact; > + > + LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); > MarkAsPreparingInRedo is internal to twophase.c. There is no need to > expose it externally and it is just used in PrepareRedoAdd so you > could just group both. > > Removed this MarkAsPreparingInRedo() function and inlined the code in PrepareRedoAdd(). boolvalid; /* TRUE if PGPROC entry is in proc array */ > boolondisk; /* TRUE if prepare state file is on disk */ > + boolinredo; /* TRUE if entry was added via xlog_redo */ > We could have a set of flags here, that's the 3rd boolean of the > structure used for a status. > This is more of a cleanup and does not need to be part of this patch. This can be a follow-on cleanup patch. I also managed to do some perf testing. Modified Stas' earlier scripts slightly: \set naccounts 10 * :scale \set from_aid random(1, :naccounts) \set to_aid random(1, :naccounts) \set delta random(1, 100) \set scale :scale+1 BEGIN; UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid; PREPARE TRANSACTION ':client_id.:scale'; COMMIT PREPARED ':client_id.:scale'; Created a base backup with scale factor 125 on an AWS t2.large instance. Set up archiving and did a 20 minute run with the above script saving the WALs in the archive. Then used recovery.conf to point to this WAL location and used the base backup to recover. With this patch applied: 20s Without patch: Stopped measuring after 5 minutes ;-) Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services twophase_recovery_shmem_150317.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Hi David and Michael, > It would be great to get this thread closed out after 14 months and many > commits. > > PFA, latest patch which addresses Michael's comments. twophase.c: In function ‘PrepareRedoAdd’: > twophase.c:2539:20: warning: variable ‘gxact’ set but not used > [-Wunused-but-set-variable] > GlobalTransaction gxact; > There is a warning at compilation. > > Fixed. > The comment at the top of PrescanPreparedTransactions() needs to be > updated. Not only does this routine look for the contents of > pg_twophase, but it does also look at the shared memory contents, at > least those ones marked with inredo and not on_disk. > > Changed comments at top of PrescanPreparedTransactions() , StandbyRecoverPreparedTransactions(), and RecoverPreparedTransactions(). > + ereport(WARNING, > + (errmsg("removing future two-phase state data from > memory \"%u\"", > + xid))); > + PrepareRedoRemove(xid); > + continue > Those are not normal (partially because unlink is atomic, but not > durable)... But they match the correct coding pattern regarding > incorrect 2PC entries... I'd really like to see those switched to a > FATAL with unlink() made durable for those calls. > > Hmm, not sure what exactly we need to do here. If you look at the prior checks, there we already skip on-disk entries. So, typically, the entries that we encounter here will be in shmem only. > + /* Deconstruct header */ > + hdr = (TwoPhaseFileHeader *) buf; > + Assert(TransactionIdEquals(hdr->xid, xid)); > + > + if (TransactionIdPrecedes(xid, result)) > + result = xid; > This portion is repeated three times and could be easily refactored. > You could just have a routine that returns the oldes transaction ID > used, and ignore the result for StandbyRecoverPreparedTransactions() > by casting a (void) to let any kind of static analyzer understand that > we don't care about the result for example. Handling for subxids is > necessary as well depending on the code path. Spliting things into a > could of sub-routines may be more readable as well. There are really a > couple of small parts that can be gathered and strengthened. > > Have added a new function to do this now. It reads either from disk or shared memory and produces error/log messages accordingly as well now. > + /* > +* Recreate its GXACT and dummy PGPROC > +*/ > + gxactnew = MarkAsPreparing(xid, gid, > + hdr->prepared_at, > + hdr->owner, hdr->database, > + gxact->prepare_start_lsn, > + gxact->prepare_end_lsn); > MarkAsPreparing() does not need to be extended with two new arguments. > RecoverPreparedTransactions() is used only at the end of recovery, > where it is not necessary to look at the 2PC state files from the > records. In this code path inredo is also set to false :) > > That's not true. We will have entries with inredo set at the end of recovery as well. Infact the MarkAsPreparing() call from RecoverPreparedTransactions() is the one which will remove these inredo entries and convert them into regular entries. We have optimized the recovery code path as well. > + /* Delete TwoPhaseState gxact entry and/or 2PC file. */ > + PrepareRedoRemove(parsed.twophase_xid); > Both things should not be present, no? If the file is pushed to disk > it means that the checkpoint horizon has already moved. > > PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK. We can have both the bits set in this case. > - ereport(ERROR, > + /* It's ok to find an entry in the redo/recovery case */ > + if (!gxact->inredo) > + ereport(ERROR, > (errcode(ERRCODE_DUPLICATE_OBJECT), > errmsg("transaction identifier \"%s\" is already in > use", > gid))); > + else > + { > + found = true; > + break; > + } > I would not have thought so. > > Since we are using the TwoPhaseState structure to track redo entries, at end of recovery, we will find existing entries. Please see my comments above for RecoverPreparedTransactions() > MarkAsPreparing and MarkAsPreparingInRedo really share the same code. > What if the setup of the dummy PGPROC entry is made conditional? > I realized that MarkAsPreparingInRedo() does not need to do all the sanity checking since it's going to be invoked during redo and everything that comes in is kosher already. So its contents are much simplified in this latest patch. Tests pass with this latest patch. Regards, Nikhils twophase_recovery_shmem_110317.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Speedup twophase transactions
Hi Michael, Thanks for taking a look at the patch. > twophase.c: In function ‘PrepareRedoAdd’: > twophase.c:2539:20: warning: variable ‘gxact’ set but not used > [-Wunused-but-set-variable] > GlobalTransaction gxact; > There is a warning at compilation. > > Will fix. > The comment at the top of PrescanPreparedTransactions() needs to be > updated. Not only does this routine look for the contents of > pg_twophase, but it does also look at the shared memory contents, at > least those ones marked with inredo and not on_disk. > > Oh yes. Will change comments at top of StandbyRecoverPreparedTransactions(), RecoverPreparedTransactions() as well. > + ereport(WARNING, > + (errmsg("removing future two-phase state data from > memory \"%u\"", > + xid))); > + PrepareRedoRemove(xid); > + continue > Those are not normal (partially because unlink is atomic, but not > durable)... But they match the correct coding pattern regarding > incorrect 2PC entries... I'd really like to see those switched to a > FATAL with unlink() made durable for those calls. > > Hmm, not sure what exactly we need to do here. If you look at the prior checks, there we already skip on-disk entries. So, typically, the entries that we encounter here will be in shmem only. > + /* Deconstruct header */ > + hdr = (TwoPhaseFileHeader *) buf; > + Assert(TransactionIdEquals(hdr->xid, xid)); > + > + if (TransactionIdPrecedes(xid, result)) > + result = xid; > This portion is repeated three times and could be easily refactored. > You could just have a routine that returns the oldes transaction ID > used, and ignore the result for StandbyRecoverPreparedTransactions() > by casting a (void) to let any kind of static analyzer understand that > we don't care about the result for example. Handling for subxids is > necessary as well depending on the code path. Spliting things into a > could of sub-routines may be more readable as well. There are really a > couple of small parts that can be gathered and strengthened. > > I will see if we can reduce this to a couple of function calls. > + /* > +* Recreate its GXACT and dummy PGPROC > +*/ > + gxactnew = MarkAsPreparing(xid, gid, > + hdr->prepared_at, > + hdr->owner, hdr->database, > + gxact->prepare_start_lsn, > + gxact->prepare_end_lsn); > MarkAsPreparing() does not need to be extended with two new arguments. > RecoverPreparedTransactions() is used only at the end of recovery, > where it is not necessary to look at the 2PC state files from the > records. In this code path inredo is also set to false :) > > That's not true. We will have entries with inredo set at the end of recovery as well. Infact the MarkAsPreparing() call from RecoverPreparedTransactions() is the one which will remove these inredo entries and convert them into regular entries. We have optimized the recovery code path as well. > + { > + /* > +* Entry could be on disk. Call with giveWarning=false > +* since it can be expected during replay. > +*/ > + RemoveTwoPhaseFile(xid, false); > + } > This would be removed at the end of recovery anyway as a stale entry, > so that's not necessary. > > Ok, will remove this. > + /* Delete TwoPhaseState gxact entry and/or 2PC file. */ > + PrepareRedoRemove(parsed.twophase_xid); > Both things should not be present, no? If the file is pushed to disk > it means that the checkpoint horizon has already moved. > > PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK. We can have both the bits set in this case. > - ereport(ERROR, > + /* It's ok to find an entry in the redo/recovery case */ > + if (!gxact->inredo) > + ereport(ERROR, > (errcode(ERRCODE_DUPLICATE_OBJECT), > errmsg("transaction identifier \"%s\" is already in > use", > gid))); > + else > + { > + found = true; > + break; > + } > I would not have thought so. > > Since we are using the TwoPhaseState structure to track redo entries, at end of recovery, we will find existing entries. Please see my comments above for RecoverPreparedTransactions() > MarkAsPreparing and MarkAsPreparingInRedo really share the same code. > What if the setup of the dummy PGPROC entry is made conditional? > I thought it was cleaner this ways. We can definitely add a bunch of if-else in MarkAsPreparing() but it won't look pretty. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speedup twophase transactions
>> Yeah. Was thinking about this yesterday. How about adding entries in >> TwoPhaseState itself (which become valid later)? Only if it does not >> cause a lot of code churn. > > That's possible as well, yes. PFA a patch which does the above. It re-uses the TwoPhaseState gxact entries to track 2PC PREPARE/COMMIT in shared memory. The advantage being that CheckPointTwoPhase() becomes the only place where the fsync of 2PC files happens. A minor annoyance in the patch is the duplication of the code to add the 2nd while loop to go through these shared memory entries in PrescanPreparedTransactions, RecoverPreparedTransactions and StandbyRecoverPreparedTransactions. Other than this, I ran TAP tests and they succeed as needed. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services twophase_recovery_shmem_020217.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> Shouldn’t the singular part of the message above be: >> "%u two-phase state file was written for a long-running prepared transaction" >> >> But, then, English is not my native language, so I might be off here :-) > > If there's one file per long-running prepared transaction, which I > think is true, then I agree with you. PFA, small patch to fix this, then. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services twophase_typo.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> CheckPointTwoPhase() in (5) does not sync this prepared transaction >> because the checkpointer's KnownPreparedList is empty. > > And that's why this needs to be stored in shared memory with a number > of elements made of max_prepared_xacts... Yeah. Was thinking about this yesterday. How about adding entries in TwoPhaseState itself (which become valid later)? Only if it does not cause a lot of code churn. The current non-shmem list patch only needs to handle standby shutdowns correctly. Other aspects like standby promotion/recovery are handled ok AFAICS. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> Having CheckPointTwoPhase() do the flush would mean shifting the data >> from KnownPreparedList into TwoPhaseState shmem. > > Er, no. For CheckPointTwoPhase(), at recovery what needs to be done is > to have all the entries in KnownPreparedList() flushed to disk and > have those entries removed while holding a shared memory lock. The KnownPreparedList is constructed by the recovery process. CheckPointTwoPhase() gets called by the checkpointer process. The checkpointer does not have access to this valid KnownPreparedList. >And for > the rest we need to be careful to have PrescanPreparedTransactions, > RecoverPreparedTransactions and StandbyRecoverPreparedTransactions > aware of entries that are in KnownPreparedList(). Yeah, that part is straightforward. It does involve duplication of the earlier while loops to work against KnownPreparedList. A smart single while loop which loops over the 2PC files followed by the list would help here :-) > Let's leave the > business of putting the information from KnownPreparedList to > TwoPhaseState in RecoverPreparedTransactions, which need to be aware > of entries in KnownPreparedList() anyway. The only thing that differs > is how the 2PC information is fetched: from the segments or from the > files in pg_twophase. > Yeah. This part is also ok. We also got to be careful to mark the shmem gxact entry with "ondisk = false" and need to set prepare_start_lsn/prepare_end_lsn properly as well. >> I wonder what's the best location for this in the common case when we >> do shutdown of standby. We could add code in XLOG_CHECKPOINT_SHUTDOWN >> and XLOG_CHECKPOINT_ONLINE xlog_redo code path. > > ShutdownXLOG() calls CreateRestartPoint() when a standby shuts down, > so doing all the durability work in CheckPointTwoPhase() would take > care of any problems. > ShutdownXLOG() gets called from the checkpointer process. See comments above about the checkpointer not having access to the proper KnownPreparedList. The following test sequence will trigger the issue: 1) start master 2) start replica 3) prepare a transaction on master 4) shutdown master 5) shutdown replica CheckPointTwoPhase() in (5) does not sync this prepared transaction because the checkpointer's KnownPreparedList is empty. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record) > (errmsg("unexpected timeline ID %u (should be %u) > in checkpoint record", > checkPoint.ThisTimeLineID, ThisTimeLineID))); > > +KnownPreparedRecreateFiles(checkPoint.redo); > RecoveryRestartPoint(); > } > And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC > files are not flushed to disk with this patch. This is a problem as a > new restart point is created... Having the flush in CheckpointTwoPhase > really makes the most sense. Having CheckPointTwoPhase() do the flush would mean shifting the data from KnownPreparedList into TwoPhaseState shmem. I wonder what's the best location for this in the common case when we do shutdown of standby. We could add code in XLOG_CHECKPOINT_SHUTDOWN and XLOG_CHECKPOINT_ONLINE xlog_redo code path. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>>> The xact_redo code will add prepared transactions to the >>> KnownPreparedList in memory. Earlier it used to create the on-disk 2PC >>> file. >>> >>> At standby promote, the surviving (yet uncommitted) prepared >>> transactions from KnownPreparedList need to be persisted, right? >> >> I don't see why, so please explain or show the error that can be >> caused if we don't. > > I agree with Simon here. There is no point to fsync the 2PC files are > in case of an immediate crash after promotion replay will happen from > the last checkpoint, aka the one before the promotion has been > triggered. So there is no point to flush them at promotion, they would > be replayed anyway. 1) start master 2) start streaming replica 3) on master begin; create table test1(g int); prepare transaction 'test'; 4) Promote standby via trigger file or via "pg_ctl promote" I thought if we don't fsync the KnownPreparedList then we might not create the 2PC state properly on the standby. However, I do realize that we will be calling RecoverPreparedTransactions() eventually on promote. This function will be modified to go through the KnownPreparedList to reload shmem appropriately for 2PC. So, all good in that case. Apologies for the digression. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
On 27 January 2017 at 15:37, Simon Riggs <si...@2ndquadrant.com> wrote: > On 27 January 2017 at 09:59, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: >>>> But, I put the recovery process and the checkpointer process of the >>>> standby under gdb with breakpoints on these functions, but both did >>>> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued >>>> a promote :-| >>> >>> No end-of-recovery checkpoints happen at promotion since 9.3. You can >>> still use fallback_promote as promote file to trigger the pre-9.2 (9.2 >>> included) behavior. >> >> Ok, so that means, we also need to fsync out these 2PC XIDs at promote >> time as well for their durability. > > Why? The xact_redo code will add prepared transactions to the KnownPreparedList in memory. Earlier it used to create the on-disk 2PC file. At standby promote, the surviving (yet uncommitted) prepared transactions from KnownPreparedList need to be persisted, right? Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> But, I put the recovery process and the checkpointer process of the >> standby under gdb with breakpoints on these functions, but both did >> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued >> a promote :-| > > No end-of-recovery checkpoints happen at promotion since 9.3. You can > still use fallback_promote as promote file to trigger the pre-9.2 (9.2 > included) behavior. Ok, so that means, we also need to fsync out these 2PC XIDs at promote time as well for their durability. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby >> promote" code path. > > CreateRestartPoint() calls it via CheckPointGuts() while in recovery. May be that I am missing something. But, I put the recovery process and the checkpointer process of the standby under gdb with breakpoints on these functions, but both did not hit CreateRestartPoint() as well as CheckPointGuts() when I issued a promote :-| Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>I look at this patch from you and that's present for me: >https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record) > (errmsg("unexpected timeline ID %u (should be %u) > in checkpoint record", > checkPoint.ThisTimeLineID, ThisTimeLineID))); > > +KnownPreparedRecreateFiles(checkPoint.redo); > RecoveryRestartPoint(); > } Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a function by this name. And now I see, the name is CheckPointTwoPhase() :-) > And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC > files are not flushed to disk with this patch. This is a problem as a > new restart point is created... Having the flush in CheckpointTwoPhase > really makes the most sense. Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby promote" code path. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> The question remains whether saving off a few fsyncs/reads for these >> long-lived prepared transactions is worth the additional code churn. >> Even if we add code to go through the KnownPreparedList, we still will >> have to go through the other on-disk 2PC transactions anyways. So, >> maybe not. > > We should really try to do things right now, or we'll never come back > to it. 9.3 (if my memory does not fail me?) has reduced the time to do > promotion by removing the need of the end-of-recovery checkpoint, > while I agree that there should not be that many 2PC transactions at > this point, if there are for a reason or another, the time it takes to > complete promotion would be impacted. So let's refactor > PrescanPreparedTransactions() so as it is able to handle 2PC data from > a buffer extracted by XlogReadTwoPhaseData(), and we are good to go. > Not quite. If we modify PrescanPreparedTransactions(), we also need to make RecoverPreparedTransactions() and StandbyRecoverPreparedTransactions() handle 2PC data via XlogReadTwoPhaseData(). > + /* > +* Move prepared transactions, if any, from KnownPreparedList to > files. > +* It is possible to skip this step and teach subsequent code about > +* KnownPreparedList, but PrescanPreparedTransactions() happens once > +* during end of recovery or on promote, so probably it isn't worth > +* the additional code. > +*/ This comment is misplaced. Does not make sense before this specific call. > + KnownPreparedRecreateFiles(checkPoint.redo); > RecoveryRestartPoint(); > Looking again at this code, I think that this is incorrect. The > checkpointer should be in charge of doing this work and not the > startup process, so this should go into CheckpointTwoPhase() instead. I don't see a function by the above name in the code? Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> We are talking about the recovery/promote code path. Specifically this > call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions(). > > We write the files to disk and they get immediately read up in the > following code. We could not write the files to disk and read > KnownPreparedList in the code path that follows as well as elsewhere. Thinking more on this. The only optimization that's really remaining is handling of prepared transactions that have not been committed or will linger around for long. The short lived 2PC transactions have been optimized already via this patch. The question remains whether saving off a few fsyncs/reads for these long-lived prepared transactions is worth the additional code churn. Even if we add code to go through the KnownPreparedList, we still will have to go through the other on-disk 2PC transactions anyways. So, maybe not. Regards, Nikhils > > Regards, > Nikhils > > >>> The difference between those two is likely noise. >>> >>> By the way, in those measurements, the OS cache is still filled with >>> the past WAL segments, which is a rather best case, no? What happens >>> if you do the same kind of tests on a box where memory is busy doing >>> something else and replayed WAL segments get evicted from the OS cache >>> more aggressively once the startup process switches to a new segment? >>> This could be tested for example on a VM with few memory (say 386MB or >>> less) so as the startup process needs to access again the past WAL >>> segments to recover the 2PC information it needs to get them back >>> directly from disk... One trick that you could use here would be to >>> tweak the startup process so as it drops the OS cache once a segment >>> is finished replaying, and see the effects of an aggressive OS cache >>> eviction. This patch is showing really nice improvements with the OS >>> cache backing up the data, still it would make sense to test things >>> with a worse test case and see if things could be done better. The >>> startup process now only reads records sequentially, not randomly >>> which is a concept that this patch introduces. >>> >>> Anyway, perhaps this does not matter much, the non-recovery code path >>> does the same thing as this patch, and the improvement is too much to >>> be ignored. So for consistency's sake we could go with the approach >>> proposed which has the advantage to not put any restriction on the >>> size of the 2PC file contrary to what an implementation saving the >>> contents of the 2PC files into memory would need to do. >> >> Maybe i’m missing something, but I don’t see how OS cache can affect >> something here. >> >> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. >> Sequential reading 1GB of data >> is order of magnitude faster even on the old hdd, not speaking of ssd. Also >> you can take a look on flame graphs >> attached to previous message — majority of time during recovery spent in >> pg_qsort while replaying >> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of >> time. That amount can >> grow in case of uncached disk read but taking into account total recovery >> time this should not affect much. >> >> If you are talking about uncached access only during checkpoint than here we >> are restricted with >> max_prepared_transaction, so at max we will read about hundred of small >> files (usually fitting into one filesystem page) which will also >> be barely noticeable comparing to recovery time between checkpoints. Also >> wal segments cache eviction during >> replay doesn’t seems to me as standard scenario. >> >> Anyway i took the machine with hdd to slow down read speed and run tests >> again. During one of the runs i >> launched in parallel bash loop that was dropping os cache each second (while >> wal fragment replay takes >> also about one second). >> >> 1.5M transactions >> start segment: 0x06 >> last segment: 0x47 >> >> patched, with constant cache_drop: >> total recovery time: 86s >> >> patched, without constant cache_drop: >>total recovery time: 68s >> >> (while difference is significant, i bet that happens mostly because of >> database file segments should be re-read after cache drop) >> >> master, without constant cache_drop: >>time to recover 35 segments: 2h 25m (after that i tired to wait) >>expected total recovery time: 4.5 hours >> >> -- >> Stas Kelvich >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company >> >> > > > > -- > Nikhil Sontakke http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
> Thanks for review, Nikhil and Michael. > > I don’t follow here. We are moving data away from WAL to files on checkpoint > because after checkpoint > there is no guaranty that WAL segment with our prepared tx will be still > available. > We are talking about the recovery/promote code path. Specifically this call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions(). We write the files to disk and they get immediately read up in the following code. We could not write the files to disk and read KnownPreparedList in the code path that follows as well as elsewhere. Regards, Nikhils >> The difference between those two is likely noise. >> >> By the way, in those measurements, the OS cache is still filled with >> the past WAL segments, which is a rather best case, no? What happens >> if you do the same kind of tests on a box where memory is busy doing >> something else and replayed WAL segments get evicted from the OS cache >> more aggressively once the startup process switches to a new segment? >> This could be tested for example on a VM with few memory (say 386MB or >> less) so as the startup process needs to access again the past WAL >> segments to recover the 2PC information it needs to get them back >> directly from disk... One trick that you could use here would be to >> tweak the startup process so as it drops the OS cache once a segment >> is finished replaying, and see the effects of an aggressive OS cache >> eviction. This patch is showing really nice improvements with the OS >> cache backing up the data, still it would make sense to test things >> with a worse test case and see if things could be done better. The >> startup process now only reads records sequentially, not randomly >> which is a concept that this patch introduces. >> >> Anyway, perhaps this does not matter much, the non-recovery code path >> does the same thing as this patch, and the improvement is too much to >> be ignored. So for consistency's sake we could go with the approach >> proposed which has the advantage to not put any restriction on the >> size of the 2PC file contrary to what an implementation saving the >> contents of the 2PC files into memory would need to do. > > Maybe i’m missing something, but I don’t see how OS cache can affect > something here. > > Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. > Sequential reading 1GB of data > is order of magnitude faster even on the old hdd, not speaking of ssd. Also > you can take a look on flame graphs > attached to previous message — majority of time during recovery spent in > pg_qsort while replaying > PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of > time. That amount can > grow in case of uncached disk read but taking into account total recovery > time this should not affect much. > > If you are talking about uncached access only during checkpoint than here we > are restricted with > max_prepared_transaction, so at max we will read about hundred of small files > (usually fitting into one filesystem page) which will also > be barely noticeable comparing to recovery time between checkpoints. Also wal > segments cache eviction during > replay doesn’t seems to me as standard scenario. > > Anyway i took the machine with hdd to slow down read speed and run tests > again. During one of the runs i > launched in parallel bash loop that was dropping os cache each second (while > wal fragment replay takes > also about one second). > > 1.5M transactions > start segment: 0x06 > last segment: 0x47 > > patched, with constant cache_drop: > total recovery time: 86s > > patched, without constant cache_drop: >total recovery time: 68s > > (while difference is significant, i bet that happens mostly because of > database file segments should be re-read after cache drop) > > master, without constant cache_drop: >time to recover 35 segments: 2h 25m (after that i tired to wait) >expected total recovery time: 4.5 hours > > -- > Stas Kelvich > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > > -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
>> Speeding up recovery or failover activity via a faster promote is a >> desirable thing. So, maybe, we should look at teaching the relevant >> code about using "KnownPreparedList"? I know that would increase the >> size of this patch and would mean more testing, but this seems to be >> last remaining optimization in this code path. > > That's a good idea, worth having in this patch. Actually we may not > want to call KnownPreparedRecreateFiles() here as promotion is not > synonym of end-of-recovery checkpoint for a couple of releases now. > Once implemented, a good way to performance test this could be to set checkpoint_timeout to a a large value like an hour. Then, generate enough 2PC WAL while ensuring that a checkpoint does not happen automatically or otherwise. We could then measure the time taken to recover on startup to see the efficacy. > Most of the 2PC syncs just won't happen, such transactions normally > don't last long, and the number you would get during a checkpoint is > largely lower than what would happen between two checkpoints. When > working on Postgres-XC, the number of 2PC was really more funky. > Yes, postgres-xl is full of 2PC, so hopefully this optimization should help a lot in that case as well. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speedup twophase transactions
Hi Stas, > > So, here is brand new implementation of the same thing. > > Now instead of creating pgproc entry for prepared transaction during recovery, > I just store recptr/xid correspondence in separate 2L-list and > deleting entries in that > list if redo process faced commit/abort. In case of checkpoint or end > of recovery > transactions remaining in that list are dumped to files in pg_twophase. > > Seems that current approach is way more simpler and patch has two times less > LOCs then previous one. > The proposed approach and patch does appear to be much simpler than the previous one. I have some comments/questions on the patch (twophase_recovery_list_2.diff): 1) + /* +* Move prepared transactions from KnownPreparedList to files, if any. +* It is possible to skip that step and teach subsequent code about +* KnownPreparedList, but whole PrescanPreparedTransactions() happens +* once during end of recovery or promote, so probably it isn't worth +* complications. +*/ + KnownPreparedRecreateFiles(InvalidXLogRecPtr); + Speeding up recovery or failover activity via a faster promote is a desirable thing. So, maybe, we should look at teaching the relevant code about using "KnownPreparedList"? I know that would increase the size of this patch and would mean more testing, but this seems to be last remaining optimization in this code path. 2) + /* +* Here we know that file should be moved to disk. But aborting recovery because +* of absence of unnecessary file doesn't seems to be a good idea, so call remove +* with giveWarning=false. +*/ + RemoveTwoPhaseFile(xid, false); We are going to call the above in case of COMMIT/ABORT. If so, we should always find the "xid" entry either in the KnownPreparedList or as a file. Does it make sense to call the above function with "false" then? 3) We are pushing the fsyncing of 2PC files to the checkpoint replay activity with this patch. Now, typically, we would assume that PREPARE followed by COMMIT/ABORT would happen within a checkpoint replay cycle, if not, this would make checkpoint replays slower since the earlier spread out fsyncs are now getting bunched up. I had concerns about this but clearly your latest performance measurements don't show any negatives around this directly. 4) Minor nit-pick on existing code. (errmsg_plural("%u two-phase state file was written " "for long-running prepared transactions", "%u two-phase state files were written " "for long-running prepared transactions", serialized_xacts, serialized_xacts) Shouldn’t the singular part of the message above be: "%u two-phase state file was written for a long-running prepared transaction" But, then, English is not my native language, so I might be off here :-) 5) Why isn't KnownPreparedRecreateFiles() tracking which entries from KnownPreparedList have been written to disk in an earlier iteration like in the normal code path? Why is it removing the entry from KnownPreparedList and not just marking it as saved on disk? 6) Do we need to hold TwoPhaseStateLock lock in shared mode while calling KnownPreparedRecreateFiles()? I think, it does not matter in the recovery/replay code path. 7) I fixed some typos and comments. PFA, patch attached. Other than this, I ran TAP tests and they succeed as needed. Regards, Nikhils On 23 January 2017 at 16:56, Stas Kelvichwrote: > >> On 27 Dec 2016, at 07:31, Michael Paquier wrote: >> >> I think that it would be a good idea to actually test that in pure >> recovery time, aka no client, and just use a base backup and make it >> recover X prepared transactions that have created Y checkpoints after >> dropping cache (or restarting server). > > I did tests with following setup: > > * Start postgres initialised with pgbench > * Start pg_receivexlog > * Take basebackup > * Perform 1.5 M transactions > * Stop everything and apply wal files stored by pg_receivexlog to base backup. > > All tests performed on a laptop with nvme ssd > number of transactions: 1.5M > start segment: 0x4 > > -master non-2pc: > last segment: 0x1b > average recovery time per 16 wal files: 11.8s > average total recovery time: 17.0s > > -master 2pc: > last segment: 0x44 > average recovery time per 16 wal files: 142s > average total recovery time: 568s > > -patched 2pc (previous patch): > last segment: 0x44 > average recovery time per 16 wal files: 5.3s > average total recovery time: 21.2s > > -patched2 2pc (dlist_push_tail changed to dlist_push_head): > last segment: 0x44 > average recovery time
Re: [HACKERS] Compression of full-page-writes
Hi Fujii-san, I must be missing something really trivial, but why not try to compress all types of WAL blocks and not just FPW? Regards, Nikhils On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao masao.fu...@gmail.com wrote: Hi, Attached patch adds new GUC parameter 'compress_backup_block'. When this parameter is enabled, the server just compresses FPW (full-page-writes) in WAL by using pglz_compress() before inserting it to the WAL buffers. Then, the compressed FPW is decompressed in recovery. This is very simple patch. The purpose of this patch is the reduction of WAL size. Under heavy write load, the server needs to write a large amount of WAL and this is likely to be a bottleneck. What's the worse is, in replication, a large amount of WAL would have harmful effect on not only WAL writing in the master, but also WAL streaming and WAL writing in the standby. Also we would need to spend more money on the storage to store such a large data. I'd like to alleviate such harmful situations by reducing WAL size. My idea is very simple, just compress FPW because FPW is a big part of WAL. I used pglz_compress() as a compression method, but you might think that other method is better. We can add something like FPW-compression-hook for that later. The patch adds new GUC parameter, but I'm thinking to merge it to full_page_writes parameter to avoid increasing the number of GUC. That is, I'm thinking to change full_page_writes so that it can accept new value 'compress'. I measured how much WAL this patch can reduce, by using pgbench. * Server spec CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz Mem: 16GB Disk: 500GB SSD Samsung 840 * Benchmark pgbench -c 32 -j 4 -T 900 -M prepared scaling factor: 100 checkpoint_segments = 1024 checkpoint_timeout = 5min (every checkpoint during benchmark were triggered by checkpoint_timeout) * Result [tps] 1386.8 (compress_backup_block = off) 1627.7 (compress_backup_block = on) [the amount of WAL generated during running pgbench] 4302 MB (compress_backup_block = off) 1521 MB (compress_backup_block = on) At least in my test, the patch could reduce the WAL size to one-third! The patch is WIP yet. But I'd like to hear the opinions about this idea before completing it, and then add the patch to next CF if okay. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Custom gucs visibility
If we haven't loaded the .so yet, where would we get the list of custom GUCs from? This has come up before. We could show the string value of the GUC, if it's been set in postgresql.conf, but we do not have correct values for any of the other columns in pg_settings; nor are we even sure that the module will think the value is valid once it does get loaded. So the consensus has been that allowing the GUC to be printed would be more misleading than helpful. How about printing them with something along the lines of, Please load extension foobar for details or (less informative, but possibly easier to code) libfoobar.so not loaded. ? Well, we have done the CREATE EXTENSION successfully earlier. Also, the GUC becomes automagically visible after the backend has executed a function from that extension ( in which case the .so gets loaded as part of the function handling). Also note that SET foo.custom_guc works ok by setting up a placeholder guc if the .so has not been loaded yet. I wonder if we should dare to try to load the .so if a 'SHOW extension_name.custom_guc' is encountered via internal_load_library or something? Obviously we should check if the extension was created before as well. Regards, Nikhils Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
[HACKERS] Custom gucs visibility
Hi, So, if I have a contrib module which adds in a custom guc via DefineCustomIntVariable() for example. Now that custom guc is not visible via a show command unless that corresponding .so has been loaded into memory. E.g. postgres=# show pg_buffercache.fancy_custom_guc; ERROR: unrecognized configuration parameter pg_buffercache.fancy_custom_guc Should we do something about this or we are ok with the current behavior? I would prefer to get to see the config parameter value if I so happen to want to see it before the load of that contrib module. Thoughts? Regards, Nikhils
Re: [HACKERS] Bug in ALTER COLUMN SET DATA TYPE ?
postgres=# CREATE TABLE test (a float check (a 10.2)); CREATE TABLE postgres=# CREATE TABLE test_child() INHERITS(test); CREATE TABLE postgres=# ALTER TABLE test ALTER COLUMN a SET DATA TYPE numeric; ERROR: constraint must be added to child tables too Interestingly, this works in 8.3 and earlier ... but it fails as described in all later versions. So we broke it quite some time back. It would be good to identify exactly what change broke it. Hmm.. I haven't yet tested on other branches. But I'm surprised that you find it broken on so much old branches. git annotate suggests that the offending error message was added by commit 09ff76fcdb275769ac4d1a45a67416735613d04b and that commit is dated Apr 20, 2012 Mea Culpa. I did indeed add that error message. But it's strange when Tom says that this is broken since 8.3! Regards, Nikhils -- StormDB - http://www.stormdb.com The Database Cloud Postgres-XC Support and Service
Re: [HACKERS] Bug in ALTER COLUMN SET DATA TYPE ?
So coming back to the issue, do you think it's a good idea to teach ATAddCheckConstraint() that the call is coming from a late phase of ALTER TABLE ? +1 You mentioned AT_PASS_OLD_INDEX in your original mail, but I guess that you meant that we should check for AT_PASS_OLD_CONSTR and then not raise an error? Regards, Nikhils -- StormDB - http://www.stormdb.com The Database Cloud Postgres-XC Support and Service
Re: [HACKERS] xml_is_document and selective pg_re_throw
No, I don't see any particular risk there. The places that might throw ERRCODE_INVALID_XML_DOCUMENT are sufficiently few (as in, exactly one, in this usage) that we can have reasonable confidence we know what the system state is when we catch that error. Hmmm, I was writing some code in which I happened to hold a LWLock when this function was called. The first catch/rethrow cleaned up the InterruptHoldoffCount value. A subsequent release of that LWLock tripped up the (Assert(InterruptHoldoffCount 0);) inside RESUME_INTERRUPTS(). I know holding an lwlock like this might not be a good idea, but this behavior just got me thinking about other probable issues. Regards, Nikhils A better way would have been to modify xml_parse to take an additional boolean argument to_rethrow and not to rethrow if that is false? We could do that, but it would greatly complicate xml_parse IMO, since it still needs its own PG_TRY block to handle other error cases, and only one of those error cases ought to optionally return failure instead of re-throwing. regards, tom lane
[HACKERS] xml_is_document and selective pg_re_throw
Hi, Consider: SELECT xml 'foobar/foobarfoo/bar' IS DOCUMENT; And I was looking at xml_is_document() source code. It calls xml_parse which throws an error with code set to ERRCODE_INVALID_XML_DOCUMENT. The catch block of xml_parse then rethrows. Now xml_is_document does a selective rethrow only if the error is not ERRCODE_INVALID_XML_DOCUMENT. I can understand that this function does this to return true/false, but doesn't this behavior of not propagating the error up all the way dangerous? InterruptHoldoffCount inconsistencies for instance? A better way would have been to modify xml_parse to take an additional boolean argument to_rethrow and not to rethrow if that is false? Thoughts? Regards, Nikhils
Re: [HACKERS] B-tree page deletion boundary cases
Was wondering if there's a similar bug which gets triggered while using VACUUM FULL. See for instance this thread: http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html This issue has been reported on-off from time to time and in most cases VACUUM or VACUUM FULL appears to be involved. We have usually attributed it to hardware issues and reindex has been recommended by default as a solution/work around.. I do not perceive much similarity. The bug I've raised can produce wrong query results transiently. It might permit injecting a tuple into the wrong spot in the tree, yielding persistent wrong results. It would not introduce tree-structural anomalies like sibling pointers directed at zeroed pages or internal pages in an 1-level tree. Given the symptoms you reported, I share Robert's suspicion of WAL replay in your scenario. Thanks for taking the time to analyze this Noah. Regards, Nikhils
Re: [HACKERS] B-tree page deletion boundary cases
Hi Noah, Was wondering if there's a similar bug which gets triggered while using VACUUM FULL. See for instance this thread: http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html This issue has been reported on-off from time to time and in most cases VACUUM or VACUUM FULL appears to be involved. We have usually attributed it to hardware issues and reindex has been recommended by default as a solution/work around.. Regards, Nikhils On Sat, Apr 21, 2012 at 10:22 PM, Noah Misch n...@leadboat.com wrote: For the sake of concurrency, our B-tree implementation has a phased process for reusing empty pages. Excerpting from nbtree/README: A deleted page cannot be reclaimed immediately, since there may be other processes waiting to reference it (ie, search processes that just left the parent, or scans moving right or left from one of the siblings). These processes must observe that the page is marked dead and recover accordingly. Searches and forward scans simply follow the right-link until they find a non-dead page --- this will be where the deleted page's key-space moved to. ... A deleted page can only be reclaimed once there is no scan or search that has a reference to it; until then, it must stay in place with its right-link undisturbed. We implement this by waiting until all transactions that were running at the time of deletion are dead; which is overly strong, but is simple to implement within Postgres. When marked dead, a deleted page is labeled with the next-transaction counter value. VACUUM can reclaim the page for re-use when this transaction number is older than the oldest open transaction. The VACUUM that deletes a page's last tuple calls _bt_pagedel(), which flags the page BTP_DELETED and stores therein the result of ReadNewTransactionId(). When a later VACUUM visits such a page and observes that the stored XID is now less than or equal to RecentXmin, it adds the page to the FSM. An INSERT or UPDATE will pull the page from the FSM and repurpose it. As I mentioned[1] peripherally back in November, that algorithm has been insufficient since the introduction of non-XID-bearing transactions in PostgreSQL 8.3. Such transactions do not restrain RecentXmin. If no running transaction has an XID, RecentXmin == ReadNewTransactionId() and the page incorrectly becomes available for immediate reuse. This is difficult to encounter in practice. VACUUM acquires a cleanup lock on every leaf page in each index. Consequently, a problem can only arise around a leaf page deletion when two VACUUMs visit the page during the narrow window when _bt_steppage() has released the pin on its left sibling and not yet acquired a read lock on the target. Non-leaf deletions might not require such narrow conditions, but they are also exponentially less frequent. Here is a test procedure illustrating the bug: 1. In session S1, run these commands: BEGIN; CREATE TABLE t (x int, filler character(400)); INSERT INTO t SELECT *, '' FROM generate_series(1, 1); CREATE INDEX ON t(x); DELETE FROM t WHERE x = 2000 AND x 4000; COMMIT; BEGIN; SET LOCAL enable_seqscan = off; SET LOCAL enable_bitmapscan = off; DECLARE c CURSOR FOR SELECT x FROM t WHERE x = 1990 AND x 4510; FETCH 5 c; 2. Attach gdb to S1 and set a breakpoint on _bt_getbuf. 3. In S1, run FETCH 10 c. This will hit the breakpoint. 4. In another session S2, run these commands: VACUUM VERBOSE t; -- mark some pages BTP_DELETED VACUUM VERBOSE t; -- update FSM to know about the pages -- reuse the pages INSERT INTO t SELECT * FROM generate_series(10001, 12000); 5. Exit gdb to free up S1. The FETCH only returns five rows. (The filler column makes each index page correspond to more heap pages. Without it, heap page pins prevent removing some of the tuples on the index page under test.) The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. We already use RecentGlobalXmin when wal_level = hot_standby. If no running transaction has an XID and all running transactions began since the last transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId(). Therefore, the correct test is btpo.xact RecentGlobalXmin, not btpo.xact = RecentGlobalXmin as we have today. This also cleanly removes the need for the bit of code in _bt_getbuf() that decrements btpo.xact before sending it down for ResolveRecoveryConflictWithSnapshot(). I suggested[2] that decrement on an unprincipled basis; it was just masking the off-by-one of using = RecentGlobalXmin instead of RecentGlobalXmin in _bt_page_recyclable(). This change makes empty B-tree pages wait through two generations of running transactions before reuse, so some additional bloat will arise. Furthermore, the set of transactions having snapshots
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
Displace yes. It would error out if someone says ALTER TABLE ONLY... CHECK (); suggesting to use the ONLY with the CHECK. I'd say the behavior for that case can revert to the PostgreSQL 9.1 behavior. If the table has children, raise an error. Otherwise, add an inheritable CHECK constraint, albeit one lacking inheritors at that moment. Ok, that sounds reasonable. Another thing that we should consider is that if we are replacing ONLY with NO INHERIT, then instead of just making a cosmetic syntactic change, we should also replace all the is*only type of field names with noinherit for the sake of completeness and uniformity. Regards, Nikhils
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
Hi, So, I have a patch for this. This patch introduces support for CHECK ONLY syntax while doing a CREATE TABLE as well as during the usual ALTER TABLE command. Example: create table atacc7 (test int, test2 int CHECK ONLY (test0), CHECK (test210)); create table atacc8 () inherits (atacc7); postgres=# \d+ atacc7 Table public.atacc7 Column | Type | Modifiers | Storage | Description +-+---+-+- test | integer | | plain | test2 | integer | | plain | Check constraints: atacc7_test2_check CHECK (test2 10) atacc7_test_check CHECK ONLY (test 0) Child tables: atacc8 Has OIDs: no postgres=# \d+ atacc8 Table public.atacc8 Column | Type | Modifiers | Storage | Description +-+---+-+- test | integer | | plain | test2 | integer | | plain | Check constraints: atacc7_test2_check CHECK (test2 10) Inherits: atacc7 Has OIDs: no This patch removes the support for : ALTER TABLE ONLY constraint_rename_test ADD CONSTRAINT con2 CHECK (b 0); and uses ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK ONLY (b 0); Is this what we want? Or we would want the earlier support in place for backward compatibility as well? We are actually introducing this in 9.2 so I guess we can remove this. This is a much cleaner implementation and we might not even need the changes in pg_dump now because the pg_get_constraintdef can provide the info about the ONLY part too. So some cleanup can be done if needed. I know it's a bit late in the commitfest, but if this patch makes this feature more complete, maybe we should consider... Thoughts? P.S Here's the discussion thread in its entirety for reference: http://postgresql.1045698.n5.nabble.com/how-to-create-a-non-inherited-CHECK-constraint-in-CREATE-TABLE-td5152184.html Regards, Nikhils On Thu, Feb 2, 2012 at 1:32 AM, Peter Eisentraut pete...@gmx.net wrote: On ons, 2012-01-18 at 18:17 -0500, Robert Haas wrote: I agree with Peter that we should have we should have CHECK ONLY. ONLY is really a property of the constraint, not the ALTER TABLE command -- if it were otherwise, we wouldn't need to store it the system catalogs, but of course we do. The fact that it's not a standard property isn't a reason not to have proper syntax for it. Clearly, we will eventually want to support inherited and non-inherited constraints of all types. Currently, each type of constraint has an implicit default regarding this property: check - inherited not null - inherited foreign key - not inherited primary key - not inherited unique - not inherited exclusion - not inherited As discussed above, we need to have a syntax that is attached to the constraint, not the table operation that creates the constraint, so that we can also create these in CREATE TABLE. How should we resolve these different defaults? Also, in ALTER TABLE, if you want to add either an inherited or not inherited constraint to a parent table, you should really say ALTER TABLE ONLY in either case. Because it's conceivably valid that ALTER TABLE foo ADD CHECK () NOINHERIT would add an independent, not inherited check constraint to each child table. So, there are all kinds of inconsistencies and backward compatibility problems lurking here. We might need either a grand transition plan or document the heck out of these inconsistencies. check_constraint_create_table_support.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
Hi, Cumulative reaction to all the responses first: Whoa! :) I was under the impression that a majority of us felt that the current mechanism was inadequate. Also if you go through the nabble thread, the fact that CREATE TABLE did not support such constraints was considered to be an annoyance. And I was enquired if/when I can provide this functionality. Apologies though with the timing. +1 for fixing up the syntax before 9.2 goes out the door. I think the original syntax was misguided to begin with. Well, it was fine in isolation, but once you consider how to make CREATE TABLE do this too, it's hard to avoid the conclusion that you need to attach the modifier to the CHECK constraint not the ALTER TABLE command. Yeah, exactly. CHECK NO INHERIT sounds fine to me; will that display ALTER TABLE ONLY x as the one true way of doing this? s/display/displace/, I think you meant? Yeah, that's what I understand the proposal to be. Displace yes. It would error out if someone says ALTER TABLE ONLY... CHECK (); suggesting to use the ONLY with the CHECK. This patch does this and also makes both CREATE TABLE and ALTER TABLE use it in a uniform manner. Regarding NO INHERIT versus ONLY, we again have had discussions on the longish original thread quite a while back: http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html But now if we prefer NO INHERIT, I can live with that. Regards, Nikhils
Re: [HACKERS] Review of patch renaming constraints
Make check passed. Patch has tests for rename constraint. Most normal uses of alter table ... rename constraint ... worked normally. However, the patch does not deal correctly with constraints which are not inherited, such as primary key constraints: That appears to be because creating a primary key constraint does not set pg_constraint.conisonly correctly. This was introduced recently with noninherited check constraints. Umm, conisonly is set as false from primary key entries in pg_constraint. And primary keys are anyways not inherited. So why is the conisonly field interfering in rename? Seems quite orthogonal to me. Regards, Nikhils
Re: [HACKERS] Review of patch renaming constraints
And primary keys are anyways not inherited. So why is the conisonly field interfering in rename? Seems quite orthogonal to me. In the past, each kind of constraint was either always inherited or always not, implicitly. Now, for check constraints we can choose what we want, and in the future, perhaps we will want to choose for primary keys as well. So having conisonly is really a good step into that future, and we should use it uniformly. Agreed. And right now primary key constraints are not marked as only making them available for inheritance in the future. Or you prefer it otherwise? Anyways, fail to see the direct connection between this and renaming. Might have to look at this patch for that. Regards, Nikhils
Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE
It appears that the only way to create a non-inherited CHECK constraint is using ALTER TABLE. Is there no support in CREATE TABLE planned? That looks a bit odd. There are no plans to do that AFAIR, though maybe you could convince Nikhil to write the patch to do so. That certainly doesn't meet the principle of least surprise... CREATE TABLE should support this. Well, the above was thought about during the original discussion and eventually we felt that CREATE TABLE already has other issues as well, so not having this done as part of creating a table was considered acceptable then: http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144 But, let me have a stab at it when I get some free cycles. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
I have also tried to change the error message as per Alvaro's suggestions. I will really try to see if we have other issues. Really cannot have Robert reporting all the bugs and getting annoyed, but there are lotsa variations possible with inheritance.. So did you find anything? Nothing much as yet. I think we are mostly there with the code but will keep on trying some more variations along the lines of what Robert tried out whenever I get the time next. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
I will really try to see if we have other issues. Really cannot have Robert reporting all the bugs and getting annoyed, but there are lotsa variations possible with inheritance.. BTW thank you for doing the work on this. Probably the reason no one bothers too much with inheritance is that even something that's seemingly simple such as what you tried to do here has all these little details that's hard to get right. Thanks Alvaro. Yeah, inheritance is semantics quicksand :) Appreciate all the help from you, Robert and Alex Hunsaker on this. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
I don't think this is a given ... In fact, IMO if we're only two or three fixes away from having it all nice and consistent, I think reverting is not necessary. Sure. It's the if part of that sentence that I'm not too sure about. Any specific area of the code that you think is/has become fragile (apart from the non-support for CREATE TABLE based ONLY constraints)? The second bug is a variant of the first. And I have provided a patch for it. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
Hi, There is at least one other problem. Consider: rhaas=# create table a (ff1 int, constraint chk check (ff1 0)); CREATE TABLE rhaas=# create table b (ff1 int, constraint chk check (ff1 0)); CREATE TABLE rhaas=# alter table b inherit a; ALTER TABLE This needs to fail if chk is an only constraint, but it doesn't, even with this patch. As you rightly point out, this is because we cannot specify ONLY constraints inside a CREATE TABLE as of now. I think that part of the problem here is fuzzy thinking about the meaning of the word ONLY in ALTER TABLE ONLY b. The word ONLY there is really supposed to mean that the operation is performed on b but not on its descendents; but that's not what you're doing: you're really performing a different operation. In theory, for a table with no current descendents, ALTER TABLE ONLY b and ALTER TABLE b ought to be identical, but here they are not. I think that's probably bad. ONLY has inheritance based connotations for present/future children. ALTER ONLY combined with ADD is honored better now with this patch IMO (atleast for constraints I think). Another manifestation of this problem is that there's no way to add an ONLY constraint in a CREATE TABLE command. I think that's bad, too: it should be possible to declare any constraint at table creation time, and if the ONLY were part of the constraint definition rather than part of the target-table specification, that would work fine. As it is, it doesn't. Well, the above was thought about during the original discussion and eventually we felt that CREATE TABLE already has other issues as well, so not having this done as part of creating a table was considered acceptable: http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144 I am tempted to say we should revert this and rethink. I don't believe we are only a small patch away from finding all the bugs here. Sure, if we all think that CREATE TABLE should support ONLY CONSTRAINT type of syntax, then +1 for reverting this and a subsequent revised submission. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
And yeah, certainly there's a bug as you point out. postgres=# create table a1 (ff1 int, constraint chk check (ff1 0)); postgres=# create table b1 (ff1 int); postgres=# alter table only b1 add constraint chk check (ff1 0); postgres=# alter table b1 inherit a1; The last command should have refused to inherit. Regards, Nikhils On Fri, Dec 23, 2011 at 8:55 AM, Nikhil Sontakke nikkh...@gmail.com wrote: Hi, There is at least one other problem. Consider: rhaas=# create table a (ff1 int, constraint chk check (ff1 0)); CREATE TABLE rhaas=# create table b (ff1 int, constraint chk check (ff1 0)); CREATE TABLE rhaas=# alter table b inherit a; ALTER TABLE This needs to fail if chk is an only constraint, but it doesn't, even with this patch. As you rightly point out, this is because we cannot specify ONLY constraints inside a CREATE TABLE as of now. I think that part of the problem here is fuzzy thinking about the meaning of the word ONLY in ALTER TABLE ONLY b. The word ONLY there is really supposed to mean that the operation is performed on b but not on its descendents; but that's not what you're doing: you're really performing a different operation. In theory, for a table with no current descendents, ALTER TABLE ONLY b and ALTER TABLE b ought to be identical, but here they are not. I think that's probably bad. ONLY has inheritance based connotations for present/future children. ALTER ONLY combined with ADD is honored better now with this patch IMO (atleast for constraints I think). Another manifestation of this problem is that there's no way to add an ONLY constraint in a CREATE TABLE command. I think that's bad, too: it should be possible to declare any constraint at table creation time, and if the ONLY were part of the constraint definition rather than part of the target-table specification, that would work fine. As it is, it doesn't. Well, the above was thought about during the original discussion and eventually we felt that CREATE TABLE already has other issues as well, so not having this done as part of creating a table was considered acceptable: http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144 I am tempted to say we should revert this and rethink. I don't believe we are only a small patch away from finding all the bugs here. Sure, if we all think that CREATE TABLE should support ONLY CONSTRAINT type of syntax, then +1 for reverting this and a subsequent revised submission. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
I don't think this is a given ... In fact, IMO if we're only two or three fixes away from having it all nice and consistent, I think reverting is not necessary. FWIW, here's a quick fix for the issue that Robert pointed out. Again it's a variation of the first issue that he reported. We have two functions which try to merge existing constraints: MergeWithExistingConstraint() in heap.c MergeConstraintsIntoExisting() in tablecmds.c Nice names indeed :) I have also tried to change the error message as per Alvaro's suggestions. I will really try to see if we have other issues. Really cannot have Robert reporting all the bugs and getting annoyed, but there are lotsa variations possible with inheritance.. Regards, Nikhils only_constraint_no_merge_v2.0.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Non-inheritable check constraints
Agreed. I just tried out the scenarios laid out by you both with and without the committed patch and AFAICS, normal inheritance semantics have been preserved properly even after the commit. No, they haven't. I didn't expect this to break anything when you have two constraints with different names. The problem is when you have two constraints with the same name. Testing reveals that this is, in fact, broken: rhaas=# create table A(ff1 int); CREATE TABLE rhaas=# create table B () inherits (A); CREATE TABLE rhaas=# create table C () inherits (B); CREATE TABLE rhaas=# alter table only b add constraint chk check (ff1 0); ALTER TABLE rhaas=# alter table a add constraint chk check (ff1 0); NOTICE: merging constraint chk with inherited definition ALTER TABLE At this point, you'll find that a has a constraint, and b has a constraint, but *c does not have a constraint*. That's bad, because a's constraint wasn't only and should therefore have propagated all the way down the tree. Apologies, I did not check this particular scenario. I guess, here, we should not allow merging of the inherited constraint into an only constraint. Because that breaks the semantics for only constraints. If this sounds ok, I can whip up a patch for the same. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
rhaas=# create table A(ff1 int); CREATE TABLE rhaas=# create table B () inherits (A); CREATE TABLE rhaas=# create table C () inherits (B); CREATE TABLE rhaas=# alter table only b add constraint chk check (ff1 0); ALTER TABLE rhaas=# alter table a add constraint chk check (ff1 0); NOTICE: merging constraint chk with inherited definition ALTER TABLE At this point, you'll find that a has a constraint, and b has a constraint, but *c does not have a constraint*. That's bad, because a's constraint wasn't only and should therefore have propagated all the way down the tree. Apologies, I did not check this particular scenario. I guess, here, we should not allow merging of the inherited constraint into an only constraint. Because that breaks the semantics for only constraints. If this sounds ok, I can whip up a patch for the same. PFA, patch which does just this. postgres=# alter table a add constraint chk check (ff1 0); ERROR: constraint chk for relation b is an ONLY constraint. Cannot merge Regards, Nikhils only_constraint_no_merge.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Non-inheritable check constraints
It seems hard to believe that ATExecDropConstraint() doesn't need any adjustment. Yeah, I think we could return early on for only type of constraints. Also, shouldn't the systable scan break out of the while loop after a matching constraint name has been found? As of now, it's doing a full scan of all the constraints for the given relation. @@ -6755,6 +6765,7 @@ ATExecDropConstraint(Relation rel, const char *constrName, HeapTupletuple; boolfound = false; boolis_check_constraint = false; +boolis_only_constraint = false; /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) @@ -6791,6 +6802,12 @@ ATExecDropConstraint(Relation rel, const char *constrName, /* Right now only CHECK constraints can be inherited */ if (con-contype == CONSTRAINT_CHECK) is_check_constraint = true; + +if (con-conisonly) +{ +Assert(is_check_constraint); +is_only_constraint = true; +} /* * Perform the actual constraint deletion @@ -6802,6 +6819,9 @@ ATExecDropConstraint(Relation rel, const char *constrName, performDeletion(conobj, behavior); found = true; + +/* constraint found - break from the while loop now */ +break; } systable_endscan(scan); @@ -6830,7 +6850,7 @@ ATExecDropConstraint(Relation rel, const char *constrName, * routines, we have to do this one level of recursion at a time; we can't * use find_all_inheritors to do it in one pass. */ -if (is_check_constraint) +if (is_check_constraint !is_only_constraint) children = find_inheritance_children(RelationGetRelid(rel), lockmode); else children = NIL; PFA, revised version containing the above changes based on Alvaro's v4 patch. Regards, Nikhils non_inh_check_v5.0.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Non-inheritable check constraints
PFA, revised version containing the above changes based on Alvaro's v4 patch. Committed with these fixes, and with my fix for the pg_dump issue noted by Alex, too. Thanks. Thanks a lot Alvaro! Regards, Nikhils -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: [HACKERS] Review: Non-inheritable check constraints
But did you see Robert Haas' response upthread? It looks like there's still some work to do -- at least some research to determine that we're correctly handling all those cases. As the committer I'm responsible for it, but I don't have resources to get into it any time soon. Yeah, am definitely planning to test out those scenarios and will respond some time late today. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
Hi Robert, First of all, let me state that this ONLY feature has not messed around with existing inheritance semantics. It allows attaching a constraint to any table (which can be part of any hierarchy) without the possibility of it ever playing any part in future or existing inheritance hierarchies. It is specific to that table, period. It's not just that. Suppose that C inherits from B which inherits from A. We add an only constraint to B and a non-only constraint to A. Now, what happens in each of the following scenarios? An example against latest HEAD should help here: create table A(ff1 int); create table B () inherits (A); create table C () inherits (B); alter table A add constraint Achk check (ff1 10); The above will attach Achk to A, B and C alter table only B add constraint Bchk check (ff1 0); The above will attach Bchk ONLY to table B 1. We drop the constraint from B without specifying ONLY. postgres=# alter table B drop constraint Achk; ERROR: cannot drop inherited constraint achk of relation b The above is existing inheritance based behaviour. Now let's look at the ONLY constraint: postgres=# alter table B drop constraint Bchk; ALTER TABLE Since this constraint is not part of any hierarchy, it can be removed. postgres=# alter table only B add constraint bchk check (ff1 0); ALTER TABLE postgres=# alter table only B drop constraint Bchk; ALTER TABLE So only constraints do not need the only B qualification to be deleted. They work both ways and can always be deleted without any issues. 2. We drop the constraint from B *with* ONLY. postgres=# alter table only B drop constraint Achk; ERROR: cannot drop inherited constraint achk of relation b The above is existing inheritance based behavior. So regardless of ONLY an inherited constraint cannot be removed from the middle of the hierarchy. 3. We drop the constraint from A without specifying ONLY. postgres=# alter table A drop constraint Achk; ALTER TABLE This removes the constraint from the entire hierarchy across A, B and C. Again existing inheritance behavior. 4. We drop the constraint from A *with* ONLY. postgres=# alter table only A drop constraint Achk; ALTER TABLE This converts the Achk constraints belonging to B into a local one. C still has it as an inherited constraint from B. We can now delete those constraints as per existing inheritance semantics. However I hope the difference between these and ONLY constraints are clear. The Achk constraint associated with B can get inherited in the future whereas only constraints will not be. Off the top of my head, I suspect that #1 should be an error; It's an error for inherited constraints, but not for only constraints. #2 should succeed, leaving only the inherited version of the constraint on B; Yeah, only constraints removal succeeds, whereas inherited constraints cannot be removed. #3 should remove the constraint from A and leave it on B but I'm not sure what should happen to C, This removes the entire hierarchy. and I have no clear vision of what #4 should do. This removes the constraint from A, but maintains the inheritance relationship between B and C. Again standard existing inheritance semantics. As a followup question, if we do #2 followed by #4, or #4 followed by #2, do we end up with the same final state in both cases? Yeah. #2 is not able to do much really because we do not allow inherited constraints to be removed from the mid of the hierarchy. Here's another scenario. B inherits from A. We a constraint to A using ONLY, and then drop it without ONLY. Does that work or fail? The constraint gets added to A and since it is an only constraint, its removal both with and without only A works just fine. Also, what happens we add matching constraints to B and A, in each case using ONLY, and then remove the constraint from A without using ONLY? Does anything happen to B's constraint? Why or why not? Again the key differentiation here is that only constraints are bound to that table and wont be inherited ever. So this works just fine. postgres=# alter table only A add constraint A2chk check (ff1 10); ALTER TABLE postgres=# alter table only B add constraint A2chk check (ff1 10); ALTER TABLE Just to be clear, I like the feature. But I've done some work on this code before, and it is amazingly easy for to screw it up and end up with bugs... so I think lots of careful thought is in order. Agreed. I just tried out the scenarios laid out by you both with and without the committed patch and AFAICS, normal inheritance semantics have been preserved properly even after the commit. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
Hi Alvaro, The patch looks ok to me. I see that we now sort the constraints by conisonly value too: @@ -1781,12 +1781,20 @@ describeOneTableDetails(const char *schemaname, /* print table (and column) check constraints */ if (tableinfo.checks) { +char *is_only; + +if (pset.sversion 90100) +is_only = r.conisonly; +else +is_only = false AS conisonly; + printfPQExpBuffer(buf, - SELECT r.conname, + SELECT r.conname, %s, pg_catalog.pg_get_constraintdef(r.oid, true)\n FROM pg_catalog.pg_constraint r\n - WHERE r.conrelid = '%s' AND r.contype = 'c'\nORDER BY 1;, - oid); + WHERE r.conrelid = '%s' AND r.contype = 'c'\n + ORDER BY 2, 1;, + is_only, oid); result = PSQLexec(buf.data, false); if (!result) goto error_return; My preference would be to see true values first, so might want to modify it to ORDER BY 2 desc, 1 to have that effect. Your call finally. Regards, Nikhils On Sat, Dec 17, 2011 at 12:36 AM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Greg Smith's message of vie dic 16 15:02:20 -0300 2011: On 12/04/2011 02:22 AM, Nikhil Sontakke wrote: Is it okay to modify an existing constraint to mark it as only, even if it was originally inheritable? This is not clear to me. Maybe the safest course of action is to raise an error. Or maybe I'm misreading what it does (because I haven't compiled it yet). Hmmm, good question. IIRC, the patch will pass is_only as true only if it going to be a locally defined, non-inheritable constraint. So I went by the logic that since it was ok to merge and mark a constraint as locally defined, it should be ok to mark it non-inheritable from this moment on with that new local definition? I think I misread what that was trying to do. I thought it would turn on the is only bit on a constraint that a child had inherited from a previous parent, but that was obviously wrong now that I think about it again. With this open question, this looks like it's back in Alvaro's hands again to me. This one started the CF as Ready for Committer and seems stalled there for now. I'm not going to touch its status, just pointing this fact out. Yeah. Nikhil, Alex, this is the merged patch. Have a look that it still works for you (particularly the pg_dump bits) and I'll commit it. I adjusted the regression test a bit too. Thanks. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: Non-inheritable check constraints
I had a look at this patch today. The pg_dump bits conflict with another patch I committed a few days ago, so I'm about to merge them. I have one question which is about this hunk: Thanks for taking a look Alvaro. @@ -2312,6 +2317,11 @@ MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr, con-conislocal = true; else con-coninhcount++; + if (is_only) + { + Assert(is_local); + con-conisonly = true; + } simple_heap_update(conDesc, tup-t_self, tup); CatalogUpdateIndexes(conDesc, tup); break; Is it okay to modify an existing constraint to mark it as only, even if it was originally inheritable? This is not clear to me. Maybe the safest course of action is to raise an error. Or maybe I'm misreading what it does (because I haven't compiled it yet). Hmmm, good question. IIRC, the patch will pass is_only as true only if it going to be a locally defined, non-inheritable constraint. So I went by the logic that since it was ok to merge and mark a constraint as locally defined, it should be ok to mark it non-inheritable from this moment on with that new local definition? Regards, Nikhils Regards, Nikhils
Re: [HACKERS] So where are we on the open commitfest?
* Non-inheritable check constraints So, this patch got shifted to the next commitfest... Regards, Nikhils
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
If all you need to do is lock a schema, you can just call LockDatabaseObject(NamespaceRelationId, namespace_oid, 0, AccessShareLock); there's no need to fake up an objectaddress just to take a lock. But I think that's not really all you need to do, because somebody could drop the namespace between the time that you decide what OID to lock and the time you acquire the lock. So I think you need something like what we did in RangeVarGetRelid(). See attached patch. Thanks Robert. But currently there are very few callers of RangeVarGetAndCheckCreationNamespace() function. For the sake of completeness we will have to introduce a call to this function while creating all other objects too. Regards, Nikhils -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
So it's probably going to take a while to get this completely nailed down, but we can keep chipping away at it. Agreed. So are you planning to commit this change? Or we want some more objects to be fixed? Last I looked at this, we will need locking to be done while creating tables, views, types, sequences, functions, indexes, extensions, constraints, operators stuff, ts stuff, rules, domains, etc. that can go into schemas. So might even make sense to write a schema specific function based on your patch template to cater in general to schema locking during object creation. Regards, Nikhils
Re: [HACKERS] Re: pg_dump: schema with OID XXXXX does not exist - was Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Wasn't 7.3 the release that introduced schemas in the first place? I think there's a very good chance that the older reports with similar symptoms are completely unrelated, anyhow. Tom Lane is reluctant and that should tell me something :) So unless the list feels that this should be fixed and also agrees on the general approach, I will not touch a single line of code. Obviously someone else is welcome to have a stab at this too. Regards, Nikhils
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Hi, Ok, understood. PFA, a patch against git head. We take the AccessShareLock lock on the schema in DefineRelation now. Note that we do not want to interlock with other concurrent creations in the schema. We only want to interlock with deletion activity. So even performance wise this should not be much of an overhead in case of concurrent DDL operations to the same schema. Adding this in DefineRelation handles creation of tables/views/types/sequences. I do not think we need to do stuff in ALTER commands, because the objects pre-exist and this issue appears to be with new objects only. Comments? Regards, Nikhils git_head_lock_schema_ddl.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Um ... why would we do this only for tables, and not for creations of other sorts of objects that belong to schemas? Right, we need to do it for other objects like functions etc. too. Also, if we are going to believe that this is a serious problem, what of ALTER ... SET SCHEMA? I admit, I hadn't thought of this. Also, the proposed solution is pretty silly on its face, because it has not removed the race condition only made the window somewhat narrower. You would have to acquire the lock as part of the initial schema lookup, not lock the OID after the fact. And could we please not do something as silly as translate the OID back to a string and then look up that string a second time? The comment mentions that part is a kluge but that we get to re-use the existing function because of it. The get_object_address function will bail out anyways if the schema has vanished from down under and it does lock it up immediately after it's found to be valid. (To be clear, I don't particularly believe that this is a problem worthy of spending code space and cycles on. But if it's deemed to be a problem, I want to see a solution that's actually watertight.) Got the message. Regards, Nikhils
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Continuing in gdb, also completes the creation of c2 table without any errors. We are now left with a dangling entry in pg_class along with all the corresponding data files in our data directory. The problem becomes worse if c2 was created using a TABLESPACE. Now dropping of that tablespace does not work at all. Am sure we can come up with myriad such other issues. Hmm. Does this break pg_dump? I have reported a bug whereby dangling pg_class entries point to a namespace that has since been dropped in the past (and has been reported many times before that, even). Sure does! I just tried it and got: pg_dump: schema with OID 16384 does not exist The bug report is here, whereby I also aggregate other similar bug reports that have taken place over a very long period of time: http://archives.postgresql.org/pgsql-bugs/2011-02/msg00185.php I guess we DO need to pay attention to fix this properly now as there are some reports with 9.x too. And I have just provided a way to reproduce this reliably too. Regards, Nikhils
[HACKERS] pg_dump: schema with OID XXXXX does not exist - was Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Hi, But if it's deemed to be a problem, I want to see a solution that's actually watertight.) After Daniel's hunch about pg_dump barfing due to such leftover entries proving out to be true, we have one credible explanation (there might be other reasons too) for this long standing issue. I see some reports from as early as 2004 and some as latest as Feb, 2011! http://archives.postgresql.org/pgsql-bugs/2011-02/msg00185.php One way in 9.x could be to modify get_object_address to additionally accept objoid as an argument and use that to lock the schema in AccessShareLock mode from all places where schema based objects (tables, views, types, sequences, functions, indexes, extensions, constraints, operators stuff, ts stuff, rules, domains, etc. phew!) can be created. Or since this is schema specific, we can as well right a new function to do this. We might also add logic to only lock user created schemas. This can be added right after the namespace for the involved object has been correctly identified. The lock can then get released later as part of the transaction commit. Regards, Nikhils
[HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Hi, Consider the following sequence of events: s1 # CREATE SCHEMA test_schema; s1 # CREATE TABLE test_schema.c1(x int); Now open another session s2 and via gdb issue a breakpoint on heap_create_with_catalog() which is called by DefineRelation(). s2 # CREATE TABLE test_schema.c2(y int); The above will break on the function. Now issue a drop schema in session s1 s1 # DROP SCHEMA test_schema CASCADE; NOTICE: drop cascades to table test_schema.c1 DROP SCHEMA Continuing in gdb, also completes the creation of c2 table without any errors. We are now left with a dangling entry in pg_class along with all the corresponding data files in our data directory. The problem becomes worse if c2 was created using a TABLESPACE. Now dropping of that tablespace does not work at all. Am sure we can come up with myriad such other issues. Am sure other CREATE commands in this namespace will have similar issues when faced with a concurrent DROP SCHEMA. We definitely need some interlocking to handle this. For lack of better APIs, we could do a LockDatabaseObject() call in AccessShareLock mode on the namespace and release the same on completion of the creation of the object. Thoughts? Regards, Nikhils
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
We definitely need some interlocking to handle this. For lack of better APIs, we could do a LockDatabaseObject() call in AccessShareLock mode on the namespace and release the same on completion of the creation of the object. Thoughts? In general, we've been reluctant to add locking on non-table objects for reasons of overhead. You can, for example, drop a type or function while a query is running that depends on it (which is not true for tables). But I think it is sensible to do it for DDL commands, which shouldn't be frequent enough for the overhead to matter much. Agreed. Especially if the race condition has non-trivial downsides as mentioned in the tablespace case. When I rewrote the comment code for 9.1, I added locking that works just this way, to prevent pg_description entries from being orphaned; see the end of get_object_address(). Yeah thanks, that does the object locking. For pre-9.1 versions, we will need a similar solution. I encountered the issue on 8.3.x.. Regards, Nikhils
Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers
Yeah thanks, that does the object locking. For pre-9.1 versions, we will need a similar solution. I encountered the issue on 8.3.x.. I don't think we should back-patch a fix of this type. There is a lot of cruftiness of this type scattered throughout the code base, and if we start back-patching all the fixes for it, we're going to end up destabilizing older branches for little real benefit. Ok, understood. Thanks and Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
Hi Alex, Hmmm, your patch checks for a constraint being only via: !recurse !recursing I hope that is good enough to conclusively conclude that the constraint is 'only'. This check was not too readable in the existing code for me anyways ;). If we check at the grammar level, we can be sure. But I remember not being too comfortable about the right position to ascertain this characteristic. Well I traced through it here was my thinking (maybe should be a comment?): 1: AlterTable() calls ATController() with recurse = interpretInhOption(stmt-relation-inhOpt 2: ATController() calls ATPrepCmd() with recurse and recursing = false 3: ATPrepCmd() saves the recurse flag with the subtup AT_AddConstraintRecurse, otherwise the subtype is AT_AddConstraint 4: ATExecCmd() calls ATExecAddConstraint() with recurse == true when subtype == AT_AddConstraintRecurse, recurse = false otherwise 5: ATExecAddConstraint() calls ATAddCheckConstraint() with recuse and recursing = false 6: now we are in ATAddCheckConstraint() where recurse == interpretInhOption(rv-inhOpt) and recursing == false. Recursing is only true when ATAddCheckConstaint() loops through children and recursively calls ATAddCheckConstraint() So with it all spelled out now I see the constraint must be added to child tables too check is dead code. Thanks the above step-wise explanation helps. But AFAICS, the default inhOpt value can be governed by the SQL_inheritance guc too. So in that case, it's possible that recurse is false and child tables are present, no? Infact as I now remember, the reason my patch was looping through was to handle this very case. It was based on the assumptions that some constraints might be ONLY type and some can be inheritable. Although admittedly the current ALTER TABLE functionality does not allow this. So maybe we can still keep this check around IMO. Regards, Nikhils
Re: [HACKERS] Review: Non-inheritable check constraints
Hi Alex, I guess we both are in agreement with each other :) After sleeping over it, I think that check is indeed dead code with this new non-inheritable check constraints functionality in place. So unless you have some other comments, we can mark this as 'Ready for Commiter'. Again, thanks for the thorough review and subsequent changes! Regards, Nikhils On Fri, Oct 7, 2011 at 12:18 PM, Alex Hunsaker bada...@gmail.com wrote: On Fri, Oct 7, 2011 at 00:28, Nikhil Sontakke nikkh...@gmail.com wrote: Hi Alex, So with it all spelled out now I see the constraint must be added to child tables too check is dead code. Thanks the above step-wise explanation helps. But AFAICS, the default inhOpt value can be governed by the SQL_inheritance guc too. So in that case, it's possible that recurse is false and child tables are present, no? Well... Do we really want to differentiate between those two case? I would argue no. Given that: set sql_inhertance to off; alter table xxx alter column; behaves the same as set sql_inhertance to on; alter table only xxx alter column; Why should we treat constraints differently? Or put another way if set sql_inhertance off makes alter table behave with an implicit only, shouldn't add/drop constraint respect that? Infact as I now remember, the reason my patch was looping through was to handle this very case. It was based on the assumptions that some constraints might be ONLY type and some can be inheritable. Although admittedly the current ALTER TABLE functionality does not allow this. Hrm... Ill I see is a user who turned off sql_inhertance wondering why they have to specify ONLY on some alter table commands and not others. I think if we want to support ONLY constraint types in the way you are thinking about them, we need to put ONLY some place else (alter table xxx add only constraint ?). Personally I don't see a reason to have that kind of constraint. Mostly because I don't see how its functionally different. Is it somehow? Anyone else have any thoughts on this?
Re: [HACKERS] Review: Non-inheritable check constraints
Hi Alex, I didn't care for the changes to gram.y so I reworked it a bit so we now pass is_only to AddRelationNewConstraint() (like we do with is_local). Seemed simpler but maybe I missed something. Comments? Hmmm, your patch checks for a constraint being only via: !recurse !recursing I hope that is good enough to conclusively conclude that the constraint is 'only'. This check was not too readable in the existing code for me anyways ;). If we check at the grammar level, we can be sure. But I remember not being too comfortable about the right position to ascertain this characteristic. I also moved the is_only check in AtAddCheckConstraint() to before we grab and loop through any children. Seemed a bit wasteful to loop through the new constraints just to set a flag so that we could bail out while looping through the children. Ditto comment for this function. I thought this function went to great lengths to spit out a proper error in case of inconsistencies between parent and child. But if your check makes it simpler, that's good! You also forgot to bump Natts_pg_constraint. Ouch. Thanks for the catch. PFA the above changes as well as being rebased against master. Thanks Alex, appreciate the review! Regards, Nikhils
[HACKERS] Issue with listing same tablenames from different schemas in the search_path
Hi, Consider the following sequence of commands in a psql session: postgres=#create table public.sample(x int); postgres=#create schema new; postgres=#create table new.sample(x int); postgres=#set search_path=public,new; postgres=#\dt Schema | Name | Type | Owner --- public | sample | table | postgres (1 row) We should have seen two entries in the above listing. So looks like a bug to me. The issue is with the call to pg_table_is_visible(). While scanning for the second entry, it breaks out because there is a matching entry with the same name in the first schema. What we need is a variation of this function which checks for visibility of the corresponding namespace in the search path and emit it out too if so. Thoughts? I can cook up a patch for this. Regards, Nikhils
Re: [HACKERS] Issue with listing same tablenames from different schemas in the search_path
postgres=#create table public.sample(x int); postgres=#create schema new; postgres=#create table new.sample(x int); postgres=#set search_path=public,new; postgres=#\dt Schema | Name | Type | Owner --**- public | sample | table | postgres (1 row) We should have seen two entries in the above listing. So looks like a bug to me. No, that's the way it's designed to work. It shows the objects that are visible to you, without schema-qualifying them. See http://www.postgresql.org/**docs/9.0/interactive/app-psql.** html#APP-PSQL-PATTERNShttp://www.postgresql.org/docs/9.0/interactive/app-psql.html#APP-PSQL-PATTERNS: Hmmm, ok. Makes sense after reading the documentation, but seems a bit surprising/confusing at first glance. Never mind. Regards, Nikhils Whenever the pattern parameter is omitted completely, the \d commands display all objects that are visible in the current schema search path — this is equivalent to using * as the pattern. (An object is said to be visible if its containing schema is in the search path and no object of the same kind and name appears earlier in the search path. This is equivalent to the statement that the object can be referenced by name without explicit schema qualification.) To see all objects in the database regardless of visibility, use *.* as the pattern. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] cataloguing NOT NULL constraints
So after writing the code to handle named NOT NULL constraints for tables, I'm thinking that dumpConstraints needs to be fixed thusly: @@ -12888,6 +12968,27 @@ dumpConstraint(Archive *fout, ConstraintInfo *coninfo) NULL, NULL); } } + else if (coninfo-contype == 'n' tbinfo) + { + /* NOT NULL constraint on a table */ + if (coninfo-separate) + { + write_msg(NULL, NOT NULL constraints cannot be dumped separately from their owning table\n); + exit_nicely(); + } + } + else if (coninfo-contype == 'n' tbinfo == NULL) + { + /* NOT NULL constraint on a domain */ + TypeInfo *tyinfo = coninfo-condomain; + + /* Ignore if not to be dumped separately */ + if (coninfo-separate) + { + write_msg(NULL, NOT NULL constraints cannot be dumped separately from their owning domain\n); + exit_nicely(); + } + } else { write_msg(NULL, unrecognized constraint type: %c\n, coninfo-contype); Some nit-picking. AFAICS above, we seem to be only using 'tbinfo' to identify the object type here - 'table' visavis 'domain'. We could probably reduce the above two elses to a single one and use the check of tbinfo being not null to decide which object type name to spit out.. Although, it's difficult to see how we could end up marking NOT NULL constraints as 'separate' ever. So this code will be rarely exercised, if ever IMO. Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check constraints on partition parents only?
Hi, Any preferences for the name? connoinh conisonly constatic or confixed I'd probably pick conisonly from those choices. The use of \d inside psql will show ONLY constraints without any embellishments similar to normal constraints. E.g. ALTER TABLE ONLY a ADD CONSTRAINT achk CHECK (FALSE); ALTER TABLE a ADD CONSTRAINT bchk CHECK (b 0); psql=# \d a Table public.a Column | Type | Modifiers +-+--- b | integer | Check constraints: achk CHECK (false) bchk CHECK (b 0) Is this acceptable? Or we need to put in work into psql to show ONLY somewhere in the description? If yes, ONLY CHECK sounds weird, maybe we should use LOCAL CHECK or some such mention: Check constraints: achk LOCAL CHECK (false) bchk CHECK (b 0) Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check constraints on partition parents only?
psql=# \d a Table public.a Column | Type | Modifiers +-+--- b | integer | Check constraints: achk CHECK (false) bchk CHECK (b 0) Is this acceptable? Or we need to put in work into psql to show ONLY somewhere in the description? If yes, ONLY CHECK sounds weird, maybe we should use LOCAL CHECK or some such mention: Check constraints: achk LOCAL CHECK (false) bchk CHECK (b 0) I think you need to stick with ONLY. Using two different words is just going to create confusion. You could fool around with where exactly you put it on the line, but switching to a different word seems like not a good idea. Ok, maybe something like: achk (ONLY) CHECK (false) (Also, don't forget you need to hack pg_dump, too.) Yeah, I have already hacked it a bit. This constraint now needs to be spit out later as an ALTER command with ONLY attached to it appropriately. Earlier all CHECK constraints were generally emitted as part of the table definition itself. Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check constraints on partition parents only?
Yeah, I have already hacked it a bit. This constraint now needs to be spit out later as an ALTER command with ONLY attached to it appropriately. Earlier all CHECK constraints were generally emitted as part of the table definition itself. Hrm. That doesn't seem so good. Maybe we've got the design wrong here. It doesn't seem like we want to lose the ability to define arbitrary constraints at table-creation time. Well the handling is different now for ONLY constraints only. The normal constraints can still be attached at table-creation time. Regards, Nikhils
Re: [HACKERS] Check constraints on partition parents only?
Yeah, I have already hacked it a bit. This constraint now needs to be spit out later as an ALTER command with ONLY attached to it appropriately. Earlier all CHECK constraints were generally emitted as part of the table definition itself. IIRC, there's already support for splitting out a constraint that way, in order to deal with circular dependencies. You just need to treat this as an additional reason for splitting. Yeah, I have indeed followed the existing separate printing logic for ONLY constraints. Had to make the table dependent on this constraint to print the constraint *after* the table definition. Regards, Nikhils
Re: [HACKERS] Check constraints on partition parents only?
We could imagine doing something like CHECK ONLY (foo), but that seems quite non-orthogonal with (a) everything else in CREATE TABLE, and (b) ALTER TABLE ONLY. Yeah, I thought about CHECK ONLY support as part of table definition, but as you say - it appears to be too non-standard right now and we can always go back to this later if the need be felt. Regards, Nikhils
Re: [HACKERS] Check constraints on partition parents only?
Hi all, PFA, patch which implements non-inheritable ONLY constraints. This has been achieved by introducing a new column conisonly in pg_constraint catalog. Specification of 'ONLY' in the ALTER TABLE ADD CONSTRAINT CHECK command is used to set this new column to true. Constraints which have this column set to true cannot be inherited by present and future children ever. The psql and pg_dump binaries have been modified to account for such persistent non-inheritable check constraints. This patch also has documentation changes along with relevant changes to the test cases. The regression runs pass fine with this patch applied. Comments and further feedback, if any, appreciated. Regards, Nikhils diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 5e5f8a7..683ad67 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -1995,6 +1995,16 @@ /row row + entrystructfieldconisonly/structfield/entry + entrytypebool/type/entry + entry/entry + entry + This constraint is defined locally for the relation. It is a + non-inheritable constraint. + /entry + /row + + row entrystructfieldconkey/structfield/entry entrytypeint2[]/type/entry entryliterallink linkend=catalog-pg-attributestructnamepg_attribute/structname/link.attnum//entry diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 4c2a4cd..3ee3ec0 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -984,6 +984,14 @@ ALTER TABLE distributors ADD CONSTRAINT zipchk CHECK (char_length(zipcode) = 5); /para para + To add a check constraint only to a table and not to its children: +programlisting +ALTER TABLE ONLY distributors ADD CONSTRAINT zipchk CHECK (char_length(zipcode) = 5); +/programlisting + (The check constraint will not be inherited by future children too.) + /para + + para To remove a check constraint from a table and all its children: programlisting ALTER TABLE distributors DROP CONSTRAINT zipchk; diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 4399493..1b382b8 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -98,10 +98,10 @@ static Oid AddNewRelationType(const char *typeName, Oid new_array_type); static void RelationRemoveInheritance(Oid relid); static void StoreRelCheck(Relation rel, char *ccname, Node *expr, - bool is_validated, bool is_local, int inhcount); + bool is_validated, bool is_local, int inhcount, bool is_only); static void StoreConstraints(Relation rel, List *cooked_constraints); static bool MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr, - bool allow_merge, bool is_local); + bool allow_merge, bool is_local, bool is_only); static void SetRelationNumChecks(Relation rel, int numchecks); static Node *cookConstraint(ParseState *pstate, Node *raw_constraint, @@ -1860,7 +1860,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum, Node *expr) */ static void StoreRelCheck(Relation rel, char *ccname, Node *expr, - bool is_validated, bool is_local, int inhcount) + bool is_validated, bool is_local, int inhcount, bool is_only) { char *ccbin; char *ccsrc; @@ -1943,7 +1943,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr, ccbin,/* Binary form of check constraint */ ccsrc,/* Source form of check constraint */ is_local, /* conislocal */ - inhcount);/* coninhcount */ + inhcount, /* coninhcount */ + is_only); /* conisonly */ pfree(ccbin); pfree(ccsrc); @@ -1984,7 +1985,7 @@ StoreConstraints(Relation rel, List *cooked_constraints) break; case CONSTR_CHECK: StoreRelCheck(rel, con-name, con-expr, !con-skip_validation, - con-is_local, con-inhcount); + con-is_local, con-inhcount, con-is_only); numchecks++; break; default: @@ -2100,6 +2101,7 @@ AddRelationNewConstraints(Relation rel, cooked-skip_validation = false; cooked-is_local = is_local; cooked-inhcount =
Re: [HACKERS] Check constraints on partition parents only?
Comments and further feedback, if any, appreciated. Did you look at how this conflicts with my patch to add not null rows to pg_constraint? https://commitfest.postgresql.org/action/patch_view?id=601 I was certainly not aware of this patch in the commitfest. Your patch has a larger footprint with more functional changes in it. IMHO, it will be easiest to queue this non-inheritable constraints patch behind your patch in the commitfest. There will be certain bitrot, which I can fix once your patch gets committed. Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check constraints on partition parents only?
Now that we have coninhcnt, conislocal etc... we can probably support ONLY. But I agree with Robert it's probably a bit more than an afternoon to crank out :-) Heh, agreed :), I was just looking for some quick and early feedback. So what we need is basically a way to indicate that a particular constraint is non-inheritable forever (meaning - even for future children) and that should work? Right now, it seems that the ONLY usage in the SQL only translates to a recurse or no-recurse operation. For the parent, a constraint is marked with conislocal set to true (coninhcount is 0) and for children, coninhcount is used to indicate inheritance of that constraint with conislocal being set to false. What we need is to persist information of a particular constraint to be as specified - ONLY for this table. We could do that by adding a new column in pg_constraint like 'connoinh' or something, but I guess we would prefer not to get into the initdb business. Alternatively we could bring about the same by using a combination of conislocal and coninhcnt. For ONLY constraints, we will need to percolate this information down to the point where we define it in the code. We can then mark ONLY constraints to have conislocal set to TRUE and coninhcnt set to a special value (-1). So to summarize, what I am proposing is: Introduce new column connoinh (boolean) in pg_constraint OR in existing infrastructure: Normal constraints: conislocal (true) coninhcnt (0) (inheritable like today) Inherited constraints: conislocal (false) coninhcnt (n 0) ONLY constraints:conislocal (true) coninhcnt (-1) (not inheritable) With this arrangment, pg_dump will be able to easily identify and spit out ONLY specifications for specific constraints and then they won't be blindly passed on to children table under these new semantics. Thoughts? Anything missing? Please let me know. Regards, Nikhils
Re: [HACKERS] Check constraints on partition parents only?
This approach certainly can't work, because a table can be both an inheritance parent and an inheritance child. It could have an ONLY constraint, and also inherit a copy of the same constraint for one or more parents. IOW, the fact that conislocal = true does not mean that coninhcount is irrelevant. Oh I see. I think what you probably want to do is either (a) add a new column or (b) change conislocal to a char value and make it three-valued: n = inherited constraint, no local definition o = defined locally as an ONLY constraint i = defined locally as a non-ONLY constraint I think I favor the latter approach as more space-efficient, but I hear Tom muttering about backward-compatibility... Yeah, in your case too an initdb would be required, so might as well go down the route of a new column. Any preferences for the name? connoinh conisonly constatic or confixed Others? Regards, Nikhils
Re: [HACKERS] Check constraints on partition parents only?
Hi, Yeah, but I think we need to take that chance. At the very least, we need to support the equivalent of a non-inherited CHECK (false) on parent tables. Indeed. I usually enforce that with a trigger that raises an exception, but of course that doesn't help at all with constraint exclusion, and I saw a result just a few weeks ago (I forget the exact details) where it appeared that the plan chosen was significantly worse because the parent table wasn't excluded, so there's a non-trivial downside from having this restriction. The downside appears to be non-trivial indeed! I cooked up the attached patch to try to allow ALTER...ONLY...CHECK(false) on parent tables. If this approach looks acceptable, I can provide a complete patch later with some documentation changes (I think we ought to tell about this special case in the documentation) and a minor test case along with it (if the need be felt for the test case). Although partitioning ought to be looked at from a different angle completely, maybe this small patch can help out a bit in the current scheme of things, although this is indeed a unusual special casing... Thoughts? Regards, Nikhils diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 82bb756..5340402 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -5433,6 +5433,7 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, ListCell *lcon; List *children; ListCell *child; + bool skip_children = false; /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) @@ -5502,9 +5503,31 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, * tables; else the addition would put them out of step. */ if (children !recurse) - ereport(ERROR, + { + /* + * Try a bit harder and check if this is a CHECK(FALSE) kinda + * constraint. Allow if so, otherwise error out + */ + if (list_length(newcons) == 1) + { + CookedConstraint *cooked = linitial(newcons); + + if (cooked-contype == CONSTR_CHECK cooked-expr) + { +Node *expr = cooked-expr; +if (IsA(expr, Const) ((Const *)expr)-consttype == BOOLOID + ((Const *)expr)-constvalue == 0) +{ + skip_children = true; +} + } + } + + if (!skip_children) + ereport(ERROR, (errcode(ERRCODE_INVALID_TABLE_DEFINITION), errmsg(constraint must be added to child tables too))); + } foreach(child, children) { @@ -5512,6 +5535,13 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel, Relation childrel; AlteredTableInfo *childtab; + /* + * Skipping the constraint should be good enough for the special case. + * No need to even release the locks on the children immediately.. + */ + if (skip_children) + break; + /* find_inheritance_children already got lock */ childrel = heap_open(childrelid, NoLock); CheckTableNotInUse(childrel, ALTER TABLE); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Check constraints on partition parents only?
8.4 had this change: * Force child tables to inherit CHECK constraints from parents (Alex Hunsaker, Nikhil Sontakke, Tom) You're not the only one who occasionally bangs his head against it. Sorry for the occasional head bumps :) Yeah. I think it's good that there's a barrier to blindly dropping a constraint that may be important to have on children, but there should be a way to override that. Hmmm, but then it does open up the possibility of naive users shooting themselves in the foot. It can be easy to conjure up a parent-only-constraint that does not gel too well with its children. And that's precisely why this feature was added in the first place.. Regards, Nikhils
Re: [HACKERS] VACUUM FULL deadlock with backend startup
Patch applied, thanks! Thanks Tom! Regards, Nikhils
Re: [HACKERS] VACUUM FULL deadlock with backend startup
Hi, The fix is similar to the earlier commit by Tom. I tested this fix against 8.3.13. We lock the parent catalog now before calling index_open. Patch against git HEAD attached with this mail. I guess we will backpatch this? Tom's last commit was backpatched till 8.2 I think. Is it really worth slowing down the startup sequence for every connection to avoid deadlocking against a very rare maintenance operation? Not really a performance issue AFAICS. If the relcache init file exists, then the phase2 of the catalog cache which eventually calls the above code path is avoided. Regards, Nikhils
[HACKERS] VACUUM FULL deadlock with backend startup
Hi, We encountered a deadlock involving VACUUM FULL (surprise surprise! :)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the window appears much smaller). The call spits out the following deadlock info: ERROR: SQLSTATE 40P01: deadlock detected DETAIL: Process 12479 waits for AccessExclusiveLock on relation 2663 of database 16384; blocked by process 14827. Process 14827 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 12479. LOCATION: DeadLockReport, deadlock.c:918 It looked familiar, so I dug up the archives and found that Tom had committed a fix for a similar deadlock via git commitid: 715120e7 However this current deadlock involved an index with oid 2663, which is ClassNameNspIndexId. Clearly this was another case of locking the index directly without taking a lock on the parent catalog. Further sleuthing revealed that the culprit function was InitCatCachePhase2, which directly calls index_open in the process startup phase. Reproducing this was easy once you know the culprit, (excruciatingly difficult if you do not know the exact race window). I added a sleep inside the InitCatCachePhase2 function before calling index_open. Then I invoked a VACUUM FULL pg_class from another session, halting it in gdb just before taking the exclusive lock via try_relation_open. When a new PG process sleeps inside InitCatCachePhase2, we then take the lock in the VF process, waiting just after it. When the startup continues after the sleep, it will take the ClassNameNspIndexId share lock, but hang to take a share lock on pg_class in RelationReloadIndexInfo. Simply continue the VF process in gdb which will try to take the exclusive lock to vacuum the index. This will reproduce the deadlock in all its glory. The fix is similar to the earlier commit by Tom. I tested this fix against 8.3.13. We lock the parent catalog now before calling index_open. Patch against git HEAD attached with this mail. I guess we will backpatch this? Tom's last commit was backpatched till 8.2 I think. Regards, Nikhils diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d0e364e..c9386aa 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -26,6 +26,7 @@ #ifdef CATCACHE_STATS #include storage/ipc.h /* for on_proc_exit */ #endif +#include storage/lmgr.h #include utils/builtins.h #include utils/fmgroids.h #include utils/inval.h @@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index) { Relation idesc; + /* + * We must lock the underlying catalog before locking the index to + * avoid deadlock, since RelationReloadIndexInfo might well need to + * read the catalog, and if anyone else is exclusive-locking this + * catalog and index they'll be doing it in that order. + */ + LockRelationOid(cache-cc_reloid, AccessShareLock); idesc = index_open(cache-cc_indexoid, AccessShareLock); index_close(idesc, AccessShareLock); + UnlockRelationOid(cache-cc_reloid, AccessShareLock); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM FULL deadlock with backend startup
Hi, We encountered a deadlock involving VACUUM FULL (surprise surprise! :)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the window appears much smaller). The call spits out the following deadlock info: ERROR: SQLSTATE 40P01: deadlock detected DETAIL: Process 12479 waits for AccessExclusiveLock on relation 2663 of database 16384; blocked by process 14827. Process 14827 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 12479. LOCATION: DeadLockReport, deadlock.c:918 It looked familiar, so I dug up the archives and found that Tom had committed a fix for a similar deadlock via git commitid: 715120e7 However this current deadlock involved an index with oid 2663, which is ClassNameNspIndexId. Clearly this was another case of locking the index directly without taking a lock on the parent catalog. Further sleuthing revealed that the culprit function was InitCatCachePhase2, which directly calls index_open in the process startup phase. Reproducing this was easy once you know the culprit, (excruciatingly difficult if you do not know the exact race window). I added a sleep inside the InitCatCachePhase2 function before calling index_open. Then I invoked a VACUUM FULL pg_class from another session, halting it in gdb just before taking the exclusive lock via try_relation_open. When a new PG process sleeps inside InitCatCachePhase2, we then take the lock in the VF process, waiting just after it. When the startup continues after the sleep, it will take the ClassNameNspIndexId share lock, but hang to take a share lock on pg_class in RelationReloadIndexInfo. Simply continue the VF process in gdb which will try to take the exclusive lock to vacuum the index. This will reproduce the deadlock in all its glory. The fix is similar to the earlier commit by Tom. I tested this fix against 8.3.13. We lock the parent catalog now before calling index_open. Patch against git HEAD attached with this mail. I guess we will backpatch this? Tom's last commit was backpatched till 8.2 I think. Regards, Nikhils diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d0e364e..c9386aa 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -26,6 +26,7 @@ #ifdef CATCACHE_STATS #include storage/ipc.h /* for on_proc_exit */ #endif +#include storage/lmgr.h #include utils/builtins.h #include utils/fmgroids.h #include utils/inval.h @@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index) { Relation idesc; + /* + * We must lock the underlying catalog before locking the index to + * avoid deadlock, since RelationReloadIndexInfo might well need to + * read the catalog, and if anyone else is exclusive-locking this + * catalog and index they'll be doing it in that order. + */ + LockRelationOid(cache-cc_reloid, AccessShareLock); idesc = index_open(cache-cc_indexoid, AccessShareLock); index_close(idesc, AccessShareLock); + UnlockRelationOid(cache-cc_reloid, AccessShareLock); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: index corruption in PG 8.3.13
Hi, To summarize, as I see it - the zeroed out block 523 should have been the second left-most leaf and should have pointed out to 522. Thus re-establishing the index chain 524 - 523 - 522 - 277 - ... Was there a machine restart in the picture as well? It seems there might have been a machine restart involved too. So I guess even WAL writing could have been impacted. But even if VF was ongoing at the time of restart, the WAL replay on restart should not do anything since this will be a non-committed transaction? Also I was looking at ReadRecord and saw that it logs a message for failed CRC blocks but the WAL replay just stops at that point since it returns a NULL. Is there a way to find out if more blocks follow in the wake of this failed block (should be a matter of calling ReadRecord with NULL as a first argument I think)? If so maybe we can warn further that error was encountered in the middle of WAL replay. However the last block too could be CRC check-fail candidate... BTW, is there a possibility to encounter trailing blocks with CRC failures regularly? For transactions that were ongoing at the time of shutdown and did not get a chance to commit or WAL log properly? Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: index corruption in PG 8.3.13
Hi, Of course, as you mentioned earlier, it's not impossible there's a bug in the recovery code. Yeah, I was looking at the repair_frag function in 8.3.13 (yup it's ugly!) and found out that the normal ExecInsertIndexTuples call is used to insert the index entries. That is standard index code used everywhere. So btree WAL bugs in this code path should be pretty rare I would think.. But if an OS crash is involved, another possibility is that something went wrong with the fsync - maybe there's a lying writeback cache between PG and the platter, for example. Yup, plan to re-confirm this too. Thanks Robert! Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: index corruption in PG 8.3.13
Hi Daniel, I have also, coincidentally, encountered corruption of a system catalog index -- 8.3.11 -- I have saved the file for forensics. Is it possible that I also receive a copy of this program? Will it be possible for you to share the file/logs off-list with me? I can also try to do some analysis to compare if the situation is similar to the one we are faced with right now and report it back here if so. TIA, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: index corruption in PG 8.3.13
Hi, 1. Somebody inserts a bunch of new tuples into the relation, causing growth in the index. In case it's not obvious VACUUM FULL would do precisely that. Oh, I didn't even think about that. Yeah, that could be it, too. Thanks a lot Greg and Robert. This theory seems very plausible. VF must have carried out a rearrangement of heap tuples for compaction and that might have caused new index entries which might explain the extension of that many blocks. maybe VACUUM FULL - crash before checkpoint - problem with recovery? Did I mention that an immediate shutdown was thrown into the mix just after the VF? That is almost close to a crash and that means that the shared buffers were not written back to the index data file. So that should also account for all these pages still being zeroed out. So change the above to: VACUUM FULL - immediate shutdown - problem with recovery? But WAL replay should still have handled this. I would presume even an immediate shutdown ensures that WAL is flushed to disk properly? So that means that either there is a corner case bug in VF which adds incorrect WAL logging in some specific btree layout scenarios or there was indeed some bit flipping in the WAL, which caused the recovery to prematurely end during WAL replay. What are the scenarios that you would think can cause WAL bit flipping? I was trying to repro this on the setup by repeatedly creating a table with large inserts, doing lotta deletes, running VF and then issuing immediate shutdown. However if I try to inspect the index data file at this point in the test case, it is inconsequential as the file is largely out of sync since its dirty shared buffers have not been flushed. That leaves me with the option to restart and check the index data file again for problems. If we see problems after the restart it should generally mean WAL logging errors (but we still cannot discount the bit flipping case I guess). I guess a perusal of the WAL activity done by VF btree index activity is the need of the hour.. Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: index corruption in PG 8.3.13
VACUUM FULL - immediate shutdown - problem with recovery? An immediate shutdown == an intentional crash. OK, so you have the VACUUM FULL and the immediate shutdown just afterward. So we just need to figure out what happened during recovery. Right. But WAL replay should still have handled this. I would presume even an immediate shutdown ensures that WAL is flushed to disk properly? I'm not sure, but I doubt it. If the VACUUM FULL committed, then the WAL records should be on disk, but if the immediate shutdown happened while it was still running, then the WAL records might still be in wal_buffers, in which case I don't think they'll get written out and thus zero pages in the index are to be expected. Now that doesn't explain any other corruption in the file, but I believe all-zeroes pages in a relation are an expected consequence of an unclean shutdown. But assuming the VF actually committed before the immediate shutdown, there must be something else going on, since by that point XLOG should have been flushed. Oh yeah, so if VF committed, the xlog should have been ok too, but can't say the same about the shared buffers. So that means that either there is a corner case bug in VF which adds incorrect WAL logging in some specific btree layout scenarios or there was indeed some bit flipping in the WAL, which caused the recovery to prematurely end during WAL replay. What are the scenarios that you would think can cause WAL bit flipping? Some kind of fluke hard drive malfunction, maybe? I know that the incidence of a hard drive being told to write A and actually writing B is very low, but it's probably not exactly zero. Do you have the logs from the recovery following the immediate shutdown? Anything interesting there? Unfortunately we do not have the recovery logs. Would have loved to see some signs about some issues in the WAL replay to confirm the theory about bit flipping.. Or, as you say, there could be a corner-case VF bug. Yeah, much harder to find by just eyeballing the code I guess :) Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: index corruption in PG 8.3.13
Oh yeah, so if VF committed, the xlog should have been ok too, but can't say the same about the shared buffers. But there was a later block that *was* written out. What was the LSN on that block? everything in the WAL log should have been fsynced up to that point when that buffer was flushed. Which block are you referring to? After the holes from 279 to 518. We have Deleted 519's (LSN: logid 29, recoff 0xd1fab5bc) previous points to Deleted 520. Deleted 520's (LSN: logid 29, recoff 0xd1fac918) previous points to Deleted 521. Deleted 521's (LSN: logid 29, recoff 0xd1fadc60) previous points to 522 Note that all the above deleted blocks have next xid set to FrozenXid, meaning VF did this. Live 522's (LSN: logid 29, recoff 0xd1fade3c) previous points to the zeroed out 523 block. Note that this seems to be latest LSN in the data file. The next of all these 4 blocks (519 to 522) point to Live Block 277 with (LSN: logid 29, recoff 0xd1fadc60). The ROOT block also has this same LSN. So Blocks 521, 277 and block 3 have this same LSN. The rest of the live blocks appear to have lower LSNs with lower logids. The last block in the index data file, which also seems to be the current left most block, number 524 has an LSN of (logid 29, recoff 0xd1fa97b8) with next sibling set to 523. One interesting observation is that there is another deleted block 278 (again done by VF) with this same LSN and next pointing to 524. And so this must have been the original leftmost block before 524 was promoted to that status. To summarize, as I see it - the zeroed out block 523 should have been the second left-most leaf and should have pointed out to 522. Thus re-establishing the index chain 524 - 523 - 522 - 277 - ... Phew, I hope the above made some sense and was decipherable. Looking at my png alongside might help a bit more too. Also maybe some bug is indeed hidden in the guts of VF. Was there a machine restart in the picture as well? I don't think so. Regards, Nikhils -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers