from:"Nikhil Sontakke"

Re: [HACKERS] logical decoding of two-phase transactions

2017-09-07 Thread Nikhil Sontakke

Hi,

FYI all, wanted to mention that I am working on an updated version of
the latest patch that I plan to submit to a later CF.

Regards,
Nikhils

On 14 May 2017 at 04:02, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> On 13 May 2017 at 22:22, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>
>> Apparently you are not testing against current HEAD.  That's been there
>> since d10c626de (a whole two days now ;-))
>
> Indeed, I was working on a more than two-day old antiquity. Unfortunately,
> it's even more complicated
> to apply this patch against the current HEAD, so I'll wait for a rebased
> version.



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtrans links incorrectly

2017-04-26 Thread Nikhil Sontakke

> I'm suggesting we take the approach that if there is a problem we can
> recreate it as a way of exploring what conditions are required and
> therefore work out the impact. Nikhil Sontakke appears to have
> re-created something, but not quite what I had expected. I think he
> will post here tomorrow with an update for us to discuss.
>
>

So, I reverted commit 0874d4f3e183757ba15a4b3f3bf563e0393dd9c2 to go back
to the earlier bad swapped arguments to SubTransSetParent resulting in
incorrect parent linkages and used the attached TAP test patch.

The test prepares a 2PC with more than 64 subtransactions. It then stops
the master and promotes the standby.

A SELECT query on the newly promoted master on any of the tables involved
in the 2PC hangs. The hang is due to a loop in
SubTransGetTopmostTransaction(). Due to incorrect linkages, we get a
circular reference in parentxid <-> subxid inducing the infinite loop.

Any further DML on these objects which will need to check visibility of
these tuples hangs as well. All unrelated objects and new transactions are
ok AFAICS.

I do not see any data loss, which is good. However tables involved in the
2PC are inaccessible till after a hard restart.

The attached TAP test patch can be considered for commit to test handling
2PC with large subtransactions on promoted standby instances.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


subxid_bug_with_test_case_v1.0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] StandbyRecoverPreparedTransactions recovers subtrans links incorrectly

2017-04-24 Thread Nikhil Sontakke

>
> >>>> Also, when I fix that, it gets further but still crashes at the same
> >>>> Assert in SubTransSetParent.  The proximate cause this time seems to
> be
> >>>> that RecoverPreparedTransactions's calculation of overwriteOK is
> wrong:
> >>>> it's computing that as "false", but in reality the subtrans link in
> >>>> question has already been set.
> >>
> >>> Not sure about that, investigating.
> >>
> >> As a quick hack, I just hotwired RecoverPreparedTransactions to set
> >> overwriteOK = true always, and with that and the SubTransSetParent
> >> argument-order fix, HEAD passes the recovery tests.  Maybe we can
> >> be smarter than that, but this might be a good short-term fix to get
> >> the buildfarm green again.
> >
> > That would work. I've been looking into a fix I can explain, but "do
> > it always" may actually be it.
>
>
Here's what's happening:

On Master:

BEGIN;

INSERT INTO t_009_tbl VALUES (42);

SAVEPOINT s1;

INSERT INTO t_009_tbl VALUES (43);

PREPARE TRANSACTION 'xact_009_1';
On Master:

Do a fast shutdown

On Slave:

Restart the slave. This causes StandbyRecoverPreparedTransactions() to be
invoked which causes ProcessTwoPhaseBuffer() to be invoked with setParent
set to true. This sets up parent xid for the above subtransaction.

On Slave:

Promote the slave. This causes RecoverPreparedTransactions() to be invoked
which ends up calling SubTransSetParent() for the above subxid. The
overwriteOK bool remains false since we don't go over the
PGPROC_MAX_CACHED_SUBXIDS limit. Since the parent xid was already set on
restart above, we hit the assert.

---

I am wondering if StandbyRecoverPreparedTransactions() needs the parent
linkage at all? I don't see SubTransGetParent() and related functions being
called on the standby but we need to be careful here. As a quick test, If I
don't call SubTransSetParent() as part of
StandbyRecoverPreparedTransactions()  then all recovery tests pass ok. But
the fact that StandbyRecoverPreparedTransactions() takes overwriteOK as a
parameter means it might not be so simple..

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Failed recovery with new faster 2PC code

2017-04-18 Thread Nikhil Sontakke

Please find attached a second version of my bug fix which is stylistically
better and clearer than the first one.

Regards,
Nikhils

On 18 April 2017 at 13:47, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote:

> Hi,
>
> There was a bug in the redo 2PC remove code path. Because of which,
> autovac would think that the 2PC is gone and cause removal of the
> corresponding clog entry earlier than needed.
>
> Please find attached, the bug fix: 2pc_redo_remove_bug.patch.
>
> I have been testing this on top of Michael's 2pc-restore-fix.patch and
> things seem to be ok for the past one+ hour. Will keep it running for long.
>
> Jeff, thanks for these very useful scripts. I am going to make a habit to
> run these scripts on my side from now on. Do you have any other script that
> I could try against these patches? Please let me know.
>
> Regards,
> Nikhils
>
> On 18 April 2017 at 12:09, Nikhil Sontakke <nikh...@2ndquadrant.com>
> wrote:
>
>>
>>
>> On 17 April 2017 at 15:02, Nikhil Sontakke <nikh...@2ndquadrant.com>
>> wrote:
>>
>>>
>>>
>>>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>>>> >> Author: Simon Riggs <si...@2ndquadrant.com>
>>>> >> Date:   Tue Apr 4 15:56:56 2017 -0400
>>>> >>
>>>> >>Speedup 2PC recovery by skipping two phase state files in normal
>>>> path
>>>> >
>>>> > Thanks Jeff for your tests.
>>>> >
>>>> > So that's now two crash bugs in as many days and lack of clarity about
>>>> > how to fix it.
>>>> >
>>>>
>>>
>>> The issue seems to be that a prepared transaction is yet to be
>> committed. But autovacuum comes in and causes the clog to be truncated
>> beyond this prepared transaction ID in one of the runs.
>>
>> We only add the corresponding pgproc entry for a surviving 2PC
>> transaction on completion of recovery. So could be a race condition here.
>> Digging in further.
>>
>> Regards,
>> Nikhils
>> --
>>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>>  PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>>
>
>
>
> --
>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>  PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


2pc_redo_remove_bug_v2.0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failed recovery with new faster 2PC code

2017-04-18 Thread Nikhil Sontakke

Hi,

There was a bug in the redo 2PC remove code path. Because of which, autovac
would think that the 2PC is gone and cause removal of the corresponding
clog entry earlier than needed.

Please find attached, the bug fix: 2pc_redo_remove_bug.patch.

I have been testing this on top of Michael's 2pc-restore-fix.patch and
things seem to be ok for the past one+ hour. Will keep it running for long.

Jeff, thanks for these very useful scripts. I am going to make a habit to
run these scripts on my side from now on. Do you have any other script that
I could try against these patches? Please let me know.

Regards,
Nikhils

On 18 April 2017 at 12:09, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote:

>
>
> On 17 April 2017 at 15:02, Nikhil Sontakke <nikh...@2ndquadrant.com>
> wrote:
>
>>
>>
>>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>>> >> Author: Simon Riggs <si...@2ndquadrant.com>
>>> >> Date:   Tue Apr 4 15:56:56 2017 -0400
>>> >>
>>> >>Speedup 2PC recovery by skipping two phase state files in normal
>>> path
>>> >
>>> > Thanks Jeff for your tests.
>>> >
>>> > So that's now two crash bugs in as many days and lack of clarity about
>>> > how to fix it.
>>> >
>>>
>>
>> The issue seems to be that a prepared transaction is yet to be committed.
> But autovacuum comes in and causes the clog to be truncated beyond this
> prepared transaction ID in one of the runs.
>
> We only add the corresponding pgproc entry for a surviving 2PC transaction
> on completion of recovery. So could be a race condition here. Digging in
> further.
>
> Regards,
> Nikhils
> --
>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>  PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>

-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

2pc_redo_remove_bug.patch
Description: Binary data

2pc-restore-fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failed recovery with new faster 2PC code

2017-04-18 Thread Nikhil Sontakke

On 17 April 2017 at 15:02, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote:

>
>
>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>> >> Author: Simon Riggs <si...@2ndquadrant.com>
>> >> Date:   Tue Apr 4 15:56:56 2017 -0400
>> >>
>> >>Speedup 2PC recovery by skipping two phase state files in normal
>> path
>> >
>> > Thanks Jeff for your tests.
>> >
>> > So that's now two crash bugs in as many days and lack of clarity about
>> > how to fix it.
>> >
>>
>
> The issue seems to be that a prepared transaction is yet to be committed.
But autovacuum comes in and causes the clog to be truncated beyond this
prepared transaction ID in one of the runs.

We only add the corresponding pgproc entry for a surviving 2PC transaction
on completion of recovery. So could be a race condition here. Digging in
further.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Failed recovery with new faster 2PC code

2017-04-17 Thread Nikhil Sontakke

> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
> >> Author: Simon Riggs 
> >> Date:   Tue Apr 4 15:56:56 2017 -0400
> >>
> >>Speedup 2PC recovery by skipping two phase state files in normal path
> >
> > Thanks Jeff for your tests.
> >
> > So that's now two crash bugs in as many days and lack of clarity about
> > how to fix it.
> >
>

I am trying to reproduce and debug it from my end as well.

Regards,
Nikhils

Re: [HACKERS] Speedup twophase transactions

2017-03-28 Thread Nikhil Sontakke

>
>
> I don't have anything else to say about this patch, so should we mark
> that as ready for committer? There are still a couple of days left
> until the end of the CF, and quite a lot has happened, so this could
> get on time into PG10.
>

Yes, let's mark it "ready for committer". This patch/feature has been a
long journey! :)

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-03-27 Thread Nikhil Sontakke

Please ignore reports about errors in other tests. Seem spurious..

Regards,
Nikhils

On 28 March 2017 at 10:40, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote:

> Hi Micheal,
>
> The latest patch looks good. By now doing a single scan of shmem two phase
> data, we have removed the double loops in all the affected functions which
> is good.
>
> My only question is if the added call to restoreTwoPhaseData() is good
> enough to handle all the 3 functions PrescanPreparedTransactions(),
> StandbyRecoverPreparedTransactions() and RecoverPreparedTransactions()
> appropriately? It looks as if it does, but we need to be doubly sure..
>
> PFA, revised patch with a very minor typo fix and rebase against latest
> master. The test cases pass as needed.
>
> Oh, btw, while running TAP tests, I got a few errors in unrelated tests.
>
> "# testing connection parameter "target_session_attrs"
>
> not ok 5 - connect to node master if mode "read-write" and
> master,standby_1 listed
>
>
> #   Failed test 'connect to node master if mode "read-write" and
> master,standby_1 listed'
>
> #   at t/001_stream_rep.pl line 93.
>
> #  got: ''
>
> # expected: '1'
>
> not ok 6 - connect to node master if mode "read-write" and
> standby_1,master listed
>
>
> #   Failed test 'connect to node master if mode "read-write" and
> standby_1,master listed'
>
> #   at t/001_stream_rep.pl line 93.
>
> #  got: ''
>
> # expected: '1'
>
> not ok 7 - connect to node master if mode "any" and master,standby_1 listed
>
>
> #   Failed test 'connect to node master if mode "any" and master,standby_1
> listed'
>
> #   at t/001_stream_rep.pl line 93.
>
> #  got: ''
>
> # expected: '1'
>
> not ok 8 - connect to node standby_1 if mode "any" and standby_1,master
> listed"
>
> Again, not related to this recovery code path, but not sure if others see
> this as well.
>
> Regards,
> Nikhils
>
> On 27 March 2017 at 05:35, Michael Paquier <michael.paqu...@gmail.com>
> wrote:
>
>> On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke
>> <nikh...@2ndquadrant.com> wrote:
>> > I was away for a bit. I will take a look at this patch and get back to
>> you
>> > soon.
>>
>> No problem. Thanks for your time!
>> --
>> Michael
>>
>
>
>
> --
>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>  PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-03-27 Thread Nikhil Sontakke

Hi Micheal,

The latest patch looks good. By now doing a single scan of shmem two phase
data, we have removed the double loops in all the affected functions which
is good.

My only question is if the added call to restoreTwoPhaseData() is good
enough to handle all the 3
functions PrescanPreparedTransactions(), StandbyRecoverPreparedTransactions()
and RecoverPreparedTransactions() appropriately? It looks as if it does,
but we need to be doubly sure..

PFA, revised patch with a very minor typo fix and rebase against latest
master. The test cases pass as needed.

Oh, btw, while running TAP tests, I got a few errors in unrelated tests.

"# testing connection parameter "target_session_attrs"

not ok 5 - connect to node master if mode "read-write" and master,standby_1
listed

#   Failed test 'connect to node master if mode "read-write" and
master,standby_1 listed'

#   at t/001_stream_rep.pl line 93.

#  got: ''

# expected: '1'

not ok 6 - connect to node master if mode "read-write" and standby_1,master
listed

#   Failed test 'connect to node master if mode "read-write" and
standby_1,master listed'

#   at t/001_stream_rep.pl line 93.

#  got: ''

# expected: '1'

not ok 7 - connect to node master if mode "any" and master,standby_1 listed

#   Failed test 'connect to node master if mode "any" and master,standby_1
listed'

#   at t/001_stream_rep.pl line 93.

#  got: ''

# expected: '1'

not ok 8 - connect to node standby_1 if mode "any" and standby_1,master
listed"

Again, not related to this recovery code path, but not sure if others see
this as well.

Regards,
Nikhils

On 27 March 2017 at 05:35, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke
> <nikh...@2ndquadrant.com> wrote:
> > I was away for a bit. I will take a look at this patch and get back to
> you
> > soon.
>
> No problem. Thanks for your time!
> --
> Michael
>

-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

twophase_recovery_shmem_280317.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-03-26 Thread Nikhil Sontakke

Thanks Michael,

I was away for a bit. I will take a look at this patch and get back to you
soon.

Regards,
Nikhils

On 22 March 2017 at 07:40, Michael Paquier <michael.paqu...@gmail.com>
wrote:

> On Fri, Mar 17, 2017 at 5:15 PM, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
> > On Fri, Mar 17, 2017 at 5:00 PM, Nikhil Sontakke
> > <nikh...@2ndquadrant.com> wrote:
> >> Micheal, it looks like you are working on a final version of this
> patch? I
> >> will wait to review it from my end, then.
> >
> > I have to admit that I am beginning to get drawn into it...
>
> And here is what I got. I have found a couple of inconsistencies in
> the patch, roughly:
> - During recovery entries marked with ondisk = true should have their
> start and end LSN reset to InvalidXLogRecPtr. This was actually
> leading to some inconsistencies in MarkAsPreparing() for 2PC
> transactions staying around for more than 2 checkpoints.
> - RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions()
> and PrescanPreparedTransactions() doing both a scan of pg_twophase and
> the shared memory entries was way too complicated. I have changed
> things so as only memory entries are scanned by those routines, but an
> initial scan of pg_twophase is done before recovery.
> - Some inconsistencies in the comments and some typos found on the way.
> - Simplification of some routines used in redo, as well as simplified
> the set of routines made available to users.
>
> Tests are passing for me, an extra lookup would be nice.
> --
> Michael
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Nikhil Sontakke

>
> >
> > Ok, we can do that and then yes, RecoverPreparedTransaction() can just
> have
> > one loop going through the shmem entries. BUT, we cannot ignore
> > "inredo"+"ondisk" entries. For such entries, we will have to read and
> > recover from the corresponding 2PC files.
>
> Yes. About other things I found... In CheckPointTwoPhase(), I am
> actually surprised that prepare_start_lsn and prepare_end_lsn are not
> reset to InvalidXLogRecPtr when a shmem entry is flushed to disk after
> ondisk is set to true, there is no need for them as the data does not
> need to be fetched from WAL segments so we had better be consistent
> (regression tests fail if I do that). And the two extra arguments in
> MarkAsPreparing() are really unnecessary, they get set all the time to
> InvalidXLogRecPtr.
>

Micheal, it looks like you are working on a final version of this patch? I
will wait to review it from my end, then.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Nikhil Sontakke

>
> >
> > I don't think this will work. We cannot replace pg_twophase with shmem
> > entries + WAL pointers. This is because we cannot expect to have WAL
> entries
> > around for long running prepared queries which survive across
> checkpoints.
>
> But at the beginning of recovery, we can mark such entries with ondisk
> and inredo, in which case the WAL pointers stored in the shmem entries
> do not matter because the data is already on disk.
>

Ok, we can do that and then yes, RecoverPreparedTransaction() can just have
one loop going through the shmem entries. BUT, we cannot ignore
"inredo"+"ondisk" entries. For such entries, we will have to read and
recover from the corresponding 2PC files.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Nikhil Sontakke

>
> Nikhil, do you mind if I try something like that? As we already know
> what is the first XID when beginning redo via
> ShmemVariableCache->nextXid it is possible to discard 2PC files that
> should not be here.


Yeah, that is ok.


> What makes me worry is the control of the maximum
> number of entries in shared memory. If there are legit 2PC files that
> are flushed on disk at checkpoint, you would finish with potentially
> more 2PC transactions than what should be possible (even if updates of
> max_prepared_xacts are WAL-logged).
>

The max_prepared_xacts number restricts the number of pending PREPARED
transactions *across* the 2PC files and shmem inredo entries. We can never
have more entries than this value.


Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-03-16 Thread Nikhil Sontakke

>
> + *  * RecoverPreparedTransactions(), StandbyRecoverPreparedTransact
> ions()
> + *and PrescanPreparedTransactions() have been modified to go
> throug
> + *gxact->inredo entries that have not made to disk yet.
>
> It seems to me that there should be an initial scan of pg_twophase at
> the beginning of recovery, discarding on the way with a WARNING
> entries that are older than the checkpoint redo horizon. This should
> fill in shmem entries using something close to PrepareRedoAdd(), and
> mark those entries as inredo. Then, at the end of recovery,
> PrescanPreparedTransactions does not need to look at the entries in
> pg_twophase. And that's the case as well of
> RecoverPreparedTransaction(). I think that you could get the patch
> much simplified this way, as any 2PC data can be fetched directly from
> WAL segments and there is no need to rely on scans of pg_twophase,
> this is replaced by scans of entries in TwoPhaseState.
>
>
I don't think this will work. We cannot replace pg_twophase with shmem
entries + WAL pointers. This is because we cannot expect to have WAL
entries around for long running prepared queries which survive across
checkpoints.

Regards,
Nikhils

Re: [HACKERS] Speedup twophase transactions

2017-03-15 Thread Nikhil Sontakke

> Thanks for the new version. Let's head toward a final patch.
>
>
:-)


>
>

>
> Have added a new function to do this now. It reads either from disk or
> > shared memory and produces error/log messages accordingly as well now.
>
> And that's ProcessTwoPhaseBufferAndReturn in the patch.
> ProcessTwoPhaseBuffer may be a better name.
>
>
 Renamed to ProcessTwoPhaseBuffer()


>
> Since we are using the TwoPhaseState structure to track redo entries, at
> end
> > of recovery, we will find existing entries. Please see my comments above
> for
> > RecoverPreparedTransactions()
>
> This is absolutely not good, because it is a direct result of the
> interactions of the first loop of RecoverPreparedTransaction() with
> its second loop, and because MarkAsPreparing() can finished by being
> called *twice* from the same transaction. I really think that this
> portion should be removed and that RecoverPreparedTransactions()
> should be more careful when scanning the entries in pg_twophase by
> looking up at what exists as well in shared memory, instead of doing
> that in MarkAsPreparing().
>
>
Ok. Modified MarkAsPreparing() to call a new MarkAsPreparingGuts()
function. This function takes in a "gxact' and works on it.

RecoverPreparedTransaction() now calls a newly added
RecoverFromTwoPhaseBuffer() function which checks if an entry already
exists via redo and calls the MarkAsPreparingGuts() function by passing in
that gxact. Otherwise the existing MarkAsPreparing() gets called.


> Here are some more comments:
>
> +   /*
> +* Recreate its GXACT and dummy PGPROC
> +*/
> +   gxactnew = MarkAsPreparing(xid, gid,
> +   hdr->prepared_at,
> +   hdr->owner, hdr->database,
> +   gxact->prepare_start_lsn,
> +   gxact->prepare_end_lsn);
> +
> +   Assert(gxactnew == gxact);
> Here it would be better to set ondisk to false. This makes the code
> more consistent with the previous loop, and the intention clear.
>
>
Done.


> The first loop of RecoverPreparedTransactions() has a lot in common
> with its second loop. You may want to refactor a little bit more here.
>
>
Done. Added the new function RecoverFromTwoPhaseBuffer() as mentioned above.



> +/*
> + * PrepareRedoRemove
> + *
> + * Remove the corresponding gxact entry from TwoPhaseState. Also
> + * remove the 2PC file.
> + */
> This could be a bit more expanded. The removal of the 2PC does not
> happen after removing the in-memory data, it would happen if the
> in-memory data is not found.
>
>
Done


> +MarkAsPreparingInRedo(TransactionId xid, const char *gid,
> +   TimestampTz prepared_at, Oid owner, Oid databaseid,
> +   XLogRecPtr prepare_start_lsn, XLogRecPtr prepare_end_lsn)
> +{
> +   GlobalTransaction gxact;
> +
> +   LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
> MarkAsPreparingInRedo is internal to twophase.c. There is no need to
> expose it externally and it is just used in PrepareRedoAdd so you
> could just group both.
>
>
Removed this  MarkAsPreparingInRedo() function and inlined the code in
PrepareRedoAdd().

boolvalid;  /* TRUE if PGPROC entry is in proc array */
> boolondisk; /* TRUE if prepare state file is on disk */
> +   boolinredo; /* TRUE if entry was added via xlog_redo */
> We could have a set of flags here, that's the 3rd boolean of the
> structure used for a status.
>

This is more of a cleanup and does not need to be part of this patch. This
can be a follow-on cleanup patch.

I also managed to do some perf testing.

Modified Stas' earlier scripts slightly:

\set naccounts 10 * :scale

\set from_aid random(1, :naccounts)

\set to_aid random(1, :naccounts)

\set delta random(1, 100)

\set scale :scale+1

BEGIN;

UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid =
:from_aid;

UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid =
:to_aid;

PREPARE TRANSACTION ':client_id.:scale';

COMMIT PREPARED ':client_id.:scale';

Created a base backup with scale factor 125 on an AWS t2.large instance.
Set up archiving and did a 20 minute run with the above script saving the
WALs in the archive.

Then used recovery.conf to point to this WAL location and used the base
backup to recover.

With this patch applied: 20s

Without patch: Stopped measuring after 5 minutes ;-)


Regards,

Nikhils

-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


twophase_recovery_shmem_150317.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-03-11 Thread Nikhil Sontakke

Hi David and Michael,


> It would be great to get this thread closed out after 14 months and many
> commits.
>
>
PFA, latest patch which addresses Michael's comments.

twophase.c: In function ‘PrepareRedoAdd’:
> twophase.c:2539:20: warning: variable ‘gxact’ set but not used
> [-Wunused-but-set-variable]
>   GlobalTransaction gxact;
> There is a warning at compilation.
>
>
Fixed.


> The comment at the top of PrescanPreparedTransactions() needs to be
> updated. Not only does this routine look for the contents of
> pg_twophase, but it does also look at the shared memory contents, at
> least those ones marked with inredo and not on_disk.
>
>
Changed comments at top of PrescanPreparedTransactions() ,
StandbyRecoverPreparedTransactions(), and RecoverPreparedTransactions().


> +   ereport(WARNING,
> +   (errmsg("removing future two-phase state data from
> memory \"%u\"",
> +   xid)));
> +   PrepareRedoRemove(xid);
> +   continue
> Those are not normal (partially because unlink is atomic, but not
> durable)... But they match the correct coding pattern regarding
> incorrect 2PC entries... I'd really like to see those switched to a
> FATAL with unlink() made durable for those calls.
>
>
Hmm, not sure what exactly we need to do here. If you look at the prior
checks, there we already skip on-disk entries. So, typically, the entries
that we encounter here will be in shmem only.


> +   /* Deconstruct header */
> +   hdr = (TwoPhaseFileHeader *) buf;
> +   Assert(TransactionIdEquals(hdr->xid, xid));
> +
> +   if (TransactionIdPrecedes(xid, result))
> +   result = xid;
> This portion is repeated three times and could be easily refactored.
> You could just have a routine that returns the oldes transaction ID
> used, and ignore the result for StandbyRecoverPreparedTransactions()
> by casting a (void) to let any kind of static analyzer understand that
> we don't care about the result for example. Handling for subxids is
> necessary as well depending on the code path. Spliting things into a
> could of sub-routines may be more readable as well. There are really a
> couple of small parts that can be gathered and strengthened.
>
>
Have added a new function to do this now. It reads either from disk or
shared memory and produces error/log messages accordingly as well now.


> +   /*
> +* Recreate its GXACT and dummy PGPROC
> +*/
> +   gxactnew = MarkAsPreparing(xid, gid,
> +   hdr->prepared_at,
> +   hdr->owner, hdr->database,
> +   gxact->prepare_start_lsn,
> +   gxact->prepare_end_lsn);
> MarkAsPreparing() does not need to be extended with two new arguments.
> RecoverPreparedTransactions() is used only at the end of recovery,
> where it is not necessary to look at the 2PC state files from the
> records. In this code path inredo is also set to false :)
>
>
That's not true. We will have entries with inredo set at the end of
recovery as well. Infact the MarkAsPreparing() call
from RecoverPreparedTransactions() is the one which will remove these
inredo entries and convert them into regular entries. We have optimized the
recovery code path as well.



> +   /* Delete TwoPhaseState gxact entry and/or 2PC file. */
> +   PrepareRedoRemove(parsed.twophase_xid);
> Both things should not be present, no? If the file is pushed to disk
> it means that the checkpoint horizon has already moved.
>
>
PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK.
We can have both the bits set in this case.



> -   ereport(ERROR,
> +   /* It's ok to find an entry in the redo/recovery case */
> +   if (!gxact->inredo)
> +   ereport(ERROR,
> (errcode(ERRCODE_DUPLICATE_OBJECT),
>  errmsg("transaction identifier \"%s\" is already in
> use",
> gid)));
> +   else
> +   {
> +   found = true;
> +   break;
> +   }
> I would not have thought so.
>
> Since we are using the TwoPhaseState structure to track redo entries, at
end of recovery, we will find existing entries. Please see my comments
above for RecoverPreparedTransactions()


> MarkAsPreparing and MarkAsPreparingInRedo really share the same code.
> What if the setup of the dummy PGPROC entry is made conditional?
>

I realized that MarkAsPreparingInRedo() does not need to do all the sanity
checking since it's going to be invoked during redo and everything that
comes in is kosher already. So its contents are much simplified in this
latest patch.

Tests pass with this latest patch.

Regards,
Nikhils


twophase_recovery_shmem_110317.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Speedup twophase transactions

2017-02-26 Thread Nikhil Sontakke

Hi Michael,

Thanks for taking a look at the patch.



> twophase.c: In function ‘PrepareRedoAdd’:
> twophase.c:2539:20: warning: variable ‘gxact’ set but not used
> [-Wunused-but-set-variable]
>   GlobalTransaction gxact;
> There is a warning at compilation.
>
>
Will fix.


> The comment at the top of PrescanPreparedTransactions() needs to be
> updated. Not only does this routine look for the contents of
> pg_twophase, but it does also look at the shared memory contents, at
> least those ones marked with inredo and not on_disk.
>
>
Oh yes. Will change comments at top of
StandbyRecoverPreparedTransactions(), RecoverPreparedTransactions() as well.


> +   ereport(WARNING,
> +   (errmsg("removing future two-phase state data from
> memory \"%u\"",
> +   xid)));
> +   PrepareRedoRemove(xid);
> +   continue
> Those are not normal (partially because unlink is atomic, but not
> durable)... But they match the correct coding pattern regarding
> incorrect 2PC entries... I'd really like to see those switched to a
> FATAL with unlink() made durable for those calls.
>
>
Hmm, not sure what exactly we need to do here. If you look at the prior
checks, there we already skip on-disk entries. So, typically, the entries
that we encounter here will be in shmem only.


> +   /* Deconstruct header */
> +   hdr = (TwoPhaseFileHeader *) buf;
> +   Assert(TransactionIdEquals(hdr->xid, xid));
> +
> +   if (TransactionIdPrecedes(xid, result))
> +   result = xid;
> This portion is repeated three times and could be easily refactored.
> You could just have a routine that returns the oldes transaction ID
> used, and ignore the result for StandbyRecoverPreparedTransactions()
> by casting a (void) to let any kind of static analyzer understand that
> we don't care about the result for example. Handling for subxids is
> necessary as well depending on the code path. Spliting things into a
> could of sub-routines may be more readable as well. There are really a
> couple of small parts that can be gathered and strengthened.
>
>
I will see if we can reduce this to a couple of function calls.


> +   /*
> +* Recreate its GXACT and dummy PGPROC
> +*/
> +   gxactnew = MarkAsPreparing(xid, gid,
> +   hdr->prepared_at,
> +   hdr->owner, hdr->database,
> +   gxact->prepare_start_lsn,
> +   gxact->prepare_end_lsn);
> MarkAsPreparing() does not need to be extended with two new arguments.
> RecoverPreparedTransactions() is used only at the end of recovery,
> where it is not necessary to look at the 2PC state files from the
> records. In this code path inredo is also set to false :)
>
>
That's not true. We will have entries with inredo set at the end of
recovery as well. Infact the MarkAsPreparing() call
from RecoverPreparedTransactions() is the one which will remove these
inredo entries and convert them into regular entries. We have optimized the
recovery code path as well.


> +   {
> +   /*
> +* Entry could be on disk. Call with giveWarning=false
> +* since it can be expected during replay.
> +*/
> +   RemoveTwoPhaseFile(xid, false);
> +   }
> This would be removed at the end of recovery anyway as a stale entry,
> so that's not necessary.
>
>
Ok, will remove this.


> +   /* Delete TwoPhaseState gxact entry and/or 2PC file. */
> +   PrepareRedoRemove(parsed.twophase_xid);
> Both things should not be present, no? If the file is pushed to disk
> it means that the checkpoint horizon has already moved.
>
>
PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK.
We can have both the bits set in this case.



> -   ereport(ERROR,
> +   /* It's ok to find an entry in the redo/recovery case */
> +   if (!gxact->inredo)
> +   ereport(ERROR,
> (errcode(ERRCODE_DUPLICATE_OBJECT),
>  errmsg("transaction identifier \"%s\" is already in
> use",
> gid)));
> +   else
> +   {
> +   found = true;
> +   break;
> +   }
> I would not have thought so.
>
> Since we are using the TwoPhaseState structure to track redo entries, at
end of recovery, we will find existing entries. Please see my comments
above for RecoverPreparedTransactions()


> MarkAsPreparing and MarkAsPreparingInRedo really share the same code.
> What if the setup of the dummy PGPROC entry is made conditional?
>

I thought it was cleaner this ways. We can definitely add a bunch of
if-else in MarkAsPreparing() but it won't look pretty.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Re: [HACKERS] Speedup twophase transactions

2017-02-01 Thread Nikhil Sontakke

>> Yeah. Was thinking about this yesterday. How about adding entries in
>> TwoPhaseState itself (which become valid later)? Only if it does not
>> cause a lot of code churn.
>
> That's possible as well, yes.

PFA a patch which does the above. It re-uses the TwoPhaseState gxact
entries to track 2PC PREPARE/COMMIT in shared memory. The advantage
being that CheckPointTwoPhase() becomes the only place where the fsync
of 2PC files happens.

A minor annoyance in the patch is the duplication of the code to add
the 2nd while loop to go through these shared memory entries in
PrescanPreparedTransactions, RecoverPreparedTransactions and
StandbyRecoverPreparedTransactions.

Other than this, I ran TAP tests and they succeed as needed.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


twophase_recovery_shmem_020217.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-02-01 Thread Nikhil Sontakke

>> Shouldn’t the singular part of the message above be:
>> "%u two-phase state file was written for a long-running prepared transaction"
>>
>> But, then, English is not my native language, so I might be off here :-)
>
> If there's one file per long-running prepared transaction, which I
> think is true, then I agree with you.

PFA, small patch to fix this, then.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


twophase_typo.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Nikhil Sontakke

>> CheckPointTwoPhase() in (5) does not sync this prepared transaction
>> because the checkpointer's KnownPreparedList is empty.
>
> And that's why this needs to be stored in shared memory with a number
> of elements made of max_prepared_xacts...

Yeah. Was thinking about this yesterday. How about adding entries in
TwoPhaseState itself (which become valid later)? Only if it does not
cause a lot of code churn.

The current non-shmem list patch only needs to handle standby
shutdowns correctly. Other aspects like standby promotion/recovery are
handled ok AFAICS.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Nikhil Sontakke

>> Having CheckPointTwoPhase() do the flush would mean shifting the data
>> from KnownPreparedList into TwoPhaseState shmem.
>
> Er, no. For CheckPointTwoPhase(), at recovery what needs to be done is
> to have all the entries in KnownPreparedList() flushed to disk and
> have those entries removed while holding a shared memory lock.

The KnownPreparedList is constructed by the recovery process.
CheckPointTwoPhase() gets called by the checkpointer process. The
checkpointer does not have access to this valid KnownPreparedList.

>And for
> the rest we need to be careful to have PrescanPreparedTransactions,
> RecoverPreparedTransactions and StandbyRecoverPreparedTransactions
> aware of entries that are in KnownPreparedList().


Yeah, that part is straightforward. It does involve duplication of the
earlier while loops to work against KnownPreparedList. A smart single
while loop which loops over the 2PC files followed by the list would
help here :-)

> Let's leave the
> business of putting the information from KnownPreparedList to
> TwoPhaseState in RecoverPreparedTransactions, which need to be aware
> of entries in KnownPreparedList() anyway. The only thing that differs
> is how the 2PC information is fetched: from the segments or from the
> files in pg_twophase.
>

Yeah.  This part is also ok. We also got to be careful to mark the
shmem gxact entry with "ondisk = false" and need to set
prepare_start_lsn/prepare_end_lsn properly as well.

>> I wonder what's the best location for this in the common case when we
>> do shutdown of standby.  We could add code in XLOG_CHECKPOINT_SHUTDOWN
>> and XLOG_CHECKPOINT_ONLINE xlog_redo code path.
>
> ShutdownXLOG() calls CreateRestartPoint() when a standby shuts down,
> so doing all the durability work in CheckPointTwoPhase() would take
> care of any problems.
>

ShutdownXLOG() gets called from the checkpointer process. See comments
above about the checkpointer not having access to the proper
KnownPreparedList.

The following test sequence will trigger the issue:

1) start master
2) start replica
3) prepare a transaction on master
4) shutdown master
5) shutdown replica

CheckPointTwoPhase() in (5) does not sync this prepared transaction
because the checkpointer's KnownPreparedList is empty.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Nikhil Sontakke

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>  (errmsg("unexpected timeline ID %u (should be %u)
> in checkpoint record",
>  checkPoint.ThisTimeLineID, ThisTimeLineID)));
>
> +KnownPreparedRecreateFiles(checkPoint.redo);
>  RecoveryRestartPoint();
>  }
> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
> files are not flushed to disk with this patch. This is a problem as a
> new restart point is created... Having the flush in CheckpointTwoPhase
> really makes the most sense.

Having CheckPointTwoPhase() do the flush would mean shifting the data
from KnownPreparedList into TwoPhaseState shmem.

I wonder what's the best location for this in the common case when we
do shutdown of standby.  We could add code in XLOG_CHECKPOINT_SHUTDOWN
and XLOG_CHECKPOINT_ONLINE xlog_redo code path.

Regards,
Nikhils




-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Nikhil Sontakke

>>> The xact_redo code will add prepared transactions to the
>>> KnownPreparedList in memory. Earlier it used to create the on-disk 2PC
>>> file.
>>>
>>> At standby promote, the surviving (yet uncommitted) prepared
>>> transactions from KnownPreparedList need to be persisted, right?
>>

>> I don't see why, so please explain or show the error that can be
>> caused if we don't.
>
> I agree with Simon here. There is no point to fsync the 2PC files are
> in case of an immediate crash after promotion replay will happen from
> the last checkpoint, aka the one before the promotion has been
> triggered. So there is no point to flush them at promotion, they would
> be replayed anyway.

1) start master
2) start streaming replica
3) on master

begin;

create table test1(g int);

prepare transaction 'test';

4) Promote standby via trigger file or via "pg_ctl promote"

I thought if we don't fsync the KnownPreparedList then we might not
create the 2PC state properly on the standby.

However, I do realize that we will be calling
RecoverPreparedTransactions() eventually on promote. This function
will be modified to go through the KnownPreparedList to reload shmem
appropriately for 2PC. So, all good in that case.

Apologies for the digression.

Regards,
Nikhils



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Nikhil Sontakke

On 27 January 2017 at 15:37, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 27 January 2017 at 09:59, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote:
>>>> But, I put the recovery process and the checkpointer process of the
>>>> standby under gdb with breakpoints on these functions, but both did
>>>> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
>>>> a promote :-|
>>>
>>> No end-of-recovery checkpoints happen at promotion since 9.3. You can
>>> still use fallback_promote as promote file to trigger the pre-9.2 (9.2
>>> included) behavior.
>>
>> Ok, so that means, we also need to fsync out these 2PC XIDs at promote
>> time as well for their durability.
>
> Why?

The xact_redo code will add prepared transactions to the
KnownPreparedList in memory. Earlier it used to create the on-disk 2PC
file.

At standby promote, the surviving (yet uncommitted) prepared
transactions from KnownPreparedList need to be persisted, right?

Regards,
Nikhils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Nikhil Sontakke

>> But, I put the recovery process and the checkpointer process of the
>> standby under gdb with breakpoints on these functions, but both did
>> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
>> a promote :-|
>
> No end-of-recovery checkpoints happen at promotion since 9.3. You can
> still use fallback_promote as promote file to trigger the pre-9.2 (9.2
> included) behavior.

Ok, so that means, we also need to fsync out these 2PC XIDs at promote
time as well for their durability.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-26 Thread Nikhil Sontakke

>> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
>> promote" code path.
>
> CreateRestartPoint() calls it via CheckPointGuts() while in recovery.

May be that I am missing something.

But, I put the recovery process and the checkpointer process of the
standby under gdb with breakpoints on these functions, but both did
not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
a promote :-|

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Nikhil Sontakke

>I look at this patch from you and that's present for me:
>https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>  (errmsg("unexpected timeline ID %u (should be %u)
> in checkpoint record",
>  checkPoint.ThisTimeLineID, ThisTimeLineID)));
>
> +KnownPreparedRecreateFiles(checkPoint.redo);
>  RecoveryRestartPoint();
>  }

Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a
function by this name. And now I see, the name is CheckPointTwoPhase()
:-)

> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
> files are not flushed to disk with this patch. This is a problem as a
> new restart point is created... Having the flush in CheckpointTwoPhase
> really makes the most sense.

Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
promote" code path.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Nikhil Sontakke

>> The question remains whether saving off a few fsyncs/reads for these
>> long-lived prepared transactions is worth the additional code churn.
>> Even if we add code to go through the KnownPreparedList, we still will
>> have to go through the other on-disk 2PC transactions anyways. So,
>> maybe not.
>
> We should really try to do things right now, or we'll never come back
> to it. 9.3 (if my memory does not fail me?) has reduced the time to do
> promotion by removing the need of the end-of-recovery checkpoint,
> while I agree that there should not be that many 2PC transactions at
> this point, if there are for a reason or another, the time it takes to
> complete promotion would be impacted. So let's refactor
> PrescanPreparedTransactions() so as it is able to handle 2PC data from
> a buffer extracted by XlogReadTwoPhaseData(), and we are good to go.
>

Not quite. If we modify PrescanPreparedTransactions(), we also need to
make RecoverPreparedTransactions() and
StandbyRecoverPreparedTransactions() handle 2PC data via
XlogReadTwoPhaseData().


> +   /*
> +* Move prepared transactions, if any, from KnownPreparedList to 
> files.
> +* It is possible to skip this step and teach subsequent code about
> +* KnownPreparedList, but PrescanPreparedTransactions() happens once
> +* during end of recovery or on promote, so probably it isn't worth
> +* the additional code.
> +*/


This comment is misplaced. Does not make sense before this specific call.

> +   KnownPreparedRecreateFiles(checkPoint.redo);
> RecoveryRestartPoint();
> Looking again at this code, I think that this is incorrect. The
> checkpointer should be in charge of doing this work and not the
> startup process, so this should go into CheckpointTwoPhase() instead.

I don't see a function by the above name in the code?

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Nikhil Sontakke

> We are talking about the recovery/promote code path. Specifically this
> call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().
>
> We write the files to disk and they get immediately read up in the
> following code. We could not write the files to disk and read
> KnownPreparedList in the code path that follows as well as elsewhere.

Thinking more on this.

The only optimization that's really remaining is handling of prepared
transactions that have not been committed or will linger around for
long. The short lived 2PC transactions have been optimized already via
this patch.

The question remains whether saving off a few fsyncs/reads for these
long-lived prepared transactions is worth the additional code churn.
Even if we add code to go through the KnownPreparedList, we still will
have to go through the other on-disk 2PC transactions anyways. So,
maybe not.

Regards,
Nikhils




>
> Regards,
> Nikhils
>
>
>>> The difference between those two is likely noise.
>>>
>>> By the way, in those measurements, the OS cache is still filled with
>>> the past WAL segments, which is a rather best case, no? What happens
>>> if you do the same kind of tests on a box where memory is busy doing
>>> something else and replayed WAL segments get evicted from the OS cache
>>> more aggressively once the startup process switches to a new segment?
>>> This could be tested for example on a VM with few memory (say 386MB or
>>> less) so as the startup process needs to access again the past WAL
>>> segments to recover the 2PC information it needs to get them back
>>> directly from disk... One trick that you could use here would be to
>>> tweak the startup process so as it drops the OS cache once a segment
>>> is finished replaying, and see the effects of an aggressive OS cache
>>> eviction. This patch is showing really nice improvements with the OS
>>> cache backing up the data, still it would make sense to test things
>>> with a worse test case and see if things could be done better. The
>>> startup process now only reads records sequentially, not randomly
>>> which is a concept that this patch introduces.
>>>
>>> Anyway, perhaps this does not matter much, the non-recovery code path
>>> does the same thing as this patch, and the improvement is too much to
>>> be ignored. So for consistency's sake we could go with the approach
>>> proposed which has the advantage to not put any restriction on the
>>> size of the 2PC file contrary to what an implementation saving the
>>> contents of the 2PC files into memory would need to do.
>>
>> Maybe i’m missing something, but I don’t see how OS cache can affect 
>> something here.
>>
>> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. 
>> Sequential reading 1GB of data
>> is order of magnitude faster even on the old hdd, not speaking of ssd. Also 
>> you can take a look on flame graphs
>> attached to previous message — majority of time during recovery spent in 
>> pg_qsort while replaying
>> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of 
>> time. That amount can
>> grow in case of uncached disk read but taking into account total recovery 
>> time this should not affect much.
>>
>> If you are talking about uncached access only during checkpoint than here we 
>> are restricted with
>> max_prepared_transaction, so at max we will read about hundred of small 
>> files (usually fitting into one filesystem page) which will also
>> be barely noticeable comparing to recovery time between checkpoints. Also 
>> wal segments cache eviction during
>> replay doesn’t seems to me as standard scenario.
>>
>> Anyway i took the machine with hdd to slow down read speed and run tests 
>> again. During one of the runs i
>> launched in parallel bash loop that was dropping os cache each second (while 
>> wal fragment replay takes
>>  also about one second).
>>
>> 1.5M transactions
>>  start segment: 0x06
>>  last segment: 0x47
>>
>> patched, with constant cache_drop:
>>   total recovery time: 86s
>>
>> patched, without constant cache_drop:
>>total recovery time: 68s
>>
>> (while difference is significant, i bet that happens mostly because of 
>> database file segments should be re-read after cache drop)
>>
>> master, without constant cache_drop:
>>time to recover 35 segments: 2h 25m (after that i tired to wait)
>>expected total recovery time: 4.5 hours
>>
>> --
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>>
>>
>
>
>
> --
>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-24 Thread Nikhil Sontakke

> Thanks for review, Nikhil and Michael.
>
> I don’t follow here. We are moving data away from WAL to files on checkpoint 
> because after checkpoint
> there is no guaranty that WAL segment with our prepared tx will be still 
> available.
>

We are talking about the recovery/promote code path. Specifically this
call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().

We write the files to disk and they get immediately read up in the
following code. We could not write the files to disk and read
KnownPreparedList in the code path that follows as well as elsewhere.

Regards,
Nikhils


>> The difference between those two is likely noise.
>>
>> By the way, in those measurements, the OS cache is still filled with
>> the past WAL segments, which is a rather best case, no? What happens
>> if you do the same kind of tests on a box where memory is busy doing
>> something else and replayed WAL segments get evicted from the OS cache
>> more aggressively once the startup process switches to a new segment?
>> This could be tested for example on a VM with few memory (say 386MB or
>> less) so as the startup process needs to access again the past WAL
>> segments to recover the 2PC information it needs to get them back
>> directly from disk... One trick that you could use here would be to
>> tweak the startup process so as it drops the OS cache once a segment
>> is finished replaying, and see the effects of an aggressive OS cache
>> eviction. This patch is showing really nice improvements with the OS
>> cache backing up the data, still it would make sense to test things
>> with a worse test case and see if things could be done better. The
>> startup process now only reads records sequentially, not randomly
>> which is a concept that this patch introduces.
>>
>> Anyway, perhaps this does not matter much, the non-recovery code path
>> does the same thing as this patch, and the improvement is too much to
>> be ignored. So for consistency's sake we could go with the approach
>> proposed which has the advantage to not put any restriction on the
>> size of the 2PC file contrary to what an implementation saving the
>> contents of the 2PC files into memory would need to do.
>
> Maybe i’m missing something, but I don’t see how OS cache can affect 
> something here.
>
> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. 
> Sequential reading 1GB of data
> is order of magnitude faster even on the old hdd, not speaking of ssd. Also 
> you can take a look on flame graphs
> attached to previous message — majority of time during recovery spent in 
> pg_qsort while replaying
> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of 
> time. That amount can
> grow in case of uncached disk read but taking into account total recovery 
> time this should not affect much.
>
> If you are talking about uncached access only during checkpoint than here we 
> are restricted with
> max_prepared_transaction, so at max we will read about hundred of small files 
> (usually fitting into one filesystem page) which will also
> be barely noticeable comparing to recovery time between checkpoints. Also wal 
> segments cache eviction during
> replay doesn’t seems to me as standard scenario.
>
> Anyway i took the machine with hdd to slow down read speed and run tests 
> again. During one of the runs i
> launched in parallel bash loop that was dropping os cache each second (while 
> wal fragment replay takes
>  also about one second).
>
> 1.5M transactions
>  start segment: 0x06
>  last segment: 0x47
>
> patched, with constant cache_drop:
>   total recovery time: 86s
>
> patched, without constant cache_drop:
>total recovery time: 68s
>
> (while difference is significant, i bet that happens mostly because of 
> database file segments should be re-read after cache drop)
>
> master, without constant cache_drop:
>time to recover 35 segments: 2h 25m (after that i tired to wait)
>expected total recovery time: 4.5 hours
>
> --
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-24 Thread Nikhil Sontakke

>> Speeding up recovery or failover activity via a faster promote is a
>> desirable thing. So, maybe, we should look at teaching the relevant
>> code about using "KnownPreparedList"? I know that would increase the
>> size of this patch and would mean more testing, but this seems to be
>> last remaining optimization in this code path.
>
> That's a good idea, worth having in this patch. Actually we may not
> want to call KnownPreparedRecreateFiles() here as promotion is not
> synonym of end-of-recovery checkpoint for a couple of releases now.
>

Once implemented, a good way to performance test this could be to set
checkpoint_timeout to a a large value like an hour. Then, generate
enough 2PC WAL while ensuring that a checkpoint does not happen
automatically or otherwise.

We could then measure the time taken to recover on startup to see the efficacy.

> Most of the 2PC syncs just won't happen, such transactions normally
> don't last long, and the number you would get during a checkpoint is
> largely lower than what would happen between two checkpoints. When
> working on Postgres-XC, the number of 2PC was really more funky.
>

Yes, postgres-xl is full of 2PC, so hopefully this optimization should
help a lot in that case as well.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2017-01-23 Thread Nikhil Sontakke

Hi Stas,

>
> So, here is brand new implementation of the same thing.
>
> Now instead of creating pgproc entry for prepared transaction during recovery,
> I just store recptr/xid correspondence in separate 2L-list and
> deleting entries in that
> list if redo process faced commit/abort. In case of checkpoint or end
> of recovery
> transactions remaining in that list are dumped to files in pg_twophase.
>
> Seems that current approach is way more simpler and patch has two times less
> LOCs then previous one.
>

The proposed approach and patch does appear to be much simpler than
the previous one.

I have some comments/questions on the patch (twophase_recovery_list_2.diff):

1)
+   /*

+* Move prepared transactions from KnownPreparedList to files, if any.

+* It is possible to skip that step and teach subsequent code about

+* KnownPreparedList, but whole PrescanPreparedTransactions() happens

+* once during end of recovery or promote, so probably it isn't worth

+* complications.

+*/

+   KnownPreparedRecreateFiles(InvalidXLogRecPtr);

+

Speeding up recovery or failover activity via a faster promote is a
desirable thing. So, maybe, we should look at teaching the relevant
code about using "KnownPreparedList"? I know that would increase the
size of this patch and would mean more testing, but this seems to be
last remaining optimization in this code path.

2)
+   /*

+* Here we know that file should be moved to disk. But
aborting recovery because

+* of absence of unnecessary file doesn't seems to be a good
idea, so call remove

+* with giveWarning=false.

+*/

+   RemoveTwoPhaseFile(xid, false);

We are going to call the above in case of COMMIT/ABORT. If so, we
should always find the "xid" entry either in the KnownPreparedList or
as a file. Does it make sense to call the above function with "false"
then?

3) We are pushing the fsyncing of 2PC files to the checkpoint replay
activity with this patch. Now, typically, we would assume that PREPARE
followed by COMMIT/ABORT would happen within a checkpoint replay
cycle, if not, this would make checkpoint replays slower since the
earlier spread out fsyncs are now getting bunched up. I had concerns
about this but clearly your latest performance measurements don't show
any negatives around this directly.

4) Minor nit-pick on existing code.

(errmsg_plural("%u two-phase state file was written "
  "for
long-running prepared transactions",
  "%u
two-phase state files were written "
  "for
long-running prepared transactions",
  serialized_xacts,
  serialized_xacts)

Shouldn’t the singular part of the message above be:
"%u two-phase state file was written for a long-running prepared transaction"

But, then, English is not my native language, so I might be off here :-)

5) Why isn't KnownPreparedRecreateFiles() tracking which entries from
KnownPreparedList have been written to disk in an earlier iteration
like in the normal code path? Why is it removing the entry from
KnownPreparedList and not just marking it as saved on disk?

6) Do we need to hold TwoPhaseStateLock lock in shared mode while
calling KnownPreparedRecreateFiles()? I think, it does not matter in
the recovery/replay code path.

7) I fixed some typos and comments. PFA, patch attached.

Other than this, I ran TAP tests and they succeed as needed.

Regards,
Nikhils

On 23 January 2017 at 16:56, Stas Kelvich  wrote:
>
>> On 27 Dec 2016, at 07:31, Michael Paquier  wrote:
>>
>> I think that it would be a good idea to actually test that in pure
>> recovery time, aka no client, and just use a base backup and make it
>> recover X prepared transactions that have created Y checkpoints after
>> dropping cache (or restarting server).
>
> I did tests with following setup:
>
> * Start postgres initialised with pgbench
> * Start pg_receivexlog
> * Take basebackup
> * Perform 1.5 M transactions
> * Stop everything and apply wal files stored by pg_receivexlog to base backup.
>
> All tests performed on a laptop with nvme ssd
> number of transactions: 1.5M
> start segment: 0x4
>
> -master non-2pc:
> last segment: 0x1b
> average recovery time per 16 wal files: 11.8s
> average total recovery time: 17.0s
>
> -master 2pc:
> last segment: 0x44
> average recovery time per 16 wal files: 142s
> average total recovery time: 568s
>
> -patched 2pc (previous patch):
> last segment: 0x44
> average recovery time per 16 wal files: 5.3s
> average total recovery time: 21.2s
>
> -patched2 2pc (dlist_push_tail changed to dlist_push_head):
> last segment: 0x44
> average recovery time

Re: [HACKERS] Compression of full-page-writes

2013-08-29 Thread Nikhil Sontakke

Hi Fujii-san,

I must be missing something really trivial, but why not try to compress all
types of WAL blocks and not just FPW?

Regards,
Nikhils


On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao masao.fu...@gmail.com wrote:

 Hi,

 Attached patch adds new GUC parameter 'compress_backup_block'.
 When this parameter is enabled, the server just compresses FPW
 (full-page-writes) in WAL by using pglz_compress() before inserting it
 to the WAL buffers. Then, the compressed FPW is decompressed
 in recovery. This is very simple patch.

 The purpose of this patch is the reduction of WAL size.
 Under heavy write load, the server needs to write a large amount of
 WAL and this is likely to be a bottleneck. What's the worse is,
 in replication, a large amount of WAL would have harmful effect on
 not only WAL writing in the master, but also WAL streaming and
 WAL writing in the standby. Also we would need to spend more
 money on the storage to store such a large data.
 I'd like to alleviate such harmful situations by reducing WAL size.

 My idea is very simple, just compress FPW because FPW is
 a big part of WAL. I used pglz_compress() as a compression method,
 but you might think that other method is better. We can add
 something like FPW-compression-hook for that later. The patch
 adds new GUC parameter, but I'm thinking to merge it to full_page_writes
 parameter to avoid increasing the number of GUC. That is,
 I'm thinking to change full_page_writes so that it can accept new value
 'compress'.

 I measured how much WAL this patch can reduce, by using pgbench.

 * Server spec
   CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
   Mem: 16GB
   Disk: 500GB SSD Samsung 840

 * Benchmark
   pgbench -c 32 -j 4 -T 900 -M prepared
   scaling factor: 100

   checkpoint_segments = 1024
   checkpoint_timeout = 5min
   (every checkpoint during benchmark were triggered by checkpoint_timeout)

 * Result
   [tps]
   1386.8 (compress_backup_block = off)
   1627.7 (compress_backup_block = on)

   [the amount of WAL generated during running pgbench]
   4302 MB (compress_backup_block = off)
   1521 MB (compress_backup_block = on)

 At least in my test, the patch could reduce the WAL size to one-third!

 The patch is WIP yet. But I'd like to hear the opinions about this idea
 before completing it, and then add the patch to next CF if okay.

 Regards,

 --
 Fujii Masao


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Custom gucs visibility

2013-07-03 Thread Nikhil Sontakke

   If we haven't loaded the .so yet, where would we get the list of
   custom GUCs from?
 
  This has come up before.  We could show the string value of the GUC,
  if it's been set in postgresql.conf, but we do not have correct
  values for any of the other columns in pg_settings; nor are we even
  sure that the module will think the value is valid once it does get
  loaded.  So the consensus has been that allowing the GUC to be
  printed would be more misleading than helpful.

 How about printing them with something along the lines of, Please
 load extension foobar for details or (less informative, but possibly
 easier to code) libfoobar.so not loaded. ?


Well, we have done the CREATE EXTENSION successfully earlier. Also, the
GUC becomes automagically visible after the backend has executed a
function from that extension ( in which case the .so gets loaded as part of
the function handling).

Also note that SET foo.custom_guc works ok by setting up a placeholder guc
if the .so has not been loaded yet.

I wonder if we should dare to try to load the .so if a 'SHOW
extension_name.custom_guc' is encountered via internal_load_library or
something? Obviously we should check if the extension was created before as
well.

Regards,
Nikhils



 Cheers,
 David.
 --
 David Fetter da...@fetter.org http://fetter.org/
 Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
 Skype: davidfetter  XMPP: david.fet...@gmail.com
 iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

 Remember to vote!
 Consider donating to Postgres: http://www.postgresql.org/about/donate

[HACKERS] Custom gucs visibility

2013-07-02 Thread Nikhil Sontakke

Hi,

So, if I have a contrib module which adds in a custom guc via
DefineCustomIntVariable() for example. Now that custom guc is not visible
via a show command unless that corresponding .so has been loaded into
memory.

E.g.

postgres=# show pg_buffercache.fancy_custom_guc;
ERROR:  unrecognized configuration parameter
pg_buffercache.fancy_custom_guc

Should we do something about this or we are ok with the current behavior? I
would prefer to get to see the config parameter value if I so happen to
want to see it before the load of that contrib module. Thoughts?

Regards,
Nikhils

Re: [HACKERS] Bug in ALTER COLUMN SET DATA TYPE ?

2012-11-02 Thread Nikhil Sontakke

  postgres=# CREATE TABLE test (a float check (a  10.2));
  CREATE TABLE
  postgres=# CREATE TABLE test_child() INHERITS(test);
  CREATE TABLE
  postgres=# ALTER TABLE test ALTER COLUMN a SET DATA TYPE numeric;
  ERROR:  constraint must be added to child tables too

 Interestingly, this works in 8.3 and earlier ... but it fails as
 described in all later versions.  So we broke it quite some time back.
 It would be good to identify exactly what change broke it.


 Hmm.. I haven't yet tested on other branches. But I'm surprised that you
 find it broken on so much old branches. git annotate suggests that the
 offending error message was added by commit
 09ff76fcdb275769ac4d1a45a67416735613d04b and that commit is dated Apr 20,
 2012



Mea Culpa. I did indeed add that error message. But it's strange when Tom
says that this is broken since 8.3!

Regards,
Nikhils
-- 
StormDB - http://www.stormdb.com
The Database Cloud
Postgres-XC Support and Service

Re: [HACKERS] Bug in ALTER COLUMN SET DATA TYPE ?

2012-11-02 Thread Nikhil Sontakke

 So coming back to the issue, do you think it's a good idea to teach
 ATAddCheckConstraint() that the call is coming from a late phase of ALTER
 TABLE ?

 +1

You mentioned AT_PASS_OLD_INDEX in your original mail, but I guess that you
meant that we should check for  AT_PASS_OLD_CONSTR and then not raise an
error?

Regards,
Nikhils
-- 
StormDB - http://www.stormdb.com
The Database Cloud
Postgres-XC Support and Service

Re: [HACKERS] xml_is_document and selective pg_re_throw

2012-06-13 Thread Nikhil Sontakke

 No, I don't see any particular risk there.  The places that might throw
 ERRCODE_INVALID_XML_DOCUMENT are sufficiently few (as in, exactly one,
 in this usage) that we can have reasonable confidence we know what the
 system state is when we catch that error.


Hmmm, I was writing some code in which I happened to hold a LWLock when
this function was called. The first catch/rethrow cleaned up the
InterruptHoldoffCount value. A subsequent release of that LWLock tripped up
the (Assert(InterruptHoldoffCount  0);) inside RESUME_INTERRUPTS().

I know holding an lwlock like this might not be a good idea, but this
behavior just got me thinking about other probable issues.

Regards,
Nikhils


  A better way would have been to modify xml_parse to take an additional
  boolean argument to_rethrow and not to rethrow if that is false?

 We could do that, but it would greatly complicate xml_parse IMO, since
 it still needs its own PG_TRY block to handle other error cases, and
 only one of those error cases ought to optionally return failure instead
 of re-throwing.

regards, tom lane

[HACKERS] xml_is_document and selective pg_re_throw

2012-06-12 Thread Nikhil Sontakke

Hi,

Consider:

SELECT xml 'foobar/foobarfoo/bar' IS DOCUMENT;

And I was looking at xml_is_document() source code. It calls xml_parse
which throws an error with code set to ERRCODE_INVALID_XML_DOCUMENT. The
catch block of xml_parse then rethrows.

Now xml_is_document does a selective rethrow only if the error is not
ERRCODE_INVALID_XML_DOCUMENT. I can understand that this function does this
to return true/false, but doesn't this behavior of not propagating the
error up all the way dangerous? InterruptHoldoffCount inconsistencies for
instance?

A better way would have been to modify xml_parse to take an additional
boolean argument to_rethrow and not to rethrow if that is false?
Thoughts?

Regards,
Nikhils

Re: [HACKERS] B-tree page deletion boundary cases

2012-04-24 Thread Nikhil Sontakke

  Was wondering if there's a similar bug which gets triggered while using
  VACUUM FULL. See for instance this thread:
 
 
 http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html
 
  This issue has been reported on-off from time to time and in most cases
  VACUUM or VACUUM FULL appears to be involved. We have usually attributed
 it
  to hardware issues and reindex has been recommended by default as a
  solution/work around..

 I do not perceive much similarity.  The bug I've raised can produce wrong
 query results transiently.  It might permit injecting a tuple into the
 wrong
 spot in the tree, yielding persistent wrong results.  It would not
 introduce
 tree-structural anomalies like sibling pointers directed at zeroed pages or
 internal pages in an 1-level tree.  Given the symptoms you reported, I
 share
 Robert's suspicion of WAL replay in your scenario.


Thanks for taking the time to analyze this Noah.

Regards,
Nikhils

Re: [HACKERS] B-tree page deletion boundary cases

2012-04-21 Thread Nikhil Sontakke

Hi Noah,

Was wondering if there's a similar bug which gets triggered while using
VACUUM FULL. See for instance this thread:

http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html

This issue has been reported on-off from time to time and in most cases
VACUUM or VACUUM FULL appears to be involved. We have usually attributed it
to hardware issues and reindex has been recommended by default as a
solution/work around..

Regards,
Nikhils

On Sat, Apr 21, 2012 at 10:22 PM, Noah Misch n...@leadboat.com wrote:

 For the sake of concurrency, our B-tree implementation has a phased process
 for reusing empty pages.  Excerpting from nbtree/README:

A deleted page cannot be reclaimed immediately, since there may be
 other
processes waiting to reference it (ie, search processes that just
 left the
parent, or scans moving right or left from one of the siblings).
  These
processes must observe that the page is marked dead and recover
accordingly.  Searches and forward scans simply follow the
 right-link
until they find a non-dead page --- this will be where the deleted
 page's
key-space moved to.

...

A deleted page can only be reclaimed once there is no scan or
 search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed.  We implement this by waiting until all
transactions that were running at the time of deletion are dead;
 which is
overly strong, but is simple to implement within Postgres.  When
 marked
dead, a deleted page is labeled with the next-transaction counter
 value.
VACUUM can reclaim the page for re-use when this transaction number
 is
older than the oldest open transaction.

 The VACUUM that deletes a page's last tuple calls _bt_pagedel(), which
 flags
 the page BTP_DELETED and stores therein the result of
 ReadNewTransactionId().
 When a later VACUUM visits such a page and observes that the stored XID is
 now
 less than or equal to RecentXmin, it adds the page to the FSM.  An INSERT
 or
 UPDATE will pull the page from the FSM and repurpose it.

 As I mentioned[1] peripherally back in November, that algorithm has been
 insufficient since the introduction of non-XID-bearing transactions in
 PostgreSQL 8.3.  Such transactions do not restrain RecentXmin.  If no
 running
 transaction has an XID, RecentXmin == ReadNewTransactionId() and the page
 incorrectly becomes available for immediate reuse.  This is difficult to
 encounter in practice.  VACUUM acquires a cleanup lock on every leaf page
 in
 each index.  Consequently, a problem can only arise around a leaf page
 deletion when two VACUUMs visit the page during the narrow window when
 _bt_steppage() has released the pin on its left sibling and not yet
 acquired a
 read lock on the target.  Non-leaf deletions might not require such narrow
 conditions, but they are also exponentially less frequent.  Here is a test
 procedure illustrating the bug:

 1. In session S1, run these commands:
 BEGIN;
 CREATE TABLE t (x int, filler character(400));
 INSERT INTO t SELECT *, '' FROM generate_series(1, 1);
 CREATE INDEX ON t(x);
 DELETE FROM t WHERE x = 2000 AND x  4000;
 COMMIT;

 BEGIN;
 SET LOCAL enable_seqscan = off;
 SET LOCAL enable_bitmapscan = off;
 DECLARE c CURSOR FOR SELECT x FROM t WHERE x = 1990 AND x  4510;
 FETCH 5 c;

 2. Attach gdb to S1 and set a breakpoint on _bt_getbuf.
 3. In S1, run FETCH 10 c.  This will hit the breakpoint.
 4. In another session S2, run these commands:
 VACUUM VERBOSE t; -- mark some pages BTP_DELETED
 VACUUM VERBOSE t; -- update FSM to know about the pages
 -- reuse the pages
 INSERT INTO t SELECT * FROM generate_series(10001, 12000);

 5. Exit gdb to free up S1.  The FETCH only returns five rows.

 (The filler column makes each index page correspond to more heap pages.
 Without it, heap page pins prevent removing some of the tuples on the index
 page under test.)


 The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin.
  We
 already use RecentGlobalXmin when wal_level = hot_standby.  If no running
 transaction has an XID and all running transactions began since the last
 transaction that did bear an XID, RecentGlobalXmin ==
 ReadNewTransactionId().
 Therefore, the correct test is btpo.xact  RecentGlobalXmin, not btpo.xact
 =
 RecentGlobalXmin as we have today.  This also cleanly removes the need for
 the
 bit of code in _bt_getbuf() that decrements btpo.xact before sending it
 down
 for ResolveRecoveryConflictWithSnapshot().  I suggested[2] that decrement
 on
 an unprincipled basis; it was just masking the off-by-one of using =
 RecentGlobalXmin instead of  RecentGlobalXmin in _bt_page_recyclable().

 This change makes empty B-tree pages wait through two generations of
 running
 transactions before reuse, so some additional bloat will arise.
  Furthermore,
 the set of transactions having snapshots

Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE

2012-04-16 Thread Nikhil Sontakke

  Displace yes. It would error out if someone says
 
  ALTER TABLE ONLY... CHECK ();
 
  suggesting to use the ONLY with the CHECK.

 I'd say the behavior for that case can revert to the PostgreSQL 9.1
 behavior.
 If the table has children, raise an error.  Otherwise, add an inheritable
 CHECK constraint, albeit one lacking inheritors at that moment.


Ok, that sounds reasonable.

Another thing that we should consider is that if we are replacing ONLY with
NO INHERIT, then instead of just making a cosmetic syntactic change, we
should also replace all the is*only type of field names with noinherit for
the sake of completeness and uniformity.

Regards,
Nikhils

Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE

2012-04-11 Thread Nikhil Sontakke

Hi,

So, I have a patch for this. This patch introduces support for

CHECK ONLY syntax while doing a CREATE TABLE as well as during the usual
ALTER TABLE command.

Example:

create table atacc7 (test int, test2 int CHECK ONLY (test0), CHECK
(test210));
create table atacc8 () inherits (atacc7);

postgres=# \d+ atacc7
Table public.atacc7
 Column |  Type   | Modifiers | Storage | Description
+-+---+-+-
 test   | integer |   | plain   |
 test2  | integer |   | plain   |
Check constraints:
atacc7_test2_check CHECK (test2  10)
atacc7_test_check CHECK ONLY (test  0)
Child tables: atacc8
Has OIDs: no

postgres=# \d+ atacc8
Table public.atacc8
 Column |  Type   | Modifiers | Storage | Description
+-+---+-+-
 test   | integer |   | plain   |
 test2  | integer |   | plain   |
Check constraints:
atacc7_test2_check CHECK (test2  10)
Inherits: atacc7
Has OIDs: no


This patch removes the support for :

ALTER TABLE ONLY constraint_rename_test ADD CONSTRAINT con2 CHECK (b  0);

and uses

ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK ONLY (b  0);

Is this what we want? Or we would want the earlier support in place for
backward compatibility as well? We are actually introducing this in 9.2 so
I guess we can remove this.

This is a much cleaner implementation and we might not even need the
changes in pg_dump now because the pg_get_constraintdef can provide the
info about the ONLY part too. So some cleanup can be done if needed.

I know it's a bit late in the commitfest, but if this patch makes this
feature more complete, maybe we should consider...

Thoughts?

P.S Here's the discussion thread in its entirety for reference:
http://postgresql.1045698.n5.nabble.com/how-to-create-a-non-inherited-CHECK-constraint-in-CREATE-TABLE-td5152184.html

Regards,
Nikhils

On Thu, Feb 2, 2012 at 1:32 AM, Peter Eisentraut
pete...@gmx.net wrote:

 On ons, 2012-01-18 at 18:17 -0500, Robert Haas wrote:
  I agree with Peter that we should have we should have CHECK ONLY.
  ONLY is really a property of the constraint, not the ALTER TABLE
  command -- if it were otherwise, we wouldn't need to store it the
  system catalogs, but of course we do.  The fact that it's not a
  standard property isn't a reason not to have proper syntax for it.

 Clearly, we will eventually want to support inherited and non-inherited
 constraints of all types.  Currently, each type of constraint has an
 implicit default regarding this property:

 check - inherited
 not null - inherited
 foreign key - not inherited
 primary key - not inherited
 unique - not inherited
 exclusion - not inherited

 As discussed above, we need to have a syntax that is attached to the
 constraint, not the table operation that creates the constraint, so that
 we can also create these in CREATE TABLE.

 How should we resolve these different defaults?

 Also, in ALTER TABLE, if you want to add either an inherited or not
 inherited constraint to a parent table, you should really say ALTER
 TABLE ONLY in either case.  Because it's conceivably valid that ALTER
 TABLE foo ADD CHECK () NOINHERIT would add an independent, not inherited
 check constraint to each child table.

 So, there are all kinds of inconsistencies and backward compatibility
 problems lurking here.  We might need either a grand transition plan or
 document the heck out of these inconsistencies.





check_constraint_create_table_support.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE

2012-04-11 Thread Nikhil Sontakke

Hi,

Cumulative reaction to all the responses first:

Whoa! :)

I was under the impression that a majority of us felt that the current
mechanism was inadequate. Also if you go through the nabble thread, the
fact that CREATE TABLE did not support such constraints was considered to
be an annoyance. And I was enquired if/when I can provide this
functionality. Apologies though with the timing.


  +1 for fixing up the syntax before 9.2 goes out the door.  I think the
  original syntax was misguided to begin with.

 Well, it was fine in isolation, but once you consider how to make CREATE
 TABLE do this too, it's hard to avoid the conclusion that you need to
 attach the modifier to the CHECK constraint not the ALTER TABLE command.


Yeah, exactly.


  CHECK NO INHERIT sounds fine to me; will that display ALTER TABLE ONLY
  x as the one true way of doing this?

 s/display/displace/, I think you meant?  Yeah, that's what I understand
 the proposal to be.


Displace yes. It would error out if someone says

ALTER TABLE ONLY... CHECK ();

suggesting to use the ONLY with the CHECK.

This patch does this and also makes both CREATE TABLE and ALTER TABLE use
it in a uniform manner.

Regarding NO INHERIT versus ONLY, we again have had discussions on the
longish original thread quite a while back:

http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html

But now if we prefer NO INHERIT, I can live with that.

Regards,
Nikhils

Re: [HACKERS] Review of patch renaming constraints

2012-01-19 Thread Nikhil Sontakke

  Make check passed.  Patch has tests for rename constraint.
 
  Most normal uses of alter table ... rename constraint ... worked
 normally.  However, the patch does not deal correctly with constraints
 which are not inherited, such as primary key constraints:

 That appears to be because creating a primary key constraint does not
 set pg_constraint.conisonly correctly.  This was introduced recently
 with noninherited check constraints.


 Umm, conisonly is set as false from primary key entries in pg_constraint.
And primary keys are anyways not inherited. So why is the conisonly field
interfering in rename? Seems quite orthogonal to me.

Regards,
Nikhils

Re: [HACKERS] Review of patch renaming constraints

2012-01-19 Thread Nikhil Sontakke

  And primary keys are anyways not inherited. So why is the conisonly
  field interfering in rename? Seems quite orthogonal to me.

 In the past, each kind of constraint was either always inherited or
 always not, implicitly.  Now, for check constraints we can choose what
 we want, and in the future, perhaps we will want to choose for primary
 keys as well.  So having conisonly is really a good step into that
 future, and we should use it uniformly.


Agreed. And right now primary key constraints are not marked as only making
them available for inheritance in the future. Or you prefer it otherwise?

Anyways, fail to see the direct connection between this and renaming. Might
have to look at this patch for that.

Regards,
Nikhils

Re: [HACKERS] how to create a non-inherited CHECK constraint in CREATE TABLE

2012-01-17 Thread Nikhil Sontakke

  It appears that the only way to create a non-inherited CHECK constraint
  is using ALTER TABLE.  Is there no support in CREATE TABLE planned?
  That looks a bit odd.
 
  There are no plans to do that AFAIR, though maybe you could convince
  Nikhil to write the patch to do so.

 That certainly doesn't meet the principle of least surprise... CREATE
 TABLE should support this.


Well, the above was thought about during the original discussion and
eventually we felt that CREATE TABLE already has other issues as well, so
not having this done as part of creating a table was considered acceptable
then:

http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144

But, let me have a stab at it when I get some free cycles.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2012-01-16 Thread Nikhil Sontakke

  I have also tried to change the error message as per Alvaro's
 suggestions.
  I will really try to see if we have other issues. Really cannot have
 Robert
  reporting all the bugs and getting annoyed, but there are lotsa
 variations
  possible with inheritance..

 So did you find anything?


Nothing much as yet. I think we are mostly there with the code but will
keep on trying some more variations along the lines of what Robert tried
out whenever I get the time next.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2012-01-16 Thread Nikhil Sontakke

  I will really try to see if we have other issues. Really cannot have
 Robert
  reporting all the bugs and getting annoyed, but there are lotsa
 variations
  possible with inheritance..

 BTW thank you for doing the work on this.  Probably the reason no one
 bothers too much with inheritance is that even something that's
 seemingly simple such as what you tried to do here has all these little
 details that's hard to get right.


Thanks Alvaro. Yeah, inheritance is semantics quicksand :)

Appreciate all the help from you, Robert and Alex Hunsaker on this.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-26 Thread Nikhil Sontakke

  I don't think this is a given ...  In fact, IMO if we're only two or
  three fixes away from having it all nice and consistent, I think
  reverting is not necessary.

 Sure.  It's the if part of that sentence that I'm not too sure about.


Any specific area of the code that you think is/has become fragile (apart
from the non-support for CREATE TABLE based ONLY constraints)? The second
bug is a variant of the first. And I have provided a patch for it.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-22 Thread Nikhil Sontakke

Hi,


 There is at least one other
 problem.  Consider:

 rhaas=# create table a (ff1 int, constraint chk check (ff1  0));
 CREATE TABLE
 rhaas=# create table b (ff1 int, constraint chk check (ff1  0));
 CREATE TABLE
 rhaas=# alter table b inherit a;
 ALTER TABLE

 This needs to fail if chk is an only constraint, but it doesn't,
 even with this patch.


As you rightly point out, this is because we cannot specify ONLY
constraints inside a CREATE TABLE as of now.


 I think that part of the problem here is fuzzy thinking about the
 meaning of the word ONLY in ALTER TABLE ONLY b.  The word ONLY
 there is really supposed to mean that the operation is performed on b
 but not on its descendents; but that's not what you're doing: you're
 really performing a different operation.  In theory, for a table with
 no current descendents, ALTER TABLE ONLY b and ALTER TABLE b ought to
 be identical, but here they are not.  I think that's probably bad.


ONLY has inheritance based connotations for present/future children. ALTER
ONLY combined with ADD is honored better now with this patch IMO (atleast
for constraints I think).


 Another manifestation of this problem is that there's no way to add an
 ONLY constraint in a CREATE TABLE command.  I think that's bad, too:
 it should be possible to declare any constraint at table creation
 time, and if the ONLY were part of the constraint definition rather
 than part of the target-table specification, that would work fine.  As
 it is, it doesn't.


Well, the above was thought about during the original discussion and
eventually we felt that CREATE TABLE already has other issues as well, so
not having this done as part of creating a table was considered acceptable:

http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144


 I am tempted to say we should revert this and rethink.  I don't
 believe we are only a small patch away from finding all the bugs here.


Sure, if we all think that CREATE TABLE should support ONLY CONSTRAINT type
of syntax, then +1 for reverting this and a subsequent revised submission.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-22 Thread Nikhil Sontakke

And yeah, certainly there's a bug as you point out.

postgres=# create table a1 (ff1 int, constraint chk check (ff1  0));
postgres=# create table b1 (ff1 int);
postgres=# alter table only b1 add constraint chk check (ff1  0);
postgres=# alter table b1 inherit a1;

The last command should have refused to inherit.

Regards,
Nikhils

On Fri, Dec 23, 2011 at 8:55 AM, Nikhil Sontakke nikkh...@gmail.com wrote:

 Hi,


 There is at least one other
 problem.  Consider:

 rhaas=# create table a (ff1 int, constraint chk check (ff1  0));
 CREATE TABLE
 rhaas=# create table b (ff1 int, constraint chk check (ff1  0));
 CREATE TABLE
 rhaas=# alter table b inherit a;
 ALTER TABLE

 This needs to fail if chk is an only constraint, but it doesn't,
 even with this patch.


 As you rightly point out, this is because we cannot specify ONLY
 constraints inside a CREATE TABLE as of now.


 I think that part of the problem here is fuzzy thinking about the
 meaning of the word ONLY in ALTER TABLE ONLY b.  The word ONLY
 there is really supposed to mean that the operation is performed on b
 but not on its descendents; but that's not what you're doing: you're
 really performing a different operation.  In theory, for a table with
 no current descendents, ALTER TABLE ONLY b and ALTER TABLE b ought to
 be identical, but here they are not.  I think that's probably bad.


 ONLY has inheritance based connotations for present/future children. ALTER
 ONLY combined with ADD is honored better now with this patch IMO (atleast
 for constraints I think).


 Another manifestation of this problem is that there's no way to add an
 ONLY constraint in a CREATE TABLE command.  I think that's bad, too:
 it should be possible to declare any constraint at table creation
 time, and if the ONLY were part of the constraint definition rather
 than part of the target-table specification, that would work fine.  As
 it is, it doesn't.


 Well, the above was thought about during the original discussion and
 eventually we felt that CREATE TABLE already has other issues as well, so
 not having this done as part of creating a table was considered acceptable:


 http://postgresql.1045698.n5.nabble.com/Check-constraints-on-partition-parents-only-tt464.html#a4647144


 I am tempted to say we should revert this and rethink.  I don't
 believe we are only a small patch away from finding all the bugs here.


 Sure, if we all think that CREATE TABLE should support ONLY CONSTRAINT
 type of syntax, then +1 for reverting this and a subsequent revised
 submission.

 Regards,
 Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-22 Thread Nikhil Sontakke

 I don't think this is a given ...  In fact, IMO if we're only two or
 three fixes away from having it all nice and consistent, I think
 reverting is not necessary.


FWIW, here's a quick fix for the issue that Robert pointed out. Again it's
a variation of the first issue that he reported. We have two functions
which try to merge existing constraints:

MergeWithExistingConstraint() in heap.c
MergeConstraintsIntoExisting() in tablecmds.c

Nice names indeed :)

I have also tried to change the error message as per Alvaro's suggestions.
I will really try to see if we have other issues. Really cannot have Robert
reporting all the bugs and getting annoyed, but there are lotsa variations
possible with inheritance..

Regards,
Nikhils


only_constraint_no_merge_v2.0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-20 Thread Nikhil Sontakke

  Agreed. I just tried out the scenarios laid out by you both with and
 without
  the committed patch and AFAICS, normal inheritance semantics have been
  preserved properly even after the commit.

 No, they haven't.  I didn't expect this to break anything when you
 have two constraints with different names.  The problem is when you
 have two constraints with the same name.

 Testing reveals that this is, in fact, broken:

 rhaas=# create table A(ff1 int);
 CREATE TABLE
 rhaas=# create table B () inherits (A);
 CREATE TABLE
 rhaas=# create table C () inherits (B);
 CREATE TABLE
 rhaas=# alter table only b add constraint chk check (ff1  0);
 ALTER TABLE
 rhaas=# alter table a add constraint chk check (ff1  0);
 NOTICE:  merging constraint chk with inherited definition
 ALTER TABLE

 At this point, you'll find that a has a constraint, and b has a
 constraint, but *c does not have a constraint*.  That's bad, because
 a's constraint wasn't only and should therefore have propagated all
 the way down the tree.


Apologies, I did not check this particular scenario.

I guess, here, we should not allow merging of the inherited constraint into
an only constraint. Because that breaks the semantics for only
constraints. If this sounds ok, I can whip up a patch for the same.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-20 Thread Nikhil Sontakke

 rhaas=# create table A(ff1 int);
 CREATE TABLE
 rhaas=# create table B () inherits (A);
 CREATE TABLE
 rhaas=# create table C () inherits (B);
 CREATE TABLE
 rhaas=# alter table only b add constraint chk check (ff1  0);
 ALTER TABLE
 rhaas=# alter table a add constraint chk check (ff1  0);
 NOTICE:  merging constraint chk with inherited definition
 ALTER TABLE

 At this point, you'll find that a has a constraint, and b has a
 constraint, but *c does not have a constraint*.  That's bad, because
 a's constraint wasn't only and should therefore have propagated all
 the way down the tree.


 Apologies, I did not check this particular scenario.

 I guess, here, we should not allow merging of the inherited constraint
 into an only constraint. Because that breaks the semantics for only
 constraints. If this sounds ok, I can whip up a patch for the same.


PFA, patch which does just this.

postgres=# alter table a add constraint chk check (ff1  0);
ERROR:  constraint chk for relation b is an ONLY constraint. Cannot
merge

Regards,
Nikhils


only_constraint_no_merge.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-19 Thread Nikhil Sontakke

 It seems hard to believe that ATExecDropConstraint() doesn't need any
 adjustment.


Yeah, I think we could return early on for only type of constraints.

Also, shouldn't the systable scan break out of the while loop after a
matching constraint name has been found? As of now, it's doing a full scan
of all the constraints for the given relation.

 @@ -6755,6 +6765,7 @@ ATExecDropConstraint(Relation rel, const char
*constrName,
  HeapTupletuple;
  boolfound = false;
  boolis_check_constraint = false;
 +boolis_only_constraint = false;

  /* At top level, permission check was done in ATPrepCmd, else do it
*/
  if (recursing)
 @@ -6791,6 +6802,12 @@ ATExecDropConstraint(Relation rel, const char
*constrName,
  /* Right now only CHECK constraints can be inherited */
  if (con-contype == CONSTRAINT_CHECK)
  is_check_constraint = true;
 +
 +if (con-conisonly)
 +{
 +Assert(is_check_constraint);
 +is_only_constraint = true;
 +}

  /*
   * Perform the actual constraint deletion
 @@ -6802,6 +6819,9 @@ ATExecDropConstraint(Relation rel, const char
*constrName,
  performDeletion(conobj, behavior);

  found = true;
 +
 +/* constraint found - break from the while loop now */
 +break;
  }

  systable_endscan(scan);
 @@ -6830,7 +6850,7 @@ ATExecDropConstraint(Relation rel, const char
*constrName,
   * routines, we have to do this one level of recursion at a time; we
can't
   * use find_all_inheritors to do it in one pass.
   */
 -if (is_check_constraint)
 +if (is_check_constraint  !is_only_constraint)
  children = find_inheritance_children(RelationGetRelid(rel),
lockmode);
  else
  children = NIL;

PFA, revised version containing the above changes based on Alvaro's v4
patch.

Regards,
Nikhils


non_inh_check_v5.0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-19 Thread Nikhil Sontakke

  PFA, revised version containing the above changes based on Alvaro's v4
  patch.

 Committed with these fixes, and with my fix for the pg_dump issue noted
 by Alex, too.  Thanks.


Thanks a lot Alvaro!

Regards,
Nikhils


 --
 Álvaro Herrera alvhe...@commandprompt.com
 The PostgreSQL Company - Command Prompt, Inc.
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-19 Thread Nikhil Sontakke

 But did you see Robert Haas' response upthread?  It looks like there's
 still some work to do -- at least some research to determine that we're
 correctly handling all those cases.  As the committer I'm responsible
 for it, but I don't have resources to get into it any time soon.


Yeah, am definitely planning to test out those scenarios and will respond
some time late today.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-19 Thread Nikhil Sontakke

Hi Robert,

First of all, let me state that this ONLY feature has not messed around
with existing inheritance semantics. It allows attaching a constraint to
any table (which can be part of any hierarchy) without the possibility of
it ever playing any part in future or existing inheritance hierarchies. It
is specific to that table, period.

It's not just that.  Suppose that C inherits from B which inherits
 from A.  We add an only constraint to B and a non-only constraint
 to A.   Now, what happens in each of the following scenarios?


An example against latest HEAD should help here:

create table A(ff1 int);
create table B () inherits (A);
create table C () inherits (B);

alter table A add constraint Achk check (ff1  10);

   The above will attach Achk to A, B and C

alter table only B add constraint Bchk check (ff1  0);

  The above will attach Bchk ONLY to table B

1. We drop the constraint from B without specifying ONLY.


postgres=# alter table B drop constraint Achk;
ERROR:  cannot drop inherited constraint achk of relation b

  The above is existing inheritance based behaviour.

Now let's look at the ONLY constraint:

postgres=# alter table B drop constraint Bchk;
ALTER TABLE

 Since this constraint is not part of any hierarchy, it can be removed.

postgres=# alter table only B add constraint bchk check (ff1  0);
ALTER TABLE
postgres=# alter table only B drop constraint Bchk;
ALTER TABLE

So only constraints do not need the only B qualification to be
deleted. They work both ways and can always be deleted without any issues.

2. We drop the constraint from B *with* ONLY.



postgres=# alter table only B drop constraint Achk;
ERROR:  cannot drop inherited constraint achk of relation b

  The above is existing inheritance based behavior. So regardless of
ONLY an inherited constraint cannot be removed from the middle of the
hierarchy.


 3. We drop the constraint from A without specifying ONLY.


postgres=# alter table A drop constraint Achk;
ALTER TABLE

This removes the constraint from the entire hierarchy across A, B and
C. Again existing inheritance behavior.


 4. We drop the constraint from A *with* ONLY.


postgres=# alter table only A drop constraint Achk;
ALTER TABLE

This converts the Achk constraints belonging to B into a local one. C
still has it as an inherited constraint from B. We can now delete those
constraints as per existing inheritance semantics. However I hope the
difference between these and ONLY constraints are clear. The Achk
constraint associated with B can get inherited in the future whereas only
constraints will not be.


 Off the top of my head, I suspect that #1 should be an error;


It's an error for inherited constraints, but not for only constraints.


 #2
 should succeed, leaving only the inherited version of the constraint
 on B;


Yeah, only constraints removal succeeds, whereas inherited constraints
cannot be removed.


 #3 should remove the constraint from A and leave it on B but I'm
 not sure what should happen to C,


This removes the entire hierarchy.


 and I have no clear vision of what
 #4 should do.


This removes the constraint from A, but maintains the inheritance
relationship between B and C. Again standard existing inheritance semantics.

As a followup question, if we do #2 followed by #4, or #4 followed by
 #2, do we end up with the same final state in both cases?


Yeah. #2 is not able to do much really because we do not allow inherited
constraints to be removed from the mid of the hierarchy.


 Here's another scenario.  B inherits from A.  We a constraint to A
 using ONLY, and then drop it without ONLY.  Does that work or fail?


The constraint gets added to A and since it is an only constraint, its
removal both with and without only A works just fine.


 Also, what happens we add matching constraints to B and A, in each
 case using ONLY, and then remove the constraint from A without using
 ONLY?  Does anything happen to B's constraint?  Why or why not?


Again the key differentiation here is that only constraints are bound to
that table and wont be inherited ever. So this works just fine.

postgres=# alter table only A add constraint A2chk check (ff1  10);
ALTER TABLE
postgres=# alter table only B add constraint A2chk check (ff1  10);
ALTER TABLE

Just to be clear, I like the feature.  But I've done some work on this
 code before, and it is amazingly easy for to screw it up and end up
 with bugs... so I think lots of careful thought is in order.


Agreed. I just tried out the scenarios laid out by you both with and
without the committed patch and AFAICS, normal inheritance semantics have
been preserved properly even after the commit.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-17 Thread Nikhil Sontakke

Hi Alvaro,

The patch looks ok to me. I see that we now sort the constraints by
conisonly value too:

@@ -1781,12 +1781,20 @@ describeOneTableDetails(const char *schemaname,
 /* print table (and column) check constraints */
 if (tableinfo.checks)
 {
+char *is_only;
+
+if (pset.sversion  90100)
+is_only = r.conisonly;
+else
+is_only = false AS conisonly;
+
 printfPQExpBuffer(buf,
-  SELECT r.conname, 
+  SELECT r.conname, %s, 
   pg_catalog.pg_get_constraintdef(r.oid,
true)\n
   FROM pg_catalog.pg_constraint r\n
-   WHERE r.conrelid = '%s' AND r.contype = 'c'\nORDER BY
1;,
-  oid);
+   WHERE r.conrelid = '%s' AND r.contype = 'c'\n
+ ORDER BY 2, 1;,
+  is_only, oid);
 result = PSQLexec(buf.data, false);
 if (!result)
 goto error_return;

My preference would be to see true values first, so might want to modify it
to

ORDER BY 2 desc, 1

to have that effect. Your call finally.

Regards,
Nikhils

On Sat, Dec 17, 2011 at 12:36 AM, Alvaro Herrera alvhe...@commandprompt.com
 wrote:

 Excerpts from Greg Smith's message of vie dic 16 15:02:20 -0300 2011:
  On 12/04/2011 02:22 AM, Nikhil Sontakke wrote:
  
   Is it okay to modify an existing constraint to mark it as only,
 even
   if it was originally inheritable?  This is not clear to me.  Maybe
 the
   safest course of action is to raise an error.  Or maybe I'm
 misreading
   what it does (because I haven't compiled it yet).
  
  
   Hmmm, good question. IIRC, the patch will pass is_only as true only if
   it going to be a locally defined, non-inheritable constraint. So I
   went by the logic that since it was ok to merge and mark a constraint
   as locally defined, it should be ok to mark it non-inheritable from
   this moment on with that new local definition?

 I think I misread what that was trying to do.  I thought it would turn
 on the is only bit on a constraint that a child had inherited from a
 previous parent, but that was obviously wrong now that I think about it
 again.

  With this open question, this looks like it's back in Alvaro's hands
  again to me.  This one started the CF as Ready for Committer and seems
  stalled there for now.  I'm not going to touch its status, just pointing
  this fact out.

 Yeah.  Nikhil, Alex, this is the merged patch.  Have a look that it
 still works for you (particularly the pg_dump bits) and I'll commit it.
 I adjusted the regression test a bit too.

 Thanks.

 --
 Álvaro Herrera alvhe...@commandprompt.com
 The PostgreSQL Company - Command Prompt, Inc.
 PostgreSQL Replication, Consulting, Custom Development, 24x7 support


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Non-inheritable check constraints

2011-12-03 Thread Nikhil Sontakke

 I had a look at this patch today.  The pg_dump bits conflict with
 another patch I committed a few days ago, so I'm about to merge them.
 I have one question which is about this hunk:


Thanks for taking a look Alvaro.


 @@ -2312,6 +2317,11 @@ MergeWithExistingConstraint(Relation rel, char
 *ccname, Node *expr,
con-conislocal = true;
else
con-coninhcount++;
 +   if (is_only)
 +   {
 +   Assert(is_local);
 +   con-conisonly = true;
 +   }
simple_heap_update(conDesc, tup-t_self, tup);
CatalogUpdateIndexes(conDesc, tup);
break;


 Is it okay to modify an existing constraint to mark it as only, even
 if it was originally inheritable?  This is not clear to me.  Maybe the
 safest course of action is to raise an error.  Or maybe I'm misreading
 what it does (because I haven't compiled it yet).


Hmmm, good question. IIRC, the patch will pass is_only as true only if it
going to be a locally defined, non-inheritable constraint. So I went by the
logic that since it was ok to merge and mark a constraint as locally
defined, it should be ok to mark it non-inheritable from this moment on
with that new local definition?

Regards,
Nikhils


Regards,
Nikhils

Re: [HACKERS] So where are we on the open commitfest?

2011-11-14 Thread Nikhil Sontakke

  * Non-inheritable check constraints
 


So, this patch got shifted to the next commitfest...

Regards,
Nikhils

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-14 Thread Nikhil Sontakke

 If all you need to do is lock a schema, you can just call
 LockDatabaseObject(NamespaceRelationId, namespace_oid, 0,
 AccessShareLock); there's no need to fake up an objectaddress just to
 take a lock.  But I think that's not really all you need to do,
 because somebody could drop the namespace between the time that you
 decide what OID to lock and the time you acquire the lock.  So I think
 you need something like what we did in RangeVarGetRelid().  See
 attached patch.


Thanks Robert. But currently there are very few callers of
RangeVarGetAndCheckCreationNamespace() function. For the sake of
completeness we will have to introduce a call to this function while
creating all other objects too.

Regards,
Nikhils


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-14 Thread Nikhil Sontakke

 So it's probably going to take a while to get this
 completely nailed down, but we can keep chipping away at it.


Agreed. So are you planning to commit this change? Or we want some more
objects to be fixed? Last I looked at this, we will need locking to be done
while creating tables, views, types, sequences, functions, indexes,
extensions, constraints, operators stuff, ts stuff, rules, domains, etc.
that can go into schemas.

So might even make sense to write a schema specific function based on your
patch template to cater in general to schema locking during object creation.

Regards,
Nikhils

Re: [HACKERS] Re: pg_dump: schema with OID XXXXX does not exist - was Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-11 Thread Nikhil Sontakke

  Wasn't 7.3 the release that introduced schemas in the first place?

 I think there's a very good chance that the older reports with similar
 symptoms are completely unrelated, anyhow.


Tom Lane is reluctant and that should tell me something :)

So unless the list feels that this should be fixed and also agrees on the
general approach, I will not touch a single line of code. Obviously someone
else is welcome to have a stab at this too.

Regards,
Nikhils

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-10 Thread Nikhil Sontakke

Hi,


 Ok, understood.


PFA, a patch against git head. We take the AccessShareLock lock on the
schema in DefineRelation now. Note that we do not want to interlock with
other concurrent creations in the schema. We only want to interlock with
deletion activity. So even performance wise this should not be much of an
overhead in case of concurrent DDL operations to the same schema.

Adding this in DefineRelation handles creation of
tables/views/types/sequences. I do not think we need to do stuff in ALTER
commands, because the objects pre-exist and this issue appears to be with
new objects only.

Comments?

Regards,
Nikhils


git_head_lock_schema_ddl.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-10 Thread Nikhil Sontakke

 Um ... why would we do this only for tables, and not for creations of
 other sorts of objects that belong to schemas?


Right, we need to do it for other objects like functions etc. too.


 Also, if we are going to believe that this is a serious problem, what
 of ALTER ... SET SCHEMA?


I admit, I hadn't thought of this.


 Also, the proposed solution is pretty silly on its face, because it has
 not removed the race condition only made the window somewhat narrower.
 You would have to acquire the lock as part of the initial schema lookup,
 not lock the OID after the fact.  And could we please not do something
 as silly as translate the OID back to a string and then look up that
 string a second time?


The comment mentions that part is a kluge but that we get to re-use the
existing function because of it. The get_object_address function will bail
out anyways if the schema has vanished from down under and it does lock it
up immediately after it's found to be valid.


 (To be clear, I don't particularly believe that this is a problem worthy
 of spending code space and cycles on.  But if it's deemed to be a
 problem, I want to see a solution that's actually watertight.)


Got the message.

Regards,
Nikhils

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-10 Thread Nikhil Sontakke

  Continuing in gdb, also completes the creation of c2 table without any
  errors. We are now left with a dangling entry in pg_class along with all
 the
  corresponding data files in our data directory. The problem becomes
 worse if
  c2 was created using a TABLESPACE. Now dropping of that tablespace does
 not
  work at all. Am sure we can come up with myriad such other issues.

 Hmm.  Does this break pg_dump?  I have reported a bug whereby dangling
 pg_class entries point to a namespace that has since been dropped in
 the past (and has been reported many times before that, even).


Sure does! I just tried it and got:
pg_dump: schema with OID 16384 does not exist


 The bug report is here, whereby I also aggregate other similar bug
 reports that have taken place over a very long period of time:

 http://archives.postgresql.org/pgsql-bugs/2011-02/msg00185.php


I guess we DO need to pay attention to fix this properly now as there are
some reports with 9.x too. And I have just provided a way to reproduce this
reliably too.

Regards,
Nikhils

[HACKERS] pg_dump: schema with OID XXXXX does not exist - was Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-10 Thread Nikhil Sontakke

Hi,


 But if it's deemed to be a
 problem, I want to see a solution that's actually watertight.)


After Daniel's hunch about pg_dump barfing due to such leftover entries
proving out to be true, we have one credible explanation (there might be
other reasons too) for this long standing issue. I see some reports from as
early as 2004 and some as latest as Feb, 2011!

http://archives.postgresql.org/pgsql-bugs/2011-02/msg00185.php

One way in 9.x could be to modify get_object_address to additionally accept
objoid as an argument and use that to lock the schema in AccessShareLock
mode from all places where schema based objects (tables, views, types,
sequences, functions, indexes, extensions, constraints, operators stuff, ts
stuff, rules, domains, etc. phew!) can be created. Or since this is schema
specific, we can as well right a new function to do this. We might also add
logic to only lock user created schemas.

This can be added right after the namespace for the involved object has
been correctly identified. The lock can then get released later as part of
the transaction commit.

Regards,
Nikhils

[HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-09 Thread Nikhil Sontakke

Hi,

Consider the following sequence of events:

s1 # CREATE SCHEMA test_schema;

s1 # CREATE TABLE test_schema.c1(x int);

Now open another session s2 and via gdb issue a breakpoint on
heap_create_with_catalog() which is called by DefineRelation().

s2 # CREATE TABLE test_schema.c2(y int);

The above will break on the function. Now issue a drop schema in session s1

s1 # DROP SCHEMA test_schema CASCADE;
NOTICE:  drop cascades to table test_schema.c1
DROP SCHEMA

Continuing in gdb, also completes the creation of c2 table without any
errors. We are now left with a dangling entry in pg_class along with all
the corresponding data files in our data directory. The problem becomes
worse if c2 was created using a TABLESPACE. Now dropping of that tablespace
does not work at all. Am sure we can come up with myriad such other issues.

Am sure other CREATE commands in this namespace will have similar issues
when faced with a concurrent DROP SCHEMA.

We definitely need some interlocking to handle this. For lack of better
APIs, we could do a LockDatabaseObject() call in AccessShareLock mode on
the namespace and release the same on completion of the creation of the
object.

Thoughts?

Regards,
Nikhils

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-09 Thread Nikhil Sontakke

  We definitely need some interlocking to handle this. For lack of better
  APIs, we could do a LockDatabaseObject() call in AccessShareLock mode on
 the
  namespace and release the same on completion of the creation of the
 object.
 
  Thoughts?

 In general, we've been reluctant to add locking on non-table objects
 for reasons of overhead.  You can, for example, drop a type or
 function while a query is running that depends on it (which is not
 true for tables).  But I think it is sensible to do it for DDL
 commands, which shouldn't be frequent enough for the overhead to
 matter much.


Agreed. Especially if the race condition has non-trivial downsides as
mentioned in the tablespace case.


 When I rewrote the comment code for 9.1, I added locking
 that works just this way, to prevent pg_description entries from being
 orphaned; see the end of get_object_address().


Yeah thanks, that does the object locking. For pre-9.1 versions, we will
need a similar solution. I encountered the issue on 8.3.x..

Regards,
Nikhils

Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2011-11-09 Thread Nikhil Sontakke

  Yeah thanks, that does the object locking. For pre-9.1 versions, we will
  need a similar solution. I encountered the issue on 8.3.x..

 I don't think we should back-patch a fix of this type.  There is a lot
 of cruftiness of this type scattered throughout the code base, and if
 we start back-patching all the fixes for it, we're going to end up
 destabilizing older branches for little real benefit.


Ok, understood.

Thanks and Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-10-07 Thread Nikhil Sontakke

Hi Alex,

 Hmmm, your patch checks for a constraint being only via:
 
!recurse  !recursing
 
  I hope that is good enough to conclusively conclude that the constraint
 is
  'only'. This check was not too readable in the existing code for me
 anyways
  ;). If we check at the grammar level, we can be sure. But I remember not
  being too comfortable about the right position to ascertain this
  characteristic.

 Well I traced through it here was my thinking (maybe should be a comment?):

 1: AlterTable() calls ATController() with recurse =
 interpretInhOption(stmt-relation-inhOpt
 2: ATController() calls ATPrepCmd() with recurse and recursing = false
 3: ATPrepCmd() saves the recurse flag with the subtup
 AT_AddConstraintRecurse, otherwise the subtype is AT_AddConstraint
 4: ATExecCmd() calls ATExecAddConstraint() with recurse == true when
 subtype == AT_AddConstraintRecurse, recurse = false otherwise
 5: ATExecAddConstraint() calls ATAddCheckConstraint() with recuse and
 recursing = false
 6: now we are in ATAddCheckConstraint() where recurse ==
 interpretInhOption(rv-inhOpt) and recursing == false. Recursing is
 only true when ATAddCheckConstaint() loops through children and
 recursively calls ATAddCheckConstraint()

 So with it all spelled out now I see the constraint must be added to
 child tables too check is dead code.


Thanks the above step-wise explanation helps.

But AFAICS, the default inhOpt value can be governed by the SQL_inheritance
guc too. So in that case, it's possible that recurse is false and child
tables are present, no?

Infact as I now remember, the reason my patch was looping through was to
handle this very case. It was based on the assumptions that some constraints
might be ONLY type and some can be inheritable. Although admittedly the
current ALTER TABLE functionality does not allow this.

So maybe we can still keep this check around IMO.

Regards,
Nikhils

Re: [HACKERS] Review: Non-inheritable check constraints

2011-10-07 Thread Nikhil Sontakke

Hi Alex,

I guess we both are in agreement with each other :)

After sleeping over it, I think that check is indeed dead code with this new
non-inheritable check constraints functionality in place. So unless you have
some other comments, we can mark this as 'Ready for Commiter'.

Again, thanks for the thorough review and subsequent changes!

Regards,
Nikhils

On Fri, Oct 7, 2011 at 12:18 PM, Alex Hunsaker bada...@gmail.com wrote:

 On Fri, Oct 7, 2011 at 00:28, Nikhil Sontakke nikkh...@gmail.com wrote:
  Hi Alex,

  So with it all spelled out now I see the constraint must be added to
  child tables too check is dead code.
 
 
  Thanks the above step-wise explanation helps.
 
  But AFAICS, the default inhOpt value can be governed by the
 SQL_inheritance
  guc too. So in that case, it's possible that recurse is false and child
  tables are present, no?

 Well... Do we really want to differentiate between those two case? I
 would argue no.

 Given that:
  set sql_inhertance to off;
  alter table xxx alter column;
 behaves the same as
  set sql_inhertance to on;
  alter table only xxx alter column;

 Why should we treat constraints differently? Or put another way if set
 sql_inhertance off makes alter table behave with an implicit only,
 shouldn't add/drop constraint respect that?

  Infact as I now remember, the reason my patch was looping through was to
  handle this very case. It was based on the assumptions that some
 constraints
  might be ONLY type and some can be inheritable.
  Although admittedly the current ALTER TABLE functionality does not allow
 this.

 Hrm... Ill I see is a user who turned off sql_inhertance wondering why
 they have to specify ONLY on some alter table commands and not others.
 I think if we want to support ONLY constraint types in the way you
 are thinking about them, we need to put ONLY some place else (alter
 table xxx add only constraint ?). Personally I don't see a reason to
 have that kind of constraint. Mostly because I don't see how its
 functionally different. Is it somehow?

 Anyone else have any thoughts on this?

Re: [HACKERS] Review: Non-inheritable check constraints

2011-10-06 Thread Nikhil Sontakke

Hi Alex,

I didn't care for the changes to gram.y so I reworked it a bit so we
 now pass is_only to AddRelationNewConstraint() (like we do with
 is_local). Seemed simpler but maybe I missed something. Comments?


Hmmm, your patch checks for a constraint being only via:

  !recurse  !recursing

I hope that is good enough to conclusively conclude that the constraint is
'only'. This check was not too readable in the existing code for me anyways
;). If we check at the grammar level, we can be sure. But I remember not
being too comfortable about the right position to ascertain this
characteristic.


 I also moved the is_only check in AtAddCheckConstraint() to before we
 grab and loop through any children. Seemed a bit wasteful to loop
 through the new constraints just to set a flag so that we could bail
 out while looping through the children.


Ditto comment for this function. I thought this function went to great
lengths to spit out a proper error in case of inconsistencies between parent
and child. But if your check makes it simpler, that's good!


 You also forgot to bump Natts_pg_constraint.


Ouch. Thanks for the catch.


 PFA the above changes as well as being rebased against master.


Thanks Alex, appreciate the review!

Regards,
Nikhils

[HACKERS] Issue with listing same tablenames from different schemas in the search_path

2011-10-02 Thread Nikhil Sontakke

Hi,

Consider the following sequence of commands in a psql session:

postgres=#create table public.sample(x int);
postgres=#create schema new;
postgres=#create table new.sample(x int);
postgres=#set search_path=public,new;

postgres=#\dt
Schema | Name | Type | Owner
---
public |  sample | table | postgres
(1 row)

We should have seen two entries in the above listing. So looks like a bug to
me.

The issue is with the call to pg_table_is_visible(). While scanning for the
second entry, it breaks out because there is a matching entry with the same
name in the first schema. What we need is a variation of this function which
checks for visibility of the corresponding namespace in the search path and
emit it out too if so.

Thoughts? I can cook up a patch for this.

Regards,
Nikhils

Re: [HACKERS] Issue with listing same tablenames from different schemas in the search_path

2011-10-02 Thread Nikhil Sontakke

postgres=#create table public.sample(x int);

 postgres=#create schema new;
 postgres=#create table new.sample(x int);
 postgres=#set search_path=public,new;

 postgres=#\dt
 Schema | Name | Type | Owner
 --**-
 public |  sample | table | postgres
 (1 row)

 We should have seen two entries in the above listing. So looks like a bug
 to
 me.


 No, that's the way it's designed to work. It shows the objects that are
 visible to you, without schema-qualifying them. See
 http://www.postgresql.org/**docs/9.0/interactive/app-psql.**
 html#APP-PSQL-PATTERNShttp://www.postgresql.org/docs/9.0/interactive/app-psql.html#APP-PSQL-PATTERNS:


Hmmm, ok. Makes sense after reading the documentation, but seems a bit
surprising/confusing at first glance. Never mind.

Regards,
Nikhils


  Whenever the pattern parameter is omitted completely, the \d commands
 display all objects that are visible in the current schema search path —
 this is equivalent to using * as the pattern. (An object is said to be
 visible if its containing schema is in the search path and no object of the
 same kind and name appears earlier in the search path. This is equivalent to
 the statement that the object can be referenced by name without explicit
 schema qualification.) To see all objects in the database regardless of
 visibility, use *.* as the pattern.


 --
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: [HACKERS] cataloguing NOT NULL constraints

2011-08-04 Thread Nikhil Sontakke

 So after writing the code to handle named NOT NULL constraints for
 tables, I'm thinking that dumpConstraints needs to be fixed thusly:

 @@ -12888,6 +12968,27 @@ dumpConstraint(Archive *fout, ConstraintInfo 
 *coninfo)
                         NULL, NULL);
        }
    }
 +   else if (coninfo-contype == 'n'  tbinfo)
 +   {
 +       /* NOT NULL constraint on a table */
 +       if (coninfo-separate)
 +       {
 +           write_msg(NULL, NOT NULL constraints cannot be dumped separately 
 from their owning table\n);
 +           exit_nicely();
 +       }
 +   }
 +   else if (coninfo-contype == 'n'  tbinfo == NULL)
 +   {
 +       /* NOT NULL constraint on a domain */
 +       TypeInfo   *tyinfo = coninfo-condomain;
 +
 +       /* Ignore if not to be dumped separately */
 +       if (coninfo-separate)
 +       {
 +           write_msg(NULL, NOT NULL constraints cannot be dumped separately 
 from their owning domain\n);
 +           exit_nicely();
 +       }
 +   }
    else
    {
        write_msg(NULL, unrecognized constraint type: %c\n, 
 coninfo-contype);


Some nit-picking.

AFAICS above, we seem to be only using 'tbinfo' to identify the object
type here - 'table' visavis 'domain'. We could probably reduce the
above two elses to a single one and use the check of tbinfo being not
null to decide which object type name to spit out..

Although, it's difficult to see how we could end up marking NOT NULL
constraints as 'separate' ever. So this code will be rarely exercised,
if ever IMO.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

Hi,

Any preferences for the name?
 connoinh
 conisonly
 constatic or confixed

 I'd probably pick conisonly from those choices.


The use of \d inside psql will show ONLY constraints without any
embellishments similar to normal constraints. E.g.


ALTER TABLE ONLY a ADD CONSTRAINT achk CHECK (FALSE);

ALTER TABLE a ADD CONSTRAINT bchk CHECK (b  0);

psql=# \d a
   Table public.a
 Column |  Type   | Modifiers
+-+---
 b  | integer |
Check constraints:
achk CHECK (false)
bchk CHECK (b  0)

Is this acceptable? Or we need to put in work into psql to show ONLY
somewhere in the description? If yes, ONLY CHECK sounds weird, maybe
we should use LOCAL CHECK or some such mention:

Check constraints:
achk LOCAL CHECK (false)
bchk CHECK (b  0)

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

 psql=# \d a
   Table public.a
  Column |  Type   | Modifiers
 +-+---
  b  | integer |
 Check constraints:
achk CHECK (false)
bchk CHECK (b  0)

 Is this acceptable? Or we need to put in work into psql to show ONLY
 somewhere in the description? If yes, ONLY CHECK sounds weird, maybe
 we should use LOCAL CHECK or some such mention:

 Check constraints:
achk LOCAL CHECK (false)
bchk CHECK (b  0)

I think you need to stick with ONLY.  Using two different words is
just going to create confusion. You could fool around with where
exactly you put it on the line, but switching to a different word
seems like not a good idea.

Ok, maybe something like:

achk (ONLY) CHECK (false)

(Also, don't forget you need to hack pg_dump, too.)

Yeah, I have already hacked it a bit. This constraint now needs to be
spit out later as an ALTER command with ONLY attached to it
appropriately. Earlier all CHECK constraints were generally emitted as
part of the table definition itself.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

  Yeah, I have already hacked it a bit. This constraint now needs to be
  spit out later as an ALTER command with ONLY attached to it
  appropriately. Earlier all CHECK constraints were generally emitted as
  part of the table definition itself.

 Hrm.  That doesn't seem so good.  Maybe we've got the design wrong
 here.  It doesn't seem like we want to lose the ability to define
 arbitrary constraints at table-creation time.


Well the handling is different now for ONLY constraints only. The normal
constraints can still be attached at table-creation time.

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

  Yeah, I have already hacked it a bit. This constraint now needs to be
  spit out later as an ALTER command with ONLY attached to it
  appropriately. Earlier all CHECK constraints were generally emitted as
  part of the table definition itself.

 IIRC, there's already support for splitting out a constraint that way,
 in order to deal with circular dependencies.  You just need to treat
 this as an additional reason for splitting.


Yeah, I have indeed followed the existing separate printing logic for ONLY
constraints. Had to make the table dependent on this constraint to print the
constraint *after* the table definition.

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

 We could imagine doing something like CHECK ONLY (foo), but that seems
 quite non-orthogonal with (a) everything else in CREATE TABLE, and
 (b) ALTER TABLE ONLY.


Yeah, I thought about CHECK ONLY support as part of table definition, but as
you say - it appears to be too non-standard right now and we can always go
back to this later if the need be felt.

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

Hi all,

PFA, patch which implements non-inheritable ONLY constraints. This
has been achieved by introducing a new column conisonly in
pg_constraint catalog. Specification of 'ONLY' in the ALTER TABLE ADD
CONSTRAINT CHECK command is used to set this new column to true.
Constraints which have this column set to true cannot be inherited by
present and future children ever.

The psql and pg_dump binaries have been modified to account for such
persistent non-inheritable check constraints. This patch also has
documentation changes along with relevant changes to the test cases.
The regression runs pass fine with this patch applied.

Comments and further feedback, if any, appreciated.

Regards,
Nikhils
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5e5f8a7..683ad67 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1995,6 +1995,16 @@
  /row
 
  row
+  entrystructfieldconisonly/structfield/entry
+  entrytypebool/type/entry
+  entry/entry
+  entry
+   This constraint is defined locally for the relation.  It is a
+  non-inheritable constraint.
+  /entry
+ /row
+
+ row
   entrystructfieldconkey/structfield/entry
   entrytypeint2[]/type/entry
   entryliterallink 
linkend=catalog-pg-attributestructnamepg_attribute/structname/link.attnum//entry
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 4c2a4cd..3ee3ec0 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -984,6 +984,14 @@ ALTER TABLE distributors ADD CONSTRAINT zipchk CHECK 
(char_length(zipcode) = 5);
   /para
 
   para
+   To add a check constraint only to a table and not to its children:
+programlisting
+ALTER TABLE ONLY distributors ADD CONSTRAINT zipchk CHECK 
(char_length(zipcode) = 5);
+/programlisting
+   (The check constraint will not be inherited by future children too.)
+  /para
+
+  para
To remove a check constraint from a table and all its children:
 programlisting
 ALTER TABLE distributors DROP CONSTRAINT zipchk;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4399493..1b382b8 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -98,10 +98,10 @@ static Oid AddNewRelationType(const char *typeName,
   Oid new_array_type);
 static void RelationRemoveInheritance(Oid relid);
 static void StoreRelCheck(Relation rel, char *ccname, Node *expr,
- bool is_validated, bool is_local, int inhcount);
+ bool is_validated, bool is_local, int inhcount, bool 
is_only);
 static void StoreConstraints(Relation rel, List *cooked_constraints);
 static bool MergeWithExistingConstraint(Relation rel, char *ccname, Node *expr,
-   bool allow_merge, bool 
is_local);
+   bool allow_merge, bool 
is_local, bool is_only);
 static void SetRelationNumChecks(Relation rel, int numchecks);
 static Node *cookConstraint(ParseState *pstate,
   Node *raw_constraint,
@@ -1860,7 +1860,7 @@ StoreAttrDefault(Relation rel, AttrNumber attnum, Node 
*expr)
  */
 static void
 StoreRelCheck(Relation rel, char *ccname, Node *expr,
- bool is_validated, bool is_local, int inhcount)
+ bool is_validated, bool is_local, int inhcount, bool 
is_only)
 {
char   *ccbin;
char   *ccsrc;
@@ -1943,7 +1943,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr,
  ccbin,/* Binary form 
of check constraint */
  ccsrc,/* Source form 
of check constraint */
  is_local, /* 
conislocal */
- inhcount);/* coninhcount 
*/
+ inhcount, /* 
coninhcount */
+ is_only); 
/* conisonly */
 
pfree(ccbin);
pfree(ccsrc);
@@ -1984,7 +1985,7 @@ StoreConstraints(Relation rel, List *cooked_constraints)
break;
case CONSTR_CHECK:
StoreRelCheck(rel, con-name, con-expr, 
!con-skip_validation,
- con-is_local, 
con-inhcount);
+ con-is_local, 
con-inhcount, con-is_only);
numchecks++;
break;
default:
@@ -2100,6 +2101,7 @@ AddRelationNewConstraints(Relation rel,
cooked-skip_validation = false;
cooked-is_local = is_local;
cooked-inhcount =

Re: [HACKERS] Check constraints on partition parents only?

2011-07-29 Thread Nikhil Sontakke

 Comments and further feedback, if any, appreciated.

 Did you look at how this conflicts with my patch to add not null
 rows to pg_constraint?

 https://commitfest.postgresql.org/action/patch_view?id=601


I was certainly not aware of this patch in the commitfest. Your patch
has a larger footprint with more functional changes in it. IMHO, it
will be easiest to queue this non-inheritable constraints patch behind
your patch in the commitfest. There will be certain bitrot, which I
can fix once your patch gets committed.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Nikhil Sontakke

 Now that we have coninhcnt, conislocal etc... we can probably support
 ONLY. But I agree with Robert it's probably a bit more than an
 afternoon to crank out :-)


Heh, agreed :), I was just looking for some quick and early feedback. So
what we need is basically a way to indicate that a particular constraint is
non-inheritable forever (meaning - even for future children) and that should
work?

Right now, it seems that the ONLY usage in the SQL only translates to a
recurse or no-recurse operation. For the parent, a constraint is marked with
conislocal set to true (coninhcount is 0) and for children, coninhcount is
used to indicate inheritance of that constraint with conislocal being set to
false.

What we need is to persist information of a particular constraint to be as
specified - ONLY for this table. We could do that by adding a new column in
pg_constraint like 'connoinh' or something, but I guess we would prefer not
to get into the initdb business. Alternatively we could bring about the same
by using a combination of conislocal and coninhcnt. For ONLY constraints, we
will need to percolate this information down to the point where we define it
in the code. We can then mark ONLY constraints to have conislocal set to
TRUE and coninhcnt set to a special value (-1). So to summarize, what I am
proposing is:

Introduce new column connoinh (boolean) in pg_constraint

OR in existing infrastructure:

Normal constraints:  conislocal (true)   coninhcnt (0)
 (inheritable like today)
Inherited constraints:   conislocal (false)  coninhcnt (n  0)
ONLY constraints:conislocal (true)   coninhcnt (-1)   (not
inheritable)

With this arrangment, pg_dump will be able to easily identify and spit out
ONLY specifications for specific constraints and then they won't be blindly
passed on to children table under these new semantics.

Thoughts? Anything missing? Please let me know.

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

2011-07-28 Thread Nikhil Sontakke

 This approach certainly can't work, because a table can be both an
 inheritance parent and an inheritance child.  It could have an ONLY
 constraint, and also inherit a copy of the same constraint for one or
 more parents.  IOW, the fact that conislocal = true does not mean that
 coninhcount is irrelevant.


Oh I see.


 I think what you probably want to do is
 either (a) add a new column or (b) change conislocal to a char value
 and make it three-valued:

 n = inherited constraint, no local definition
 o = defined locally as an ONLY constraint
 i = defined locally as a non-ONLY constraint

 I think I favor the latter approach as more space-efficient, but I
 hear Tom muttering about backward-compatibility...


Yeah, in your case too an initdb would be required, so might as well go down
the route of a new column. Any preferences for the name?

connoinh
conisonly
constatic or confixed

Others?

Regards,
Nikhils

Re: [HACKERS] Check constraints on partition parents only?

2011-07-27 Thread Nikhil Sontakke

Hi,


  Yeah, but I think we need to take that chance.  At the very least, we
 need to support the equivalent of a non-inherited CHECK (false) on
 parent tables.


 Indeed. I usually enforce that with a trigger that raises an exception, but
 of course that doesn't help at all with constraint exclusion, and I saw a
 result just a few weeks ago (I forget the exact details) where it appeared
 that the plan chosen was significantly worse because the parent table wasn't
 excluded, so there's a  non-trivial downside from having this restriction.


The downside appears to be non-trivial indeed! I cooked up the attached
patch to try to allow ALTER...ONLY...CHECK(false) on parent tables.

If this approach looks acceptable, I can provide a complete patch later with
some documentation changes (I think we ought to tell about this special case
in the documentation) and a minor test case along with it (if the need be
felt for the test case).

Although partitioning ought to be looked at from a different angle
completely, maybe this small patch can help out a bit in the current scheme
of things, although this is indeed a unusual special casing... Thoughts?

Regards,
Nikhils
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 82bb756..5340402 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5433,6 +5433,7 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	ListCell   *lcon;
 	List	   *children;
 	ListCell   *child;
+	bool		skip_children = false;
 
 	/* At top level, permission check was done in ATPrepCmd, else do it */
 	if (recursing)
@@ -5502,9 +5503,31 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * tables; else the addition would put them out of step.
 	 */
 	if (children  !recurse)
-		ereport(ERROR,
+	{
+		/*
+		 * Try a bit harder and check if this is a CHECK(FALSE) kinda
+		 * constraint. Allow if so, otherwise error out
+		 */
+		if (list_length(newcons) == 1)
+		{
+			CookedConstraint *cooked = linitial(newcons);
+
+			if (cooked-contype == CONSTR_CHECK  cooked-expr)
+			{
+Node *expr = cooked-expr;
+if (IsA(expr, Const)  ((Const *)expr)-consttype == BOOLOID 
+	 ((Const *)expr)-constvalue == 0)
+{
+	skip_children = true;
+}
+			}
+		}
+
+		if (!skip_children)
+			ereport(ERROR,
 (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
  errmsg(constraint must be added to child tables too)));
+	}
 
 	foreach(child, children)
 	{
@@ -5512,6 +5535,13 @@ ATAddCheckConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 		Relation	childrel;
 		AlteredTableInfo *childtab;
 
+		/*
+		 * Skipping the constraint should be good enough for the special case.
+		 * No need to even release the locks on the children immediately..
+		 */
+		if (skip_children)
+			break;
+
 		/* find_inheritance_children already got lock */
 		childrel = heap_open(childrelid, NoLock);
 		CheckTableNotInUse(childrel, ALTER TABLE);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Check constraints on partition parents only?

2011-07-26 Thread Nikhil Sontakke

  8.4 had this change:
 
  *
Force child tables to inherit CHECK constraints from parents
(Alex Hunsaker, Nikhil Sontakke, Tom)

  You're not the only one who occasionally bangs his head against it.


Sorry for the occasional head bumps :)


 Yeah.  I think it's good that there's a barrier to blindly dropping a
 constraint that may be important to have on children, but there should
 be a way to override that.


Hmmm, but then it does open up the possibility of naive users shooting
themselves in the foot. It can be easy to conjure up a
parent-only-constraint that does not gel too well with its children. And
that's precisely why this feature was added in the first place..

Regards,
Nikhils

Re: [HACKERS] VACUUM FULL deadlock with backend startup

2011-03-22 Thread Nikhil Sontakke

 Patch applied, thanks!


Thanks Tom!

Regards,
Nikhils

Re: [HACKERS] VACUUM FULL deadlock with backend startup

2011-03-19 Thread Nikhil Sontakke

Hi,

   The fix is similar to the earlier commit by Tom. I tested this fix

  against 8.3.13. We lock the parent catalog now before calling
  index_open. Patch against git HEAD attached with this mail. I guess we
  will backpatch this? Tom's last commit was backpatched till 8.2 I
  think.

 Is it really worth slowing down the startup sequence for every
 connection to avoid deadlocking against a very rare maintenance
 operation?


Not really a performance issue AFAICS. If the relcache init file exists,
then the phase2 of the catalog cache which eventually calls the above code
path is avoided.

Regards,
Nikhils

[HACKERS] VACUUM FULL deadlock with backend startup

2011-03-18 Thread Nikhil Sontakke

Hi,

We encountered a deadlock involving VACUUM FULL (surprise surprise!
:)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the
window appears much smaller). The call spits out the following
deadlock info:

ERROR: SQLSTATE 40P01: deadlock detected
DETAIL:   Process 12479 waits for AccessExclusiveLock on relation 2663
of database 16384; blocked by process 14827.
Process 14827 waits for AccessShareLock on relation 1259 of database
16384; blocked by process 12479.
LOCATION: DeadLockReport, deadlock.c:918

It looked familiar, so I dug up the archives and found that Tom had
committed a fix for a similar deadlock via git commitid: 715120e7

However this current deadlock involved an index with oid 2663, which
is ClassNameNspIndexId. Clearly this was another case of locking the
index directly without taking a lock on the parent catalog. Further
sleuthing revealed that the culprit function was InitCatCachePhase2,
which directly calls index_open in the process startup phase.

Reproducing this was easy once you know the culprit, (excruciatingly
difficult if you do not know the exact race window). I added a sleep
inside the InitCatCachePhase2 function before calling index_open. Then
I invoked a VACUUM FULL pg_class from another session, halting it in
gdb just before taking the exclusive lock via try_relation_open. When
a new PG process sleeps inside InitCatCachePhase2, we then take the
lock in the VF process, waiting just after it.

When the startup continues after the sleep, it will take the
ClassNameNspIndexId share lock, but hang to take a share lock on
pg_class in RelationReloadIndexInfo. Simply continue the VF process in
gdb which will try to take the exclusive lock to vacuum the index.
This will reproduce the deadlock in all its glory.

The fix is similar to the earlier commit by Tom. I tested this fix
against 8.3.13. We lock the parent catalog now before calling
index_open. Patch against git HEAD attached with this mail. I guess we
will backpatch this? Tom's last commit was backpatched till 8.2 I
think.

Regards,
Nikhils
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d0e364e..c9386aa 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -26,6 +26,7 @@
 #ifdef CATCACHE_STATS
 #include storage/ipc.h		/* for on_proc_exit */
 #endif
+#include storage/lmgr.h
 #include utils/builtins.h
 #include utils/fmgroids.h
 #include utils/inval.h
@@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index)
 	{
 		Relation	idesc;
 
+		/*
+		 * We must lock the underlying catalog before locking the index to
+		 * avoid deadlock, since RelationReloadIndexInfo might well need to
+		 * read the catalog, and if anyone else is exclusive-locking this
+		 * catalog and index they'll be doing it in that order.
+		 */
+		LockRelationOid(cache-cc_reloid, AccessShareLock);
 		idesc = index_open(cache-cc_indexoid, AccessShareLock);
 		index_close(idesc, AccessShareLock);
+		UnlockRelationOid(cache-cc_reloid, AccessShareLock);
 	}
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] VACUUM FULL deadlock with backend startup

2011-03-18 Thread Nikhil Sontakke

Hi,

We encountered a deadlock involving VACUUM FULL (surprise surprise!
:)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the
window appears much smaller). The call spits out the following
deadlock info:

ERROR: SQLSTATE 40P01: deadlock detected
DETAIL:   Process 12479 waits for AccessExclusiveLock on relation 2663
of database 16384; blocked by process 14827.
Process 14827 waits for AccessShareLock on relation 1259 of database
16384; blocked by process 12479.
LOCATION: DeadLockReport, deadlock.c:918

It looked familiar, so I dug up the archives and found that Tom had
committed a fix for a similar deadlock via git commitid: 715120e7

However this current deadlock involved an index with oid 2663, which
is ClassNameNspIndexId. Clearly this was another case of locking the
index directly without taking a lock on the parent catalog. Further
sleuthing revealed that the culprit function was InitCatCachePhase2,
which directly calls index_open in the process startup phase.

Reproducing this was easy once you know the culprit, (excruciatingly
difficult if you do not know the exact race window). I added a sleep
inside the InitCatCachePhase2 function before calling index_open. Then
I invoked a VACUUM FULL pg_class from another session, halting it in
gdb just before taking the exclusive lock via try_relation_open. When
a new PG process sleeps inside InitCatCachePhase2, we then take the
lock in the VF process, waiting just after it.

When the startup continues after the sleep, it will take the
ClassNameNspIndexId share lock, but hang to take a share lock on
pg_class in RelationReloadIndexInfo. Simply continue the VF process in
gdb which will try to take the exclusive lock to vacuum the index.
This will reproduce the deadlock in all its glory.

The fix is similar to the earlier commit by Tom. I tested this fix
against 8.3.13. We lock the parent catalog now before calling
index_open. Patch against git HEAD attached with this mail. I guess we
will backpatch this? Tom's last commit was backpatched till 8.2 I
think.

Regards,
Nikhils
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d0e364e..c9386aa 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -26,6 +26,7 @@
 #ifdef CATCACHE_STATS
 #include storage/ipc.h		/* for on_proc_exit */
 #endif
+#include storage/lmgr.h
 #include utils/builtins.h
 #include utils/fmgroids.h
 #include utils/inval.h
@@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index)
 	{
 		Relation	idesc;
 
+		/*
+		 * We must lock the underlying catalog before locking the index to
+		 * avoid deadlock, since RelationReloadIndexInfo might well need to
+		 * read the catalog, and if anyone else is exclusive-locking this
+		 * catalog and index they'll be doing it in that order.
+		 */
+		LockRelationOid(cache-cc_reloid, AccessShareLock);
 		idesc = index_open(cache-cc_indexoid, AccessShareLock);
 		index_close(idesc, AccessShareLock);
+		UnlockRelationOid(cache-cc_reloid, AccessShareLock);
 	}
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-16 Thread Nikhil Sontakke

Hi,

 To summarize, as I see it - the zeroed out block 523 should have been
 the second left-most leaf and should have pointed out to 522. Thus
 re-establishing the index chain

 524 - 523 - 522 - 277 - ...

 Was there a machine restart in the picture as well?


It seems there might have been a machine restart involved too. So I
guess even WAL writing could have been impacted.

But even if VF was ongoing at the time of restart, the WAL replay on
restart should not do anything since this will be a non-committed
transaction?

Also I was looking at ReadRecord and saw that it logs a message for
failed CRC blocks but the WAL replay just stops at that point since it
returns a NULL. Is there a way to find out if more blocks follow in
the wake of this failed block (should be a matter of calling
ReadRecord with NULL as a first argument I think)? If so maybe we can
warn further that error was encountered in the middle of WAL replay.
However the last block too could be CRC check-fail candidate...

BTW, is there a possibility to encounter trailing blocks with CRC
failures regularly? For transactions that were ongoing at the time of
shutdown and did not get a chance to commit or WAL log properly?

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-16 Thread Nikhil Sontakke

Hi,

 Of course, as you mentioned earlier, it's not impossible
 there's a bug in the recovery code.

Yeah, I was looking at the repair_frag function in 8.3.13 (yup it's
ugly!) and found out that the normal ExecInsertIndexTuples call is
used to insert the index entries. That is standard index code used
everywhere. So btree WAL bugs in this code path should be pretty rare
I would think..

 But if an OS crash is involved,
 another possibility is that something went wrong with the fsync -
 maybe there's a lying writeback cache between PG and the platter, for
 example.


Yup, plan to re-confirm this too.

Thanks Robert!

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-14 Thread Nikhil Sontakke

Hi Daniel,


 I have also, coincidentally, encountered corruption of a system
 catalog index -- 8.3.11 -- I have saved the file for forensics.  Is it
 possible that I also receive a copy of this program?


Will it be possible for you to share the file/logs off-list with me? I
can also try to do some analysis to compare if the situation is
similar to the one we are faced with right now and report it back here
if so.

TIA,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Nikhil Sontakke

Hi,

 1. Somebody inserts a bunch of new tuples into the relation, causing
 growth in the index.

 In case it's not obvious VACUUM FULL would do precisely that.

 Oh, I didn't even think about that.  Yeah, that could be it, too.

Thanks a lot Greg and Robert. This theory seems very plausible. VF
must have carried out a rearrangement of heap tuples for compaction
and that might have caused new index entries which might explain the
extension of that many blocks.

 maybe VACUUM FULL - crash before checkpoint - problem with recovery?

Did I mention that an immediate shutdown was thrown into the mix just
after the VF? That is almost close to a crash and that means that the
shared buffers were not written back to the index data file. So that
should also account for all these pages still being zeroed out. So
change the above to:

 VACUUM FULL - immediate shutdown - problem with recovery?

But WAL replay should still have handled this. I would presume even an
immediate shutdown ensures that WAL is flushed to disk properly?

So that means that either there is a corner case bug in VF which adds
incorrect WAL logging in some specific btree layout scenarios or there
was indeed some bit flipping in the WAL, which caused the recovery to
prematurely end during WAL replay. What are the scenarios that you
would think can cause WAL bit flipping?

I was trying to repro this on the setup by repeatedly creating a table
with large inserts, doing lotta deletes, running VF and then issuing
immediate shutdown. However if I try to inspect the index data file at
this point in the test case, it is inconsequential as the file is
largely out of sync since its dirty shared buffers have not been
flushed. That leaves me with the option to restart and check the index
data file again for problems. If we see problems after the restart it
should generally mean WAL logging errors (but we still cannot discount
the bit flipping case I guess).

I guess a perusal of the WAL activity done by VF btree index activity
is the need of the hour..

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Nikhil Sontakke

 VACUUM FULL - immediate shutdown - problem with recovery?

 An immediate shutdown == an intentional crash.  OK, so you have the
 VACUUM FULL and the immediate shutdown just afterward.  So we just
 need to figure out what happened during recovery.


Right.

 But WAL replay should still have handled this. I would presume even an
 immediate shutdown ensures that WAL is flushed to disk properly?

 I'm not sure, but I doubt it.  If the VACUUM FULL committed, then the
 WAL records should be on disk, but if the immediate shutdown happened
 while it was still running, then the WAL records might still be in
 wal_buffers, in which case I don't think they'll get written out and
 thus zero pages in the index are to be expected.  Now that doesn't
 explain any other corruption in the file, but I believe all-zeroes
 pages in a relation are an expected consequence of an unclean
 shutdown.  But assuming the VF actually committed before the immediate
 shutdown, there must be something else going on, since by that point
 XLOG should have been flushed.


Oh yeah, so if VF committed, the xlog should have been ok too, but
can't say the same about the shared buffers.

 So that means that either there is a corner case bug in VF which adds
 incorrect WAL logging in some specific btree layout scenarios or there
 was indeed some bit flipping in the WAL, which caused the recovery to
 prematurely end during WAL replay. What are the scenarios that you
 would think can cause WAL bit flipping?

 Some kind of fluke hard drive malfunction, maybe?  I know that the
 incidence of a hard drive being told to write A and actually writing B
 is very low, but it's probably not exactly zero.  Do you have the logs
 from the recovery following the immediate shutdown?  Anything
 interesting there?


Unfortunately we do not have the recovery logs. Would have loved to
see some signs about some issues in the WAL replay to confirm the
theory about bit flipping..

 Or, as you say, there could be a corner-case VF bug.


Yeah, much harder to find by just eyeballing the code I guess :)

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: index corruption in PG 8.3.13

2011-03-11 Thread Nikhil Sontakke

 Oh yeah, so if VF committed, the xlog should have been ok too, but
 can't say the same about the shared buffers.

 But there was a later block that *was* written out. What was the LSN
 on that block? everything in the WAL log should have been fsynced up
 to that point when that buffer was flushed.


Which block are you referring to?
After the holes from 279 to 518. We have

Deleted 519's (LSN: logid 29, recoff 0xd1fab5bc) previous points to Deleted 520.

Deleted 520's (LSN: logid 29, recoff 0xd1fac918) previous points to Deleted 521.

Deleted 521's (LSN: logid 29, recoff 0xd1fadc60) previous points to 522

Note that all the above deleted blocks have next xid set to FrozenXid,
meaning VF did this.

Live 522's  (LSN: logid 29, recoff 0xd1fade3c) previous points to
the zeroed out 523 block. Note that this seems to be latest LSN in the
data file.

The next of all these 4 blocks (519 to 522) point to Live Block 277
with (LSN: logid 29, recoff 0xd1fadc60). The ROOT block also has this
same LSN. So Blocks 521, 277 and block 3 have this same LSN. The rest
of the live blocks appear to have lower LSNs with lower logids.

The last block in the index data file, which also seems to be the
current left most block, number 524 has an LSN of (logid 29, recoff
0xd1fa97b8) with next sibling set to 523. One interesting observation
is that there is another deleted block 278 (again done by VF) with
this same LSN and next pointing to 524. And so this must have been the
original leftmost block before 524 was promoted to that status.

To summarize, as I see it - the zeroed out block 523 should have been
the second left-most leaf and should have pointed out to 522. Thus
re-establishing the index chain

524 - 523 - 522 - 277 - ...

Phew, I hope the above made some sense and was decipherable. Looking
at my png alongside might help a bit more too. Also maybe some bug is
indeed hidden in the guts of VF.

 Was there a machine restart in the picture as well?

I don't think so.

Regards,
Nikhils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 165 matches

Mail list logo