Re: [HACKERS] Speedup twophase transactions

2017-04-04 Thread Michael Paquier
On Tue, Mar 28, 2017 at 3:10 PM, Michael Paquier
 wrote:
> OK, done. I have just noticed that Simon has marked himself as a
> committer of this patch 24 hours ago.

For the archive's sake, this has been committed as 728bd991. Thanks Simon!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-28 Thread Michael Paquier
On Tue, Mar 28, 2017 at 3:08 PM, Nikhil Sontakke
 wrote:
>>
>> I don't have anything else to say about this patch, so should we mark
>> that as ready for committer? There are still a couple of days left
>> until the end of the CF, and quite a lot has happened, so this could
>> get on time into PG10.
>
>
> Yes, let's mark it "ready for committer". This patch/feature has been a long
> journey! :)

OK, done. I have just noticed that Simon has marked himself as a
committer of this patch 24 hours ago.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-28 Thread Nikhil Sontakke
>
>
> I don't have anything else to say about this patch, so should we mark
> that as ready for committer? There are still a couple of days left
> until the end of the CF, and quite a lot has happened, so this could
> get on time into PG10.
>

Yes, let's mark it "ready for committer". This patch/feature has been a
long journey! :)

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-03-28 Thread Michael Paquier
On Tue, Mar 28, 2017 at 2:10 PM, Nikhil Sontakke
 wrote:
> The latest patch looks good. By now doing a single scan of shmem two phase
> data, we have removed the double loops in all the affected functions which
> is good.

Yup.

> My only question is if the added call to restoreTwoPhaseData() is good
> enough to handle all the 3 functions PrescanPreparedTransactions(),
> StandbyRecoverPreparedTransactions() and RecoverPreparedTransactions()
> appropriately? It looks as if it does, but we need to be doubly sure..

Yeah, I have spent a bit of time thinking about that. But as
restoreTwoPhaseData() is basically what those other three routines do
but at an earlier stage I cannot see a problem with it. I don't
discard being in shortage of imagination of course.

> PFA, revised patch with a very minor typo fix and rebase against latest
> master. The test cases pass as needed.

Thanks!

> Oh, btw, while running TAP tests, I got a few errors in unrelated tests.
> [...]
> Again, not related to this recovery code path, but not sure if others see
> this as well.

Definitely not related to this patch, and I am unable to see anything
like that. Even spurious errors merit attention, but even by running
those tests multiple times daily I have not seen anything like that.
That's mainly on OSX 10.11 though.

I don't have anything else to say about this patch, so should we mark
that as ready for committer? There are still a couple of days left
until the end of the CF, and quite a lot has happened, so this could
get on time into PG10.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-27 Thread Nikhil Sontakke
Please ignore reports about errors in other tests. Seem spurious..

Regards,
Nikhils

On 28 March 2017 at 10:40, Nikhil Sontakke  wrote:

> Hi Micheal,
>
> The latest patch looks good. By now doing a single scan of shmem two phase
> data, we have removed the double loops in all the affected functions which
> is good.
>
> My only question is if the added call to restoreTwoPhaseData() is good
> enough to handle all the 3 functions PrescanPreparedTransactions(),
> StandbyRecoverPreparedTransactions() and RecoverPreparedTransactions()
> appropriately? It looks as if it does, but we need to be doubly sure..
>
> PFA, revised patch with a very minor typo fix and rebase against latest
> master. The test cases pass as needed.
>
> Oh, btw, while running TAP tests, I got a few errors in unrelated tests.
>
> "# testing connection parameter "target_session_attrs"
>
> not ok 5 - connect to node master if mode "read-write" and
> master,standby_1 listed
>
>
> #   Failed test 'connect to node master if mode "read-write" and
> master,standby_1 listed'
>
> #   at t/001_stream_rep.pl line 93.
>
> #  got: ''
>
> # expected: '1'
>
> not ok 6 - connect to node master if mode "read-write" and
> standby_1,master listed
>
>
> #   Failed test 'connect to node master if mode "read-write" and
> standby_1,master listed'
>
> #   at t/001_stream_rep.pl line 93.
>
> #  got: ''
>
> # expected: '1'
>
> not ok 7 - connect to node master if mode "any" and master,standby_1 listed
>
>
> #   Failed test 'connect to node master if mode "any" and master,standby_1
> listed'
>
> #   at t/001_stream_rep.pl line 93.
>
> #  got: ''
>
> # expected: '1'
>
> not ok 8 - connect to node standby_1 if mode "any" and standby_1,master
> listed"
>
> Again, not related to this recovery code path, but not sure if others see
> this as well.
>
> Regards,
> Nikhils
>
> On 27 March 2017 at 05:35, Michael Paquier 
> wrote:
>
>> On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke
>>  wrote:
>> > I was away for a bit. I will take a look at this patch and get back to
>> you
>> > soon.
>>
>> No problem. Thanks for your time!
>> --
>> Michael
>>
>
>
>
> --
>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>  PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-03-27 Thread Nikhil Sontakke
Hi Micheal,

The latest patch looks good. By now doing a single scan of shmem two phase
data, we have removed the double loops in all the affected functions which
is good.

My only question is if the added call to restoreTwoPhaseData() is good
enough to handle all the 3
functions PrescanPreparedTransactions(), StandbyRecoverPreparedTransactions()
and RecoverPreparedTransactions() appropriately? It looks as if it does,
but we need to be doubly sure..

PFA, revised patch with a very minor typo fix and rebase against latest
master. The test cases pass as needed.

Oh, btw, while running TAP tests, I got a few errors in unrelated tests.

"# testing connection parameter "target_session_attrs"

not ok 5 - connect to node master if mode "read-write" and master,standby_1
listed


#   Failed test 'connect to node master if mode "read-write" and
master,standby_1 listed'

#   at t/001_stream_rep.pl line 93.

#  got: ''

# expected: '1'

not ok 6 - connect to node master if mode "read-write" and standby_1,master
listed


#   Failed test 'connect to node master if mode "read-write" and
standby_1,master listed'

#   at t/001_stream_rep.pl line 93.

#  got: ''

# expected: '1'

not ok 7 - connect to node master if mode "any" and master,standby_1 listed


#   Failed test 'connect to node master if mode "any" and master,standby_1
listed'

#   at t/001_stream_rep.pl line 93.

#  got: ''

# expected: '1'

not ok 8 - connect to node standby_1 if mode "any" and standby_1,master
listed"

Again, not related to this recovery code path, but not sure if others see
this as well.

Regards,
Nikhils

On 27 March 2017 at 05:35, Michael Paquier 
wrote:

> On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke
>  wrote:
> > I was away for a bit. I will take a look at this patch and get back to
> you
> > soon.
>
> No problem. Thanks for your time!
> --
> Michael
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


twophase_recovery_shmem_280317.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-26 Thread Michael Paquier
On Sun, Mar 26, 2017 at 4:50 PM, Nikhil Sontakke
 wrote:
> I was away for a bit. I will take a look at this patch and get back to you
> soon.

No problem. Thanks for your time!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-26 Thread Nikhil Sontakke
Thanks Michael,

I was away for a bit. I will take a look at this patch and get back to you
soon.

Regards,
Nikhils

On 22 March 2017 at 07:40, Michael Paquier 
wrote:

> On Fri, Mar 17, 2017 at 5:15 PM, Michael Paquier
>  wrote:
> > On Fri, Mar 17, 2017 at 5:00 PM, Nikhil Sontakke
> >  wrote:
> >> Micheal, it looks like you are working on a final version of this
> patch? I
> >> will wait to review it from my end, then.
> >
> > I have to admit that I am beginning to get drawn into it...
>
> And here is what I got. I have found a couple of inconsistencies in
> the patch, roughly:
> - During recovery entries marked with ondisk = true should have their
> start and end LSN reset to InvalidXLogRecPtr. This was actually
> leading to some inconsistencies in MarkAsPreparing() for 2PC
> transactions staying around for more than 2 checkpoints.
> - RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions()
> and PrescanPreparedTransactions() doing both a scan of pg_twophase and
> the shared memory entries was way too complicated. I have changed
> things so as only memory entries are scanned by those routines, but an
> initial scan of pg_twophase is done before recovery.
> - Some inconsistencies in the comments and some typos found on the way.
> - Simplification of some routines used in redo, as well as simplified
> the set of routines made available to users.
>
> Tests are passing for me, an extra lookup would be nice.
> --
> Michael
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-03-21 Thread Michael Paquier
On Fri, Mar 17, 2017 at 5:15 PM, Michael Paquier
 wrote:
> On Fri, Mar 17, 2017 at 5:00 PM, Nikhil Sontakke
>  wrote:
>> Micheal, it looks like you are working on a final version of this patch? I
>> will wait to review it from my end, then.
>
> I have to admit that I am beginning to get drawn into it...

And here is what I got. I have found a couple of inconsistencies in
the patch, roughly:
- During recovery entries marked with ondisk = true should have their
start and end LSN reset to InvalidXLogRecPtr. This was actually
leading to some inconsistencies in MarkAsPreparing() for 2PC
transactions staying around for more than 2 checkpoints.
- RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions()
and PrescanPreparedTransactions() doing both a scan of pg_twophase and
the shared memory entries was way too complicated. I have changed
things so as only memory entries are scanned by those routines, but an
initial scan of pg_twophase is done before recovery.
- Some inconsistencies in the comments and some typos found on the way.
- Simplification of some routines used in redo, as well as simplified
the set of routines made available to users.

Tests are passing for me, an extra lookup would be nice.
-- 
Michael


twophase_recovery_shmem_michael.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Michael Paquier
On Fri, Mar 17, 2017 at 5:00 PM, Nikhil Sontakke
 wrote:
>> >
>> > Ok, we can do that and then yes, RecoverPreparedTransaction() can just
>> > have
>> > one loop going through the shmem entries. BUT, we cannot ignore
>> > "inredo"+"ondisk" entries. For such entries, we will have to read and
>> > recover from the corresponding 2PC files.
>>
>> Yes. About other things I found... In CheckPointTwoPhase(), I am
>> actually surprised that prepare_start_lsn and prepare_end_lsn are not
>> reset to InvalidXLogRecPtr when a shmem entry is flushed to disk after
>> ondisk is set to true, there is no need for them as the data does not
>> need to be fetched from WAL segments so we had better be consistent
>> (regression tests fail if I do that). And the two extra arguments in
>> MarkAsPreparing() are really unnecessary, they get set all the time to
>> InvalidXLogRecPtr.
>
>
> Micheal, it looks like you are working on a final version of this patch? I
> will wait to review it from my end, then.

I have to admit that I am beginning to get drawn into it...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Nikhil Sontakke
>
> >
> > Ok, we can do that and then yes, RecoverPreparedTransaction() can just
> have
> > one loop going through the shmem entries. BUT, we cannot ignore
> > "inredo"+"ondisk" entries. For such entries, we will have to read and
> > recover from the corresponding 2PC files.
>
> Yes. About other things I found... In CheckPointTwoPhase(), I am
> actually surprised that prepare_start_lsn and prepare_end_lsn are not
> reset to InvalidXLogRecPtr when a shmem entry is flushed to disk after
> ondisk is set to true, there is no need for them as the data does not
> need to be fetched from WAL segments so we had better be consistent
> (regression tests fail if I do that). And the two extra arguments in
> MarkAsPreparing() are really unnecessary, they get set all the time to
> InvalidXLogRecPtr.
>

Micheal, it looks like you are working on a final version of this patch? I
will wait to review it from my end, then.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Michael Paquier
On Fri, Mar 17, 2017 at 4:42 PM, Nikhil Sontakke
 wrote:
>> >
>> > I don't think this will work. We cannot replace pg_twophase with shmem
>> > entries + WAL pointers. This is because we cannot expect to have WAL
>> > entries
>> > around for long running prepared queries which survive across
>> > checkpoints.
>>
>> But at the beginning of recovery, we can mark such entries with ondisk
>> and inredo, in which case the WAL pointers stored in the shmem entries
>> do not matter because the data is already on disk.
>
> Ok, we can do that and then yes, RecoverPreparedTransaction() can just have
> one loop going through the shmem entries. BUT, we cannot ignore
> "inredo"+"ondisk" entries. For such entries, we will have to read and
> recover from the corresponding 2PC files.

Yes. About other things I found... In CheckPointTwoPhase(), I am
actually surprised that prepare_start_lsn and prepare_end_lsn are not
reset to InvalidXLogRecPtr when a shmem entry is flushed to disk after
ondisk is set to true, there is no need for them as the data does not
need to be fetched from WAL segments so we had better be consistent
(regression tests fail if I do that). And the two extra arguments in
MarkAsPreparing() are really unnecessary, they get set all the time to
InvalidXLogRecPtr.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Nikhil Sontakke
>
> >
> > I don't think this will work. We cannot replace pg_twophase with shmem
> > entries + WAL pointers. This is because we cannot expect to have WAL
> entries
> > around for long running prepared queries which survive across
> checkpoints.
>
> But at the beginning of recovery, we can mark such entries with ondisk
> and inredo, in which case the WAL pointers stored in the shmem entries
> do not matter because the data is already on disk.
>

Ok, we can do that and then yes, RecoverPreparedTransaction() can just have
one loop going through the shmem entries. BUT, we cannot ignore
"inredo"+"ondisk" entries. For such entries, we will have to read and
recover from the corresponding 2PC files.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-03-17 Thread Nikhil Sontakke
>
> Nikhil, do you mind if I try something like that? As we already know
> what is the first XID when beginning redo via
> ShmemVariableCache->nextXid it is possible to discard 2PC files that
> should not be here.


Yeah, that is ok.


> What makes me worry is the control of the maximum
> number of entries in shared memory. If there are legit 2PC files that
> are flushed on disk at checkpoint, you would finish with potentially
> more 2PC transactions than what should be possible (even if updates of
> max_prepared_xacts are WAL-logged).
>

The max_prepared_xacts number restricts the number of pending PREPARED
transactions *across* the 2PC files and shmem inredo entries. We can never
have more entries than this value.


Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-03-16 Thread Michael Paquier
On Thu, Mar 16, 2017 at 9:25 PM, Michael Paquier
 wrote:
> On Thu, Mar 16, 2017 at 7:18 PM, Nikhil Sontakke
>  wrote:
>>> + *  * RecoverPreparedTransactions(),
>>> StandbyRecoverPreparedTransactions()
>>> + *and PrescanPreparedTransactions() have been modified to go
>>> throug
>>> + *gxact->inredo entries that have not made to disk yet.
>>>
>>> It seems to me that there should be an initial scan of pg_twophase at
>>> the beginning of recovery, discarding on the way with a WARNING
>>> entries that are older than the checkpoint redo horizon. This should
>>> fill in shmem entries using something close to PrepareRedoAdd(), and
>>> mark those entries as inredo. Then, at the end of recovery,
>>> PrescanPreparedTransactions does not need to look at the entries in
>>> pg_twophase. And that's the case as well of
>>> RecoverPreparedTransaction(). I think that you could get the patch
>>> much simplified this way, as any 2PC data can be fetched directly from
>>> WAL segments and there is no need to rely on scans of pg_twophase,
>>> this is replaced by scans of entries in TwoPhaseState.
>>>
>>
>> I don't think this will work. We cannot replace pg_twophase with shmem
>> entries + WAL pointers. This is because we cannot expect to have WAL entries
>> around for long running prepared queries which survive across checkpoints.
>
> But at the beginning of recovery, we can mark such entries with ondisk
> and inredo, in which case the WAL pointers stored in the shmem entries
> do not matter because the data is already on disk.

Nikhil, do you mind if I try something like that? As we already know
what is the first XID when beginning redo via
ShmemVariableCache->nextXid it is possible to discard 2PC files that
should not be here. What makes me worry is the control of the maximum
number of entries in shared memory. If there are legit 2PC files that
are flushed on disk at checkpoint, you would finish with potentially
more 2PC transactions than what should be possible (even if updates of
max_prepared_xacts are WAL-logged).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-16 Thread Michael Paquier
On Thu, Mar 16, 2017 at 7:18 PM, Nikhil Sontakke
 wrote:
>> + *  * RecoverPreparedTransactions(),
>> StandbyRecoverPreparedTransactions()
>> + *and PrescanPreparedTransactions() have been modified to go
>> throug
>> + *gxact->inredo entries that have not made to disk yet.
>>
>> It seems to me that there should be an initial scan of pg_twophase at
>> the beginning of recovery, discarding on the way with a WARNING
>> entries that are older than the checkpoint redo horizon. This should
>> fill in shmem entries using something close to PrepareRedoAdd(), and
>> mark those entries as inredo. Then, at the end of recovery,
>> PrescanPreparedTransactions does not need to look at the entries in
>> pg_twophase. And that's the case as well of
>> RecoverPreparedTransaction(). I think that you could get the patch
>> much simplified this way, as any 2PC data can be fetched directly from
>> WAL segments and there is no need to rely on scans of pg_twophase,
>> this is replaced by scans of entries in TwoPhaseState.
>>
>
> I don't think this will work. We cannot replace pg_twophase with shmem
> entries + WAL pointers. This is because we cannot expect to have WAL entries
> around for long running prepared queries which survive across checkpoints.

But at the beginning of recovery, we can mark such entries with ondisk
and inredo, in which case the WAL pointers stored in the shmem entries
do not matter because the data is already on disk.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-16 Thread Nikhil Sontakke
>
> + *  * RecoverPreparedTransactions(), StandbyRecoverPreparedTransact
> ions()
> + *and PrescanPreparedTransactions() have been modified to go
> throug
> + *gxact->inredo entries that have not made to disk yet.
>
> It seems to me that there should be an initial scan of pg_twophase at
> the beginning of recovery, discarding on the way with a WARNING
> entries that are older than the checkpoint redo horizon. This should
> fill in shmem entries using something close to PrepareRedoAdd(), and
> mark those entries as inredo. Then, at the end of recovery,
> PrescanPreparedTransactions does not need to look at the entries in
> pg_twophase. And that's the case as well of
> RecoverPreparedTransaction(). I think that you could get the patch
> much simplified this way, as any 2PC data can be fetched directly from
> WAL segments and there is no need to rely on scans of pg_twophase,
> this is replaced by scans of entries in TwoPhaseState.
>
>
I don't think this will work. We cannot replace pg_twophase with shmem
entries + WAL pointers. This is because we cannot expect to have WAL
entries around for long running prepared queries which survive across
checkpoints.

Regards,
Nikhils


Re: [HACKERS] Speedup twophase transactions

2017-03-15 Thread Michael Paquier
On Wed, Mar 15, 2017 at 4:48 PM, Nikhil Sontakke
 wrote:
>> boolvalid;  /* TRUE if PGPROC entry is in proc array
>> */
>> boolondisk; /* TRUE if prepare state file is on disk
>> */
>> +   boolinredo; /* TRUE if entry was added via xlog_redo
>> */
>> We could have a set of flags here, that's the 3rd boolean of the
>> structure used for a status.
>
> This is more of a cleanup and does not need to be part of this patch. This
> can be a follow-on cleanup patch.

OK, that's fine for me. This patch is complicated enough anyway.

After some time thinking about it, I have finally put my finger on
what was itching me about this patch, and the answer is here:

+ *  Replay of twophase records happens by the following rules:
+ *
+ *  * On PREPARE redo we add the transaction to TwoPhaseState->prepXacts.
+ *We set gxact->inredo to true for such entries.
+ *
+ *  * On Checkpoint we iterate through TwoPhaseState->prepXacts entries
+ *that have gxact->inredo set and are behind the redo_horizon. We
+ *save them to disk and also set gxact->ondisk to true.
+ *
+ *  * On COMMIT/ABORT we delete the entry from TwoPhaseState->prepXacts.
+ *If gxact->ondisk is true, we delete the corresponding entry from
+ *the disk as well.
+ *
+ *  * RecoverPreparedTransactions(), StandbyRecoverPreparedTransactions()
+ *and PrescanPreparedTransactions() have been modified to go through
+ *gxact->inredo entries that have not made to disk yet.

It seems to me that there should be an initial scan of pg_twophase at
the beginning of recovery, discarding on the way with a WARNING
entries that are older than the checkpoint redo horizon. This should
fill in shmem entries using something close to PrepareRedoAdd(), and
mark those entries as inredo. Then, at the end of recovery,
PrescanPreparedTransactions does not need to look at the entries in
pg_twophase. And that's the case as well of
RecoverPreparedTransaction(). I think that you could get the patch
much simplified this way, as any 2PC data can be fetched directly from
WAL segments and there is no need to rely on scans of pg_twophase,
this is replaced by scans of entries in TwoPhaseState.

> I also managed to do some perf testing.
>
> Modified Stas' earlier scripts slightly:
>
> \set naccounts 10 * :scale
> \set from_aid random(1, :naccounts)
> \set to_aid random(1, :naccounts)
> \set delta random(1, 100)
> \set scale :scale+1
> BEGIN;
> UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid =
> :from_aid;
> UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid =
> :to_aid;
> PREPARE TRANSACTION ':client_id.:scale';
> COMMIT PREPARED ':client_id.:scale';
>
> Created a base backup with scale factor 125 on an AWS t2.large instance. Set
> up archiving and did a 20 minute run with the above script saving the WALs
> in the archive.
>
> Then used recovery.conf to point to this WAL location and used the base
> backup to recover.
>
> With this patch applied: 20s
> Without patch: Stopped measuring after 5 minutes ;-)

And that's really nice.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-15 Thread Nikhil Sontakke
> Thanks for the new version. Let's head toward a final patch.
>
>
:-)


>
>

>
> Have added a new function to do this now. It reads either from disk or
> > shared memory and produces error/log messages accordingly as well now.
>
> And that's ProcessTwoPhaseBufferAndReturn in the patch.
> ProcessTwoPhaseBuffer may be a better name.
>
>
 Renamed to ProcessTwoPhaseBuffer()


>
> Since we are using the TwoPhaseState structure to track redo entries, at
> end
> > of recovery, we will find existing entries. Please see my comments above
> for
> > RecoverPreparedTransactions()
>
> This is absolutely not good, because it is a direct result of the
> interactions of the first loop of RecoverPreparedTransaction() with
> its second loop, and because MarkAsPreparing() can finished by being
> called *twice* from the same transaction. I really think that this
> portion should be removed and that RecoverPreparedTransactions()
> should be more careful when scanning the entries in pg_twophase by
> looking up at what exists as well in shared memory, instead of doing
> that in MarkAsPreparing().
>
>
Ok. Modified MarkAsPreparing() to call a new MarkAsPreparingGuts()
function. This function takes in a "gxact' and works on it.

RecoverPreparedTransaction() now calls a newly added
RecoverFromTwoPhaseBuffer() function which checks if an entry already
exists via redo and calls the MarkAsPreparingGuts() function by passing in
that gxact. Otherwise the existing MarkAsPreparing() gets called.


> Here are some more comments:
>
> +   /*
> +* Recreate its GXACT and dummy PGPROC
> +*/
> +   gxactnew = MarkAsPreparing(xid, gid,
> +   hdr->prepared_at,
> +   hdr->owner, hdr->database,
> +   gxact->prepare_start_lsn,
> +   gxact->prepare_end_lsn);
> +
> +   Assert(gxactnew == gxact);
> Here it would be better to set ondisk to false. This makes the code
> more consistent with the previous loop, and the intention clear.
>
>
Done.


> The first loop of RecoverPreparedTransactions() has a lot in common
> with its second loop. You may want to refactor a little bit more here.
>
>
Done. Added the new function RecoverFromTwoPhaseBuffer() as mentioned above.



> +/*
> + * PrepareRedoRemove
> + *
> + * Remove the corresponding gxact entry from TwoPhaseState. Also
> + * remove the 2PC file.
> + */
> This could be a bit more expanded. The removal of the 2PC does not
> happen after removing the in-memory data, it would happen if the
> in-memory data is not found.
>
>
Done


> +MarkAsPreparingInRedo(TransactionId xid, const char *gid,
> +   TimestampTz prepared_at, Oid owner, Oid databaseid,
> +   XLogRecPtr prepare_start_lsn, XLogRecPtr prepare_end_lsn)
> +{
> +   GlobalTransaction gxact;
> +
> +   LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
> MarkAsPreparingInRedo is internal to twophase.c. There is no need to
> expose it externally and it is just used in PrepareRedoAdd so you
> could just group both.
>
>
Removed this  MarkAsPreparingInRedo() function and inlined the code in
PrepareRedoAdd().

boolvalid;  /* TRUE if PGPROC entry is in proc array */
> boolondisk; /* TRUE if prepare state file is on disk */
> +   boolinredo; /* TRUE if entry was added via xlog_redo */
> We could have a set of flags here, that's the 3rd boolean of the
> structure used for a status.
>

This is more of a cleanup and does not need to be part of this patch. This
can be a follow-on cleanup patch.

I also managed to do some perf testing.

Modified Stas' earlier scripts slightly:

\set naccounts 10 * :scale

\set from_aid random(1, :naccounts)

\set to_aid random(1, :naccounts)

\set delta random(1, 100)

\set scale :scale+1

BEGIN;

UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid =
:from_aid;

UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid =
:to_aid;

PREPARE TRANSACTION ':client_id.:scale';

COMMIT PREPARED ':client_id.:scale';

Created a base backup with scale factor 125 on an AWS t2.large instance.
Set up archiving and did a 20 minute run with the above script saving the
WALs in the archive.

Then used recovery.conf to point to this WAL location and used the base
backup to recover.

With this patch applied: 20s

Without patch: Stopped measuring after 5 minutes ;-)


Regards,

Nikhils

-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


twophase_recovery_shmem_150317.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-03-13 Thread Michael Paquier
On Sat, Mar 11, 2017 at 7:26 PM, Nikhil Sontakke
 wrote:
> Hi David and Michael,
>> It would be great to get this thread closed out after 14 months and many
>> commits.
>
> PFA, latest patch which addresses Michael's comments.

Thanks for the new version. Let's head toward a final patch.

>> +   ereport(WARNING,
>> +   (errmsg("removing future two-phase state data from
>> memory \"%u\"",
>> +   xid)));
>> +   PrepareRedoRemove(xid);
>> +   continue
>> Those are not normal (partially because unlink is atomic, but not
>> durable)... But they match the correct coding pattern regarding
>> incorrect 2PC entries... I'd really like to see those switched to a
>> FATAL with unlink() made durable for those calls.
>
> Hmm, not sure what exactly we need to do here. If you look at the prior
> checks, there we already skip on-disk entries. So, typically, the entries
> that we encounter here will be in shmem only.

As long as we don't have an alternative to offer a durable unlink,
let's do nothing then. This is as well consistent with the other code
paths handling corrupted or incorrect 2PC entries.

>> +   /* Deconstruct header */
>> +   hdr = (TwoPhaseFileHeader *) buf;
>> +   Assert(TransactionIdEquals(hdr->xid, xid));
>> +
>> +   if (TransactionIdPrecedes(xid, result))
>> +   result = xid;
>> This portion is repeated three times and could be easily refactored.
>> You could just have a routine that returns the oldes transaction ID
>> used, and ignore the result for StandbyRecoverPreparedTransactions()
>> by casting a (void) to let any kind of static analyzer understand that
>> we don't care about the result for example. Handling for subxids is
>> necessary as well depending on the code path. Spliting things into a
>> could of sub-routines may be more readable as well. There are really a
>> couple of small parts that can be gathered and strengthened.
>
> Have added a new function to do this now. It reads either from disk or
> shared memory and produces error/log messages accordingly as well now.

And that's ProcessTwoPhaseBufferAndReturn in the patch.
ProcessTwoPhaseBuffer may be a better name.

>> +   /*
>> +* Recreate its GXACT and dummy PGPROC
>> +*/
>> +   gxactnew = MarkAsPreparing(xid, gid,
>> +   hdr->prepared_at,
>> +   hdr->owner, hdr->database,
>> +   gxact->prepare_start_lsn,
>> +   gxact->prepare_end_lsn);
>> MarkAsPreparing() does not need to be extended with two new arguments.
>> RecoverPreparedTransactions() is used only at the end of recovery,
>> where it is not necessary to look at the 2PC state files from the
>> records. In this code path inredo is also set to false :)
>
> That's not true. We will have entries with inredo set at the end of recovery
> as well. Infact the MarkAsPreparing() call from
> RecoverPreparedTransactions() is the one which will remove these inredo
> entries and convert them into regular entries. We have optimized the
> recovery code path as well.
>
>> +   /* Delete TwoPhaseState gxact entry and/or 2PC file. */
>> +   PrepareRedoRemove(parsed.twophase_xid);
>> Both things should not be present, no? If the file is pushed to disk
>> it means that the checkpoint horizon has already moved.
>>
> PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK. We
> can have both the bits set in this case.

Oh, I see where our thoughts don't overlap. I actually thought that
the shared memory entry and the on-disk file cannot co-exist (or if
you want a file flushed at checkpoint should have its shmem entry
removed). But you are right and I am wrong. In order to have the error
handling done properly if the maximum amount of 2PC transactions is
reached. Still

>> -   ereport(ERROR,
>> +   /* It's ok to find an entry in the redo/recovery case */
>> +   if (!gxact->inredo)
>> +   ereport(ERROR,
>> (errcode(ERRCODE_DUPLICATE_OBJECT),
>>  errmsg("transaction identifier \"%s\" is already in
>> use",
>> gid)));
>> +   else
>> +   {
>> +   found = true;
>> +   break;
>> +   }
>> I would not have thought so.
>
> Since we are using the TwoPhaseState structure to track redo entries, at end
> of recovery, we will find existing entries. Please see my comments above for
> RecoverPreparedTransactions()

This is absolutely not good, because it is a direct result of the
interactions of the first loop of RecoverPreparedTransaction() with
its second loop, and because MarkAsPreparing() can finished by being
called *twice* from the same transaction. I really think that this
portion should be removed and that RecoverPreparedTransactions()
should be more careful when scanning the entries 

Re: [HACKERS] Speedup twophase transactions

2017-03-11 Thread Nikhil Sontakke
Hi David and Michael,


> It would be great to get this thread closed out after 14 months and many
> commits.
>
>
PFA, latest patch which addresses Michael's comments.

twophase.c: In function ‘PrepareRedoAdd’:
> twophase.c:2539:20: warning: variable ‘gxact’ set but not used
> [-Wunused-but-set-variable]
>   GlobalTransaction gxact;
> There is a warning at compilation.
>
>
Fixed.


> The comment at the top of PrescanPreparedTransactions() needs to be
> updated. Not only does this routine look for the contents of
> pg_twophase, but it does also look at the shared memory contents, at
> least those ones marked with inredo and not on_disk.
>
>
Changed comments at top of PrescanPreparedTransactions() ,
StandbyRecoverPreparedTransactions(), and RecoverPreparedTransactions().


> +   ereport(WARNING,
> +   (errmsg("removing future two-phase state data from
> memory \"%u\"",
> +   xid)));
> +   PrepareRedoRemove(xid);
> +   continue
> Those are not normal (partially because unlink is atomic, but not
> durable)... But they match the correct coding pattern regarding
> incorrect 2PC entries... I'd really like to see those switched to a
> FATAL with unlink() made durable for those calls.
>
>
Hmm, not sure what exactly we need to do here. If you look at the prior
checks, there we already skip on-disk entries. So, typically, the entries
that we encounter here will be in shmem only.


> +   /* Deconstruct header */
> +   hdr = (TwoPhaseFileHeader *) buf;
> +   Assert(TransactionIdEquals(hdr->xid, xid));
> +
> +   if (TransactionIdPrecedes(xid, result))
> +   result = xid;
> This portion is repeated three times and could be easily refactored.
> You could just have a routine that returns the oldes transaction ID
> used, and ignore the result for StandbyRecoverPreparedTransactions()
> by casting a (void) to let any kind of static analyzer understand that
> we don't care about the result for example. Handling for subxids is
> necessary as well depending on the code path. Spliting things into a
> could of sub-routines may be more readable as well. There are really a
> couple of small parts that can be gathered and strengthened.
>
>
Have added a new function to do this now. It reads either from disk or
shared memory and produces error/log messages accordingly as well now.


> +   /*
> +* Recreate its GXACT and dummy PGPROC
> +*/
> +   gxactnew = MarkAsPreparing(xid, gid,
> +   hdr->prepared_at,
> +   hdr->owner, hdr->database,
> +   gxact->prepare_start_lsn,
> +   gxact->prepare_end_lsn);
> MarkAsPreparing() does not need to be extended with two new arguments.
> RecoverPreparedTransactions() is used only at the end of recovery,
> where it is not necessary to look at the 2PC state files from the
> records. In this code path inredo is also set to false :)
>
>
That's not true. We will have entries with inredo set at the end of
recovery as well. Infact the MarkAsPreparing() call
from RecoverPreparedTransactions() is the one which will remove these
inredo entries and convert them into regular entries. We have optimized the
recovery code path as well.



> +   /* Delete TwoPhaseState gxact entry and/or 2PC file. */
> +   PrepareRedoRemove(parsed.twophase_xid);
> Both things should not be present, no? If the file is pushed to disk
> it means that the checkpoint horizon has already moved.
>
>
PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK.
We can have both the bits set in this case.



> -   ereport(ERROR,
> +   /* It's ok to find an entry in the redo/recovery case */
> +   if (!gxact->inredo)
> +   ereport(ERROR,
> (errcode(ERRCODE_DUPLICATE_OBJECT),
>  errmsg("transaction identifier \"%s\" is already in
> use",
> gid)));
> +   else
> +   {
> +   found = true;
> +   break;
> +   }
> I would not have thought so.
>
> Since we are using the TwoPhaseState structure to track redo entries, at
end of recovery, we will find existing entries. Please see my comments
above for RecoverPreparedTransactions()


> MarkAsPreparing and MarkAsPreparingInRedo really share the same code.
> What if the setup of the dummy PGPROC entry is made conditional?
>

I realized that MarkAsPreparingInRedo() does not need to do all the sanity
checking since it's going to be invoked during redo and everything that
comes in is kosher already. So its contents are much simplified in this
latest patch.

Tests pass with this latest patch.

Regards,
Nikhils


twophase_recovery_shmem_110317.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Speedup twophase transactions

2017-03-02 Thread David Steele
Nikhil,

On 2/27/17 12:19 AM, Nikhil Sontakke wrote:
> Hi Michael,
> 
> Thanks for taking a look at the patch.
> 
> twophase.c: In function ‘PrepareRedoAdd’:
> twophase.c:2539:20: warning: variable ‘gxact’ set but not used
> [-Wunused-but-set-variable]
>   GlobalTransaction gxact;
> There is a warning at compilation.
> 
> Will fix. 

<...>

Do you know when you will have a new patch ready?

It would be great to get this thread closed out after 14 months and many
commits.

Thanks,
-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-02-26 Thread Nikhil Sontakke
Hi Michael,

Thanks for taking a look at the patch.



> twophase.c: In function ‘PrepareRedoAdd’:
> twophase.c:2539:20: warning: variable ‘gxact’ set but not used
> [-Wunused-but-set-variable]
>   GlobalTransaction gxact;
> There is a warning at compilation.
>
>
Will fix.


> The comment at the top of PrescanPreparedTransactions() needs to be
> updated. Not only does this routine look for the contents of
> pg_twophase, but it does also look at the shared memory contents, at
> least those ones marked with inredo and not on_disk.
>
>
Oh yes. Will change comments at top of
StandbyRecoverPreparedTransactions(), RecoverPreparedTransactions() as well.


> +   ereport(WARNING,
> +   (errmsg("removing future two-phase state data from
> memory \"%u\"",
> +   xid)));
> +   PrepareRedoRemove(xid);
> +   continue
> Those are not normal (partially because unlink is atomic, but not
> durable)... But they match the correct coding pattern regarding
> incorrect 2PC entries... I'd really like to see those switched to a
> FATAL with unlink() made durable for those calls.
>
>
Hmm, not sure what exactly we need to do here. If you look at the prior
checks, there we already skip on-disk entries. So, typically, the entries
that we encounter here will be in shmem only.


> +   /* Deconstruct header */
> +   hdr = (TwoPhaseFileHeader *) buf;
> +   Assert(TransactionIdEquals(hdr->xid, xid));
> +
> +   if (TransactionIdPrecedes(xid, result))
> +   result = xid;
> This portion is repeated three times and could be easily refactored.
> You could just have a routine that returns the oldes transaction ID
> used, and ignore the result for StandbyRecoverPreparedTransactions()
> by casting a (void) to let any kind of static analyzer understand that
> we don't care about the result for example. Handling for subxids is
> necessary as well depending on the code path. Spliting things into a
> could of sub-routines may be more readable as well. There are really a
> couple of small parts that can be gathered and strengthened.
>
>
I will see if we can reduce this to a couple of function calls.


> +   /*
> +* Recreate its GXACT and dummy PGPROC
> +*/
> +   gxactnew = MarkAsPreparing(xid, gid,
> +   hdr->prepared_at,
> +   hdr->owner, hdr->database,
> +   gxact->prepare_start_lsn,
> +   gxact->prepare_end_lsn);
> MarkAsPreparing() does not need to be extended with two new arguments.
> RecoverPreparedTransactions() is used only at the end of recovery,
> where it is not necessary to look at the 2PC state files from the
> records. In this code path inredo is also set to false :)
>
>
That's not true. We will have entries with inredo set at the end of
recovery as well. Infact the MarkAsPreparing() call
from RecoverPreparedTransactions() is the one which will remove these
inredo entries and convert them into regular entries. We have optimized the
recovery code path as well.


> +   {
> +   /*
> +* Entry could be on disk. Call with giveWarning=false
> +* since it can be expected during replay.
> +*/
> +   RemoveTwoPhaseFile(xid, false);
> +   }
> This would be removed at the end of recovery anyway as a stale entry,
> so that's not necessary.
>
>
Ok, will remove this.


> +   /* Delete TwoPhaseState gxact entry and/or 2PC file. */
> +   PrepareRedoRemove(parsed.twophase_xid);
> Both things should not be present, no? If the file is pushed to disk
> it means that the checkpoint horizon has already moved.
>
>
PREPARE in redo, followed by a checkpoint, followed by a COMMIT/ROLLBACK.
We can have both the bits set in this case.



> -   ereport(ERROR,
> +   /* It's ok to find an entry in the redo/recovery case */
> +   if (!gxact->inredo)
> +   ereport(ERROR,
> (errcode(ERRCODE_DUPLICATE_OBJECT),
>  errmsg("transaction identifier \"%s\" is already in
> use",
> gid)));
> +   else
> +   {
> +   found = true;
> +   break;
> +   }
> I would not have thought so.
>
> Since we are using the TwoPhaseState structure to track redo entries, at
end of recovery, we will find existing entries. Please see my comments
above for RecoverPreparedTransactions()


> MarkAsPreparing and MarkAsPreparingInRedo really share the same code.
> What if the setup of the dummy PGPROC entry is made conditional?
>

I thought it was cleaner this ways. We can definitely add a bunch of
if-else in MarkAsPreparing() but it won't look pretty.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: [HACKERS] Speedup twophase transactions

2017-02-22 Thread Michael Paquier
On Thu, Feb 2, 2017 at 3:07 PM, Nikhil Sontakke  wrote:
>>> Yeah. Was thinking about this yesterday. How about adding entries in
>>> TwoPhaseState itself (which become valid later)? Only if it does not
>>> cause a lot of code churn.
>>
>> That's possible as well, yes.
>
> PFA a patch which does the above. It re-uses the TwoPhaseState gxact
> entries to track 2PC PREPARE/COMMIT in shared memory. The advantage
> being that CheckPointTwoPhase() becomes the only place where the fsync
> of 2PC files happens.
>
> A minor annoyance in the patch is the duplication of the code to add
> the 2nd while loop to go through these shared memory entries in
> PrescanPreparedTransactions, RecoverPreparedTransactions and
> StandbyRecoverPreparedTransactions.
>
> Other than this, I ran TAP tests and they succeed as needed.

Thanks for the new patch. Finally I am looking at it... The regression
tests already committed are all able to pass.

twophase.c: In function ‘PrepareRedoAdd’:
twophase.c:2539:20: warning: variable ‘gxact’ set but not used
[-Wunused-but-set-variable]
  GlobalTransaction gxact;
There is a warning at compilation.

The comment at the top of PrescanPreparedTransactions() needs to be
updated. Not only does this routine look for the contents of
pg_twophase, but it does also look at the shared memory contents, at
least those ones marked with inredo and not on_disk.

+   ereport(WARNING,
+   (errmsg("removing future two-phase state data from
memory \"%u\"",
+   xid)));
+   PrepareRedoRemove(xid);
+   continue
Those are not normal (partially because unlink is atomic, but not
durable)... But they match the correct coding pattern regarding
incorrect 2PC entries... I'd really like to see those switched to a
FATAL with unlink() made durable for those calls.

+   /* Deconstruct header */
+   hdr = (TwoPhaseFileHeader *) buf;
+   Assert(TransactionIdEquals(hdr->xid, xid));
+
+   if (TransactionIdPrecedes(xid, result))
+   result = xid;
This portion is repeated three times and could be easily refactored.
You could just have a routine that returns the oldes transaction ID
used, and ignore the result for StandbyRecoverPreparedTransactions()
by casting a (void) to let any kind of static analyzer understand that
we don't care about the result for example. Handling for subxids is
necessary as well depending on the code path. Spliting things into a
could of sub-routines may be more readable as well. There are really a
couple of small parts that can be gathered and strengthened.

+   /*
+* Recreate its GXACT and dummy PGPROC
+*/
+   gxactnew = MarkAsPreparing(xid, gid,
+   hdr->prepared_at,
+   hdr->owner, hdr->database,
+   gxact->prepare_start_lsn,
+   gxact->prepare_end_lsn);
MarkAsPreparing() does not need to be extended with two new arguments.
RecoverPreparedTransactions() is used only at the end of recovery,
where it is not necessary to look at the 2PC state files from the
records. In this code path inredo is also set to false :)

+   {
+   /*
+* Entry could be on disk. Call with giveWarning=false
+* since it can be expected during replay.
+*/
+   RemoveTwoPhaseFile(xid, false);
+   }
This would be removed at the end of recovery anyway as a stale entry,
so that's not necessary.

+   /* Delete TwoPhaseState gxact entry and/or 2PC file. */
+   PrepareRedoRemove(parsed.twophase_xid);
Both things should not be present, no? If the file is pushed to disk
it means that the checkpoint horizon has already moved.

-   ereport(ERROR,
+   /* It's ok to find an entry in the redo/recovery case */
+   if (!gxact->inredo)
+   ereport(ERROR,
(errcode(ERRCODE_DUPLICATE_OBJECT),
 errmsg("transaction identifier \"%s\" is already in use",
gid)));
+   else
+   {
+   found = true;
+   break;
+   }
I would not have thought so.

MarkAsPreparing and MarkAsPreparingInRedo really share the same code.
What if the setup of the dummy PGPROC entry is made conditional?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-02-21 Thread Michael Paquier
On Wed, Feb 22, 2017 at 7:03 AM, Alvaro Herrera
 wrote:
> I have pushed the TAP test file, which is already passing, at least for
> me.  Let's see what the buildfarm says.

Thanks. I still have on my sheet to look at the latest 2PC patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-02-21 Thread Alvaro Herrera
I have pushed the TAP test file, which is already passing, at least for
me.  Let's see what the buildfarm says.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-02-03 Thread Robert Haas
On Wed, Feb 1, 2017 at 3:29 AM, Nikhil Sontakke  wrote:
>>> Shouldn’t the singular part of the message above be:
>>> "%u two-phase state file was written for a long-running prepared 
>>> transaction"
>>>
>>> But, then, English is not my native language, so I might be off here :-)
>>
>> If there's one file per long-running prepared transaction, which I
>> think is true, then I agree with you.
>
> PFA, small patch to fix this, then.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-02-01 Thread Nikhil Sontakke
>> Yeah. Was thinking about this yesterday. How about adding entries in
>> TwoPhaseState itself (which become valid later)? Only if it does not
>> cause a lot of code churn.
>
> That's possible as well, yes.

PFA a patch which does the above. It re-uses the TwoPhaseState gxact
entries to track 2PC PREPARE/COMMIT in shared memory. The advantage
being that CheckPointTwoPhase() becomes the only place where the fsync
of 2PC files happens.

A minor annoyance in the patch is the duplication of the code to add
the 2nd while loop to go through these shared memory entries in
PrescanPreparedTransactions, RecoverPreparedTransactions and
StandbyRecoverPreparedTransactions.

Other than this, I ran TAP tests and they succeed as needed.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


twophase_recovery_shmem_020217.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-02-01 Thread Nikhil Sontakke
>> Shouldn’t the singular part of the message above be:
>> "%u two-phase state file was written for a long-running prepared transaction"
>>
>> But, then, English is not my native language, so I might be off here :-)
>
> If there's one file per long-running prepared transaction, which I
> think is true, then I agree with you.

PFA, small patch to fix this, then.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


twophase_typo.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-31 Thread Robert Haas
On Mon, Jan 23, 2017 at 7:00 AM, Nikhil Sontakke
 wrote:
> 4) Minor nit-pick on existing code.
>
> (errmsg_plural("%u two-phase state file was written "
>   "for
> long-running prepared transactions",
>   "%u
> two-phase state files were written "
>   "for
> long-running prepared transactions",
>   serialized_xacts,
>   serialized_xacts)
>
> Shouldn’t the singular part of the message above be:
> "%u two-phase state file was written for a long-running prepared transaction"
>
> But, then, English is not my native language, so I might be off here :-)

If there's one file per long-running prepared transaction, which I
think is true, then I agree with you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Michael Paquier
On Tue, Jan 31, 2017 at 2:58 PM, Nikhil Sontakke
 wrote:
>>> CheckPointTwoPhase() in (5) does not sync this prepared transaction
>>> because the checkpointer's KnownPreparedList is empty.
>>
>> And that's why this needs to be stored in shared memory with a number
>> of elements made of max_prepared_xacts...
>
> Yeah. Was thinking about this yesterday. How about adding entries in
> TwoPhaseState itself (which become valid later)? Only if it does not
> cause a lot of code churn.

That's possible as well, yes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Nikhil Sontakke
>> CheckPointTwoPhase() in (5) does not sync this prepared transaction
>> because the checkpointer's KnownPreparedList is empty.
>
> And that's why this needs to be stored in shared memory with a number
> of elements made of max_prepared_xacts...

Yeah. Was thinking about this yesterday. How about adding entries in
TwoPhaseState itself (which become valid later)? Only if it does not
cause a lot of code churn.

The current non-shmem list patch only needs to handle standby
shutdowns correctly. Other aspects like standby promotion/recovery are
handled ok AFAICS.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Michael Paquier
On Tue, Jan 31, 2017 at 2:34 PM, Nikhil Sontakke
 wrote:
>>> I wonder what's the best location for this in the common case when we
>>> do shutdown of standby.  We could add code in XLOG_CHECKPOINT_SHUTDOWN
>>> and XLOG_CHECKPOINT_ONLINE xlog_redo code path.
>>
>> ShutdownXLOG() calls CreateRestartPoint() when a standby shuts down,
>> so doing all the durability work in CheckPointTwoPhase() would take
>> care of any problems.
>>
>
> ShutdownXLOG() gets called from the checkpointer process. See comments
> above about the checkpointer not having access to the proper
> KnownPreparedList.
>
> The following test sequence will trigger the issue:
>
> 1) start master
> 2) start replica
> 3) prepare a transaction on master
> 4) shutdown master
> 5) shutdown replica
>
> CheckPointTwoPhase() in (5) does not sync this prepared transaction
> because the checkpointer's KnownPreparedList is empty.

And that's why this needs to be stored in shared memory with a number
of elements made of max_prepared_xacts...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Nikhil Sontakke
>> Having CheckPointTwoPhase() do the flush would mean shifting the data
>> from KnownPreparedList into TwoPhaseState shmem.
>
> Er, no. For CheckPointTwoPhase(), at recovery what needs to be done is
> to have all the entries in KnownPreparedList() flushed to disk and
> have those entries removed while holding a shared memory lock.

The KnownPreparedList is constructed by the recovery process.
CheckPointTwoPhase() gets called by the checkpointer process. The
checkpointer does not have access to this valid KnownPreparedList.

>And for
> the rest we need to be careful to have PrescanPreparedTransactions,
> RecoverPreparedTransactions and StandbyRecoverPreparedTransactions
> aware of entries that are in KnownPreparedList().


Yeah, that part is straightforward. It does involve duplication of the
earlier while loops to work against KnownPreparedList. A smart single
while loop which loops over the 2PC files followed by the list would
help here :-)

> Let's leave the
> business of putting the information from KnownPreparedList to
> TwoPhaseState in RecoverPreparedTransactions, which need to be aware
> of entries in KnownPreparedList() anyway. The only thing that differs
> is how the 2PC information is fetched: from the segments or from the
> files in pg_twophase.
>

Yeah.  This part is also ok. We also got to be careful to mark the
shmem gxact entry with "ondisk = false" and need to set
prepare_start_lsn/prepare_end_lsn properly as well.

>> I wonder what's the best location for this in the common case when we
>> do shutdown of standby.  We could add code in XLOG_CHECKPOINT_SHUTDOWN
>> and XLOG_CHECKPOINT_ONLINE xlog_redo code path.
>
> ShutdownXLOG() calls CreateRestartPoint() when a standby shuts down,
> so doing all the durability work in CheckPointTwoPhase() would take
> care of any problems.
>

ShutdownXLOG() gets called from the checkpointer process. See comments
above about the checkpointer not having access to the proper
KnownPreparedList.

The following test sequence will trigger the issue:

1) start master
2) start replica
3) prepare a transaction on master
4) shutdown master
5) shutdown replica

CheckPointTwoPhase() in (5) does not sync this prepared transaction
because the checkpointer's KnownPreparedList is empty.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Michael Paquier
On Tue, Jan 31, 2017 at 3:45 AM, Nikhil Sontakke
 wrote:
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>>  (errmsg("unexpected timeline ID %u (should be %u)
>> in checkpoint record",
>>  checkPoint.ThisTimeLineID, ThisTimeLineID)));
>>
>> +KnownPreparedRecreateFiles(checkPoint.redo);
>>  RecoveryRestartPoint();
>>  }
>> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
>> files are not flushed to disk with this patch. This is a problem as a
>> new restart point is created... Having the flush in CheckpointTwoPhase
>> really makes the most sense.
>
> Having CheckPointTwoPhase() do the flush would mean shifting the data
> from KnownPreparedList into TwoPhaseState shmem.

Er, no. For CheckPointTwoPhase(), at recovery what needs to be done is
to have all the entries in KnownPreparedList() flushed to disk and
have those entries removed while holding a shared memory lock. And for
the rest we need to be careful to have PrescanPreparedTransactions,
RecoverPreparedTransactions and StandbyRecoverPreparedTransactions
aware of entries that are in KnownPreparedList(). Let's leave the
business of putting the information from KnownPreparedList to
TwoPhaseState in RecoverPreparedTransactions, which need to be aware
of entries in KnownPreparedList() anyway. The only thing that differs
is how the 2PC information is fetched: from the segments or from the
files in pg_twophase.

> I wonder what's the best location for this in the common case when we
> do shutdown of standby.  We could add code in XLOG_CHECKPOINT_SHUTDOWN
> and XLOG_CHECKPOINT_ONLINE xlog_redo code path.

ShutdownXLOG() calls CreateRestartPoint() when a standby shuts down,
so doing all the durability work in CheckPointTwoPhase() would take
care of any problems.

(moving patch to CF 2017-03)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-30 Thread Nikhil Sontakke
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>  (errmsg("unexpected timeline ID %u (should be %u)
> in checkpoint record",
>  checkPoint.ThisTimeLineID, ThisTimeLineID)));
>
> +KnownPreparedRecreateFiles(checkPoint.redo);
>  RecoveryRestartPoint();
>  }
> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
> files are not flushed to disk with this patch. This is a problem as a
> new restart point is created... Having the flush in CheckpointTwoPhase
> really makes the most sense.

Having CheckPointTwoPhase() do the flush would mean shifting the data
from KnownPreparedList into TwoPhaseState shmem.

I wonder what's the best location for this in the common case when we
do shutdown of standby.  We could add code in XLOG_CHECKPOINT_SHUTDOWN
and XLOG_CHECKPOINT_ONLINE xlog_redo code path.

Regards,
Nikhils




-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Nikhil Sontakke
>>> The xact_redo code will add prepared transactions to the
>>> KnownPreparedList in memory. Earlier it used to create the on-disk 2PC
>>> file.
>>>
>>> At standby promote, the surviving (yet uncommitted) prepared
>>> transactions from KnownPreparedList need to be persisted, right?
>>

>> I don't see why, so please explain or show the error that can be
>> caused if we don't.
>
> I agree with Simon here. There is no point to fsync the 2PC files are
> in case of an immediate crash after promotion replay will happen from
> the last checkpoint, aka the one before the promotion has been
> triggered. So there is no point to flush them at promotion, they would
> be replayed anyway.

1) start master
2) start streaming replica
3) on master

begin;

create table test1(g int);

prepare transaction 'test';

4) Promote standby via trigger file or via "pg_ctl promote"

I thought if we don't fsync the KnownPreparedList then we might not
create the 2PC state properly on the standby.

However, I do realize that we will be calling
RecoverPreparedTransactions() eventually on promote. This function
will be modified to go through the KnownPreparedList to reload shmem
appropriately for 2PC. So, all good in that case.

Apologies for the digression.

Regards,
Nikhils



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Michael Paquier
On Fri, Jan 27, 2017 at 8:23 PM, Simon Riggs  wrote:
> On 27 January 2017 at 11:01, Nikhil Sontakke  wrote:
>> The xact_redo code will add prepared transactions to the
>> KnownPreparedList in memory. Earlier it used to create the on-disk 2PC
>> file.
>>
>> At standby promote, the surviving (yet uncommitted) prepared
>> transactions from KnownPreparedList need to be persisted, right?
>
> I don't see why, so please explain or show the error that can be
> caused if we don't.

I agree with Simon here. There is no point to fsync the 2PC files are
in case of an immediate crash after promotion replay will happen from
the last checkpoint, aka the one before the promotion has been
triggered. So there is no point to flush them at promotion, they would
be replayed anyway.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Simon Riggs
On 27 January 2017 at 11:01, Nikhil Sontakke  wrote:
> On 27 January 2017 at 15:37, Simon Riggs  wrote:
>> On 27 January 2017 at 09:59, Nikhil Sontakke  wrote:
> But, I put the recovery process and the checkpointer process of the
> standby under gdb with breakpoints on these functions, but both did
> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
> a promote :-|

 No end-of-recovery checkpoints happen at promotion since 9.3. You can
 still use fallback_promote as promote file to trigger the pre-9.2 (9.2
 included) behavior.
>>>
>>> Ok, so that means, we also need to fsync out these 2PC XIDs at promote
>>> time as well for their durability.
>>
>> Why?
>
> The xact_redo code will add prepared transactions to the
> KnownPreparedList in memory. Earlier it used to create the on-disk 2PC
> file.
>
> At standby promote, the surviving (yet uncommitted) prepared
> transactions from KnownPreparedList need to be persisted, right?

I don't see why, so please explain or show the error that can be
caused if we don't.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Nikhil Sontakke
On 27 January 2017 at 15:37, Simon Riggs  wrote:
> On 27 January 2017 at 09:59, Nikhil Sontakke  wrote:
 But, I put the recovery process and the checkpointer process of the
 standby under gdb with breakpoints on these functions, but both did
 not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
 a promote :-|
>>>
>>> No end-of-recovery checkpoints happen at promotion since 9.3. You can
>>> still use fallback_promote as promote file to trigger the pre-9.2 (9.2
>>> included) behavior.
>>
>> Ok, so that means, we also need to fsync out these 2PC XIDs at promote
>> time as well for their durability.
>
> Why?

The xact_redo code will add prepared transactions to the
KnownPreparedList in memory. Earlier it used to create the on-disk 2PC
file.

At standby promote, the surviving (yet uncommitted) prepared
transactions from KnownPreparedList need to be persisted, right?

Regards,
Nikhils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Simon Riggs
On 27 January 2017 at 09:59, Nikhil Sontakke  wrote:
>>> But, I put the recovery process and the checkpointer process of the
>>> standby under gdb with breakpoints on these functions, but both did
>>> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
>>> a promote :-|
>>
>> No end-of-recovery checkpoints happen at promotion since 9.3. You can
>> still use fallback_promote as promote file to trigger the pre-9.2 (9.2
>> included) behavior.
>
> Ok, so that means, we also need to fsync out these 2PC XIDs at promote
> time as well for their durability.

Why? The data files haven't been fsynced either at that point.

If there is a bug there it doesn't just affect 2PC.

What sequence of actions would cause data loss?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-27 Thread Nikhil Sontakke
>> But, I put the recovery process and the checkpointer process of the
>> standby under gdb with breakpoints on these functions, but both did
>> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
>> a promote :-|
>
> No end-of-recovery checkpoints happen at promotion since 9.3. You can
> still use fallback_promote as promote file to trigger the pre-9.2 (9.2
> included) behavior.

Ok, so that means, we also need to fsync out these 2PC XIDs at promote
time as well for their durability.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-26 Thread Michael Paquier
On Thu, Jan 26, 2017 at 5:02 PM, Nikhil Sontakke
 wrote:
>>> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
>>> promote" code path.
>>
>> CreateRestartPoint() calls it via CheckPointGuts() while in recovery.
>
> May be that I am missing something.
>
> But, I put the recovery process and the checkpointer process of the
> standby under gdb with breakpoints on these functions, but both did
> not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
> a promote :-|

No end-of-recovery checkpoints happen at promotion since 9.3. You can
still use fallback_promote as promote file to trigger the pre-9.2 (9.2
included) behavior.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-26 Thread Nikhil Sontakke
>> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
>> promote" code path.
>
> CreateRestartPoint() calls it via CheckPointGuts() while in recovery.

May be that I am missing something.

But, I put the recovery process and the checkpointer process of the
standby under gdb with breakpoints on these functions, but both did
not hit CreateRestartPoint() as well as CheckPointGuts() when I issued
a promote :-|

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-26 Thread Stas Kelvich

> On 26 Jan 2017, at 10:34, Michael Paquier  wrote:
> 
> On Thu, Jan 26, 2017 at 4:09 PM, Nikhil Sontakke
>  wrote:
>>> I look at this patch from you and that's present for me:
>>> https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com
>> 
>>> --- a/src/backend/access/transam/xlog.c
>>> +++ b/src/backend/access/transam/xlog.c
>>> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>>> (errmsg("unexpected timeline ID %u (should be %u)
>>> in checkpoint record",
>>> checkPoint.ThisTimeLineID, ThisTimeLineID)));
>>> 
>>> +KnownPreparedRecreateFiles(checkPoint.redo);
>>> RecoveryRestartPoint();
>>> }
>> 
>> Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a
>> function by this name. And now I see, the name is CheckPointTwoPhase()
>> :-)
> 
> My mistake then :D
> 
>>> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
>>> files are not flushed to disk with this patch. This is a problem as a
>>> new restart point is created... Having the flush in CheckpointTwoPhase
>>> really makes the most sense.
>> 
>> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
>> promote" code path.
> 
> CreateRestartPoint() calls it via CheckPointGuts() while in recovery.
> 

Huh, glad that this tread received a lot of attention.

> On 24 Jan 2017, at 17:26, Nikhil Sontakke  wrote:
> 
> We are talking about the recovery/promote code path. Specifically this
> call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().
> 
> We write the files to disk and they get immediately read up in the
> following code. We could not write the files to disk and read
> KnownPreparedList in the code path that follows as well as elsewhere.

Thanks Nikhil, now I got that. Since we are talking about promotion we are on 
different timescale and 1-10 second
lag matters a lot.

I think I have in my mind realistic scenario when proposed recovery code path 
will hit the worst case: Google cloud.
They have quite fast storage, but fsync time is really big and can go up to 
10-100ms (i suppose it is network-attacheble).
Having say 300 prepared tx, we can delay promotion up to half minute. 

So i think it worth of examination.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Michael Paquier
On Thu, Jan 26, 2017 at 4:09 PM, Nikhil Sontakke
 wrote:
>>I look at this patch from you and that's present for me:
>>https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com
>
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>>  (errmsg("unexpected timeline ID %u (should be %u)
>> in checkpoint record",
>>  checkPoint.ThisTimeLineID, ThisTimeLineID)));
>>
>> +KnownPreparedRecreateFiles(checkPoint.redo);
>>  RecoveryRestartPoint();
>>  }
>
> Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a
> function by this name. And now I see, the name is CheckPointTwoPhase()
> :-)

My mistake then :D

>> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
>> files are not flushed to disk with this patch. This is a problem as a
>> new restart point is created... Having the flush in CheckpointTwoPhase
>> really makes the most sense.
>
> Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
> promote" code path.

CreateRestartPoint() calls it via CheckPointGuts() while in recovery.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Nikhil Sontakke
>I look at this patch from you and that's present for me:
>https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq->Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
>  (errmsg("unexpected timeline ID %u (should be %u)
> in checkpoint record",
>  checkPoint.ThisTimeLineID, ThisTimeLineID)));
>
> +KnownPreparedRecreateFiles(checkPoint.redo);
>  RecoveryRestartPoint();
>  }

Oh, sorry. I was asking about CheckpointTwoPhase(). I don't see a
function by this name. And now I see, the name is CheckPointTwoPhase()
:-)

> And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
> files are not flushed to disk with this patch. This is a problem as a
> new restart point is created... Having the flush in CheckpointTwoPhase
> really makes the most sense.

Umm, AFAICS, CheckPointTwoPhase() does not get called in the "standby
promote" code path.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Michael Paquier
On Thu, Jan 26, 2017 at 1:38 PM, Nikhil Sontakke
 wrote:
>> We should really try to do things right now, or we'll never come back
>> to it. 9.3 (if my memory does not fail me?) has reduced the time to do
>> promotion by removing the need of the end-of-recovery checkpoint,
>> while I agree that there should not be that many 2PC transactions at
>> this point, if there are for a reason or another, the time it takes to
>> complete promotion would be impacted. So let's refactor
>> PrescanPreparedTransactions() so as it is able to handle 2PC data from
>> a buffer extracted by XlogReadTwoPhaseData(), and we are good to go.
>
> Not quite. If we modify PrescanPreparedTransactions(), we also need to
> make RecoverPreparedTransactions() and
> StandbyRecoverPreparedTransactions() handle 2PC data via
> XlogReadTwoPhaseData().

Ah, right for both, even for RecoverPreparedTransactions() that
happens at the end of recovery. Thanks for noticing. The patch
mentions that as well:
+ ** At the end of recovery we move all known prepared transactions to disk.
+ *  This allows RecoverPreparedTransactions() and
+ *  StandbyRecoverPreparedTransactions() to do their work.
I need some strong coffee..

>> +   KnownPreparedRecreateFiles(checkPoint.redo);
>> RecoveryRestartPoint();
>> Looking again at this code, I think that this is incorrect. The
>> checkpointer should be in charge of doing this work and not the
>> startup process, so this should go into CheckpointTwoPhase() instead.
>
> I don't see a function by the above name in the code?

I look at this patch from you and that's present for me:
https://www.postgresql.org/message-id/CAMGcDxf8Bn9ZPBBJZba9wiyQq-Qk5uqq=vjomnrnw5s+fks...@mail.gmail.com
If I look as well at the last version of Stas it is here:
https://www.postgresql.org/message-id/becc988a-db74-48d5-b5d5-a54551a62...@postgrespro.ru

As this change:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9573,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
 (errmsg("unexpected timeline ID %u (should be %u)
in checkpoint record",
 checkPoint.ThisTimeLineID, ThisTimeLineID)));

+KnownPreparedRecreateFiles(checkPoint.redo);
 RecoveryRestartPoint();
 }
And actually, when a XLOG_CHECKPOINT_SHUTDOWN record is taken, 2PC
files are not flushed to disk with this patch. This is a problem as a
new restart point is created... Having the flush in CheckpointTwoPhase
really makes the most sense.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Nikhil Sontakke
>> The question remains whether saving off a few fsyncs/reads for these
>> long-lived prepared transactions is worth the additional code churn.
>> Even if we add code to go through the KnownPreparedList, we still will
>> have to go through the other on-disk 2PC transactions anyways. So,
>> maybe not.
>
> We should really try to do things right now, or we'll never come back
> to it. 9.3 (if my memory does not fail me?) has reduced the time to do
> promotion by removing the need of the end-of-recovery checkpoint,
> while I agree that there should not be that many 2PC transactions at
> this point, if there are for a reason or another, the time it takes to
> complete promotion would be impacted. So let's refactor
> PrescanPreparedTransactions() so as it is able to handle 2PC data from
> a buffer extracted by XlogReadTwoPhaseData(), and we are good to go.
>

Not quite. If we modify PrescanPreparedTransactions(), we also need to
make RecoverPreparedTransactions() and
StandbyRecoverPreparedTransactions() handle 2PC data via
XlogReadTwoPhaseData().


> +   /*
> +* Move prepared transactions, if any, from KnownPreparedList to 
> files.
> +* It is possible to skip this step and teach subsequent code about
> +* KnownPreparedList, but PrescanPreparedTransactions() happens once
> +* during end of recovery or on promote, so probably it isn't worth
> +* the additional code.
> +*/


This comment is misplaced. Does not make sense before this specific call.

> +   KnownPreparedRecreateFiles(checkPoint.redo);
> RecoveryRestartPoint();
> Looking again at this code, I think that this is incorrect. The
> checkpointer should be in charge of doing this work and not the
> startup process, so this should go into CheckpointTwoPhase() instead.

I don't see a function by the above name in the code?

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Michael Paquier
On Wed, Jan 25, 2017 at 11:55 PM, Nikhil Sontakke
 wrote:
>> We are talking about the recovery/promote code path. Specifically this
>> call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().
>>
>> We write the files to disk and they get immediately read up in the
>> following code. We could not write the files to disk and read
>> KnownPreparedList in the code path that follows as well as elsewhere.
>
> Thinking more on this.
>
> The only optimization that's really remaining is handling of prepared
> transactions that have not been committed or will linger around for
> long. The short lived 2PC transactions have been optimized already via
> this patch.
>
> The question remains whether saving off a few fsyncs/reads for these
> long-lived prepared transactions is worth the additional code churn.
> Even if we add code to go through the KnownPreparedList, we still will
> have to go through the other on-disk 2PC transactions anyways. So,
> maybe not.

We should really try to do things right now, or we'll never come back
to it. 9.3 (if my memory does not fail me?) has reduced the time to do
promotion by removing the need of the end-of-recovery checkpoint,
while I agree that there should not be that many 2PC transactions at
this point, if there are for a reason or another, the time it takes to
complete promotion would be impacted. So let's refactor
PrescanPreparedTransactions() so as it is able to handle 2PC data from
a buffer extracted by XlogReadTwoPhaseData(), and we are good to go.

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9573,6 +9573,15 @@ xlog_redo(XLogReaderState *record)
(errmsg("unexpected timeline ID %u (should be %u)
in checkpoint record",
checkPoint.ThisTimeLineID, ThisTimeLineID)));

+
+   /*
+* Move prepared transactions, if any, from KnownPreparedList to files.
+* It is possible to skip this step and teach subsequent code about
+* KnownPreparedList, but PrescanPreparedTransactions() happens once
+* during end of recovery or on promote, so probably it isn't worth
+* the additional code.
+*/
+   KnownPreparedRecreateFiles(checkPoint.redo);
RecoveryRestartPoint();
Looking again at this code, I think that this is incorrect. The
checkpointer should be in charge of doing this work and not the
startup process, so this should go into CheckpointTwoPhase() instead.
At the end, we should be able to just live without
KnownPreparedRecreateFiles() and just rip it off from the patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-25 Thread Nikhil Sontakke
> We are talking about the recovery/promote code path. Specifically this
> call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().
>
> We write the files to disk and they get immediately read up in the
> following code. We could not write the files to disk and read
> KnownPreparedList in the code path that follows as well as elsewhere.

Thinking more on this.

The only optimization that's really remaining is handling of prepared
transactions that have not been committed or will linger around for
long. The short lived 2PC transactions have been optimized already via
this patch.

The question remains whether saving off a few fsyncs/reads for these
long-lived prepared transactions is worth the additional code churn.
Even if we add code to go through the KnownPreparedList, we still will
have to go through the other on-disk 2PC transactions anyways. So,
maybe not.

Regards,
Nikhils




>
> Regards,
> Nikhils
>
>
>>> The difference between those two is likely noise.
>>>
>>> By the way, in those measurements, the OS cache is still filled with
>>> the past WAL segments, which is a rather best case, no? What happens
>>> if you do the same kind of tests on a box where memory is busy doing
>>> something else and replayed WAL segments get evicted from the OS cache
>>> more aggressively once the startup process switches to a new segment?
>>> This could be tested for example on a VM with few memory (say 386MB or
>>> less) so as the startup process needs to access again the past WAL
>>> segments to recover the 2PC information it needs to get them back
>>> directly from disk... One trick that you could use here would be to
>>> tweak the startup process so as it drops the OS cache once a segment
>>> is finished replaying, and see the effects of an aggressive OS cache
>>> eviction. This patch is showing really nice improvements with the OS
>>> cache backing up the data, still it would make sense to test things
>>> with a worse test case and see if things could be done better. The
>>> startup process now only reads records sequentially, not randomly
>>> which is a concept that this patch introduces.
>>>
>>> Anyway, perhaps this does not matter much, the non-recovery code path
>>> does the same thing as this patch, and the improvement is too much to
>>> be ignored. So for consistency's sake we could go with the approach
>>> proposed which has the advantage to not put any restriction on the
>>> size of the 2PC file contrary to what an implementation saving the
>>> contents of the 2PC files into memory would need to do.
>>
>> Maybe i’m missing something, but I don’t see how OS cache can affect 
>> something here.
>>
>> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. 
>> Sequential reading 1GB of data
>> is order of magnitude faster even on the old hdd, not speaking of ssd. Also 
>> you can take a look on flame graphs
>> attached to previous message — majority of time during recovery spent in 
>> pg_qsort while replaying
>> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of 
>> time. That amount can
>> grow in case of uncached disk read but taking into account total recovery 
>> time this should not affect much.
>>
>> If you are talking about uncached access only during checkpoint than here we 
>> are restricted with
>> max_prepared_transaction, so at max we will read about hundred of small 
>> files (usually fitting into one filesystem page) which will also
>> be barely noticeable comparing to recovery time between checkpoints. Also 
>> wal segments cache eviction during
>> replay doesn’t seems to me as standard scenario.
>>
>> Anyway i took the machine with hdd to slow down read speed and run tests 
>> again. During one of the runs i
>> launched in parallel bash loop that was dropping os cache each second (while 
>> wal fragment replay takes
>>  also about one second).
>>
>> 1.5M transactions
>>  start segment: 0x06
>>  last segment: 0x47
>>
>> patched, with constant cache_drop:
>>   total recovery time: 86s
>>
>> patched, without constant cache_drop:
>>total recovery time: 68s
>>
>> (while difference is significant, i bet that happens mostly because of 
>> database file segments should be re-read after cache drop)
>>
>> master, without constant cache_drop:
>>time to recover 35 segments: 2h 25m (after that i tired to wait)
>>expected total recovery time: 4.5 hours
>>
>> --
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>>
>>
>
>
>
> --
>  Nikhil Sontakke   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-24 Thread Nikhil Sontakke
> Thanks for review, Nikhil and Michael.
>
> I don’t follow here. We are moving data away from WAL to files on checkpoint 
> because after checkpoint
> there is no guaranty that WAL segment with our prepared tx will be still 
> available.
>

We are talking about the recovery/promote code path. Specifically this
call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().

We write the files to disk and they get immediately read up in the
following code. We could not write the files to disk and read
KnownPreparedList in the code path that follows as well as elsewhere.

Regards,
Nikhils


>> The difference between those two is likely noise.
>>
>> By the way, in those measurements, the OS cache is still filled with
>> the past WAL segments, which is a rather best case, no? What happens
>> if you do the same kind of tests on a box where memory is busy doing
>> something else and replayed WAL segments get evicted from the OS cache
>> more aggressively once the startup process switches to a new segment?
>> This could be tested for example on a VM with few memory (say 386MB or
>> less) so as the startup process needs to access again the past WAL
>> segments to recover the 2PC information it needs to get them back
>> directly from disk... One trick that you could use here would be to
>> tweak the startup process so as it drops the OS cache once a segment
>> is finished replaying, and see the effects of an aggressive OS cache
>> eviction. This patch is showing really nice improvements with the OS
>> cache backing up the data, still it would make sense to test things
>> with a worse test case and see if things could be done better. The
>> startup process now only reads records sequentially, not randomly
>> which is a concept that this patch introduces.
>>
>> Anyway, perhaps this does not matter much, the non-recovery code path
>> does the same thing as this patch, and the improvement is too much to
>> be ignored. So for consistency's sake we could go with the approach
>> proposed which has the advantage to not put any restriction on the
>> size of the 2PC file contrary to what an implementation saving the
>> contents of the 2PC files into memory would need to do.
>
> Maybe i’m missing something, but I don’t see how OS cache can affect 
> something here.
>
> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. 
> Sequential reading 1GB of data
> is order of magnitude faster even on the old hdd, not speaking of ssd. Also 
> you can take a look on flame graphs
> attached to previous message — majority of time during recovery spent in 
> pg_qsort while replaying
> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of 
> time. That amount can
> grow in case of uncached disk read but taking into account total recovery 
> time this should not affect much.
>
> If you are talking about uncached access only during checkpoint than here we 
> are restricted with
> max_prepared_transaction, so at max we will read about hundred of small files 
> (usually fitting into one filesystem page) which will also
> be barely noticeable comparing to recovery time between checkpoints. Also wal 
> segments cache eviction during
> replay doesn’t seems to me as standard scenario.
>
> Anyway i took the machine with hdd to slow down read speed and run tests 
> again. During one of the runs i
> launched in parallel bash loop that was dropping os cache each second (while 
> wal fragment replay takes
>  also about one second).
>
> 1.5M transactions
>  start segment: 0x06
>  last segment: 0x47
>
> patched, with constant cache_drop:
>   total recovery time: 86s
>
> patched, without constant cache_drop:
>total recovery time: 68s
>
> (while difference is significant, i bet that happens mostly because of 
> database file segments should be re-read after cache drop)
>
> master, without constant cache_drop:
>time to recover 35 segments: 2h 25m (after that i tired to wait)
>expected total recovery time: 4.5 hours
>
> --
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>



-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-24 Thread Stas Kelvich

> On 24 Jan 2017, at 09:42, Michael Paquier  wrote:
> 
> On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke
>  wrote:
>> Speeding up recovery or failover activity via a faster promote is a
>> desirable thing. So, maybe, we should look at teaching the relevant
>> code about using "KnownPreparedList"? I know that would increase the
>> size of this patch and would mean more testing, but this seems to be
>> last remaining optimization in this code path.
> 
> That's a good idea, worth having in this patch. Actually we may not
> want to call KnownPreparedRecreateFiles() here as promotion is not
> synonym of end-of-recovery checkpoint for a couple of releases now.

Thanks for review, Nikhil and Michael.

I don’t follow here. We are moving data away from WAL to files on checkpoint 
because after checkpoint
there is no guaranty that WAL segment with our prepared tx will be still 
available.

> The difference between those two is likely noise.
> 
> By the way, in those measurements, the OS cache is still filled with
> the past WAL segments, which is a rather best case, no? What happens
> if you do the same kind of tests on a box where memory is busy doing
> something else and replayed WAL segments get evicted from the OS cache
> more aggressively once the startup process switches to a new segment?
> This could be tested for example on a VM with few memory (say 386MB or
> less) so as the startup process needs to access again the past WAL
> segments to recover the 2PC information it needs to get them back
> directly from disk... One trick that you could use here would be to
> tweak the startup process so as it drops the OS cache once a segment
> is finished replaying, and see the effects of an aggressive OS cache
> eviction. This patch is showing really nice improvements with the OS
> cache backing up the data, still it would make sense to test things
> with a worse test case and see if things could be done better. The
> startup process now only reads records sequentially, not randomly
> which is a concept that this patch introduces.
> 
> Anyway, perhaps this does not matter much, the non-recovery code path
> does the same thing as this patch, and the improvement is too much to
> be ignored. So for consistency's sake we could go with the approach
> proposed which has the advantage to not put any restriction on the
> size of the 2PC file contrary to what an implementation saving the
> contents of the 2PC files into memory would need to do.

Maybe i’m missing something, but I don’t see how OS cache can affect something 
here.

Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. Sequential 
reading 1GB of data
is order of magnitude faster even on the old hdd, not speaking of ssd. Also you 
can take a look on flame graphs
attached to previous message — majority of time during recovery spent in 
pg_qsort while replaying 
PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of time. 
That amount can
grow in case of uncached disk read but taking into account total recovery time 
this should not affect much.

If you are talking about uncached access only during checkpoint than here we 
are restricted with
max_prepared_transaction, so at max we will read about hundred of small files 
(usually fitting into one filesystem page) which will also
be barely noticeable comparing to recovery time between checkpoints. Also wal 
segments cache eviction during
replay doesn’t seems to me as standard scenario.

Anyway i took the machine with hdd to slow down read speed and run tests again. 
During one of the runs i
launched in parallel bash loop that was dropping os cache each second (while 
wal fragment replay takes
 also about one second).

1.5M transactions
 start segment: 0x06
 last segment: 0x47

patched, with constant cache_drop:
  total recovery time: 86s

patched, without constant cache_drop:
   total recovery time: 68s

(while difference is significant, i bet that happens mostly because of database 
file segments should be re-read after cache drop)

master, without constant cache_drop:
   time to recover 35 segments: 2h 25m (after that i tired to wait)
   expected total recovery time: 4.5 hours

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-24 Thread Nikhil Sontakke
>> Speeding up recovery or failover activity via a faster promote is a
>> desirable thing. So, maybe, we should look at teaching the relevant
>> code about using "KnownPreparedList"? I know that would increase the
>> size of this patch and would mean more testing, but this seems to be
>> last remaining optimization in this code path.
>
> That's a good idea, worth having in this patch. Actually we may not
> want to call KnownPreparedRecreateFiles() here as promotion is not
> synonym of end-of-recovery checkpoint for a couple of releases now.
>

Once implemented, a good way to performance test this could be to set
checkpoint_timeout to a a large value like an hour. Then, generate
enough 2PC WAL while ensuring that a checkpoint does not happen
automatically or otherwise.

We could then measure the time taken to recover on startup to see the efficacy.

> Most of the 2PC syncs just won't happen, such transactions normally
> don't last long, and the number you would get during a checkpoint is
> largely lower than what would happen between two checkpoints. When
> working on Postgres-XC, the number of 2PC was really more funky.
>

Yes, postgres-xl is full of 2PC, so hopefully this optimization should
help a lot in that case as well.

Regards,
Nikhils
-- 
 Nikhil Sontakke   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2017-01-23 Thread Michael Paquier
On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke
 wrote:
> Hi Stas,
>
>>
>> So, here is brand new implementation of the same thing.
>>
>> Now instead of creating pgproc entry for prepared transaction during 
>> recovery,
>> I just store recptr/xid correspondence in separate 2L-list and
>> deleting entries in that
>> list if redo process faced commit/abort. In case of checkpoint or end
>> of recovery
>> transactions remaining in that list are dumped to files in pg_twophase.
>>
>> Seems that current approach is way more simpler and patch has two times less
>> LOCs then previous one.
>>
>
> The proposed approach and patch does appear to be much simpler than
> the previous one.
>
> I have some comments/questions on the patch (twophase_recovery_list_2.diff):
>
> 1)
> +   /*
> +* Move prepared transactions from KnownPreparedList to files, if any.
> +* It is possible to skip that step and teach subsequent code about
> +* KnownPreparedList, but whole PrescanPreparedTransactions() happens
> +* once during end of recovery or promote, so probably it isn't worth
> +* complications.
> +*/
> +   KnownPreparedRecreateFiles(InvalidXLogRecPtr);
> +
>
> Speeding up recovery or failover activity via a faster promote is a
> desirable thing. So, maybe, we should look at teaching the relevant
> code about using "KnownPreparedList"? I know that would increase the
> size of this patch and would mean more testing, but this seems to be
> last remaining optimization in this code path.

That's a good idea, worth having in this patch. Actually we may not
want to call KnownPreparedRecreateFiles() here as promotion is not
synonym of end-of-recovery checkpoint for a couple of releases now.

> 3) We are pushing the fsyncing of 2PC files to the checkpoint replay
> activity with this patch. Now, typically, we would assume that PREPARE
> followed by COMMIT/ABORT would happen within a checkpoint replay
> cycle, if not, this would make checkpoint replays slower since the
> earlier spread out fsyncs are now getting bunched up. I had concerns
> about this but clearly your latest performance measurements don't show
> any negatives around this directly.

Most of the 2PC syncs just won't happen, such transactions normally
don't last long, and the number you would get during a checkpoint is
largely lower than what would happen between two checkpoints. When
working on Postgres-XC, the number of 2PC was really more funky.

> 6) Do we need to hold TwoPhaseStateLock lock in shared mode while
> calling KnownPreparedRecreateFiles()? I think, it does not matter in
> the recovery/replay code path.

It does not matter as the only code path tracking that would be the
checkpointer in TwoPhaseCheckpoint(), and this 2PC recovery patch does
not touch anymore TwoPhaseStateData. Actually the patch makes no use
of it.

> 7) I fixed some typos and comments. PFA, patch attached.
>
> Other than this, I ran TAP tests and they succeed as needed.
>
> On 23 January 2017 at 16:56, Stas Kelvich  wrote:
>> I did tests with following setup:
>>
>> * Start postgres initialised with pgbench
>> * Start pg_receivexlog
>> * Take basebackup
>> * Perform 1.5 M transactions
>> * Stop everything and apply wal files stored by pg_receivexlog to base 
>> backup.
>>
>> All tests performed on a laptop with nvme ssd
>> number of transactions: 1.5M
>> start segment: 0x4
>>
>> -master non-2pc:
>> last segment: 0x1b
>> average recovery time per 16 wal files: 11.8s
>> average total recovery time: 17.0s
>>
>> -master 2pc:
>> last segment: 0x44
>> average recovery time per 16 wal files: 142s
>> average total recovery time: 568s
>>
>> -patched 2pc (previous patch):
>> last segment: 0x44
>> average recovery time per 16 wal files: 5.3s
>> average total recovery time: 21.2s
>>
>> -patched2 2pc (dlist_push_tail changed to dlist_push_head):
>> last segment: 0x44
>> average recovery time per 16 wal files: 5.2s
>> average total recovery time: 20.8s

The difference between those two is likely noise.

By the way, in those measurements, the OS cache is still filled with
the past WAL segments, which is a rather best case, no? What happens
if you do the same kind of tests on a box where memory is busy doing
something else and replayed WAL segments get evicted from the OS cache
more aggressively once the startup process switches to a new segment?
This could be tested for example on a VM with few memory (say 386MB or
less) so as the startup process needs to access again the past WAL
segments to recover the 2PC information it needs to get them back
directly from disk... One trick that you could use here would be to
tweak the startup process so as it drops the OS cache once a segment
is finished replaying, and see the effects of an aggressive OS cache
eviction. This patch is showing really nice improvements with the OS
cache backing up the data, still it would 

Re: [HACKERS] Speedup twophase transactions

2017-01-23 Thread Nikhil Sontakke
Hi Stas,

>
> So, here is brand new implementation of the same thing.
>
> Now instead of creating pgproc entry for prepared transaction during recovery,
> I just store recptr/xid correspondence in separate 2L-list and
> deleting entries in that
> list if redo process faced commit/abort. In case of checkpoint or end
> of recovery
> transactions remaining in that list are dumped to files in pg_twophase.
>
> Seems that current approach is way more simpler and patch has two times less
> LOCs then previous one.
>

The proposed approach and patch does appear to be much simpler than
the previous one.

I have some comments/questions on the patch (twophase_recovery_list_2.diff):

1)
+   /*

+* Move prepared transactions from KnownPreparedList to files, if any.

+* It is possible to skip that step and teach subsequent code about

+* KnownPreparedList, but whole PrescanPreparedTransactions() happens

+* once during end of recovery or promote, so probably it isn't worth

+* complications.

+*/

+   KnownPreparedRecreateFiles(InvalidXLogRecPtr);

+

Speeding up recovery or failover activity via a faster promote is a
desirable thing. So, maybe, we should look at teaching the relevant
code about using "KnownPreparedList"? I know that would increase the
size of this patch and would mean more testing, but this seems to be
last remaining optimization in this code path.

2)
+   /*

+* Here we know that file should be moved to disk. But
aborting recovery because

+* of absence of unnecessary file doesn't seems to be a good
idea, so call remove

+* with giveWarning=false.

+*/

+   RemoveTwoPhaseFile(xid, false);

We are going to call the above in case of COMMIT/ABORT. If so, we
should always find the "xid" entry either in the KnownPreparedList or
as a file. Does it make sense to call the above function with "false"
then?

3) We are pushing the fsyncing of 2PC files to the checkpoint replay
activity with this patch. Now, typically, we would assume that PREPARE
followed by COMMIT/ABORT would happen within a checkpoint replay
cycle, if not, this would make checkpoint replays slower since the
earlier spread out fsyncs are now getting bunched up. I had concerns
about this but clearly your latest performance measurements don't show
any negatives around this directly.


4) Minor nit-pick on existing code.

(errmsg_plural("%u two-phase state file was written "
  "for
long-running prepared transactions",
  "%u
two-phase state files were written "
  "for
long-running prepared transactions",
  serialized_xacts,
  serialized_xacts)

Shouldn’t the singular part of the message above be:
"%u two-phase state file was written for a long-running prepared transaction"

But, then, English is not my native language, so I might be off here :-)

5) Why isn't KnownPreparedRecreateFiles() tracking which entries from
KnownPreparedList have been written to disk in an earlier iteration
like in the normal code path? Why is it removing the entry from
KnownPreparedList and not just marking it as saved on disk?

6) Do we need to hold TwoPhaseStateLock lock in shared mode while
calling KnownPreparedRecreateFiles()? I think, it does not matter in
the recovery/replay code path.

7) I fixed some typos and comments. PFA, patch attached.

Other than this, I ran TAP tests and they succeed as needed.

Regards,
Nikhils

On 23 January 2017 at 16:56, Stas Kelvich  wrote:
>
>> On 27 Dec 2016, at 07:31, Michael Paquier  wrote:
>>
>> I think that it would be a good idea to actually test that in pure
>> recovery time, aka no client, and just use a base backup and make it
>> recover X prepared transactions that have created Y checkpoints after
>> dropping cache (or restarting server).
>
> I did tests with following setup:
>
> * Start postgres initialised with pgbench
> * Start pg_receivexlog
> * Take basebackup
> * Perform 1.5 M transactions
> * Stop everything and apply wal files stored by pg_receivexlog to base backup.
>
> All tests performed on a laptop with nvme ssd
> number of transactions: 1.5M
> start segment: 0x4
>
> -master non-2pc:
> last segment: 0x1b
> average recovery time per 16 wal files: 11.8s
> average total recovery time: 17.0s
>
> -master 2pc:
> last segment: 0x44
> average recovery time per 16 wal files: 142s
> average total recovery time: 568s
>
> -patched 2pc (previous patch):
> last segment: 0x44
> average recovery time per 16 wal files: 5.3s
> average total recovery time: 21.2s
>
> -patched2 2pc (dlist_push_tail changed to dlist_push_head):
> last segment: 0x44
> average recovery time 

Re: [HACKERS] Speedup twophase transactions

2016-12-26 Thread Michael Paquier
On Tue, Dec 27, 2016 at 12:59 PM, Stas Kelvich  wrote:
> Standard config with increased shared_buffers. I think the most significant
> impact on the recovery speed here is on the client side, namely time between
> prepare and commit. Right now I’m using pgbench script that issues commit
> right after prepare. It’s also possible to put sleep between prepare and
> commit
> and increase number of connections to thousands. That will be probably the
> worst case — majority of prepared tx will be moved to files.

I think that it would be a good idea to actually test that in pure
recovery time, aka no client, and just use a base backup and make it
recover X prepared transactions that have created Y checkpoints after
dropping cache (or restarting server).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-26 Thread Stas Kelvich

> On 22 Dec 2016, at 05:35, Michael Paquier  wrote:
> 
> True. The more spread the checkpoints and 2PC files, the more risk to
> require access to disk. Memory's cheap anyway. What was the system
> memory? How many checkpoints did you trigger for how many 2PC files
> created?

Standard config with increased shared_buffers. I think the most significant
impact on the recovery speed here is on the client side, namely time between
prepare and commit. Right now I’m using pgbench script that issues commit
right after prepare. It’s also possible to put sleep between prepare and commit
and increase number of connections to thousands. That will be probably the
worst case — majority of prepared tx will be moved to files.

> Perhaps it would be a good idea to look for the 2PC files
> from WAL records in a specific order. Did you try to use
> dlist_push_head instead of dlist_push_tail? This may make a difference
> on systems where WAL segments don't fit in system cache as the latest
> files generated would be looked at first for 2PC data.

Ouch! Good catch. I didn’t actually noticed that list populated in opposite 
order
with respect to traversal. I’ll fix that.

> On 27 Dec 2016, at 08:33, Michael Paquier  wrote:
> 
> Stas, have you tested as well tested the impact on recovery time when
> WAL segments are very likely evicted from the OS cache? This could be
> a plausible scenario if a standby instance is heavily used for
> read-only transactions (say pgbench -S), and that the data quantity is
> higher than the amount of RAM available. It would not be complicated
> to test that: just drop_caches before beginning recovery. The maximum
> amount of 2PC transactions that need to have access to the past WAL
> segments is linearly related to the volume of WAL between two
> checkpoints, so max_wal_size does not really matter. What matters is
> the time it takes to recover the same amount of WAL. Increasing
> max_wal_size would give more room to reduce the overall noise between
> two measurements though.

Okay, i’ll perform such testing.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Speedup twophase transactions

2016-12-26 Thread Michael Paquier
On Thu, Dec 22, 2016 at 7:35 AM, Michael Paquier
 wrote:
> On Wed, Dec 21, 2016 at 10:37 PM, Stas Kelvich  
> wrote:
>> ISTM your reasoning about filesystem cache applies here as well, but just
>> without spending time on file creation.
>
> True. The more spread the checkpoints and 2PC files, the more risk to
> require access to disk. Memory's cheap anyway. What was the system
> memory? How many checkpoints did you trigger for how many 2PC files
> created? Perhaps it would be a good idea to look for the 2PC files
> from WAL records in a specific order. Did you try to use
> dlist_push_head instead of dlist_push_tail? This may make a difference
> on systems where WAL segments don't fit in system cache as the latest
> files generated would be looked at first for 2PC data.

Stas, have you tested as well tested the impact on recovery time when
WAL segments are very likely evicted from the OS cache? This could be
a plausible scenario if a standby instance is heavily used for
read-only transactions (say pgbench -S), and that the data quantity is
higher than the amount of RAM available. It would not be complicated
to test that: just drop_caches before beginning recovery. The maximum
amount of 2PC transactions that need to have access to the past WAL
segments is linearly related to the volume of WAL between two
checkpoints, so max_wal_size does not really matter. What matters is
the time it takes to recover the same amount of WAL. Increasing
max_wal_size would give more room to reduce the overall noise between
two measurements though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-21 Thread Michael Paquier
On Wed, Dec 21, 2016 at 10:37 PM, Stas Kelvich  wrote:
> On 21 Dec 2016, at 19:56, Michael Paquier  wrote:
>> That's indeed way simpler than before. Have you as well looked at the
>> most simple approach discussed? That would be just roughly replacing
>> the pg_fsync() calls currently in RecreateTwoPhaseFile() by a save
>> into a list as you are doing, then issue them all checkpoint.Even for
>
>> 2PC files that are created and then removed before the next
>> checkpoint, those will likely be in system cache.
>
>   Yes, I tried that as well. But in such approach another bottleneck arises
> —
> new file creation isn’t very cheap operation itself. Dual xeon with 100
> backends
> quickly hit that, and OS routines about file creation occupies first places
> in perf top. Probably that depends on filesystem (I used ext4), but avoiding
> file creation when it isn’t necessary seems like cleaner approach.
>   On the other hand it is possible to skip file creation by reusing files,
> for example
> naming them by dummy PGPROC offset, but that would require some changes
> to places that right now looks only at filenames.

Interesting. Thanks for looking at it. Your latest approach looks more
promising based on that then.

>> And this saves lookups at the WAL segments
>> still present in pg_xlog, making the operation at checkpoint much
>> faster with many 2PC files to process.
>
> ISTM your reasoning about filesystem cache applies here as well, but just
> without spending time on file creation.

True. The more spread the checkpoints and 2PC files, the more risk to
require access to disk. Memory's cheap anyway. What was the system
memory? How many checkpoints did you trigger for how many 2PC files
created? Perhaps it would be a good idea to look for the 2PC files
from WAL records in a specific order. Did you try to use
dlist_push_head instead of dlist_push_tail? This may make a difference
on systems where WAL segments don't fit in system cache as the latest
files generated would be looked at first for 2PC data.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-21 Thread Stas Kelvich

> On 21 Dec 2016, at 19:56, Michael Paquier  wrote:
> 
> That's indeed way simpler than before. Have you as well looked at the
> most simple approach discussed? That would be just roughly replacing
> the pg_fsync() calls currently in RecreateTwoPhaseFile() by a save
> into a list as you are doing, then issue them all checkpoint.Even for
> 2PC files that are created and then removed before the next
> checkpoint, those will likely be in system cache.

  Yes, I tried that as well. But in such approach another bottleneck arises —
new file creation isn’t very cheap operation itself. Dual xeon with 100 backends
quickly hit that, and OS routines about file creation occupies first places in
perf top. Probably that depends on filesystem (I used ext4), but avoiding
file creation when it isn’t necessary seems like cleaner approach.
  On the other hand it is possible to skip file creation by reusing files, for 
example
naming them by dummy PGPROC offset, but that would require some changes
to places that right now looks only at filenames.

> This removes as well
> the need to have XlogReadTwoPhaseData() work in crash recovery, which
> makes me a bit nervous.

Hm, do you have any particular bad scenario for that case in you mind?

> And this saves lookups at the WAL segments
> still present in pg_xlog, making the operation at checkpoint much
> faster with many 2PC files to process.

ISTM your reasoning about filesystem cache applies here as well, but just 
without
spending time on file creation.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Speedup twophase transactions

2016-12-21 Thread Michael Paquier
On Fri, Dec 16, 2016 at 8:00 PM, Stas Kelvich  wrote:
> So, here is brand new implementation of the same thing.
>
> Now instead of creating pgproc entry for prepared transaction during
> recovery,
> I just store recptr/xid correspondence in separate 2L-list and deleting
> entries in that
> list if redo process faced commit/abort. In case of checkpoint or end of
> recovery
> transactions remaining in that list are dumped to files in pg_twophase.
>
> Seems that current approach is way more simpler and patch has two times less
> LOCs then previous one.

That's indeed way simpler than before. Have you as well looked at the
most simple approach discussed? That would be just roughly replacing
the pg_fsync() calls currently in RecreateTwoPhaseFile() by a save
into a list as you are doing, then issue them all checkpoint. Even for
2PC files that are created and then removed before the next
checkpoint, those will likely be in system cache. This removes as well
the need to have XlogReadTwoPhaseData() work in crash recovery, which
makes me a bit nervous. And this saves lookups at the WAL segments
still present in pg_xlog, making the operation at checkpoint much
faster with many 2PC files to process.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Bruce Momjian
On Sat, Dec 17, 2016 at 07:38:46PM -0500, Bruce Momjian wrote:
> As Andres already stated, the problem is that there is a text/plain and
> text/html of the same email, and the diff is _inside_ the multipa/mixed
> HTML block.  I think it needs to be outside on its own.

Mutt shows text/plain by default.  I bet email readers that prefer to
display HTML see the patch.  Mystery solved by Andres!  :-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Bruce Momjian
On Sat, Dec 17, 2016 at 03:47:34PM -0800, Andres Freund wrote:
> On 2016-12-17 15:30:08 -0800, David Fetter wrote:
> > On Sat, Dec 17, 2016 at 05:54:04PM -0500, Bruce Momjian wrote:
> > > On Sun, Dec 18, 2016 at 07:41:50AM +0900, Michael Paquier wrote:
> > > > On Sun, Dec 18, 2016 at 6:42 AM, Bruce Momjian  wrote:
> > > > > Uh, did you mean to attached patch here?
> > > >
> > > > Strange. I can confirm that I have received the patch as attached, but
> > > > it is not on the archives.
> > >
> > > It must have been stripped by our email system.  You were a direct
> > > CC so you received it.
> >
> > I was neither, and I received it, so I don't thing PostgreSQL's email
> > system stripped it.  It's pretty mystifying, though.
> 
> The mime construction in the email is weird. The attachement is below a
> multipart/alternative and multipart/mixed, besides the text/html version
> of the plain text email.  Because the attachement is below the
> multipart/alternative (i.e. the switch between plain text / html), it'll
> not be considered a proper attachement by many mail readers.
> 
> It seems quite borked to put an attachement there - weird that apple
> mail does so.

Oh, I now see the attachment _is_ in the original email, it is just that
mutt doesn't show it.  Here is the heirarchy of the email as shown by
mutt:

  I 1   
  [multipa/alternativ, 7bit, 27K]
  I 2 ├─>   
 [text/plain, quoted, us-ascii, 0.8K]
  I 3 └─>   
   [multipa/mixed, 7bit, 26K]
  I 4   ├─> 
  [text/html, quoted, us-ascii, 5.4K]
  A 5   ├─>twophase_recovery_list.diff  
  [applica/octet-stre, 7bit, 20K]
  I 6   └─> 
[text/html, 7bit, us-ascii, 0.4K]

As Andres already stated, the problem is that there is a text/plain and
text/html of the same email, and the diff is _inside_ the multipa/mixed
HTML block.  I think it needs to be outside on its own.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Andres Freund
On 2016-12-17 15:30:08 -0800, David Fetter wrote:
> On Sat, Dec 17, 2016 at 05:54:04PM -0500, Bruce Momjian wrote:
> > On Sun, Dec 18, 2016 at 07:41:50AM +0900, Michael Paquier wrote:
> > > On Sun, Dec 18, 2016 at 6:42 AM, Bruce Momjian  wrote:
> > > > Uh, did you mean to attached patch here?
> > >
> > > Strange. I can confirm that I have received the patch as attached, but
> > > it is not on the archives.
> >
> > It must have been stripped by our email system.  You were a direct
> > CC so you received it.
>
> I was neither, and I received it, so I don't thing PostgreSQL's email
> system stripped it.  It's pretty mystifying, though.

The mime construction in the email is weird. The attachement is below a
multipart/alternative and multipart/mixed, besides the text/html version
of the plain text email.  Because the attachement is below the
multipart/alternative (i.e. the switch between plain text / html), it'll
not be considered a proper attachement by many mail readers.

It seems quite borked to put an attachement there - weird that apple
mail does so.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread David Fetter
On Sat, Dec 17, 2016 at 05:54:04PM -0500, Bruce Momjian wrote:
> On Sun, Dec 18, 2016 at 07:41:50AM +0900, Michael Paquier wrote:
> > On Sun, Dec 18, 2016 at 6:42 AM, Bruce Momjian  wrote:
> > > Uh, did you mean to attached patch here?
> > 
> > Strange. I can confirm that I have received the patch as attached, but
> > it is not on the archives.
> 
> It must have been stripped by our email system.  You were a direct
> CC so you received it.

I was neither, and I received it, so I don't thing PostgreSQL's email
system stripped it.  It's pretty mystifying, though.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Stas Kelvich

> On 18 Dec 2016, at 01:54, Bruce Momjian  wrote:
> 
> On Sun, Dec 18, 2016 at 07:41:50AM +0900, Michael Paquier wrote:
>> 
>> 
>> Strange. I can confirm that I have received the patch as attached, but
>> it is not on the archives.
> 
> It must have been stripped by our email system.  You were a direct CC so
> you received it.
> 

Then, probably, my mail client did something strange. I’ll check.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Bruce Momjian
On Sun, Dec 18, 2016 at 07:41:50AM +0900, Michael Paquier wrote:
> On Sun, Dec 18, 2016 at 6:42 AM, Bruce Momjian  wrote:
> > Uh, did you mean to attached patch here?
> 
> Strange. I can confirm that I have received the patch as attached, but
> it is not on the archives.

It must have been stripped by our email system.  You were a direct CC so
you received it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Michael Paquier
On Sun, Dec 18, 2016 at 6:42 AM, Bruce Momjian  wrote:
> Uh, did you mean to attached patch here?

Strange. I can confirm that I have received the patch as attached, but
it is not on the archives.
-- 
Michael
diff --git a/src/backend/access/transam/twophase.c 
b/src/backend/access/transam/twophase.c
index 5415604..fb69646 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -45,8 +45,8 @@
  *   fsynced
  * * If COMMIT happens after checkpoint then backend reads state 
data from
  *   files
- * * In case of crash replay will move data from xlog to files, if 
that
- *   hasn't happened before. XXX TODO - move to shmem in replay 
also
+ * * Simplified version of the same scenario happens during 
recovery and
+ *   replication. See comments to KnownPreparedXact structure.
  *
  *-
  */
@@ -181,6 +181,35 @@ static GlobalTransaction MyLockedGxact = NULL;
 
 static bool twophaseExitRegistered = false;
 
+/*
+ * During replay and replication KnownPreparedList holds info about active 
prepared
+ * transactions that weren't moved to files yet. We will need that info by the 
end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and 
don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small 
files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ * * On PREPARE redo KnownPreparedAdd() is called to add that 
transaction to
+ *   KnownPreparedList and no more actions are taken.
+ * * On checkpoint redo we iterate through KnownPreparedList and 
move all prepare
+ *   records that behind redo_horizon to files and deleting them 
from list.
+ * * On COMMIT/ABORT we delete file or entry in KnownPreparedList.
+ * * At the end of recovery we move all known prepared 
transactions to disk
+ *   to allow 
RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *   do their work.
+ */
+typedef struct KnownPreparedXact
+{
+   TransactionId   xid;
+   XLogRecPtr  prepare_start_lsn;
+   XLogRecPtr  prepare_end_lsn;
+   dlist_node  list_node;
+} KnownPreparedXact;
+
+static dlist_head KnownPreparedList = DLIST_STATIC_INIT(KnownPreparedList);
+
 static void RecordTransactionCommitPrepared(TransactionId xid,
int nchildren,
TransactionId 
*children,
@@ -1241,9 +1270,9 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
  * Reads 2PC data from xlog. During checkpoint this data will be moved to
  * twophase files and ReadTwoPhaseFile should be used instead.
  *
- * Note clearly that this function accesses WAL during normal operation, 
similarly
- * to the way WALSender or Logical Decoding would do. It does not run during
- * crash recovery or standby processing.
+ * Note clearly that this function can access WAL during normal operation, 
similarly
+ * to the way WALSender or Logical Decoding would do.
+ *
  */
 static void
 XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
@@ -1252,8 +1281,6 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
XLogReaderState *xlogreader;
char   *errormsg;
 
-   Assert(!RecoveryInProgress());
-
xlogreader = XLogReaderAllocate(_local_xlog_page, NULL);
if (!xlogreader)
ereport(ERROR,
@@ -1691,6 +1718,15 @@ PrescanPreparedTransactions(TransactionId **xids_p, int 
*nxids_p)
int nxids = 0;
int allocsize = 0;
 
+   /*
+* Move prepared transactions from KnownPreparedList to files, if any.
+* It is possible to skip that step and teach subsequent code about
+* KnownPreparedList, but whole PrescanPreparedTransactions() happens
+* once during end of recovery or promote, so probably it isn't worth
+* complications.
+*/
+   KnownPreparedRecreateFiles(InvalidXLogRecPtr);
+
cldir = AllocateDir(TWOPHASE_DIR);
while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL)
{
@@ -2162,3 +2198,111 @@ RecordTransactionAbortPrepared(TransactionId xid,
 */
SyncRepWaitForLSN(recptr, false);
 }
+
+/*
+ * KnownPreparedAdd.
+ *
+ * Store correspondence of start/end lsn and xid in KnownPreparedList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions that aren't yet moved to 2PC files by the end of recovery.
+ */
+void
+KnownPreparedAdd(XLogReaderState *record)
+{
+   

Re: [HACKERS] Speedup twophase transactions

2016-12-17 Thread Bruce Momjian
On Fri, Dec 16, 2016 at 02:00:46PM +0300, Stas Kelvich wrote:
> 
> > On 27 Sep 2016, at 03:30, Michael Paquier  wrote:
> > 
> > OK. I am marking this patch as returned with feedback then. Looking
> > forward to seeing the next investigations.. At least this review has
> > taught us one thing or two.
> 
> So, here is brand new implementation of the same thing.
> 
> Now instead of creating pgproc entry for prepared transaction during recovery,
> I just store recptr/xid correspondence in separate 2L-list and deleting 
> entries in that
> list if redo process faced commit/abort. In case of checkpoint or end of 
> recovery
> transactions remaining in that list are dumped to files in pg_twophase.
> 
> Seems that current approach is way more simpler and patch has two times less
> LOCs then previous one.

Uh, did you mean to attached patch here?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-12-16 Thread Stas Kelvich
On 27 Sep 2016, at 03:30, Michael Paquier  wrote:OK. I am marking this patch as returned with feedback then. Lookingforward to seeing the next investigations.. At least this review hastaught us one thing or two.So, here is brand new implementation of the same thing.Now instead of creating pgproc entry for prepared transaction during recovery,I just store recptr/xid correspondence in separate 2L-list and deleting entries in thatlist if redo process faced commit/abort. In case of checkpoint or end of recoverytransactions remaining in that list are dumped to files in pg_twophase.Seems that current approach is way more simpler and patch has two times lessLOCs then previous one.-- Stas KelvichPostgres Professional: http://www.postgrespro.comThe Russian Postgres Company

twophase_recovery_list.diff
Description: Binary data


Re: [HACKERS] Speedup twophase transactions

2016-09-26 Thread Michael Paquier
On Thu, Sep 22, 2016 at 12:30 AM, Stas Kelvich  wrote:
> I’m not giving up yet, i’ll write them) I still have in mind several other 
> patches to 2pc handling in
> postgres during this release cycle — logical decoding and partitioned hash 
> instead of
> TwoPhaseState list.
>
> My bet that relative speed of that patches will depend on used filesystem. 
> Like it was with the
> first patch in that mail thread it is totally possible sometimes to hit 
> filesystem limits on file
> creation speed. Otherwise both approaches should be more or less equal, i 
> suppose.

OK. I am marking this patch as returned with feedback then. Looking
forward to seeing the next investigations.. At least this review has
taught us one thing or two.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-21 Thread Stas Kelvich
> On 21 Sep 2016, at 10:32, Michael Paquier  wrote:
> 
> On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich  
> wrote:
>> 
>> Putting that before actual WAL replay is just following historical order of 
>> events.
>> Prepared files are pieces of WAL that happened before checkpoint, so ISTM
>> there is no conceptual problem in restoring their state before replay.
> 
> For wal_level = minimal there is no need to call that to begin with..
> And I'd think that it is better to continue with things the same way.
> 

Hm, why?

>> 
>> With this patch we are reusing one infrastructure for normal work, recovery 
>> and replay.
> 
> The mix between normal work and recovery is the scary part. Normal
> work and recovery are entirely two different things.
> 

Okay, agreed. Commit procedure that checks whether recovery is active or not
is quite hacky solution.

>> So taking into account my comments what do you think? Should we keep current 
>> approach
>> or try to avoid replaying prepare records at all?
> 
> I'd really like to give a try to the idea you just mentioned, aka
> delay the fsync of the 2PC files from RecreateTwoPhaseFile to
> CreateRestartPoint, or get things into memory.. I could write one, or
> both of those patches. I don't mind.

I’m not giving up yet, i’ll write them) I still have in mind several other 
patches to 2pc handling in
postgres during this release cycle — logical decoding and partitioned hash 
instead of 
TwoPhaseState list.

My bet that relative speed of that patches will depend on used filesystem. Like 
it was with the
first patch in that mail thread it is totally possible sometimes to hit 
filesystem limits on file
creation speed. Otherwise both approaches should be more or less equal, i 
suppose.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-21 Thread Michael Paquier
On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich  wrote:
>> On 20 Sep 2016, at 09:40, Michael Paquier  wrote:
>> +   StartupCLOG();
>> +   StartupSUBTRANS(checkPoint.oldestActiveXid);
>> +   RecoverPreparedFromFiles();
>> [...]
>>/*
>> -* Startup commit log and subtrans only.  MultiXact and commit
>> -* timestamp have already been started up and other SLRUs are not
>> -* maintained during recovery and need not be started yet.
>> -*/
>> -   StartupCLOG();
>> -   StartupSUBTRANS(oldestActiveXID);
>> Something is definitely wrong in this patch if we begin to do that.
>
> Putting that before actual WAL replay is just following historical order of 
> events.
> Prepared files are pieces of WAL that happened before checkpoint, so ISTM
> there is no conceptual problem in restoring their state before replay.

For wal_level = minimal there is no need to call that to begin with..
And I'd think that it is better to continue with things the same way.

>> I have been thinking about this patch quite a bit, and the approach
>> taken looks really over-complicated to me. Basically what we are
>> trying to do here is to reuse as much as possible code paths that are
>> being used by non-recovery code paths during recovery, and then try to
>> adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
>> and PGXACT things in sync to handle the 2PC information in memory
>> correctly. I am getting to think that this is *really* bug-prone in
>> the future and really going to make maintenance hard.
>
> With this patch we are reusing one infrastructure for normal work, recovery 
> and replay.

The mix between normal work and recovery is the scary part. Normal
work and recovery are entirely two different things.

>> Taking one step back, and I know that I am a noisy one, but looking at
>> this patch is making me aware of the fact that it is trying more and
>> more to patch things around the more we dig into issues, and I'd
>> suspect that trying to adapt for example sub-transaction and clog
>> handling is just the tip of the iceberg and that there are more things
>> that need to be taken care of if we continue to move on with this
>> approach.
>
> Again, it isn’t patching around to fix issues, it’s totally possible to keep 
> old interface.
> However it’s possible that current approach is wrong because of some aspect
> that i didn’t think of, but now I don’t see good counterarguments.

Any missing points could be costly here. If we have a way to make
things with the same performance

>> Coming to the point: why don't we simplify things? In short:
>
>> - Just use a static list and allocate a chunk of shared memory as
>> needed. DSM could come into play here to adapt to the size of a 2PC
>> status file, this way there is no need to allocate a chunk of memory
>> with an upper bound. Still we could use an hardcoded upper-bound of
>> say 2k with max_prepared_transactions, and just write the 2PC status
>> file to disk if its size is higher than what can stick into memory.
>> - save 2PC information in memory instead of writing it to a 2PC file
>> when XLOG_XACT_PREPARE shows up.
>> - flush information to disk when there is a valid restart point in
>> RecoveryRestartPoint(). We would need as well to tweak
>> PrescanPreparedTransactions accordingly than everything we are trying
>> to do here.
>> That would be way more simple than what's proposed here, and we'd
>> still keep all the performance benefits.
>
> So file arrives to replica through WAL and instead of directly reading it you 
> suggest
> to copy it do DSM if it is small enough, and to filesystem if not. Probably 
> that will
> allow us to stay only around reading/writing files, but:
>
> * That will be measurably slower than proposed approach because of unnecessary
> copying between WAL and DSM. Not to mention prepare records of arbitrary 
> length.
> * Different behaviour on replica and master — less tested code for replica.

Well, the existing code on HEAD is battery-tested as well. This
reduces the likeliness of bugs at replay with new features.

> * Almost the same behaviour can be achieved by delaying fsync on 2pc files 
> till
> checkpoint.

Oh, that's an idea here. It may be interesting to see if this meets
your performance goals... And that could result in a far smaller
patch.

> But if reword you comment in a way that it is better to avoid replaying 
> prepare record at all,
> like it happens now in master, then I see the point. And that is possible 
> even without DSM, it possible
> to allocate static sized array storing some info about tx, whether it is in 
> the WAL or in file, xid, gid.
> Some sort of PGXACT doppelganger only for replay purposes instead of using 
> normal one.

That's exactly what I meant. The easiest approach is just to allocate
a bunk of shared memory made of 2PC_STATUS_SIZE * max_prepared_xacts
with 2PC_STATUS_SIZE 

Re: [HACKERS] Speedup twophase transactions

2016-09-20 Thread Stas Kelvich
> 
> On 20 Sep 2016, at 09:40, Michael Paquier  wrote:
> 

Thanks for digging into this.

> +   StartupCLOG();
> +   StartupSUBTRANS(checkPoint.oldestActiveXid);
> +   RecoverPreparedFromFiles();
> [...]
>/*
> -* Startup commit log and subtrans only.  MultiXact and commit
> -* timestamp have already been started up and other SLRUs are not
> -* maintained during recovery and need not be started yet.
> -*/
> -   StartupCLOG();
> -   StartupSUBTRANS(oldestActiveXID);
> Something is definitely wrong in this patch if we begin to do that.

Putting that before actual WAL replay is just following historical order of 
events.
Prepared files are pieces of WAL that happened before checkpoint, so ISTM 
there is no conceptual problem in restoring their state before replay.

Moreover I think that this approach is better then oldest one.
There is kind of circular dependency between StartupSUBTRANS() and restoring
old prepared transaction: StartupSUBTRANS requires oldestActiveXID, but you
can get it only after recovering 2pc files, but that recovery requires working 
SUBTRANS.

In old code that was resolved by two passes through 2pc files: first one finds
oldestActiveXmin but doesn’t restore their state, then subtrans was started, and
only after that RecoverPreparedTransactions() is called. And that logic was 
repeated
three times in xlog.c with help of following functions:
PrescanPreparedTransactions() — 3 calls
StandbyRecoverPreparedTransactions() — 2 calls
RecoverPreparedTransactions() — 1 call

Now, since we know that 2pc files are written only on checkpoint we can boil all
that cases down to one: get oldestActiveXID from checkpoint and restore it 
before
WAL replay. So only one call of RecoverPreparedFromFiles() and one call of 
CleanupPreparedTransactionsOnPITR() in case of PITR. And each of them does
only what stated on function name without side-effects like advancing nextXid in
previous versions.

So, to summarise, i think keeping old interfaces here is bigger evil because in 
each of
mentioned functions we will need to deal with both memory and file 2pc states.

> There is no need to move those two calls normally, and I think that we
> had better continue to do that only for hot standbys just to improve
> 2PC handling…

We can simulate old interface, it’s not a big problem. But that will complicate 
2pc code and
keep useless code in xlog.c because it was written in assumption that 2pc file 
can be created
before checkpoint and now it isn’t true.

> CleanupPreparedTransactionsOnPITR() cleans up pg_twophase, but isn't
> it necessary to do something as well for what is in memory?

Yes, that is necessary. Thanks, will fix. And probably add prepared tx to PITR 
test.

> I have been thinking about this patch quite a bit, and the approach
> taken looks really over-complicated to me. Basically what we are
> trying to do here is to reuse as much as possible code paths that are
> being used by non-recovery code paths during recovery, and then try to
> adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
> and PGXACT things in sync to handle the 2PC information in memory
> correctly. I am getting to think that this is *really* bug-prone in
> the future and really going to make maintenance hard.

With this patch we are reusing one infrastructure for normal work, recovery and 
replay.
I don’t think that we will win a lot reliability if we split that to a 
different code paths.

> Taking one step back, and I know that I am a noisy one, but looking at
> this patch is making me aware of the fact that it is trying more and
> more to patch things around the more we dig into issues, and I'd
> suspect that trying to adapt for example sub-transaction and clog
> handling is just the tip of the iceberg and that there are more things
> that need to be taken care of if we continue to move on with this
> approach.

Again, it isn’t patching around to fix issues, it’s totally possible to keep 
old interface.
However it’s possible that current approach is wrong because of some aspect
that i didn’t think of, but now I don’t see good counterarguments.

> Coming to the point: why don't we simplify things? In short:

> - Just use a static list and allocate a chunk of shared memory as
> needed. DSM could come into play here to adapt to the size of a 2PC
> status file, this way there is no need to allocate a chunk of memory
> with an upper bound. Still we could use an hardcoded upper-bound of
> say 2k with max_prepared_transactions, and just write the 2PC status
> file to disk if its size is higher than what can stick into memory.
> - save 2PC information in memory instead of writing it to a 2PC file
> when XLOG_XACT_PREPARE shows up.
> - flush information to disk when there is a valid restart point in
> RecoveryRestartPoint(). We would need as well to tweak
> PrescanPreparedTransactions accordingly than everything we are trying

Re: [HACKERS] Speedup twophase transactions

2016-09-20 Thread Michael Paquier
On Sat, Sep 17, 2016 at 2:45 AM, Stas Kelvich  wrote:
> Looking through old version i’ve noted few things that were bothering me:
>
> * In case of crash replay PREPARE redo accesses SUBTRANS, but 
> StartupSUBTRANS() isn’t called yet
> in StartupXLOG().
> * Several functions in twophase.c have to walk through both PGPROC and 
> pg_twophase directory.
>
> Now I slightly changed order of things in StartupXLOG() so twophase state 
> loaded from from file and
> StartupSUBTRANS called before actual recovery starts. So in all other 
> functions we can be sure that
> file were already loaded in memory.
>
> Also since 2pc transactions now are dumped to files only on checkpoint, we 
> can get rid of
> PrescanPreparedTransactions() — all necessary info can reside in checkpoint 
> itself. I’ve changed
> behaviour of oldestActiveXid write in checkpoint and that’s probably 
> discussable topic, but ISTM
> that simplifies a lot recovery logic in both twophase.c and xlog.c. More 
> comments on that in patch itself.

Finally I am back on this one..

First be more careful about typos in the comments of your code. Just
reading through the patch I found quite a couple of places that needed
to be fixed. cosequent, unavalable, SUBTRANCE are some examples among
many other things..

+   StartupCLOG();
+   StartupSUBTRANS(checkPoint.oldestActiveXid);
+   RecoverPreparedFromFiles();
[...]
/*
-* Startup commit log and subtrans only.  MultiXact and commit
-* timestamp have already been started up and other SLRUs are not
-* maintained during recovery and need not be started yet.
-*/
-   StartupCLOG();
-   StartupSUBTRANS(oldestActiveXID);
Something is definitely wrong in this patch if we begin to do that.
There is no need to move those two calls normally, and I think that we
had better continue to do that only for hot standbys just to improve
2PC handling...

CleanupPreparedTransactionsOnPITR() cleans up pg_twophase, but isn't
it necessary to do something as well for what is in memory?

I have been thinking about this patch quite a bit, and the approach
taken looks really over-complicated to me. Basically what we are
trying to do here is to reuse as much as possible code paths that are
being used by non-recovery code paths during recovery, and then try to
adapt a bunch of things like SUBTRANS structures, CLOG, XID handling
and PGXACT things in sync to handle the 2PC information in memory
correctly. I am getting to think that this is *really* bug-prone in
the future and really going to make maintenance hard.

Taking one step back, and I know that I am a noisy one, but looking at
this patch is making me aware of the fact that it is trying more and
more to patch things around the more we dig into issues, and I'd
suspect that trying to adapt for example sub-transaction and clog
handling is just the tip of the iceberg and that there are more things
that need to be taken care of if we continue to move on with this
approach. Coming to the point: why don't we simplify things? In short:
- Just use a static list and allocate a chunk of shared memory as
needed. DSM could come into play here to adapt to the size of a 2PC
status file, this way there is no need to allocate a chunk of memory
with an upper bound. Still we could use an hardcoded upper-bound of
say 2k with max_prepared_transactions, and just write the 2PC status
file to disk if its size is higher than what can stick into memory.
- save 2PC information in memory instead of writing it to a 2PC file
when XLOG_XACT_PREPARE shows up.
- flush information to disk when there is a valid restart point in
RecoveryRestartPoint(). We would need as well to tweak
PrescanPreparedTransactions accordingly than everything we are trying
to do here.
That would be way more simple than what's proposed here, and we'd
still keep all the performance benefits.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-16 Thread Stas Kelvich
> On 07 Sep 2016, at 11:07, Stas Kelvich  wrote:
> 
>> On 07 Sep 2016, at 03:09, Michael Paquier  wrote:
>> 
 On 06 Sep 2016, at 12:03, Michael Paquier  
 wrote:
 
 On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
 wrote:
> Oh, I was preparing new version of patch, after fresh look on it. 
> Probably, I should
> said that in this topic. I’ve found a bug in sub transaction handling and 
> now working
> on fix.
 
 What's the problem actually?
>>> 
>>> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
>>> prepared tx's in memory.
>>> Also I want to double-check and add comments to RecoveryInProgress() checks 
>>> in FinishGXact.
>>> 
>>> I’ll post reworked patch in several days.
>> 
>> Could you use as a base the version I just sent yesterday then? I
>> noticed many mistakes in the comments, for example many s/it's/its/
>> and did a couple of adjustments around the code, the goto next_file
>> was particularly ugly. That will be that much work not do to again
>> later.
> 
> Yes. Already merged branch with your version.

Here is updated version of patch.

Looking through old version i’ve noted few things that were bothering me:

* In case of crash replay PREPARE redo accesses SUBTRANS, but StartupSUBTRANS() 
isn’t called yet
in StartupXLOG().
* Several functions in twophase.c have to walk through both PGPROC and 
pg_twophase directory.

Now I slightly changed order of things in StartupXLOG() so twophase state 
loaded from from file and 
StartupSUBTRANS called before actual recovery starts. So in all other functions 
we can be sure that
file were already loaded in memory.

Also since 2pc transactions now are dumped to files only on checkpoint, we can 
get rid of
PrescanPreparedTransactions() — all necessary info can reside in checkpoint 
itself. I’ve changed
behaviour of oldestActiveXid write in checkpoint and that’s probably 
discussable topic, but ISTM
that simplifies a lot recovery logic in both twophase.c and xlog.c. More 
comments on that in patch itself.



twophase_replay.v7.patch
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-07 Thread Stas Kelvich
> On 07 Sep 2016, at 03:09, Michael Paquier  wrote:
> 
>>> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
>>> 
>>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
>>> wrote:
 Oh, I was preparing new version of patch, after fresh look on it. 
 Probably, I should
 said that in this topic. I’ve found a bug in sub transaction handling and 
 now working
 on fix.
>>> 
>>> What's the problem actually?
>> 
>> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
>> prepared tx's in memory.
>> Also I want to double-check and add comments to RecoveryInProgress() checks 
>> in FinishGXact.
>> 
>> I’ll post reworked patch in several days.
> 
> Could you use as a base the version I just sent yesterday then? I
> noticed many mistakes in the comments, for example many s/it's/its/
> and did a couple of adjustments around the code, the goto next_file
> was particularly ugly. That will be that much work not do to again
> later.

Yes. Already merged branch with your version.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Michael Paquier
>> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
>>
>> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  
>> wrote:
>>> Oh, I was preparing new version of patch, after fresh look on it. Probably, 
>>> I should
>>> said that in this topic. I’ve found a bug in sub transaction handling and 
>>> now working
>>> on fix.
>>
>> What's the problem actually?
>
> Handling of xids_p array in PrescanPreparedTransactions() is wrong for 
> prepared tx's in memory.
> Also I want to double-check and add comments to RecoveryInProgress() checks 
> in FinishGXact.
>
> I’ll post reworked patch in several days.

Could you use as a base the version I just sent yesterday then? I
noticed many mistakes in the comments, for example many s/it's/its/
and did a couple of adjustments around the code, the goto next_file
was particularly ugly. That will be that much work not do to again
later.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Stas Kelvich

> On 06 Sep 2016, at 12:09, Simon Riggs  wrote:
> 
> On 6 September 2016 at 09:58, Stas Kelvich  wrote:
>> 
>> I'll check it against my failure scenario with subtransactions and post 
>> results or updated patch here.
> 
> Make sure tests are added for that. It would have been better to say
> you knew there were bugs in it earlier.

I’ve spotted that yesterday during rebase. Sorry. Next time in same situation 
i’ll write right away
to save everyone’s time.

> On 06 Sep 2016, at 12:03, Michael Paquier  wrote:
> 
> On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  wrote:
>> Oh, I was preparing new version of patch, after fresh look on it. Probably, 
>> I should
>> said that in this topic. I’ve found a bug in sub transaction handling and 
>> now working
>> on fix.
> 
> What's the problem actually?

Handling of xids_p array in PrescanPreparedTransactions() is wrong for prepared 
tx's in memory.

Also I want to double-check and add comments to RecoveryInProgress() checks in 
FinishGXact.

I’ll post reworked patch in several days.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Simon Riggs
On 6 September 2016 at 09:58, Stas Kelvich  wrote:
>> On 06 Sep 2016, at 04:41, Michael Paquier  wrote:
>>
>> On Sat, Sep 3, 2016 at 10:26 PM, Michael Paquier
>>  wrote:
>>> On Fri, Sep 2, 2016 at 5:06 AM, Simon Riggs  wrote:
 On 13 April 2016 at 15:31, Stas Kelvich  wrote:

> Fixed patch attached. There already was infrastructure to skip currently
> held locks during replay of standby_redo() and I’ve extended that with 
> check for
> prepared xids.

 Please confirm that everything still works on current HEAD for the new
 CF, so review can start.
>>>
>>> The patch does not apply cleanly. Stas, could you rebase? I am
>>> switching the patch to "waiting on author" for now.
>>
>> So, I have just done the rebase myself and did an extra round of
>> reviews of the patch. Here are couple of comments after this extra
>> lookup.

> Oh, I was preparing new version of patch, after fresh look on it. Probably, I 
> should
> said that in this topic. I’ve found a bug in sub transaction handling and now 
> working
> on fix.

Not replying has wasted time and effort.

>> After review the result is attached. Perhaps a committer could get a look at 
>> it?
>
> I'll check it against my failure scenario with subtransactions and post 
> results or updated patch here.

Make sure tests are added for that. It would have been better to say
you knew there were bugs in it earlier.

This has been buggy so far, so I am hesitant about this now. I suggest
we add significant docs to explain how it works, so everybody can
double-check the concepts. Please also do what you can to reduce the
patch complexity.

I'll look at this again in two weeks time. Help me to make sure it
gets committed that time by doing a full and complete patch. Thanks.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Michael Paquier
On Tue, Sep 6, 2016 at 5:58 PM, Stas Kelvich  wrote:
>> On 06 Sep 2016, at 04:41, Michael Paquier  wrote:
>> This overlaps with TwoPhaseGetGXact but I'd rather keep both things
>> separated: it does not seem worth complicating the existing interface,
>> and putting that in cache during recovery has no meaning.
>
> Oh, I was preparing new version of patch, after fresh look on it. Probably, I 
> should
> said that in this topic. I’ve found a bug in sub transaction handling and now 
> working
> on fix.

What's the problem actually?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Stas Kelvich
> On 06 Sep 2016, at 04:41, Michael Paquier  wrote:
> 
> On Sat, Sep 3, 2016 at 10:26 PM, Michael Paquier
>  wrote:
>> On Fri, Sep 2, 2016 at 5:06 AM, Simon Riggs  wrote:
>>> On 13 April 2016 at 15:31, Stas Kelvich  wrote:
>>> 
 Fixed patch attached. There already was infrastructure to skip currently
 held locks during replay of standby_redo() and I’ve extended that with 
 check for
 prepared xids.
>>> 
>>> Please confirm that everything still works on current HEAD for the new
>>> CF, so review can start.
>> 
>> The patch does not apply cleanly. Stas, could you rebase? I am
>> switching the patch to "waiting on author" for now.
> 
> So, I have just done the rebase myself and did an extra round of
> reviews of the patch. Here are couple of comments after this extra
> lookup.
> 
> LockGXactByXid() is aimed to be used only in recovery, so it seems
> adapted to be to add an assertion with RecoveryInProgress(). Using
> this routine in non-recovery code paths is risky because it assumes
> that a PGXACT could be missing, which is fine in recovery as prepared
> transactions could be moved to twophase files because of a checkpoint,
> but not in normal cases. We could also add an assertion of the kind
> gxact->locking_backend == InvalidBackendId before locking the PGXACT
> but as this routine is just normally used by the startup process (in
> non-standalone mode!) that's fine without.
> 
> The handling of invalidation messages and removal of relfilenodes done
> in FinishGXact can be grouped together, checking only once for
> !RecoveryInProgress().
> 
> + *
> + * The same procedure happens during replication and crash recovery.
>  *
> "during WAL replay" is more generic and applies here.
> 
> +
> +next_file:
> +   continue;
> +
> That can be removed and replaced by a "continue;".
> 
> +   /*
> +* At first check prepared tx that wasn't yet moved to disk.
> +*/
> +   LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
> +   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
> +   {
> +   GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
> +   PGXACT *pgxact = >allPgXact[gxact->pgprocno];
> +
> +   if (TransactionIdEquals(pgxact->xid, xid))
> +   {
> +   LWLockRelease(TwoPhaseStateLock);
> +   return true;
> +   }
> +   }
> +   LWLockRelease(TwoPhaseStateLock);
> This overlaps with TwoPhaseGetGXact but I'd rather keep both things
> separated: it does not seem worth complicating the existing interface,
> and putting that in cache during recovery has no meaning.

Oh, I was preparing new version of patch, after fresh look on it. Probably, I 
should
said that in this topic. I’ve found a bug in sub transaction handling and now 
working
on fix.

> 
> I have also reworked the test format, and fixed many typos and grammar
> problems in the patch as well as in the tests.

Thanks!

> 
> After review the result is attached. Perhaps a committer could get a look at 
> it?

I'll check it against my failure scenario with subtransactions and post results 
or updated patch here.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-06 Thread Simon Riggs
On 6 September 2016 at 02:41, Michael Paquier  wrote:

> After review the result is attached. Perhaps a committer could get a look at 
> it?

Yes, will do, but it will be a few more days yet.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-05 Thread Michael Paquier
On Sat, Sep 3, 2016 at 10:26 PM, Michael Paquier
 wrote:
> On Fri, Sep 2, 2016 at 5:06 AM, Simon Riggs  wrote:
>> On 13 April 2016 at 15:31, Stas Kelvich  wrote:
>>
>>> Fixed patch attached. There already was infrastructure to skip currently
>>> held locks during replay of standby_redo() and I’ve extended that with 
>>> check for
>>> prepared xids.
>>
>> Please confirm that everything still works on current HEAD for the new
>> CF, so review can start.
>
> The patch does not apply cleanly. Stas, could you rebase? I am
> switching the patch to "waiting on author" for now.

So, I have just done the rebase myself and did an extra round of
reviews of the patch. Here are couple of comments after this extra
lookup.

LockGXactByXid() is aimed to be used only in recovery, so it seems
adapted to be to add an assertion with RecoveryInProgress(). Using
this routine in non-recovery code paths is risky because it assumes
that a PGXACT could be missing, which is fine in recovery as prepared
transactions could be moved to twophase files because of a checkpoint,
but not in normal cases. We could also add an assertion of the kind
gxact->locking_backend == InvalidBackendId before locking the PGXACT
but as this routine is just normally used by the startup process (in
non-standalone mode!) that's fine without.

The handling of invalidation messages and removal of relfilenodes done
in FinishGXact can be grouped together, checking only once for
!RecoveryInProgress().

+ *
+ * The same procedure happens during replication and crash recovery.
  *
"during WAL replay" is more generic and applies here.

+
+next_file:
+   continue;
+
That can be removed and replaced by a "continue;".

+   /*
+* At first check prepared tx that wasn't yet moved to disk.
+*/
+   LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+   {
+   GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+   PGXACT *pgxact = >allPgXact[gxact->pgprocno];
+
+   if (TransactionIdEquals(pgxact->xid, xid))
+   {
+   LWLockRelease(TwoPhaseStateLock);
+   return true;
+   }
+   }
+   LWLockRelease(TwoPhaseStateLock);
This overlaps with TwoPhaseGetGXact but I'd rather keep both things
separated: it does not seem worth complicating the existing interface,
and putting that in cache during recovery has no meaning.

I have also reworked the test format, and fixed many typos and grammar
problems in the patch as well as in the tests.

After review the result is attached. Perhaps a committer could get a look at it?
-- 
Michael
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9f55adc..eb7c339 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -45,8 +45,8 @@
  *		  fsynced
  *		* If COMMIT happens after checkpoint then backend reads state data from
  *		  files
- *		* In case of crash replay will move data from xlog to files, if that
- *		  hasn't happened before. XXX TODO - move to shmem in replay also
+ *
+ *		The same procedure happens during WAL replay.
  *
  *-
  */
@@ -578,6 +578,45 @@ LockGXact(const char *gid, Oid user)
 }
 
 /*
+ * LockGXactByXid
+ *
+ * Find prepared transaction by xid and lock corresponding GXACT.
+ * This is used during recovery as an alternative to LockGXact(), and
+ * should only be used in recovery. No entries found means that a checkpoint
+ * has moved the searched prepared transaction data to a twophase file.
+ *
+ * Returns the transaction data if found, or NULL if nothing has been locked.
+ */
+static GlobalTransaction
+LockGXactByXid(TransactionId xid)
+{
+	int		i;
+	GlobalTransaction gxact = NULL;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		PGXACT	   *pgxact;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		pgxact = >allPgXact[gxact->pgprocno];
+
+		if (TransactionIdEquals(xid, pgxact->xid))
+		{
+			/* ok to lock it */
+			gxact->locking_backend = MyBackendId;
+			MyLockedGxact = gxact;
+			break;
+		}
+	}
+	LWLockRelease(TwoPhaseStateLock);
+
+	return gxact;
+}
+
+/*
  * RemoveGXact
  *		Remove the prepared transaction from the shared memory array.
  *
@@ -1241,9 +1280,9 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
  * Reads 2PC data from xlog. During checkpoint this data will be moved to
  * twophase files and ReadTwoPhaseFile should be used instead.
  *
- * Note clearly that this function accesses WAL during normal operation, similarly
- * to the way WALSender or Logical Decoding would do. It does not run during
- * crash recovery or standby processing.
+ * Note that this function accesses WAL not only during recovery but also
+ * during normal operation, similarly to the way WALSender or 

Re: [HACKERS] Speedup twophase transactions

2016-09-03 Thread Michael Paquier
On Fri, Sep 2, 2016 at 5:06 AM, Simon Riggs  wrote:
> On 13 April 2016 at 15:31, Stas Kelvich  wrote:
>
>> Fixed patch attached. There already was infrastructure to skip currently
>> held locks during replay of standby_redo() and I’ve extended that with check 
>> for
>> prepared xids.
>
> Please confirm that everything still works on current HEAD for the new
> CF, so review can start.

The patch does not apply cleanly. Stas, could you rebase? I am
switching the patch to "waiting on author" for now.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-09-01 Thread Simon Riggs
On 13 April 2016 at 15:31, Stas Kelvich  wrote:

> Fixed patch attached. There already was infrastructure to skip currently
> held locks during replay of standby_redo() and I’ve extended that with check 
> for
> prepared xids.

Please confirm that everything still works on current HEAD for the new
CF, so review can start.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-05-20 Thread Michael Paquier
On Fri, May 20, 2016 at 12:46 PM, Alvaro Herrera
 wrote:
> Jesper Pedersen wrote:
>> Discussed with Noah off-list I think we should revisit this for 9.6 due to
>> the async replica lag as shown in [1]. The performance improvement for the
>> master node is shown in [2].
>
> I gave a very quick look and it seems to me far more invasive than we
> would normally consider in the beta period.  I would just put it to
> sleep till release 10 opens up.

I share the same opinion. Improving 2PC is definitely a huge win
thanks to the first patch that got committed when WAL is generated,
but considering how the second patch is invasive really concerns me,
and I looked at the patch in an extended way a couple of weeks back.
As we care about stability now regarding 9.6, let's bump the second
portion to 10.0 as well as keep the improvement for WAL generation in
9.6.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-05-20 Thread Alvaro Herrera
Jesper Pedersen wrote:

> Discussed with Noah off-list I think we should revisit this for 9.6 due to
> the async replica lag as shown in [1]. The performance improvement for the
> master node is shown in [2].

I gave a very quick look and it seems to me far more invasive than we
would normally consider in the beta period.  I would just put it to
sleep till release 10 opens up.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-05-20 Thread Jesper Pedersen

Hi,

On 04/13/2016 10:31 AM, Stas Kelvich wrote:

On 13 Apr 2016, at 01:04, Michael Paquier  wrote:
That's too late for 9.6 unfortunately, don't forget to add that in the next CF!


Fixed patch attached. There already was infrastructure to skip currently
held locks during replay of standby_redo() and I’ve extended that with check for
prepared xids.

The reason why I’m still active on this thread is because I see real problems
in deploying 9.6 in current state. Let me stress my concern: current state of 
things
_WILL_BREAK_ async replication in case of substantial load of two phase
transactions on master. And a lot of J2EE apps falls into that category, as they
wrap most of their transactions in prepare/commit. Slave server just will always
increase it lag and will never catch up. It is possible to deal with that by 
switching
to synchronous replication or inserting triggers with pg_sleep on master, but it
doesn’t looks like normal behaviour of system.



Discussed with Noah off-list I think we should revisit this for 9.6 due 
to the async replica lag as shown in [1]. The performance improvement 
for the master node is shown in [2].


As I see it there are 3 options to resolve this (in one way or another)

* Leave as is, document the behaviour in release notes/documentation
* Apply the patch for slaves
* Revert the changes done to the twophase.c during the 9.6 cycle.

All have pros/cons for the release.

Latest slave patch by Stas is on [3].

Thoughts from others on the matter would be appreciated.

[1] 
http://www.postgresql.org/message-id/e7497864-de11-4099-83f5-89fb97af5...@postgrespro.ru

[2] http://www.postgresql.org/message-id/5693f703.3000...@redhat.com
[3] https://commitfest.postgresql.org/10/523/

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-13 Thread Stas Kelvich
> On 13 Apr 2016, at 01:04, Michael Paquier  wrote:
> 
> On Wed, Apr 13, 2016 at 1:53 AM, Stas Kelvich  
> wrote:
>>> On 12 Apr 2016, at 15:47, Michael Paquier  wrote:
>>> 
>>> It looks to be the case... The PREPARE phase replayed after the
>>> standby is restarted in recovery creates a series of exclusive locks
>>> on the table created and those are not taken on HEAD. Once those are
>>> replayed the LOCK_STANDBY record is conflicting with it. In the case
>>> of the failure, the COMMIT PREPARED record cannot be fetched from
>>> master via the WAL stream so the relation never becomes visible.
>> 
>> Yep, it is. It is okay for prepared xact hold a locks for created/changed 
>> tables,
>> but code in standby_redo() was written in assumption that there are no 
>> prepared
>> xacts at the time of recovery. I’ll look closer at checkpointer code and 
>> will send
>> updated patch.
>> 
>> And thanks again.
> 
> That's too late for 9.6 unfortunately, don't forget to add that in the next 
> CF!

Fixed patch attached. There already was infrastructure to skip currently
held locks during replay of standby_redo() and I’ve extended that with check for
prepared xids.

The reason why I’m still active on this thread is because I see real problems
in deploying 9.6 in current state. Let me stress my concern: current state of 
things
_WILL_BREAK_ async replication in case of substantial load of two phase
transactions on master. And a lot of J2EE apps falls into that category, as they
wrap most of their transactions in prepare/commit. Slave server just will always
increase it lag and will never catch up. It is possible to deal with that by 
switching
to synchronous replication or inserting triggers with pg_sleep on master, but it
doesn’t looks like normal behaviour of system.



twophase_replay.v5.patch
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-12 Thread Michael Paquier
On Wed, Apr 13, 2016 at 1:53 AM, Stas Kelvich  wrote:
>> On 12 Apr 2016, at 15:47, Michael Paquier  wrote:
>>
>> It looks to be the case... The PREPARE phase replayed after the
>> standby is restarted in recovery creates a series of exclusive locks
>> on the table created and those are not taken on HEAD. Once those are
>> replayed the LOCK_STANDBY record is conflicting with it. In the case
>> of the failure, the COMMIT PREPARED record cannot be fetched from
>> master via the WAL stream so the relation never becomes visible.
>
> Yep, it is. It is okay for prepared xact hold a locks for created/changed 
> tables,
> but code in standby_redo() was written in assumption that there are no 
> prepared
> xacts at the time of recovery. I’ll look closer at checkpointer code and will 
> send
> updated patch.
>
> And thanks again.

That's too late for 9.6 unfortunately, don't forget to add that in the next CF!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-12 Thread Stas Kelvich
> On 12 Apr 2016, at 15:47, Michael Paquier  wrote:
> 
> On Mon, Apr 11, 2016 at 7:16 PM, Stas Kelvich wrote:
>> Michael, it looks like that you are the only one person who can reproduce 
>> that bug. I’ve tried on bunch of OS’s and didn’t observe that behaviour, 
>> also looking at your backtraces I can’t get who is holding this lock (and 
>> all of that happens before first prepare record is replayed).
> 
> Where did you try it. FWIW, I can reproduce that on Linux and OSX, and
> only manually though:

Thanks a lot, Michael! Now I was able to reproduce that. Seems that when
you was performing manual setup, master instance issued checkpoint, but in
my script that didn’t happened because of shorter timing. There are tests
with checkpoint between prepare/commit in proposed test suite, but none of
them was issuing ddl.

> It looks to be the case... The PREPARE phase replayed after the
> standby is restarted in recovery creates a series of exclusive locks
> on the table created and those are not taken on HEAD. Once those are
> replayed the LOCK_STANDBY record is conflicting with it. In the case
> of the failure, the COMMIT PREPARED record cannot be fetched from
> master via the WAL stream so the relation never becomes visible.

Yep, it is. It is okay for prepared xact hold a locks for created/changed 
tables,
but code in standby_redo() was written in assumption that there are no prepared
xacts at the time of recovery. I’ll look closer at checkpointer code and will 
send
updated patch.

And thanks again.

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-12 Thread Michael Paquier
On Mon, Apr 11, 2016 at 7:16 PM, Stas Kelvich wrote:
> Michael, it looks like that you are the only one person who can reproduce 
> that bug. I’ve tried on bunch of OS’s and didn’t observe that behaviour, also 
> looking at your backtraces I can’t get who is holding this lock (and all of 
> that happens before first prepare record is replayed).

Where did you try it. FWIW, I can reproduce that on Linux and OSX, and
only manually though:
1) Create a master and a streaming standby.
2) Run the following on master:
BEGIN;
CREATE TABLE foo (a int);
PREPARE TRANSACTION 'tx';
3) stop -m immediate the standby
4) COMMIT PREPARED 'tx' on master
5) Restart standby, and the node will wait for a lock

> Can you investigate it more? Particularly find out who holds the lock?

OK, so if I look at this backtrace that's a standby LOCK record, but
we already know that:
frame #9: 0x000107600383
postgres`LockAcquireExtended(locktag=0x7fff58a4b228, lockmode=8,
sessionLock='\x01', dontWait='\0', reportMemoryError='\0') + 2819 at
lock.c:998
frame #10: 0x0001075f9cd6
postgres`StandbyAcquireAccessExclusiveLock(xid=867, dbOid=16384,
relOid=16385) + 358 at standby.c:627
  * frame #11: 0x0001075fa23b
postgres`standby_redo(record=0x7f90a9841e38) + 251 at
standby.c:809
frame #12: 0x000107298d37 postgres`StartupXLOG + 9351 at xlog.c:6871

Here is the record pointer:
(lldb) p *record
(XLogReaderState) $4 = {
  read_page = 0x00010729b3c0 (postgres`XLogPageRead at xlog.c:10973)
  system_identifier = 6272572355656465658
  private_data = 0x7fff58a4bf40
  ReadRecPtr = 50424128
  EndRecPtr = 50424176
  decoded_record = 0x7f90a9843178
  main_data = 0x7f90a9804b78 "\x01"
  main_data_len = 16
  main_data_bufsz = 784
  record_origin = 0
  blocks = {
[0] = {
  in_use = '\0'
  rnode = (spcNode = 1663, dbNode = 16384, relNode = 2674)
  forknum = MAIN_FORKNUM
  blkno = 22
  flags = '\x10'
  has_image = '\0'
  bkp_image = 0x7f90a984826b "\x01"
  hole_offset = 892
  hole_length = 2076
  bimg_len = 6116
  bimg_info = '\x01'
  has_data = '\0'
  data = 0x7f90a98595d8 "\a"
  data_len = 0
  data_bufsz = 154
}

And in our case this corresponds to the record with LSN 0/03016940
that cannot take an exclusive LOCK:
rmgr: Transaction len (rec/tot):784/   813, tx:867, lsn:
0/03016610, prev 0/030165D8, desc: PREPARE
rmgr: Standby len (rec/tot): 16/42, tx:  0, lsn:
0/03016940, prev 0/03016610, desc: LOCK xid 867 db 16384 rel 16385
rmgr: Standby len (rec/tot): 28/54, tx:  0, lsn:
0/03016970, prev 0/03016940, desc: RUNNING_XACTS nextXid 868
latestCompletedXid 866 oldestRunningXid 867; 1 xacts: 867

There are two XID locks taken before that:
rmgr: Standby len (rec/tot): 16/42, tx:867, lsn:
0/03016578, prev 0/03014D40, desc: LOCK xid 867 db 16384 rel 16385
rmgr: Standby len (rec/tot): 16/42, tx:  0, lsn:
0/030165A8, prev 0/03016578, desc: LOCK xid 867 db 16384 rel 16385

And pg_locks on the standby is reporting that another lock has been
taken but not released:
=# select locktype, pid, mode, granted, fastpath from pg_locks where
relation = 16385;
 locktype |  pid  |mode | granted | fastpath
--+---+-+-+--
 relation | 68955 | AccessExclusiveLock | f   | f
 relation |  null | AccessExclusiveLock | t   | f
(2 rows)
In this case 68955 is the startup process trying to take the lock for
the LOCK record and it is not granted yet:
ioltas 68955   0.0  0.0  2593064   3228   ??  TXs   4:44PM   0:00.05
postgres: startup process   recovering 00010003
waiting

Now there is already a lock that has been taken and granted,
conflicting here... As the relation is only PREPARE'd yet and not
COMMIT PREPARED at this stage of the replay, isn't this lock taken
during the PREPARE phase, then it is not released by your new code
paths, no?

[One LOCK_DEBUG build later...]

It looks to be the case... The PREPARE phase replayed after the
standby is restarted in recovery creates a series of exclusive locks
on the table created and those are not taken on HEAD. Once those are
replayed the LOCK_STANDBY record is conflicting with it. In the case
of the failure, the COMMIT PREPARED record cannot be fetched from
master via the WAL stream so the relation never becomes visible.

Attached as two log files that are the result of a couple of runs,
those are the logs of the standby after being restarted in crash
recovery
- 2pc_master_logs.log, for HEAD.
- 2pc_patch_logs.log, with your last patch applied.
Feel free to have a look at them.
Regards,
-- 
Michael


2pc_master_logs.log
Description: Binary data


2pc_patch_logs.log
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-11 Thread Stas Kelvich
> On 08 Apr 2016, at 16:09, Stas Kelvich  wrote:
> 
> Tried on linux and os x 10.11 and 10.4.
> 
> Still can’t reproduce, but have played around with your backtrace.
> 
> I see in xlodump on slave following sequence of records:
> 
> rmgr: Storage len (rec/tot): 16/42, tx:  0, lsn: 
> 0/03015AF0, prev 0/03015958, desc: CREATE base/12669/16387
> rmgr: Heaplen (rec/tot):  3/  1518, tx:867, lsn: 
> 0/03015B20, prev 0/03015AF0, desc: INSERT off 8, blkref #0: rel 
> 1663/12669/1247 blk 8 FPW
> <...>
> rmgr: Btree   len (rec/tot):  2/72, tx:867, lsn: 
> 0/03019CD0, prev 0/03019C88, desc: INSERT_LEAF off 114, blkref #0: rel 
> 1663/12669/2674 blk 22
> rmgr: Standby len (rec/tot): 16/42, tx:867, lsn: 
> 0/03019D18, prev 0/03019CD0, desc: LOCK xid 867 db 12669 rel 16387 
> rmgr: Transaction len (rec/tot):784/   813, tx:867, lsn: 
> 0/03019D48, prev 0/03019D18, desc: PREPARE 
> rmgr: Transaction len (rec/tot):380/   409, tx:  0, lsn: 
> 0/0301A090, prev 0/03019D48, desc: COMMIT_PREPARED 867: 2016-04-08 
> 14:38:33.347851 MSK;
> 
> It looks like that you had stuck in LOCK xid 867 even before replaying 
> PREPARE record, so I have not that much ideas on why that can be caused by 
> changing procedures of PREPARE replay.
> 
> Just to keep things sane, here is my current diff:
> 
> 

Michael, it looks like that you are the only one person who can reproduce that 
bug. I’ve tried on bunch of OS’s and didn’t observe that behaviour, also 
looking at your backtraces I can’t get who is holding this lock (and all of 
that happens before first prepare record is replayed).

Can you investigate it more? Particularly find out who holds the lock?

There is last version of the patch:



twophase_replay.v4.patch
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Noah Misch
On Fri, Apr 08, 2016 at 02:57:00PM -0400, Jesper Pedersen wrote:
> On 04/08/2016 02:37 PM, Robert Haas wrote:
> >On Fri, Apr 8, 2016 at 8:49 AM, Jesper Pedersen  
> >wrote:
> >>Should we create an entry for the open item list [0] for this, due to the
> >>replication lag [1] ?
> >>
> >>CommitFest entry [2]
> >>Original commit [3]
> >>
> >>Cc'ed RMT.
> >
> >If there is something you think needs to be fixed that is a new issue
> >in 9.6, then yes you should.  I don't quite understand what thing is
> >from reading this, so please make sure to describe it clearly.
> >
> 
> Michael, you seem to have the necessary permission for this. Could you add
> an entry ?

Everyone may edit the list; follow
https://wiki.postgresql.org/wiki/WikiEditing to setup access.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Stas Kelvich

> On 08 Apr 2016, at 21:55, Jesper Pedersen  wrote:
> 
> On 04/08/2016 02:42 PM, Robert Haas wrote:
>> On Tue, Jan 26, 2016 at 7:43 AM, Stas Kelvich  
>> wrote:
>>> Hi,
>>> 
>>> Thanks for reviews and commit!
>> 
>> I apologize for being clueless here, but was this patch committed?
>> It's still marked as "Needs Review" in the CommitFest application.
>> 
> 
> There are 2 parts to this - both in the same email thread.
> 
> Part 1 [0] dealt with 2-phase commits on the master node. Part 2 [1] deals 
> with replaying on slaves, which currently shows lag.
> 
> There is still an open item found by Michael, so part 2 isn't ready to be 
> moved to "Ready for Committer" yet.
> 
> [0] 
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e
> [1] https://commitfest.postgresql.org/9/523/
> 
> Best regards,
> Jesper
> 

By the way, Jesper, can you, please, try to run tests in diff, that i’ve sent 
today? It includes test scenario that was
causing problems for Michael, but works fine on all systems that I have access 
to.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Stas Kelvich

> On 08 Apr 2016, at 21:42, Robert Haas  wrote:
> 
> On Tue, Jan 26, 2016 at 7:43 AM, Stas Kelvich  
> wrote:
>> Hi,
>> 
>> Thanks for reviews and commit!
> 
> I apologize for being clueless here, but was this patch committed?
> It's still marked as "Needs Review" in the CommitFest application.

There was a patch to skip two phase file creation when there were no checkpoint
between PREPARE and COMMIT, and that patch was commited.
But that patch didn’t touch anything in replay, so replay speed of 2pc is 
significantly slower
than 2pc in normal mode. And that can cause constantly increasing replication 
lag for async
replication.
After that i’ve wrote new patch introducing same behaviour in replay and used 
the same
mail thread. Now Michael found a (heisen)bug in second patch, that i can’t 
reproduce.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Jesper Pedersen

On 04/08/2016 02:37 PM, Robert Haas wrote:

On Fri, Apr 8, 2016 at 8:49 AM, Jesper Pedersen
 wrote:

On 04/07/2016 02:29 AM, Michael Paquier wrote:

So recovery is conflicting here. My guess is that this patch is
missing some lock cleanup.

With the test case attached in my case the COMMIT PREPARED record does
not even get replayed.



Should we create an entry for the open item list [0] for this, due to the
replication lag [1] ?

CommitFest entry [2]
Original commit [3]

Cc'ed RMT.


If there is something you think needs to be fixed that is a new issue
in 9.6, then yes you should.  I don't quite understand what thing is
from reading this, so please make sure to describe it clearly.



Michael, you seem to have the necessary permission for this. Could you 
add an entry ?


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Jesper Pedersen

On 04/08/2016 02:42 PM, Robert Haas wrote:

On Tue, Jan 26, 2016 at 7:43 AM, Stas Kelvich  wrote:

Hi,

Thanks for reviews and commit!


I apologize for being clueless here, but was this patch committed?
It's still marked as "Needs Review" in the CommitFest application.



There are 2 parts to this - both in the same email thread.

Part 1 [0] dealt with 2-phase commits on the master node. Part 2 [1] 
deals with replaying on slaves, which currently shows lag.


There is still an open item found by Michael, so part 2 isn't ready to 
be moved to "Ready for Committer" yet.


[0] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e

[1] https://commitfest.postgresql.org/9/523/

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >