Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Robert Haas
On Tue, Jan 26, 2016 at 7:43 AM, Stas Kelvich  wrote:
> Hi,
>
> Thanks for reviews and commit!

I apologize for being clueless here, but was this patch committed?
It's still marked as "Needs Review" in the CommitFest application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Robert Haas
On Fri, Apr 8, 2016 at 8:49 AM, Jesper Pedersen
 wrote:
> On 04/07/2016 02:29 AM, Michael Paquier wrote:
>> So recovery is conflicting here. My guess is that this patch is
>> missing some lock cleanup.
>>
>> With the test case attached in my case the COMMIT PREPARED record does
>> not even get replayed.
>>
>
> Should we create an entry for the open item list [0] for this, due to the
> replication lag [1] ?
>
> CommitFest entry [2]
> Original commit [3]
>
> Cc'ed RMT.

If there is something you think needs to be fixed that is a new issue
in 9.6, then yes you should.  I don't quite understand what thing is
from reading this, so please make sure to describe it clearly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Stas Kelvich
> On 08 Apr 2016, at 16:36, Simon Riggs  wrote:
> 
> On 7 April 2016 at 07:29, Michael Paquier  wrote:
> 
> With the test case attached in my case the COMMIT PREPARED record does
> not even get replayed.
> 
> I was surprised to see this in the test...
> 
>sleep 2; # wait for changes to arrive on slave 
> 
> I think the test framework needs a WaitForLSN function to allow us to know 
> for certain that something has been delivered.

Yes, test framework already has that function. That was just quick script to 
reproduce bug, that Michael faced.
If there will be deterministic way to reproduce that bug, i'll rework it and 
move to 00X_twophase.pl

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Simon Riggs
On 7 April 2016 at 07:29, Michael Paquier  wrote:

With the test case attached in my case the COMMIT PREPARED record does
> not even get replayed.
>

I was surprised to see this in the test...

   sleep 2; # wait for changes to arrive on slave

I think the test framework needs a WaitForLSN function to allow us to know
for certain that something has been delivered.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Stas Kelvich
> On 07 Apr 2016, at 09:29, Michael Paquier  wrote:
> 
> relOid=16385) + 358 at standby.c:627
>frame #11: 0x0001023dac6b
> postgres`standby_redo(record=0x7fde50841e38) + 251 at
> standby.c:809
> 
> (LOCALLOCK) $1 = {
>  tag = {
>lock = {
>  locktag_field1 = 16384
>  locktag_field2 = 16385
>  locktag_field3 = 0
>  locktag_field4 = 0
>  locktag_type = '\0'
>  locktag_lockmethodid = '\x01'
>}
>mode = 8
>  }
> 
> =# select relname, oid from pg_class where oid > 16000;
> relname |  oid

Tried on linux and os x 10.11 and 10.4.

Still can’t reproduce, but have played around with your backtrace.

I see in xlodump on slave following sequence of records:

rmgr: Storage len (rec/tot): 16/42, tx:  0, lsn: 
0/03015AF0, prev 0/03015958, desc: CREATE base/12669/16387
rmgr: Heaplen (rec/tot):  3/  1518, tx:867, lsn: 
0/03015B20, prev 0/03015AF0, desc: INSERT off 8, blkref #0: rel 1663/12669/1247 
blk 8 FPW
<...>
rmgr: Btree   len (rec/tot):  2/72, tx:867, lsn: 
0/03019CD0, prev 0/03019C88, desc: INSERT_LEAF off 114, blkref #0: rel 
1663/12669/2674 blk 22
rmgr: Standby len (rec/tot): 16/42, tx:867, lsn: 
0/03019D18, prev 0/03019CD0, desc: LOCK xid 867 db 12669 rel 16387 
rmgr: Transaction len (rec/tot):784/   813, tx:867, lsn: 
0/03019D48, prev 0/03019D18, desc: PREPARE 
rmgr: Transaction len (rec/tot):380/   409, tx:  0, lsn: 
0/0301A090, prev 0/03019D48, desc: COMMIT_PREPARED 867: 2016-04-08 
14:38:33.347851 MSK;

It looks like that you had stuck in LOCK xid 867 even before replaying PREPARE 
record, so I have not that much ideas on why that can be caused by changing 
procedures of PREPARE replay.

Just to keep things sane, here is my current diff:



twophase_replay.v4.patch
Description: Binary data

-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-08 Thread Jesper Pedersen

On 04/07/2016 02:29 AM, Michael Paquier wrote:

So recovery is conflicting here. My guess is that this patch is
missing some lock cleanup.

With the test case attached in my case the COMMIT PREPARED record does
not even get replayed.



Should we create an entry for the open item list [0] for this, due to 
the replication lag [1] ?


CommitFest entry [2]
Original commit [3]

Cc'ed RMT.

[0] https://wiki.postgresql.org/wiki/PostgreSQL_9.6_Open_Items
[1] 
http://www.postgresql.org/message-id/e7497864-de11-4099-83f5-89fb97af5...@postgrespro.ru

[2] https://commitfest.postgresql.org/9/523/
[3] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-07 Thread Michael Paquier
On Wed, Apr 6, 2016 at 6:47 PM, Stas Kelvich  wrote:
>> On Apr 2, 2016, at 3:14 AM, Michael Paquier  
>> wrote:
>>
>> On Fri, Apr 1, 2016 at 10:53 PM, Stas Kelvich  
>> wrote:
>>> I wrote:
 While testing the patch, I found a bug in the recovery conflict code
 path. You can do the following to reproduce it:
 1) Start a master with a standby
 2) prepare a transaction on master
 3) Stop immediate on standby to force replay
 4) commit prepare transaction on master
 5) When starting the standby, it remains stuck here:
>>>
>>> Hm, I wasn’t able to reproduce that. Do you mean following scenario or am I 
>>> missing something?
>>>
>>> (async replication)
>>>
>>> $node_master->psql('postgres', "
>>>begin;
>>>insert into t values (1);
>>>prepare transaction 'x';
>>> ");
>>> $node_slave->teardown_node;
>>> $node_master->psql('postgres',"commit prepared 'x'");
>>> $node_slave->start;
>>> $node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", 
>>> stdout => \$psql_out);
>>> is($psql_out, '0', "Commit prepared on master while slave is down.");
>>
>> Actually, not exactly, the transaction prepared on master created a
>> table. Sorry for the lack of precisions in my review.
>
> Sorry for delay.
>
> Actually I can’t reproduce that again, tried with following tx:

Well not for me, here are more details, with a test case attached:
* thread #1: tid = 0x50c5b, 0x7fff93822f96
libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread',
stop reason = signal SIGSTOP
  * frame #0: 0x7fff93822f96 libsystem_kernel.dylib`poll + 10
frame #1: 0x0001023cdda3
postgres`WaitEventSetWaitBlock(set=0x7fde50858f28, cur_timeout=-1,
occurred_events=0x7fff5dc87cf8, nevents=1) + 51 at latch.c:1102
frame #2: 0x0001023cd786
postgres`WaitEventSetWait(set=0x7fde50858f28, timeout=-1,
occurred_events=0x7fff5dc87cf8, nevents=1) + 390 at latch.c:935
frame #3: 0x0001023cd227
postgres`WaitLatchOrSocket(latch=0x00010b9b12e4, wakeEvents=1,
sock=-1, timeout=-1) + 343 at latch.c:347
frame #4: 0x0001023cd0ca
postgres`WaitLatch(latch=0x00010b9b12e4, wakeEvents=1, timeout=0)
+ 42 at latch.c:302
frame #5: 0x0001023eb306 postgres`ProcWaitForSignal + 38 at proc.c:1731
frame #6: 0x0001023da24b
postgres`ResolveRecoveryConflictWithLock(locktag=LOCKTAG at
0x7fff5dc87db8) + 187 at standby.c:391
frame #7: 0x0001023ea408
postgres`ProcSleep(locallock=0x7fde5083dac8,
lockMethodTable=0x00010286e278) + 1128 at proc.c:1215
frame #8: 0x0001023e25e6
postgres`WaitOnLock(locallock=0x7fde5083dac8,
owner=0x) + 358 at lock.c:1703
frame #9: 0x0001023e0cf3
postgres`LockAcquireExtended(locktag=0x7fff5dc88208, lockmode=8,
sessionLock='\x01', dontWait='\0', reportMemoryError='\0') + 2819 at
lock.c:998
frame #10: 0x0001023da706
postgres`StandbyAcquireAccessExclusiveLock(xid=867, dbOid=16384,
relOid=16385) + 358 at standby.c:627
frame #11: 0x0001023dac6b
postgres`standby_redo(record=0x7fde50841e38) + 251 at
standby.c:809

(LOCALLOCK) $1 = {
  tag = {
lock = {
  locktag_field1 = 16384
  locktag_field2 = 16385
  locktag_field3 = 0
  locktag_field4 = 0
  locktag_type = '\0'
  locktag_lockmethodid = '\x01'
}
mode = 8
  }

=# select relname, oid from pg_class where oid > 16000;
 relname |  oid
-+---
 aa  | 16385
(1 row)
So recovery is conflicting here. My guess is that this patch is
missing some lock cleanup.

With the test case attached in my case the COMMIT PREPARED record does
not even get replayed.
-- 
Michael
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 2;

# Setup master node
my $node_master = get_new_node("master");
$node_master->init(allows_streaming => 1);
$node_master->append_conf('postgresql.conf', qq(
max_prepared_transactions = 10
));
$node_master->start;
$node_master->backup('master_backup');
$node_master->psql('postgres', "create table t(id int)");

# Setup slave node
my $node_slave = get_new_node('slave');
$node_slave->init_from_backup($node_master, 'master_backup', has_streaming => 1);
$node_slave->start;

my $psql_out = '';
my $psql_rc = '';

$node_master->psql('postgres', "
	begin;
	create table t2(id int);
	prepare transaction 'x';
");
sleep 2; # wait for changes to arrive on slave
$node_slave->teardown_node;
$node_master->psql('postgres',"commit prepared 'x'");
$node_slave->start;
sleep 10; # wait for changes to arrive on slave
$node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", stdout => \$psql_out);
is($psql_out, '0', "Commit prepared on master while slave is down.");
$node_slave->psql('postgres',"select sum(id) from t2", stdout => \$psql_out);
is($psql_out, '0', "Check that tx changes are visible.");

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Speedup twophase transactions

2016-04-06 Thread Stas Kelvich
> On Apr 2, 2016, at 3:14 AM, Michael Paquier  wrote:
> 
> On Fri, Apr 1, 2016 at 10:53 PM, Stas Kelvich  
> wrote:
>> I wrote:
>>> While testing the patch, I found a bug in the recovery conflict code
>>> path. You can do the following to reproduce it:
>>> 1) Start a master with a standby
>>> 2) prepare a transaction on master
>>> 3) Stop immediate on standby to force replay
>>> 4) commit prepare transaction on master
>>> 5) When starting the standby, it remains stuck here:
>> 
>> Hm, I wasn’t able to reproduce that. Do you mean following scenario or am I 
>> missing something?
>> 
>> (async replication)
>> 
>> $node_master->psql('postgres', "
>>begin;
>>insert into t values (1);
>>prepare transaction 'x';
>> ");
>> $node_slave->teardown_node;
>> $node_master->psql('postgres',"commit prepared 'x'");
>> $node_slave->start;
>> $node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", 
>> stdout => \$psql_out);
>> is($psql_out, '0', "Commit prepared on master while slave is down.");
> 
> Actually, not exactly, the transaction prepared on master created a
> table. Sorry for the lack of precisions in my review.

Sorry for delay.

Actually I can’t reproduce that again, tried with following tx:

begin;
insert into t values(0);
create table t1(id int);
insert into t1 values(1);
create table t2(id int);
insert into t2 values(2);
savepoint s1;
drop table t1;
select * from t for update;
select * from t2 for share;
prepare transaction 'x’;

use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 2;

# Setup master node
my $node_master = get_new_node("master");
$node_master->init(allows_streaming => 1);
$node_master->append_conf('postgresql.conf', qq(
max_prepared_transactions = 10
));
$node_master->start;
$node_master->backup('master_backup');
$node_master->psql('postgres', "create table t(id int)");

# Setup slave node
my $node_slave = get_new_node('slave');
$node_slave->init_from_backup($node_master, 'master_backup', has_streaming => 1);
$node_slave->start;

my $psql_out = '';
my $psql_rc = '';

$node_master->psql('postgres', "
	begin;
	insert into t values(0);
	create table t1(id int);
	insert into t1 values(1);
	create table t2(id int);
	insert into t2 values(2);
	savepoint s1;
	drop table t1;
	select * from t for update;
	select * from t2 for share;
	prepare transaction 'x';
");
sleep 2; # wait for changes to arrive on slave
$node_slave->teardown_node;
$node_master->psql('postgres',"commit prepared 'x'");
$node_slave->start;
$node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", stdout => \$psql_out);
is($psql_out, '0', "Commit prepared on master while slave is down.");
$node_slave->psql('postgres',"select sum(id) from t2", stdout => \$psql_out);
is($psql_out, '2', "Check that tx changes are visible.");


-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-01 Thread Michael Paquier
On Fri, Apr 1, 2016 at 10:53 PM, Stas Kelvich  wrote:
> I wrote:
>> While testing the patch, I found a bug in the recovery conflict code
>> path. You can do the following to reproduce it:
>> 1) Start a master with a standby
>> 2) prepare a transaction on master
>> 3) Stop immediate on standby to force replay
>> 4) commit prepare transaction on master
>> 5) When starting the standby, it remains stuck here:
>
> Hm, I wasn’t able to reproduce that. Do you mean following scenario or am I 
> missing something?
>
> (async replication)
>
> $node_master->psql('postgres', "
> begin;
> insert into t values (1);
> prepare transaction 'x';
> ");
> $node_slave->teardown_node;
> $node_master->psql('postgres',"commit prepared 'x'");
> $node_slave->start;
> $node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", stdout 
> => \$psql_out);
> is($psql_out, '0', "Commit prepared on master while slave is down.");

Actually, not exactly, the transaction prepared on master created a
table. Sorry for the lack of precisions in my review.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-01 Thread Stas Kelvich
> On Apr 1, 2016, at 10:04 AM, Michael Paquier  
> wrote:
> 
> I would suggest the following name modifications, node names have been
> introduced to help tracking of each node's log:
> - Candie => master
> - Django => slave or just standby
> There is no need for complication! And each node's log file is
> prefixed by the test number. Note that in other tests there are no
> promotions, or fallbacks done, but we stick with a node name that
> represents the initial state of the cluster.

Ok, let’s reflect initial state in node names. So master and standby then.

> 
>> +# Switch to synchronous replication
>> +$node_master->append_conf('postgresql.conf', qq(
>> +synchronous_standby_names = '*'
>> +));
>> +$node_master->restart;
>> Reloading would be fine.
>> 
>> Good catch, done.
> 
> +$node_master->psql('postgres', "select pg_reload_conf()");
> 
> It would be cleaner to introduce a new routine in PostgresNode that
> calls pg_ctl reload (mentioned in the N-sync patch as well, that would
> be useful for many purposes).

Okay.

> 
>> All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock,
>> so I thick that’s safe. Also I’ve deleted comment above that block, probably
>> it’s more confusing than descriptive.
> 
> OK, you removed the use to allProcs. Though by reading again the code
> just holding TwoPhaseStateLock that's actually fine.
> 
> The patch needs a small cleanup:
> $ git diff master --check
> src/test/recovery/t/006_twophase.pl:224: new blank line at EOF.
> 
> 006_twophase.pl should be renamed to 007. It keeps its license to
> kill, and gains in being famous.

Huh, eventually there will be Fleming reference, instead of Tarantino one in 
node names.

> 
> - * Scan the pg_twophase directory and setup all the required information to
> - * allow standby queries to treat prepared transactions as still active.
> - * This is never called at the end of recovery - we use
> - * RecoverPreparedTransactions() at that point.
> + * It's a caller responsibility to call MarkAsPrepared() on returned gxact.
>  *
> Wouldn't it be more simple to just call MarkAsPrepared at the end of
> RecoverPreparedFromBuffer?

I did that intentionally to allow modification of gxact before unlock.

RecoverPreparedFromXLOG:
gxact = RecoverPreparedFromBuffer((char *) XLogRecGetData(record), 
false);
gxact->prepare_start_lsn = record->ReadRecPtr;
gxact->prepare_end_lsn = record->EndRecPtr;
MarkAsPrepared(gxact);

RecoverPreparedFromFiles:
gxact = RecoverPreparedFromBuffer(buf, forceOverwriteOK);
gxact->ondisk = true;
MarkAsPrepared(gxact);

While both function are only called during recovery, I think that it is better 
to write that
in a way when it possible to use it in multiprocess environment.

> 
> While testing the patch, I found a bug in the recovery conflict code
> path. You can do the following to reproduce it:
> 1) Start a master with a standby
> 2) prepare a transaction on master
> 3) Stop immediate on standby to force replay
> 4) commit prepare transaction on master
> 5) When starting the standby, it remains stuck here:

Hm, I wasn’t able to reproduce that. Do you mean following scenario or am I 
missing something?

(async replication)

$node_master->psql('postgres', "
begin;
insert into t values (1);
prepare transaction 'x';
");
$node_slave->teardown_node;
$node_master->psql('postgres',"commit prepared 'x'");
$node_slave->start;
$node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", stdout 
=> \$psql_out);
is($psql_out, '0', "Commit prepared on master while slave is down.");


> * thread #1: tid = 0x229b4, 0x7fff8e2d4f96
> libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread',
> stop reason = signal SIGSTOP
>  * frame #0: 0x7fff8e2d4f96 libsystem_kernel.dylib`poll + 10
>frame #1: 0x000107e5e043
> postgres`WaitEventSetWaitBlock(set=0x7f90c20596a8, cur_timeout=-1,
> occurred_events=0x7fff581efd28, nevents=1) + 51 at latch.c:1102
>frame #2: 0x000107e5da26
> postgres`WaitEventSetWait(set=0x7f90c20596a8, timeout=-1,
> occurred_events=0x7fff581efd28, nevents=1) + 390 at latch.c:935
>frame #3: 0x000107e5d4c7
> postgres`WaitLatchOrSocket(latch=0x000111432464, wakeEvents=1,
> sock=-1, timeout=-1) + 343 at latch.c:347
>frame #4: 0x000107e5d36a
> postgres`WaitLatch(latch=0x000111432464, wakeEvents=1, timeout=0)
> + 42 at latch.c:302
>frame #5: 0x000107e7b5a6 postgres`ProcWaitForSignal + 38 at proc.c:1731
>frame #6: 0x000107e6a4eb
> postgres`ResolveRecoveryConflictWithLock(locktag=LOCKTAG at
> 0x7fff581efde8) + 187 at standby.c:391
>frame #7: 0x000107e7a6a8
> postgres`ProcSleep(locallock=0x7f90c203dac8,
> lockMethodTable=0x0001082f6218) + 1128 at proc.c:1215
>frame #8: 0x000107e72886
> postgres`WaitOnLock(locallock=0x7f90c203dac8,
> owner=0x) + 358 at 

Re: [HACKERS] Speedup twophase transactions

2016-04-01 Thread Michael Paquier
On Wed, Mar 30, 2016 at 10:19 PM, Stas Kelvich  wrote:
> Hm, it’s hard to create descriptive names because test changes master/slave
> roles for that nodes several times during test.

Really? the names used in the patch help less then.

> It’s possible to call them
> “node1” and “node2” but I’m not sure that improves something. But anyway I’m
> not insisting on that particular names and will agree with any reasonable
> suggestion.

I would suggest the following name modifications, node names have been
introduced to help tracking of each node's log:
- Candie => master
- Django => slave or just standby
There is no need for complication! And each node's log file is
prefixed by the test number. Note that in other tests there are no
promotions, or fallbacks done, but we stick with a node name that
represents the initial state of the cluster.

> +# Switch to synchronous replication
> +$node_master->append_conf('postgresql.conf', qq(
> +synchronous_standby_names = '*'
> +));
> +$node_master->restart;
> Reloading would be fine.
>
> Good catch, done.

+$node_master->psql('postgres', "select pg_reload_conf()");

It would be cleaner to introduce a new routine in PostgresNode that
calls pg_ctl reload (mentioned in the N-sync patch as well, that would
be useful for many purposes).

> All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock,
> so I thick that’s safe. Also I’ve deleted comment above that block, probably
> it’s more confusing than descriptive.

OK, you removed the use to allProcs. Though by reading again the code
just holding TwoPhaseStateLock that's actually fine.

The patch needs a small cleanup:
$ git diff master --check
src/test/recovery/t/006_twophase.pl:224: new blank line at EOF.

006_twophase.pl should be renamed to 007. It keeps its license to
kill, and gains in being famous.

- * Scan the pg_twophase directory and setup all the required information to
- * allow standby queries to treat prepared transactions as still active.
- * This is never called at the end of recovery - we use
- * RecoverPreparedTransactions() at that point.
+ * It's a caller responsibility to call MarkAsPrepared() on returned gxact.
  *
Wouldn't it be more simple to just call MarkAsPrepared at the end of
RecoverPreparedFromBuffer?

While testing the patch, I found a bug in the recovery conflict code
path. You can do the following to reproduce it:
1) Start a master with a standby
2) prepare a transaction on master
3) Stop immediate on standby to force replay
4) commit prepare transaction on master
5) When starting the standby, it remains stuck here:
* thread #1: tid = 0x229b4, 0x7fff8e2d4f96
libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread',
stop reason = signal SIGSTOP
  * frame #0: 0x7fff8e2d4f96 libsystem_kernel.dylib`poll + 10
frame #1: 0x000107e5e043
postgres`WaitEventSetWaitBlock(set=0x7f90c20596a8, cur_timeout=-1,
occurred_events=0x7fff581efd28, nevents=1) + 51 at latch.c:1102
frame #2: 0x000107e5da26
postgres`WaitEventSetWait(set=0x7f90c20596a8, timeout=-1,
occurred_events=0x7fff581efd28, nevents=1) + 390 at latch.c:935
frame #3: 0x000107e5d4c7
postgres`WaitLatchOrSocket(latch=0x000111432464, wakeEvents=1,
sock=-1, timeout=-1) + 343 at latch.c:347
frame #4: 0x000107e5d36a
postgres`WaitLatch(latch=0x000111432464, wakeEvents=1, timeout=0)
+ 42 at latch.c:302
frame #5: 0x000107e7b5a6 postgres`ProcWaitForSignal + 38 at proc.c:1731
frame #6: 0x000107e6a4eb
postgres`ResolveRecoveryConflictWithLock(locktag=LOCKTAG at
0x7fff581efde8) + 187 at standby.c:391
frame #7: 0x000107e7a6a8
postgres`ProcSleep(locallock=0x7f90c203dac8,
lockMethodTable=0x0001082f6218) + 1128 at proc.c:1215
frame #8: 0x000107e72886
postgres`WaitOnLock(locallock=0x7f90c203dac8,
owner=0x) + 358 at lock.c:1703
frame #9: 0x000107e70f93
postgres`LockAcquireExtended(locktag=0x7fff581f0238, lockmode=8,
sessionLock='\x01', dontWait='\0', reportMemoryError='\0') + 2819 at
lock.c:998
frame #10: 0x000107e6a9a6
postgres`StandbyAcquireAccessExclusiveLock(xid=863, dbOid=16384,
relOid=16385) + 358 at standby.c:627
frame #11: 0x000107e6af0b
postgres`standby_redo(record=0x7f90c2041e38) + 251 at
standby.c:809
frame #12: 0x000107b0e227 postgres`StartupXLOG + 9351 at xlog.c:6871
It seems that the replay on on-memory state of the PREPARE transaction
is conflicting directly with replay.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-30 Thread Jesper Pedersen

On 03/30/2016 09:19 AM, Stas Kelvich wrote:

> +++ b/src/test/recovery/t/006_twophase.pl
> @@ -0,0 +1,226 @@
> +# Checks for recovery_min_apply_delay
> +use strict;
> This description is wrong, this file has been copied from 005.

Yep, done.

>
> +my $node_master = get_new_node("Candie");
> +my $node_slave = get_new_node('Django');
> Please let's use a node names that are more descriptive.

Hm, it’s hard to create descriptive names because test changes master/slave
roles for that nodes several times during test. It’s possible to call them
“node1” and “node2” but I’m not sure that improves something. But anyway I’m not
insisting on that particular names and will agree with any reasonable 
suggestion.

>
> +# Switch to synchronous replication
> +$node_master->append_conf('postgresql.conf', qq(
> +synchronous_standby_names = '*'
> +));
> +$node_master->restart;
> Reloading would be fine.

Good catch, done.

>
> +   /* During replay that lock isn't really necessary, but let's take
> it anyway */
> +   LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
> +   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
> +   {
> +   gxact = TwoPhaseState->prepXacts[i];
> +   proc = >allProcs[gxact->pgprocno];
> +   pgxact = >allPgXact[gxact->pgprocno];
> +
> +   if (TransactionIdEquals(xid, pgxact->xid))
> +   {
> +   gxact->locking_backend = MyBackendId;
> +   MyLockedGxact = gxact;
> +   break;
> +   }
> +   }
> +   LWLockRelease(TwoPhaseStateLock);
> Not taking ProcArrayLock here?

All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock, so
I thick that’s safe. Also I’ve deleted comment above that block, probably it’s
more confusing than descriptive.

>
> The comment at the top of XlogReadTwoPhaseData is incorrect.

Yep, fixed.

>
> RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot of
> code in common, having this duplication is not good, and you could
> simplify your patch.

I reworked patch to avoid duplicated code between
RecoverPreparedFromXLOG/RecoverPreparedFromFiles and also
between FinishPreparedTransaction/XlogRedoFinishPrepared.




Patch applies with hunks, and test cases are passing.

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-30 Thread Stas Kelvich
On Mar 29, 2016, at 6:04 PM, David Steele  wrote:It looks like you should post a new patch or respond to Michael's comments.  Marked as "waiting on author".Yep, here it is.On Mar 22, 2016, at 4:20 PM, Michael Paquier  wrote:Looking at this patch….Thanks.+++ b/src/test/recovery/t/006_twophase.pl@@ -0,0 +1,226 @@+# Checks for recovery_min_apply_delay+use strict;This description is wrong, this file has been copied from 005.Yep, done.+my $node_master = get_new_node("Candie");+my $node_slave = get_new_node('Django');Please let's use a node names that are more descriptive.Hm, it’s hard to create descriptive names because test changes master/slave roles for that nodes several times during test. It’s possible to call them “node1” and “node2” but I’m not sure that improves something. But anyway I’m not insisting on that particular names and will agree with any reasonable suggestion.+# Switch to synchronous replication+$node_master->append_conf('postgresql.conf', qq(+synchronous_standby_names = '*'+));+$node_master->restart;Reloading would be fine.Good catch, done.+   /* During replay that lock isn't really necessary, but let's takeit anyway */+   LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);+   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)+   {+   gxact = TwoPhaseState->prepXacts[i];+   proc = >allProcs[gxact->pgprocno];+   pgxact = >allPgXact[gxact->pgprocno];++   if (TransactionIdEquals(xid, pgxact->xid))+   {+   gxact->locking_backend = MyBackendId;+   MyLockedGxact = gxact;+   break;+   }+   }+   LWLockRelease(TwoPhaseStateLock);Not taking ProcArrayLock here?All accesses to 2pc dummies in ProcArray are covered with TwoPhaseStateLock, so I thick that’s safe. Also I’ve deleted comment above that block, probably it’s more confusing than descriptive.The comment at the top of XlogReadTwoPhaseData is incorrect.Yep, fixed.RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot ofcode in common, having this duplication is not good, and you couldsimplify your patch.I reworked patch to avoid duplicated code between RecoverPreparedFromXLOG/RecoverPreparedFromFiles and also between FinishPreparedTransaction/XlogRedoFinishPrepared.

twophase_replay.v3.diff
Description: Binary data

-- Stas KelvichPostgres Professional: http://www.postgrespro.comRussian Postgres Company




Re: [HACKERS] Speedup twophase transactions

2016-03-29 Thread David Steele

Hi Stas,

On 3/22/16 9:20 AM, Michael Paquier wrote:


Not taking ProcArrayLock here?

The comment at the top of XlogReadTwoPhaseData is incorrect.

RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot of
code in common, having this duplication is not good, and you could
simplify your patch.


It looks like you should post a new patch or respond to Michael's 
comments.  Marked as "waiting on author".


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-22 Thread Michael Paquier
On Tue, Mar 22, 2016 at 1:53 AM, Jesper Pedersen
 wrote:
> On 03/18/2016 12:50 PM, Stas Kelvich wrote:
>>>
>>> On 11 Mar 2016, at 19:41, Jesper Pedersen 
>>> wrote:
>>>
>>
>> Thanks for review, Jesper.
>>
>>> Some comments:
>>>
>>> * The patch needs a rebase against the latest TwoPhaseFileHeader change
>>
>>
>> Done.
>>
>>> * Rework the check.sh script into a TAP test case (src/test/recovery), as
>>> suggested by Alvaro and Michael down thread
>>
>>
>> Done. Originally I thought about reducing number of tests (11 right now),
>> but now, after some debugging, I’m more convinced that it is better to
>> include them all, as they are really testing different code paths.
>>
>>> * Add documentation for RecoverPreparedFromXLOG
>>
>>
>> Done.
>
>
> Thanks, Stas.
>
> I have gone over this version, and tested with --enable-tap-tests + make
> check in src/test/recovery, which passes.
>
> Simon, do you want to move this entry to "Ready for Committer" and take the
> 2REVIEWER section as part of that, or leave it in "Needs Review" and update
> the thread ?

Looking at this patch

+++ b/src/test/recovery/t/006_twophase.pl
@@ -0,0 +1,226 @@
+# Checks for recovery_min_apply_delay
+use strict;
This description is wrong, this file has been copied from 005.

+my $node_master = get_new_node("Candie");
+my $node_slave = get_new_node('Django');
Please let's use a node names that are more descriptive.

+# Switch to synchronous replication
+$node_master->append_conf('postgresql.conf', qq(
+synchronous_standby_names = '*'
+));
+$node_master->restart;
Reloading would be fine.

+   /* During replay that lock isn't really necessary, but let's take
it anyway */
+   LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+   for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+   {
+   gxact = TwoPhaseState->prepXacts[i];
+   proc = >allProcs[gxact->pgprocno];
+   pgxact = >allPgXact[gxact->pgprocno];
+
+   if (TransactionIdEquals(xid, pgxact->xid))
+   {
+   gxact->locking_backend = MyBackendId;
+   MyLockedGxact = gxact;
+   break;
+   }
+   }
+   LWLockRelease(TwoPhaseStateLock);
Not taking ProcArrayLock here?

The comment at the top of XlogReadTwoPhaseData is incorrect.

RecoverPreparedFromXLOG and RecoverPreparedFromFiles have a lot of
code in common, having this duplication is not good, and you could
simplify your patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-21 Thread Jesper Pedersen

On 03/18/2016 12:50 PM, Stas Kelvich wrote:

On 11 Mar 2016, at 19:41, Jesper Pedersen  wrote:



Thanks for review, Jesper.


Some comments:

* The patch needs a rebase against the latest TwoPhaseFileHeader change


Done.


* Rework the check.sh script into a TAP test case (src/test/recovery), as 
suggested by Alvaro and Michael down thread


Done. Originally I thought about reducing number of tests (11 right now), but 
now, after some debugging, I’m more convinced that it is better to include them 
all, as they are really testing different code paths.


* Add documentation for RecoverPreparedFromXLOG


Done.


Thanks, Stas.

I have gone over this version, and tested with --enable-tap-tests + make 
check in src/test/recovery, which passes.


Simon, do you want to move this entry to "Ready for Committer" and take 
the 2REVIEWER section as part of that, or leave it in "Needs Review" 
and update the thread ?


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-19 Thread David Steele

On 1/26/16 3:39 PM, Stas Kelvich wrote:

Agree, I had the same idea in my mind when was writing that script.

I will migrate it to TAP suite and write a review for Michael Paquier's patch.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



On 26 Jan 2016, at 20:20, Alvaro Herrera  wrote:

Stas Kelvich wrote:


While this patch touches quite sensible part of postgres replay and there is 
some rarely used code paths, I wrote shell script to setup master/slave 
replication and test different failure scenarios that can happened with 
instances. Attaching this file to show test scenarios that I have tested and 
more importantly to show what I didn’t tested. Particularly I failed to 
reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
somebody can suggest way how to force it’s usage. Also I’m not too sure about 
necessity of calling cache invalidation callbacks during 
XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
comment.


I think this is the third thread in which I say this: We need to push
Michael Paquier's recovery test framework, then convert your test script
to that.  That way we can put your tests in core.


It seems this thread has been waiting quite a while on a new patch.  If 
one doesn't appear by Monday I will mark this "returned with feedback".


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-18 Thread Stas Kelvich

> On 11 Mar 2016, at 19:41, Jesper Pedersen  wrote:
> 

Thanks for review, Jesper.

> Some comments:
> 
> * The patch needs a rebase against the latest TwoPhaseFileHeader change

Done.

> * Rework the check.sh script into a TAP test case (src/test/recovery), as 
> suggested by Alvaro and Michael down thread

Done. Originally I thought about reducing number of tests (11 right now), but 
now, after some debugging, I’m more convinced that it is better to include them 
all, as they are really testing different code paths.

> * Add documentation for RecoverPreparedFromXLOG

Done.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



twophase_replay.v2.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-03-11 Thread Jesper Pedersen

On 01/26/2016 07:43 AM, Stas Kelvich wrote:

Thanks for reviews and commit!

   As Simon and Andres already mentioned in this thread replay of twophase 
transaction is significantly slower then the same operations in normal mode. 
Major reason is that each state file is fsynced during replay and while it is 
not a problem for recovery, it is a problem for replication. Under high 2pc 
update load lag between master and async replica is constantly increasing (see 
graph below).

   One way to improve things is to move fsyncs to restartpoints, but as we saw 
previously it is a half-measure and just frequent calls to fopen can cause 
bottleneck.

   Other option is to use the same scenario for replay that was used already 
for non-recovery mode: read state files to memory during replay of prepare, and 
if checkpoint/restartpoint occurs between prepare and commit move data to 
files. On commit we can read xlog or files. So here is the patch that 
implements this scenario for replay.

   Patch is quite straightforward. During replay of prepare records 
RecoverPreparedFromXLOG() is called to create memory state in GXACT, PROC, 
PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state. 
Also there are several functions (PrescanPreparedTransactions, 
StandbyTransactionIdIsPrepared) that were assuming that during replay all 
prepared xacts have files in pg_twophase, so I have extended them to check 
GXACT too.
   Side effect of that behaviour is that we can see prepared xacts in 
pg_prepared_xacts view on slave.

While this patch touches quite sensible part of postgres replay and there is 
some rarely used code paths, I wrote shell script to setup master/slave 
replication and test different failure scenarios that can happened with 
instances. Attaching this file to show test scenarios that I have tested and 
more importantly to show what I didn’t tested. Particularly I failed to 
reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
somebody can suggest way how to force it’s usage. Also I’m not too sure about 
necessity of calling cache invalidation callbacks during 
XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
comment.

Tests shows that this patch increases speed of 2pc replay to the level when 
replica can keep pace with master.

Graph: replica lag under a pgbench run for a 200 seconds with 2pc update transactions (80 
connections, one update per 2pc tx, two servers with 12 cores each, 10GbE interconnect) 
on current master and with suggested patch. Replica lag measured with "select 
sent_location-replay_location as delay from pg_stat_replication;" each second.



Some comments:

* The patch needs a rebase against the latest TwoPhaseFileHeader change
* Rework the check.sh script into a TAP test case (src/test/recovery), 
as suggested by Alvaro and Michael down thread

* Add documentation for RecoverPreparedFromXLOG

+* that xlog record. We need just to clen up memmory state.

'clean' + 'memory'

+* This is usually called after end-of-recovery checkpoint, so all 2pc
+* files moved xlog to files. But if we restart slave when master is
+* switched off this function will be called before checkpoint ans we 
need
+* to check PGXACT array as it can contain prepared transactions that
+* didn't created any state files yet.

=>

"We need to check the PGXACT array for prepared transactions that 
doesn't have any state file in case of a slave restart with the master 
being off."


+* prepare xlog resords in shared memory in the same way as it 
happens

'records'

+* We need such behaviour because speed of 2PC replay on 
replica should
+* be at least not slower than 2PC tx speed on master.

=>

"We need this behaviour because the speed of the 2PC replay on the 
replica should be at least the same as the 2PC transaction speed of the 
master."


I'll leave the 2REVIEWER section to Simon.

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-26 Thread Alvaro Herrera
Stas Kelvich wrote:

> While this patch touches quite sensible part of postgres replay and there is 
> some rarely used code paths, I wrote shell script to setup master/slave 
> replication and test different failure scenarios that can happened with 
> instances. Attaching this file to show test scenarios that I have tested and 
> more importantly to show what I didn’t tested. Particularly I failed to 
> reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
> somebody can suggest way how to force it’s usage. Also I’m not too sure about 
> necessity of calling cache invalidation callbacks during 
> XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
> comment.

I think this is the third thread in which I say this: We need to push
Michael Paquier's recovery test framework, then convert your test script
to that.  That way we can put your tests in core.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-26 Thread Stas Kelvich
Agree, I had the same idea in my mind when was writing that script.

I will migrate it to TAP suite and write a review for Michael Paquier's patch.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 26 Jan 2016, at 20:20, Alvaro Herrera  wrote:
> 
> Stas Kelvich wrote:
> 
>> While this patch touches quite sensible part of postgres replay and there is 
>> some rarely used code paths, I wrote shell script to setup master/slave 
>> replication and test different failure scenarios that can happened with 
>> instances. Attaching this file to show test scenarios that I have tested and 
>> more importantly to show what I didn’t tested. Particularly I failed to 
>> reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
>> somebody can suggest way how to force it’s usage. Also I’m not too sure 
>> about necessity of calling cache invalidation callbacks during 
>> XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
>> comment.
> 
> I think this is the third thread in which I say this: We need to push
> Michael Paquier's recovery test framework, then convert your test script
> to that.  That way we can put your tests in core.
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-26 Thread Michael Paquier
On Wed, Jan 27, 2016 at 5:39 AM, Stas Kelvich  wrote:
> Agree, I had the same idea in my mind when was writing that script.
> I will migrate it to TAP suite and write a review for Michael Paquier's patch.

Yeah, please! And you have won a free-hug coupon that I can give in
person next week.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-26 Thread Stas Kelvich
Hi,

Thanks for reviews and commit!

  As Simon and Andres already mentioned in this thread replay of twophase 
transaction is significantly slower then the same operations in normal mode. 
Major reason is that each state file is fsynced during replay and while it is 
not a problem for recovery, it is a problem for replication. Under high 2pc 
update load lag between master and async replica is constantly increasing (see 
graph below).

  One way to improve things is to move fsyncs to restartpoints, but as we saw 
previously it is a half-measure and just frequent calls to fopen can cause 
bottleneck.

  Other option is to use the same scenario for replay that was used already for 
non-recovery mode: read state files to memory during replay of prepare, and if 
checkpoint/restartpoint occurs between prepare and commit move data to files. 
On commit we can read xlog or files. So here is the patch that implements this 
scenario for replay.

  Patch is quite straightforward. During replay of prepare records 
RecoverPreparedFromXLOG() is called to create memory state in GXACT, PROC, 
PGPROC; on commit XlogRedoFinishPrepared() is called to clean up that state. 
Also there are several functions (PrescanPreparedTransactions, 
StandbyTransactionIdIsPrepared) that were assuming that during replay all 
prepared xacts have files in pg_twophase, so I have extended them to check 
GXACT too.
  Side effect of that behaviour is that we can see prepared xacts in 
pg_prepared_xacts view on slave.

While this patch touches quite sensible part of postgres replay and there is 
some rarely used code paths, I wrote shell script to setup master/slave 
replication and test different failure scenarios that can happened with 
instances. Attaching this file to show test scenarios that I have tested and 
more importantly to show what I didn’t tested. Particularly I failed to 
reproduce situation where StandbyTransactionIdIsPrepared() is called, may be 
somebody can suggest way how to force it’s usage. Also I’m not too sure about 
necessity of calling cache invalidation callbacks during 
XlogRedoFinishPrepared(), I’ve marked this place in patch with 2REVIEWER 
comment.

Tests shows that this patch increases speed of 2pc replay to the level when 
replica can keep pace with master.

Graph: replica lag under a pgbench run for a 200 seconds with 2pc update 
transactions (80 connections, one update per 2pc tx, two servers with 12 cores 
each, 10GbE interconnect) on current master and with suggested patch. Replica 
lag measured with "select sent_location-replay_location as delay from 
pg_stat_replication;" each second.



twophase_replay.diff
Description: Binary data


check.sh
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 12 Jan 2016, at 22:57, Simon Riggs  wrote:
> 
> On 12 January 2016 at 18:14, Andres Freund  wrote:
> Hi,
> 
> Thank you for the additional review.
>  
> On 2016-01-11 19:39:14 +, Simon Riggs wrote:
> > Currently, the patch reuses all of the code related to reading/write state
> > files, so it is the minimal patch that can implement the important things
> > for performance. The current patch succeeds in its goal to improve
> > performance, so I personally see no further need for code churn.
> 
> Sorry, I don't buy that argument. This approach leaves us with a bunch
> of code related to statefiles that's barely ever going to be exercised,
> and leaves the performance bottleneck on WAL replay in place.
> 
> I raised the issue of WAL replay performance before you were involved, as has 
> been mentioned already. I don't see it as a blocker for this patch. I have 
> already requested it from Stas and he has already agreed to write that.
> 
> Anyway, we know the statefile code works, so I'd prefer to keep it, rather 
> than write a whole load of new code that would almost certainly fail. 
> Whatever the code looks like, the frequency of usage is the same. As I 
> already said, you can submit a patch for the new way if you wish; the reality 
> is that this code works and there's no additional performance gain from doing 
> it a different way.
>  
> > As you suggest, we could also completely redesign the state file mechanism
> > and/or put it in WAL at checkpoint. That's all very nice but is much more
> > code and doesn't anything more for performance, since the current mainline
> > path writes ZERO files at checkpoint.
> 
> Well, on the primary, yes.
> 
> Your changes proposed earlier wouldn't change performance on the standby.
>  
> > If you want that for some other reason or refactoring, I won't stop
> > you, but its a separate patch for a separate purpose.
> 
> Maintainability/complexity very much has to be considered during review
> and can't just be argued away with "but this is what we implemented".
> 
> ;-) ehem, please don't make the mistake of thinking that because your 
> judgement differs to mine 

Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Simon Riggs
On 12 January 2016 at 06:35, Michael Paquier 
wrote:

> On Tue, Jan 12, 2016 at 4:57 AM, Simon Riggs 
> wrote:
> > Performance tests for me show that the patch is effective; my results
> match
> > Jesper's roughly in relative numbers.
> >
> > My robustness review is that the approach and implementation are safe.
> >
> > It's clear there are various additional tuning opportunities, but the
> > objective of the current patch to improve performance is very, very
> clearly
> > met, so I'm aiming to commit *this* patch soon.
>
> -   /* initialize LSN to 0 (start of WAL) */
> -   gxact->prepare_lsn = 0;
> +   /* initialize LSN to InvalidXLogRecPtr */
> +   gxact->prepare_start_lsn = InvalidXLogRecPtr;
> +   gxact->prepare_end_lsn = InvalidXLogRecPtr;
>

OK


> I think that it would be better to explicitly initialize gxact->ondisk
> to false here.
>
> +   xlogreader = XLogReaderAllocate(_read_local_xlog_page,
> NULL);
> +   if (!xlogreader)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_OUT_OF_MEMORY),
> +errmsg("out of memory"),
> +errdetail("Failed while allocating an
> XLog reading processor.")));
> Depending on something that is part of logical decoding to decode WAL
> is not a good idea.


Well, if you put it like that, it sounds wrong, clearly; that's not how I
saw it, when reviewed.

I think any fuss can be avoided simply by renaming
logical_read_local_xlog_page() to read_local_xlog_page()


> If you want to move on with this approach, you
> should have a dedicated function.


The code is exactly what we need, apart from the point that the LSN is
always known flushed by the time we execute it, for 2PC.


> Even better, it would be nice to
> come up with a generic function used by both 2PC and logical decoding.
>

Surely that is exactly what has been done?

A specific function could have been written, which would simply have
duplicated about 160 lines of code. Reusing the existing code makes the
code generic. So lets just rename the function, as mentioned above.

Should we just move the code somewhere just to imply it is generic? Seems
pointless refactoring to me.

The code is clearly due for refactoring once we can share elog with client
programs, as described in comments on the functions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Simon Riggs
On 12 January 2016 at 06:41, Michael Paquier 
wrote:

>
> +   if (log_checkpoints && n > 0)
> +   ereport(LOG,
> +   (errmsg("%u two-phase state files were
> written "
> +   "for long-running
> prepared transactions",
> +   n)));
> This would be better as an independent change. That looks useful for
> debugging, and I guess that's why you added it.
>

The typical case is that no LOG message would be written at all, since that
only happens minutes after a prepared transaction is created and then not
terminated. Restarting a transaction manager likely won't take that long,
so it implies a crash or emergency shutdown of the transaction manager.

I think it is sensible and useful to be notified of this as a condition the
operator would wish to know about. The message doesn't recur every
checkpoint, it occurs only once at the point the files are created, so its
not log spam either.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Michael Paquier
On Tue, Jan 12, 2016 at 5:21 PM, Simon Riggs  wrote:
> Should we just move the code somewhere just to imply it is generic? Seems
> pointless refactoring to me.

Er, why not xlogutils.c? Having the 2PC code depending directly on
something that is within logicalfuncs.c is weird.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Alvaro Herrera
Michael Paquier wrote:
> On Tue, Jan 12, 2016 at 5:21 PM, Simon Riggs  wrote:
> > Should we just move the code somewhere just to imply it is generic? Seems
> > pointless refactoring to me.
> 
> Er, why not xlogutils.c? Having the 2PC code depending directly on
> something that is within logicalfuncs.c is weird.

Yes, I agree with Michael -- it's better to place code in its logical
location than keep it somewhere else just because historically it was
there.  That way, future coders can find the function more easily.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Andres Freund
Hi,

On 2016-01-11 19:39:14 +, Simon Riggs wrote:
> Currently, the patch reuses all of the code related to reading/write state
> files, so it is the minimal patch that can implement the important things
> for performance. The current patch succeeds in its goal to improve
> performance, so I personally see no further need for code churn.

Sorry, I don't buy that argument. This approach leaves us with a bunch
of code related to statefiles that's barely ever going to be exercised,
and leaves the performance bottleneck on WAL replay in place.

> As you suggest, we could also completely redesign the state file mechanism
> and/or put it in WAL at checkpoint. That's all very nice but is much more
> code and doesn't anything more for performance, since the current mainline
> path writes ZERO files at checkpoint.

Well, on the primary, yes.

> If you want that for some other reason or refactoring, I won't stop
> you, but its a separate patch for a separate purpose.

Maintainability/complexity very much has to be considered during review
and can't just be argued away with "but this is what we implemented".


> - *   In order to survive crashes and shutdowns, all prepared
> - *   transactions must be stored in permanent storage. This includes
> - *   locking information, pending notifications etc. All that state
> - *   information is written to the per-transaction state file in
> - *   the pg_twophase directory.
> + *   Information to recover prepared transactions in case of crash is
> + *   now stored in WAL for the common case. In some cases there will 
> be
> + *   an extended period between preparing a GXACT and commit/abort, 
> in

Absolutely minor: The previous lines were differently indented (two tabs
before, one space + two tabs now), which will probably mean pgindent
will yank all of it around, besides looking confusing with different tab
settings.


> * * In case of crash replay will move data from xlog to files, if 
> that
> *   hasn't happened before. XXX TODO - move to shmem in replay 
> also

This is a bit confusing - causing my earlier confusion about using
XlogReadTwoPhaseData in recovery - because what this actually means is
that we get the data from normal WAL replay, not our new way of getting
things from the WAL.


> @@ -772,7 +769,7 @@ TwoPhaseGetGXact(TransactionId xid)
>* During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be 
> called
>* repeatedly for the same XID.  We can save work with a simple cache.
>*/
> - if (xid == cached_xid)
> + if (xid == cached_xid && cached_gxact)
>   return cached_gxact;

What's that about? When can cached_xid be be equal xid and cached_gxact
not set? And why did that change with this patch?


> /*
>  * Finish preparing state file.
>  *
>  * Calculates CRC and writes state file to WAL and in pg_twophase directory.
>  */
> void
> EndPrepare(GlobalTransaction gxact)

In contrast to that comment we're not writing to pg_twophase anymore.


>   /*
>* If the file size exceeds MaxAllocSize, we won't be able to read it in
>* ReadTwoPhaseFile. Check for that now, rather than fail at commit 
> time.
>*/
>   if (hdr->total_len > MaxAllocSize)
>   ereport(ERROR,
>   (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>errmsg("two-phase state file maximum length 
> exceeded")));
> 

Outdated comment.


> +/*
> + * Reads 2PC data from xlog. During checkpoint this data will be moved to
> + * twophase files and ReadTwoPhaseFile should be used instead.
> + */
> +static void
> +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
> +{
> + XLogRecord *record;
> + XLogReaderState *xlogreader;
> + char   *errormsg;
> +
> + xlogreader = XLogReaderAllocate(_read_local_xlog_page, NULL);
> + if (!xlogreader)
> + ereport(ERROR,
> + (errcode(ERRCODE_OUT_OF_MEMORY),
> +  errmsg("out of memory"),
> +  errdetail("Failed while allocating an XLog 
> reading processor.")));

Creating and deleting an xlogreader for every 2pc transaction isn't
particularly efficient. Reading the 2pc state from WAL will often also
mean hitting disk if there's significant WAL volume (we even hint that
we want the cache to be throw away for low wal_level!).


If we really go this way, we really need a) a comment here explaining
why timelines are never an issue b) an assert, preventing to be called
during recovery.


> + record = XLogReadRecord(xlogreader, lsn, );
> + if (record == NULL ||
> + XLogRecGetRmid(xlogreader) != RM_XACT_ID ||
> + (XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != 
> XLOG_XACT_PREPARE)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> +   

Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Michael Paquier
On Tue, Jan 12, 2016 at 5:26 PM, Simon Riggs  wrote:
> On 12 January 2016 at 06:41, Michael Paquier 
> wrote:
>>
>>
>> +   if (log_checkpoints && n > 0)
>> +   ereport(LOG,
>> +   (errmsg("%u two-phase state files were
>> written "
>> +   "for long-running
>> prepared transactions",
>> +   n)));
>> This would be better as an independent change. That looks useful for
>> debugging, and I guess that's why you added it.
>
>
> The typical case is that no LOG message would be written at all, since that
> only happens minutes after a prepared transaction is created and then not
> terminated. Restarting a transaction manager likely won't take that long, so
> it implies a crash or emergency shutdown of the transaction manager.

Thanks for the detailed explanation.

> I think it is sensible and useful to be notified of this as a condition the
> operator would wish to know about. The message doesn't recur every
> checkpoint, it occurs only once at the point the files are created, so its
> not log spam either.

Well, I am not saying that this is bad, quite the contrary actually.
It is just that this seems unrelated to this patch and would still be
useful even now with CheckPointTwoPhase.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Simon Riggs
On 12 January 2016 at 18:14, Andres Freund  wrote:

> Hi,


Thank you for the additional review.


> On 2016-01-11 19:39:14 +, Simon Riggs wrote:
> > Currently, the patch reuses all of the code related to reading/write
> state
> > files, so it is the minimal patch that can implement the important things
> > for performance. The current patch succeeds in its goal to improve
> > performance, so I personally see no further need for code churn.
>
> Sorry, I don't buy that argument. This approach leaves us with a bunch
> of code related to statefiles that's barely ever going to be exercised,
> and leaves the performance bottleneck on WAL replay in place.


I raised the issue of WAL replay performance before you were involved, as
has been mentioned already. I don't see it as a blocker for this patch. I
have already requested it from Stas and he has already agreed to write that.

Anyway, we know the statefile code works, so I'd prefer to keep it, rather
than write a whole load of new code that would almost certainly fail.
Whatever the code looks like, the frequency of usage is the same. As I
already said, you can submit a patch for the new way if you wish; the
reality is that this code works and there's no additional performance gain
from doing it a different way.


> > As you suggest, we could also completely redesign the state file
> mechanism
> > and/or put it in WAL at checkpoint. That's all very nice but is much more
> > code and doesn't anything more for performance, since the current
> mainline
> > path writes ZERO files at checkpoint.
>
> Well, on the primary, yes.


Your changes proposed earlier wouldn't change performance on the standby.


> > If you want that for some other reason or refactoring, I won't stop
> > you, but its a separate patch for a separate purpose.
>
> Maintainability/complexity very much has to be considered during review
> and can't just be argued away with "but this is what we implemented".
>

;-) ehem, please don't make the mistake of thinking that because your
judgement differs to mine that you can claim that you are the only one that
has thought about maintainability and complexity.

I'm happy to do some refactoring if you and Michael think it necessary.


> > - *   In order to survive crashes and shutdowns, all prepared
> > - *   transactions must be stored in permanent storage. This
> includes
> > - *   locking information, pending notifications etc. All that
> state
> > - *   information is written to the per-transaction state file in
> > - *   the pg_twophase directory.
> > + *   Information to recover prepared transactions in case of
> crash is
> > + *   now stored in WAL for the common case. In some cases there
> will be
> > + *   an extended period between preparing a GXACT and
> commit/abort, in
>
> Absolutely minor: The previous lines were differently indented (two tabs
> before, one space + two tabs now), which will probably mean pgindent
> will yank all of it around, besides looking confusing with different tab
> settings.
>
>
> > * * In case of crash replay will move data from xlog to
> files, if that
> > *   hasn't happened before. XXX TODO - move to shmem in
> replay also
>
> This is a bit confusing - causing my earlier confusion about using
> XlogReadTwoPhaseData in recovery - because what this actually means is
> that we get the data from normal WAL replay, not our new way of getting
> things from the WAL.
>
>
> > @@ -772,7 +769,7 @@ TwoPhaseGetGXact(TransactionId xid)
> >* During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be
> called
> >* repeatedly for the same XID.  We can save work with a simple
> cache.
> >*/
> > - if (xid == cached_xid)
> > + if (xid == cached_xid && cached_gxact)
> >   return cached_gxact;
>
> What's that about? When can cached_xid be be equal xid and cached_gxact
> not set? And why did that change with this patch?
>
>
> > /*
> >  * Finish preparing state file.
> >  *
> >  * Calculates CRC and writes state file to WAL and in pg_twophase
> directory.
> >  */
> > void
> > EndPrepare(GlobalTransaction gxact)
>
> In contrast to that comment we're not writing to pg_twophase anymore.
>
>
> >   /*
> >* If the file size exceeds MaxAllocSize, we won't be able to read
> it in
> >* ReadTwoPhaseFile. Check for that now, rather than fail at
> commit time.
> >*/
> >   if (hdr->total_len > MaxAllocSize)
> >   ereport(ERROR,
> >   (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
> >errmsg("two-phase state file maximum
> length exceeded")));
> >
>
> Outdated comment.


Ack all above.


> > +/*
> > + * Reads 2PC data from xlog. During checkpoint this data will be moved
> to
> > + * twophase files and ReadTwoPhaseFile should be used instead.
> > + */
> > +static void
> > 

Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Simon Riggs
On 12 January 2016 at 12:53, Michael Paquier 
wrote:

> On Tue, Jan 12, 2016 at 5:21 PM, Simon Riggs 
> wrote:
> > Should we just move the code somewhere just to imply it is generic? Seems
> > pointless refactoring to me.
>
> Er, why not xlogutils.c? Having the 2PC code depending directly on
> something that is within logicalfuncs.c is weird.
>

If that sounds better, I'm happy to move the code there.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-12 Thread Stas Kelvich
My +1 for moving function to xlogutils.c too.

Now call to this function goes through series of callbacks so it is hard to 
find it.
Personally I found it only after I have implemented same function by myself 
(based on code in pg_xlogdump).

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 12 Jan 2016, at 18:56, Alvaro Herrera  wrote:
> 
> Michael Paquier wrote:
>> On Tue, Jan 12, 2016 at 5:21 PM, Simon Riggs  wrote:
>>> Should we just move the code somewhere just to imply it is generic? Seems
>>> pointless refactoring to me.
>> 
>> Er, why not xlogutils.c? Having the 2PC code depending directly on
>> something that is within logicalfuncs.c is weird.
> 
> Yes, I agree with Michael -- it's better to place code in its logical
> location than keep it somewhere else just because historically it was
> there.  That way, future coders can find the function more easily.
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Stas Kelvich
> 
> On 11 Jan 2016, at 21:40, Jesper Pedersen  wrote:
> 
> I have done a run with the patch and it looks really great.
> 
> Attached is the TPS graph - with a 1pc run too - and the perf profile as a 
> flame graph (28C/56T w/ 256Gb mem, 2 x RAID10 SSD).
> 

Thanks for testing and especially for the flame graph. That is somewhat in 
between the cases that I have tested. On commodity server with dual Xeon (6C 
each) 2pc speed is about 80% of 1pc speed, but on 60C/120T system that patch 
didn’t make significant difference because main bottleneck changes from file 
access to locks on array of running global transactions.

How did you generated names for your PREPARE’s? One funny thing that I’ve 
spotted that tx rate increased when i was using incrementing counter as GID 
instead of random string.

And can you also share flame graph for 1pc workload?

> 
> On 11 Jan 2016, at 21:43, Simon Riggs  wrote:
> 
> Have you measured lwlocking as a problem?
> 


Yes. GXACT locks that wasn’t even in perf top 10 on dual Xeon moves to the 
first places when running on 60 core system. But Jesper’s flame graph on 24 
core system shows different picture.

> On 12 Jan 2016, at 01:24, Andres Freund  wrote:
> 
> Currently recovery of 2pc often already is a bigger bottleneck than the 
> workload on the master, because replay has to execute the fsyncs implied by 
> statefile  re-creation serially, whereas on the master they'll usually be 
> executed in parallel.

That’s interesting observation. Simon already pointed me to this problem in 2pc 
replay, but I didn’t thought that it is so slow. I’m now working on that.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Andres Freund
Hi,

On January 11, 2016 10:46:01 PM GMT+01:00, Simon Riggs  
wrote:
>On 11 January 2016 at 20:10, Andres Freund  wrote:
>
>> On January 11, 2016 8:57:58 PM GMT+01:00, Simon Riggs
>>  wrote:
>> >On 11 January 2016 at 18:43, Simon Riggs 
>wrote:
>>
>> >It's clear there are various additional tuning opportunities, but
>the
>> >objective of the current patch to improve performance is very, very
>> >clearly
>> >met, so I'm aiming to commit *this* patch soon.
>>
>> Again, the WAL read routine used doesn't deal with timeline changes.
>
>
>Not relevant: The direct WAL read routine is never used during replay,
>so
>your comment is not relevant since we don't change timelines on the
>master.

Hm, OK.   But, isn't this actually a bad sign? Currently recovery of 2pc often 
already is a bigger bottleneck than the workload on the master, because replay 
has to execute the fsyncs implied by statefile  re-creation serially, whereas 
on the master they'll usually be executed in parallel. So, if I understand 
correctly this patch would widen that gap?

Anyway, as evidenced here, review on a phone isn't efficient, and that's all i 
have access to right now. Please wait till at least tomorrow evening, so I can 
have a meaningful look.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 20:10, Andres Freund  wrote:

> On January 11, 2016 8:57:58 PM GMT+01:00, Simon Riggs
>  wrote:
> >On 11 January 2016 at 18:43, Simon Riggs  wrote:
>
> >It's clear there are various additional tuning opportunities, but the
> >objective of the current patch to improve performance is very, very
> >clearly
> >met, so I'm aiming to commit *this* patch soon.
>
> Again, the WAL read routine used doesn't deal with timeline changes.


Not relevant: The direct WAL read routine is never used during replay, so
your comment is not relevant since we don't change timelines on the master.

So no,  it's bit ready to be committed.
>

I will update the comment on that function to explain its usage and its
limitations for future usage, to make that clearer.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 22:24, Andres Freund  wrote:


> Please wait till at least tomorrow evening, so I can have a meaningful
> look.
>

No problem, make sure to look at 2pc_optimize.v4.patch

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 23:11, Stas Kelvich  wrote:


> >
>
> On 11 Jan 2016, at 21:43, Simon Riggs  wrote:
> >
> > Have you measured lwlocking as a problem?
> >
>
> Yes. GXACT locks that wasn’t even in perf top 10 on dual Xeon moves to the
> first places when running on 60 core system. But Jesper’s flame graph on 24
> core system shows different picture.


I think we can use a shmem hash table to identify the GID by name during
LockGxact and avoid duplicates during prepare. Hashing on the first 16
bytes of the GID should be sufficient in most cases; the worst case would
be the same as it is now, all depending on how people use the GID name
field. The hash value can be calculated outside of the lock. We can also
partition the lock without risk, just adds a little extra code.

We can also optimize final removal (sketch of how to do that attached).

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


2pc_remove_prepXacts.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Michael Paquier
On Tue, Jan 12, 2016 at 4:57 AM, Simon Riggs  wrote:
> Performance tests for me show that the patch is effective; my results match
> Jesper's roughly in relative numbers.
>
> My robustness review is that the approach and implementation are safe.
>
> It's clear there are various additional tuning opportunities, but the
> objective of the current patch to improve performance is very, very clearly
> met, so I'm aiming to commit *this* patch soon.

-   /* initialize LSN to 0 (start of WAL) */
-   gxact->prepare_lsn = 0;
+   /* initialize LSN to InvalidXLogRecPtr */
+   gxact->prepare_start_lsn = InvalidXLogRecPtr;
+   gxact->prepare_end_lsn = InvalidXLogRecPtr;

I think that it would be better to explicitly initialize gxact->ondisk
to false here.

+   xlogreader = XLogReaderAllocate(_read_local_xlog_page, NULL);
+   if (!xlogreader)
+   ereport(ERROR,
+   (errcode(ERRCODE_OUT_OF_MEMORY),
+errmsg("out of memory"),
+errdetail("Failed while allocating an
XLog reading processor.")));
Depending on something that is part of logical decoding to decode WAL
is not a good idea. If you want to move on with this approach, you
should have a dedicated function. Even better, it would be nice to
come up with a generic function used by both 2PC and logical decoding.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Michael Paquier
On Tue, Jan 12, 2016 at 3:35 PM, Michael Paquier
 wrote:
> On Tue, Jan 12, 2016 at 4:57 AM, Simon Riggs  wrote:
>> Performance tests for me show that the patch is effective; my results match
>> Jesper's roughly in relative numbers.
>>
>> My robustness review is that the approach and implementation are safe.
>>
>> It's clear there are various additional tuning opportunities, but the
>> objective of the current patch to improve performance is very, very clearly
>> met, so I'm aiming to commit *this* patch soon.
>
> -   /* initialize LSN to 0 (start of WAL) */
> -   gxact->prepare_lsn = 0;
> +   /* initialize LSN to InvalidXLogRecPtr */
> +   gxact->prepare_start_lsn = InvalidXLogRecPtr;
> +   gxact->prepare_end_lsn = InvalidXLogRecPtr;
>
> I think that it would be better to explicitly initialize gxact->ondisk
> to false here.
>
> +   xlogreader = XLogReaderAllocate(_read_local_xlog_page, NULL);
> +   if (!xlogreader)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_OUT_OF_MEMORY),
> +errmsg("out of memory"),
> +errdetail("Failed while allocating an
> XLog reading processor.")));
> Depending on something that is part of logical decoding to decode WAL
> is not a good idea. If you want to move on with this approach, you
> should have a dedicated function. Even better, it would be nice to
> come up with a generic function used by both 2PC and logical decoding.


+   if (log_checkpoints && n > 0)
+   ereport(LOG,
+   (errmsg("%u two-phase state files were written "
+   "for long-running
prepared transactions",
+   n)));
This would be better as an independent change. That looks useful for
debugging, and I guess that's why you added it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Stas Kelvich

> On 10 Jan 2016, at 12:15, Simon Riggs  wrote:
> 
> So we've only optimized half the usage? We're still going to cause 
> replication delays.

Yes, replica will go through old procedures of moving data to and from file.

> We can either
> 
> 1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during restartpoints

>From what i’ve seen with old 2pc code main performance bottleneck was caused 
>by frequent creating of files. So better to avoid files if possible.

> 
> 2) Copy the contents to shmem and then write them at restartpoint as we do 
> for checkpoint
> (preferred)

Problem with shared memory is that we can’t really predict size of state data, 
and anyway it isn’t faster then reading data from WAL
(I have tested that while preparing original patch). 

We can just apply the same logic on replica that on master: do not do anything 
special on prepare, and just read that data from WAL.
If checkpoint occurs during recovery/replay probably existing code will handle 
moving data to files.

I will update patch to address this issue.

> I think padding will negate the effects of the additional bool.
> 
> If we want to reduce the size of the array GIDSIZE is currently 200, but XA 
> says maximum 128 bytes.
> 
> Anybody know why that is set to 200?

Good catch about GID size.

If we talk about further optimisations i see two ways:

1) Optimising access to GXACT. Here we can try to shrink it; introduce more 
granular locks, 
e.g. move GIDs out of GXACT and lock GIDs array only once while checking new 
GID uniqueness; try to lock only part of GXACT by hash; etc.

2) Be optimistic about consequent COMMIT PREPARED. In normal workload next 
command after PREPARE will be COMMIT/ROLLBACK, so we can save
transaction context and release it only if next command isn’t our designated 
COMMIT/ROLLBACK. But that is a big amount of work and requires
changes to whole transaction pipeline in postgres.

Anyway I suggest that we should consider that as a separate task.

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 12:58, Stas Kelvich  wrote:

>
> > On 10 Jan 2016, at 12:15, Simon Riggs  wrote:
> >
> > So we've only optimized half the usage? We're still going to cause
> replication delays.
>
> Yes, replica will go through old procedures of moving data to and from
> file.
>
> > We can either
> >
> > 1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during
> restartpoints
>
> From what i’ve seen with old 2pc code main performance bottleneck was
> caused by frequent creating of files. So better to avoid files if possible.
>
> >
> > 2) Copy the contents to shmem and then write them at restartpoint as we
> do for checkpoint
> > (preferred)
>
> Problem with shared memory is that we can’t really predict size of state
> data, and anyway it isn’t faster then reading data from WAL
> (I have tested that while preparing original patch).
>
> We can just apply the same logic on replica that on master: do not do
> anything special on prepare, and just read that data from WAL.
> If checkpoint occurs during recovery/replay probably existing code will
> handle moving data to files.
>
> I will update patch to address this issue.
>

I'm looking to commit what we have now, so lets do that as a separate but
necessary patch please.


> > I think padding will negate the effects of the additional bool.
> >
> > If we want to reduce the size of the array GIDSIZE is currently 200, but
> XA says maximum 128 bytes.
> >
> > Anybody know why that is set to 200?
>
> Good catch about GID size.
>

I'll apply that as a separate patch also.


> If we talk about further optimisations i see two ways:
>
> 1) Optimising access to GXACT. Here we can try to shrink it; introduce
> more granular locks,
> e.g. move GIDs out of GXACT and lock GIDs array only once while checking
> new GID uniqueness; try to lock only part of GXACT by hash; etc.
>

Have you measured lwlocking as a problem?


> 2) Be optimistic about consequent COMMIT PREPARED. In normal workload next
> command after PREPARE will be COMMIT/ROLLBACK, so we can save
> transaction context and release it only if next command isn’t our
> designated COMMIT/ROLLBACK. But that is a big amount of work and requires
> changes to whole transaction pipeline in postgres.
>

We'd need some way to force session pools to use that correctly, but yes,
agreed.


> Anyway I suggest that we should consider that as a separate task.


Definitely. From the numbers, I can see there is still considerable
performance gain to be had.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Andres Freund
Hi,

On 2016-01-09 15:29:11 +, Simon Riggs wrote:
> Hmm, I was just preparing this for commit.

Just read downthread that you want to commit this soon. Please hold of
for a while, this doesn't really look ready to me. I don't have time for
a real review right now, but I'll try to get to it asap.

> +
> +/*
> + * Reads 2PC data from xlog. During checkpoint this data will be moved to
> + * twophase files and ReadTwoPhaseFile should be used instead.
> + */
> +static void
> +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
> +{
> + XLogRecord *record;
> + XLogReaderState *xlogreader;
> + char   *errormsg;
> +
> + xlogreader = XLogReaderAllocate(_read_local_xlog_page,
> NULL);

logical_read_local_xlog_page isn't really suitable for the use
here. Besides the naming issue, at the very least it'll be wrong during
WAL replay in the presence of promotions on an upstream node - it
doesn't dealwith timelines.

More generally, I'm doubtful that the approach of reading data from WAL
as proposed here is a very good idea. It seems better to "just" dump the
entire 2pc state into *one* file at checkpoint time.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Andres Freund
On 2016-01-11 20:03:18 +0100, Andres Freund wrote:
> More generally, I'm doubtful that the approach of reading data from WAL
> as proposed here is a very good idea. It seems better to "just" dump the
> entire 2pc state into *one* file at checkpoint time.

Or better: After determining the checkpoint redo location, insert a WAL
record representing the entire 2PC state as of that moment. That way it
can easily restored during WAL replay and nothing special has to be done
on a standby. This way we'll need no extra wal flushes and fsyncs.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 19:07, Andres Freund  wrote:

> On 2016-01-11 20:03:18 +0100, Andres Freund wrote:
> > More generally, I'm doubtful that the approach of reading data from WAL
> > as proposed here is a very good idea. It seems better to "just" dump the
> > entire 2pc state into *one* file at checkpoint time.
>
> Or better: After determining the checkpoint redo location, insert a WAL
> record representing the entire 2PC state as of that moment. That way it
> can easily restored during WAL replay and nothing special has to be done
> on a standby. This way we'll need no extra wal flushes and fsyncs.
>

Feel free to submit a patch that does that.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 18:51, Jesper Pedersen 
wrote:

> On 01/10/2016 04:15 AM, Simon Riggs wrote:
>
>> One concern that come into my mind while reading updated
>>> patch is about creating extra bool field in GlobalTransactionData
>>> structure. While this improves readability, it
>>> also increases size of that structure and that size have impact on
>>> performance on systems with many cores
>>> (say like 60-80). Probably one byte will not make measurable difference,
>>> but I think it is good idea to keep
>>> GXact as small as possible. As far as I understand the same logic was
>>> behind split of
>>> PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)
>>>
>>
>>
>> I think padding will negate the effects of the additional bool.
>>
>> If we want to reduce the size of the array GIDSIZE is currently 200, but
>> XA
>> says maximum 128 bytes.
>>
>> Anybody know why that is set to 200?
>>
>>
> Even though GlobalTransactionId and BranchQualifer have a maximum of 64
> each, external clients may choose to encode the information, and thereby
> need more space,
>
>
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/xa/RecoveredXid.java#L66-L70
>
> http://docs.oracle.com/javaee/7/api/javax/transaction/xa/Xid.html
>
> which in this case adds up to a maximum of 189 characters.
>

OK, thanks for those references.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Andres Freund
On 2016-01-11 19:15:23 +, Simon Riggs wrote:
> On 11 January 2016 at 19:03, Andres Freund  wrote:
> 
> > Hi,
> >
> > On 2016-01-09 15:29:11 +, Simon Riggs wrote:
> > > Hmm, I was just preparing this for commit.
> >
> > Just read downthread that you want to commit this soon. Please hold of
> > for a while, this doesn't really look ready to me. I don't have time for
> > a real review right now, but I'll try to get to it asap.
> >
> 
> "A real review"? Huh.

All I meant was that my email didn't consist out of a real review, but
just was a quick scan

> > More generally, I'm doubtful that the approach of reading data from WAL
> > as proposed here is a very good idea. It seems better to "just" dump the
> > entire 2pc state into *one* file at checkpoint time.
> >
> 
>  I think you misunderstand the proposed approach. This isn't just to do
> with reading things back at checkpoint.

Sure, the main purpose is not to write 2pc state files in the common
path - or is that not the main purpose?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 18:43, Simon Riggs  wrote:


> I'm looking to commit what we have now.
>

Here is the patch in its "final" state after my minor additions, edits and
review.

Performance tests for me show that the patch is effective; my results match
Jesper's roughly in relative numbers.

My robustness review is that the approach and implementation are safe.

It's clear there are various additional tuning opportunities, but the
objective of the current patch to improve performance is very, very clearly
met, so I'm aiming to commit *this* patch soon.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


2pc_optimize.v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Andres Freund
On January 11, 2016 8:57:58 PM GMT+01:00, Simon Riggs  
wrote:
>On 11 January 2016 at 18:43, Simon Riggs  wrote:

>It's clear there are various additional tuning opportunities, but the
>objective of the current patch to improve performance is very, very
>clearly
>met, so I'm aiming to commit *this* patch soon.

Again, the WAL read routine used doesn't deal with timeline changes. So no,  
it's bit ready to be committed.

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-11 Thread Simon Riggs
On 11 January 2016 at 19:03, Andres Freund  wrote:

> Hi,
>
> On 2016-01-09 15:29:11 +, Simon Riggs wrote:
> > Hmm, I was just preparing this for commit.
>
> Just read downthread that you want to commit this soon. Please hold of
> for a while, this doesn't really look ready to me. I don't have time for
> a real review right now, but I'll try to get to it asap.
>

"A real review"? Huh.


> > +
> > +/*
> > + * Reads 2PC data from xlog. During checkpoint this data will be moved
> to
> > + * twophase files and ReadTwoPhaseFile should be used instead.
> > + */
> > +static void
> > +XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
> > +{
> > + XLogRecord *record;
> > + XLogReaderState *xlogreader;
> > + char   *errormsg;
> > +
> > + xlogreader = XLogReaderAllocate(_read_local_xlog_page,
> > NULL);
>
> logical_read_local_xlog_page isn't really suitable for the use
> here. Besides the naming issue, at the very least it'll be wrong during
> WAL replay in the presence of promotions on an upstream node - it
> doesn't dealwith timelines.
>

I'm aware of that, though note that it isn't being used in that way here.


> More generally, I'm doubtful that the approach of reading data from WAL
> as proposed here is a very good idea. It seems better to "just" dump the
> entire 2pc state into *one* file at checkpoint time.
>

 I think you misunderstand the proposed approach. This isn't just to do
with reading things back at checkpoint.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-10 Thread Simon Riggs
On 9 January 2016 at 20:28, Stas Kelvich  wrote:

> Thanks a lot for your edits, now that patch is much more cleaner.
>
> > Your comments say
> >
> >   "In case of crash replay will move data from xlog to files, if that
> hasn't happened before."
> >
> > but I don't see that in code. Can you show me where that happens?
>
> xact.c calls RecreateTwoPhaseFile in xact_redo() function (xact.c:5596)


So we've only optimized half the usage? We're still going to cause
replication delays.

Sounds like we should be fixing both.

We can either

1) Skip fsyncing the RecreateTwoPhaseFile and then fsync during
restartpoints

2) Copy the contents to shmem and then write them at restartpoint as we do
for checkpoint
(preferred)


> > On 09 Jan 2016, at 18:29, Simon Riggs  wrote:
> >
> > Hmm, I was just preparing this for commit.
> >
> > Please have a look at my mild edits and extended comments.
>
>
> One concern that come into my mind while reading updated
> patch is about creating extra bool field in GlobalTransactionData
> structure. While this improves readability, it
> also increases size of that structure and that size have impact on
> performance on systems with many cores
> (say like 60-80). Probably one byte will not make measurable difference,
> but I think it is good idea to keep
> GXact as small as possible. As far as I understand the same logic was
> behind split of
> PGPROC to PGPROC+PGXACT in 9.2 (comment in proc.h:166)


I think padding will negate the effects of the additional bool.

If we want to reduce the size of the array GIDSIZE is currently 200, but XA
says maximum 128 bytes.

Anybody know why that is set to 200?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Simon Riggs
On 9 January 2016 at 12:26, Stas Kelvich  wrote:


> I’ve updated patch and wrote description of thighs that happens
> with 2PC state data in the beginning of the file. I think now this patch
> is well documented,
> but if somebody points me to places that probably requires more detailed
> description I’m ready
> to extend that.
>

Hmm, I was just preparing this for commit.

Please have a look at my mild edits and extended comments.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


2pc_optimize.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Stas Kelvich
Hi.

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch is 
well documented,
but if somebody points me to places that probably requires more detailed 
description I’m ready
to extend that.



2pc_xlog.diff
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 08 Jan 2016, at 19:29, Alvaro Herrera  wrote:
> 
> Simon Riggs wrote:
> 
>> I think we could do better still, but this looks like the easiest 80% and
>> actually removes code.
>> 
>> The lack of substantial comments on the patch is a problem though - the
>> details above should go in the patch. I'll have a go at reworking this for
>> you, this time.
> 
> Is someone submitting an updated patch here?
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Simon Riggs
On 9 January 2016 at 12:26, Stas Kelvich  wrote:


> I’ve updated patch and wrote description of thighs that happens
> with 2PC state data in the beginning of the file. I think now this patch
> is well documented,
> but if somebody points me to places that probably requires more detailed
> description I’m ready
> to extend that.
>

Your comments say

  "In case of crash replay will move data from xlog to files, if
that hasn't happened before."

but I don't see that in code. Can you show me where that happens?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2016-01-08 Thread Alvaro Herrera
Simon Riggs wrote:

> I think we could do better still, but this looks like the easiest 80% and
> actually removes code.
> 
> The lack of substantial comments on the patch is a problem though - the
> details above should go in the patch. I'll have a go at reworking this for
> you, this time.

Is someone submitting an updated patch here?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-10 Thread Simon Riggs
On 9 December 2015 at 18:44, Stas Kelvich  wrote:


> In this patch I’ve changed this procedures to following:
> * on prepare backend writes data only to xlog and store pointer to the
> start of the xlog record
> * if commit occurs before checkpoint then backend reads data from xlog by
> this pointer
> * on checkpoint 2pc data copied to files and fsynced
> * if commit happens after checkpoint then backend reads files
> * in case of crash replay will move data from xlog to files (as it was
> before patch)
>

This looks sound to me.

I think we could do better still, but this looks like the easiest 80% and
actually removes code.

The lack of substantial comments on the patch is a problem though - the
details above should go in the patch. I'll have a go at reworking this for
you, this time.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Speedup twophase transactions

2015-12-10 Thread Stas Kelvich
Michael, Jeff thanks for reviewing and testing.

> On 10 Dec 2015, at 02:16, Michael Paquier  wrote:
> 
> This has better be InvalidXLogRecPtr if unused.


Yes, that’s better. Changed.

> On 10 Dec 2015, at 02:16, Michael Paquier  wrote:

> +if (gxact->prepare_lsn)
> +{
> +XlogReadTwoPhaseData(gxact->prepare_xlogptr, , NULL);
> +}
> Perhaps you mean prepare_xlogptr here?


Yes, my bad. But funnily I have this error even number of times: code in 
CheckPointTwoPhase also uses prepare_lsn instead of xlogptr, so overall this 
was working well, that’s why it survived my own tests and probably Jeff’s tests.
I think that’s a bad variable naming, for example because lsn in pg_xlogdump 
points to start of the record, but here start used as xloptr and end as lsn.
So changed both variables to prepare_start_lsn and prepare_end_lsn.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:
> I've tested this through my testing harness which forces the database
> to go through endless runs of crash recovery and checks for
> consistency, and so far it has survived perfectly.


Cool! I think that patch is most vulnerable to following type of workload: 
prepare transaction, do a lot of stuff with database to force checkpoints (or 
even recovery cycles), and commit it.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:

> Can you give the full command line?  -j, -c, etc.


pgbench -h testhost -i && pgbench -h testhost -f 2pc.pgb -T 300 -P 1 -c 64 -j 
16 -r

where 2pc.pgb as in previous message.

Also all this applies to hosts with uniform memory. I tried to run patched 
postgres on NUMA with 60 physical cores and patch didn’t change anything. Perf 
top shows that main bottleneck is access to gxact, but on ordinary host with 
1/2 cpu’s that access even not in top ten heaviest routines.

> On 10 Dec 2015, at 09:48, Jeff Janes  wrote:

> Why are you incrementing :scale ?


That’s a funny part, overall 2pc speed depends on how you will name your 
prepared transaction. Concretely I tried to use random numbers for gid’s and it 
was slower than having constantly incrementing gid. Probably that happens due 
to linear search by gid in gxact array on commit. So I used :scale just as a 
counter, bacause it is initialised on pgbench start and line like “\set scale 
:scale+1” works well. (may be there is a different way to do it in pgbench).

> I very rapidly reach a point where most of the updates are against
> tuples that don't exist, and then get integer overflow problems.

Hmm, that’s strange. Probably you set scale to big value, so that 10*:scale 
is bigger that int4? But i thought that pgbench will change aid columns to 
bigint if scale is more than 2.



2pc_xlog.v2.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Speedup twophase transactions

2015-12-09 Thread Stas Kelvich
Hello.

While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc 
transactions is approximately two times slower than an ordinary commit on 
workload with fast transactions — few single-row updates and COMMIT or 
PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on 
fopen/fclose, so it worth a try to reduce file operations with 2pc tx.

Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of 
the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this 
pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before 
patch)

Most of that ideas was already mentioned in 2009 thread by Michael Paquier 
http://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a9a46f2818...@mail.gmail.com
 where he suggested to store 2pc data in shared memory. 
At that time patch was declined because no significant speedup were observed. 
Now I see performance improvements by my patch at about 60%. Probably old 
benchmark overall tps was lower and it was harder to hit filesystem 
fopen/fclose limits.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

Benchmark done with following script:

\set naccounts 10 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';



2pc_xlog.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Kevin Grittner
On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich  wrote:

> Now 2PC in postgres does following:
> * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
> xlog and to file, but file not is not fsynced
> * on commit backend reads data from file
> * if checkpoint occurs before commit, then files are fsynced during checkpoint
> * if case of crash replay will move data from xlog to files
>
> In this patch I’ve changed this procedures to following:
> * on prepare backend writes data only to xlog and store pointer to the start 
> of the xlog record
> * if commit occurs before checkpoint then backend reads data from xlog by 
> this pointer
> * on checkpoint 2pc data copied to files and fsynced
> * if commit happens after checkpoint then backend reads files
> * in case of crash replay will move data from xlog to files (as it was before 
> patch)

That sounds like a very good plan to me.

> Now results of benchmark are following (dual 6-core xeon server):
>
> Current master without 2PC: ~42 ktps
> Current master with 2PC: ~22 ktps
> Current master with 2PC: ~36 ktps

I assume that last one should have been *Patched master with 2PC"?

Please add this to the January CommitFest.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Stas Kelvich
Thanks, Kevin.

> I assume that last one should have been *Patched master with 2PC”?

Yes, this list should look like this:

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Patched master with 2PC: ~36 ktps

And created CommitFest entry for this patch.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 10 Dec 2015, at 00:37, Kevin Grittner  wrote:
> 
> On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich  
> wrote:
> 
>> Now 2PC in postgres does following:
>> * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
>> xlog and to file, but file not is not fsynced
>> * on commit backend reads data from file
>> * if checkpoint occurs before commit, then files are fsynced during 
>> checkpoint
>> * if case of crash replay will move data from xlog to files
>> 
>> In this patch I’ve changed this procedures to following:
>> * on prepare backend writes data only to xlog and store pointer to the start 
>> of the xlog record
>> * if commit occurs before checkpoint then backend reads data from xlog by 
>> this pointer
>> * on checkpoint 2pc data copied to files and fsynced
>> * if commit happens after checkpoint then backend reads files
>> * in case of crash replay will move data from xlog to files (as it was 
>> before patch)
> 
> That sounds like a very good plan to me.
> 
>> Now results of benchmark are following (dual 6-core xeon server):
>> 
>> Current master without 2PC: ~42 ktps
>> Current master with 2PC: ~22 ktps
>> Current master with 2PC: ~36 ktps
> 
> I assume that last one should have been *Patched master with 2PC"?
> 
> Please add this to the January CommitFest.
> 
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Michael Paquier
On Thu, Dec 10, 2015 at 3:44 AM, Stas Kelvich  wrote:
> Most of that ideas was already mentioned in 2009 thread by Michael Paquier 
> http://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a9a46f2818...@mail.gmail.com
>  where he suggested to store 2pc data in shared memory.
> At that time patch was declined because no significant speedup were observed. 
> Now I see performance improvements by my patch at about 60%. Probably old 
> benchmark overall tps was lower and it was harder to hit filesystem 
> fopen/fclose limits.

Glad to see this patch is given a second life 6 years later.

> Now results of benchmark are following (dual 6-core xeon server):
>
> Current master without 2PC: ~42 ktps
> Current master with 2PC: ~22 ktps
> Current master with 2PC: ~36 ktps

That's nice.

+XLogRecPtrprepare_xlogptr;/* XLOG offset of prepare record start
+ * or NULL if twophase data moved to file
+ * after checkpoint.
+ */
This has better be InvalidXLogRecPtr if unused.

+if (gxact->prepare_lsn)
+{
+XlogReadTwoPhaseData(gxact->prepare_xlogptr, , NULL);
+}
Perhaps you mean prepare_xlogptr here?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Jeff Janes
On Wed, Dec 9, 2015 at 10:44 AM, Stas Kelvich  wrote:
> Hello.
>
> While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc 
> transactions is approximately two times slower than an ordinary commit on 
> workload with fast transactions — few single-row updates and COMMIT or 
> PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on 
> fopen/fclose, so it worth a try to reduce file operations with 2pc tx.
>

I've tested this through my testing harness which forces the database
to go through endless runs of crash recovery and checks for
consistency, and so far it has survived perfectly.

...

>
> Now results of benchmark are following (dual 6-core xeon server):
>
> Current master without 2PC: ~42 ktps
> Current master with 2PC: ~22 ktps
> Current master with 2PC: ~36 ktps

Can you give the full command line?  -j, -c, etc.

>
> Benchmark done with following script:
>
> \set naccounts 10 * :scale
> \setrandom from_aid 1 :naccounts
> \setrandom to_aid 1 :naccounts
> \setrandom delta 1 100
> \set scale :scale+1

Why are you incrementing :scale ?

I very rapidly reach a point where most of the updates are against
tuples that don't exist, and then get integer overflow problems.

> BEGIN;
> UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = 
> :from_aid;
> UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
> PREPARE TRANSACTION ':client_id.:scale';
> COMMIT PREPARED ':client_id.:scale';
>

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2