Re: [HACKERS] Slow synchronous logical replication

2017-10-12 Thread Craig Ringer
On 12 October 2017 at 16:09, Konstantin Knizhnik
 wrote:
>

> Is the CREATE TABLE and INSERT done in the same transaction?
>
> No. Table was create in separate transaction.
> Moreover  the same effect will take place if table is create before start of 
> replication.
> The problem in this case seems to be caused by spilling decoded transaction 
> to the file by ReorderBufferSerializeTXN.

Yeah. That's known to perform sub-optimally, and it also uses way more
memory than it should.

Your design compounds that by spilling transactions it will then
discard, and doing so multiple times.

To make your design viable you likely need some kind of cache of
serialized reorder buffer transactions, where you don't rebuild one if
it's already been generated. And likely a fair bit of optimisation on
the serialisation.

Or you might want a table- and even a row-filter that can be run
during decoding, before appending to the ReorderBuffer, to let you
skip changes early. Right now this can only be done at the transaction
level, based on replication origin. Of course, if you do this you
can't do the caching thing.

> Unfortunately it is not quite clear how to make wal-sender smarter and let
> him skip transaction not affecting its publication.

You'd need more hooks to be implemented by the output plugin.

> I really not sure that it is possible to skip over WAL. But the particular 
> problem with invalidation records etc  can be solved by always processing 
> this records by WAl sender.
> I.e. if backend is inserting invalidation record or some other record which 
> always should be processed by WAL sender, it can always promote LSN of this 
> record to WAL sender.
> So WAl sender will skip only those WAl records which is safe to skip 
> (insert/update/delete records not affecting this publication).

That sounds like a giant layering violation too.

I suggest focusing on reducing the amount of work done when reading
WAL, not trying to jump over whole ranges of WAL.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-12 Thread Konstantin Knizhnik



On 12.10.2017 04:23, Craig Ringer wrote:

On 12 October 2017 at 00:57, Konstantin Knizhnik
 wrote:


The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction?


No. Table was create in separate transaction.
Moreover  the same effect will take place if table is create before 
start of replication.
The problem in this case seems to be caused by spilling decoded 
transaction to the file by ReorderBufferSerializeTXN.

Please look at two profiles:
http://garret.ru/lr1.svg  corresponds to normal work if pgbench with 
synchronous replication to one replica,
http://garret.ru/lr2.svg - the with concurrent execution of huge insert 
statement.


And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat .578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce 
performance, but not so much.

Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k





Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.
Yes I know about origin filtering mechanism (and we are using it in 
multimaster).
But I am speaking about standard pgoutput.c output plugin. it's 
pgoutput_origin_filter

always returns false.




I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.


Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.
The problem is that before end of transaction we do not know whether it 
touch this publication or not.

So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the 
particular problem with invalidation records etc  can be solved by 
always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record 
which always should be processed by WAL sender, it can always promote 
LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip 
(insert/update/delete records not affecting this publication).


I wonder if there can be some other problems with skipping part of 
transaction by WAL sender.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Craig Ringer
On 12 October 2017 at 00:57, Konstantin Knizhnik
 wrote:

> The reason of such behavior is obvious: wal sender has to decode huge
> transaction generate by insert although it has no relation to this
> publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction? Because
that's a known pathological case for logical replication, it has to do
a LOT of extra work when it's in a transaction that has done DDL. I'm
sure there's room for optimisation there, but the general
recommendation for now is "don't do that".

> Filtering of insert records of this transaction is done only inside output
> plug-in.

Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.

I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.

> Unfortunately it is not quite clear how to make wal-sender smarter and let
> him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.

It's not so simple by table since we don't know early enough whether
the xact affects tables of interest or not. But you could definitely
do some selective skipping. Making it efficient could be the
challenge.

> Once of the possible solutions is to let backend inform wal-sender about
> smallest LSN it should wait for (backend knows which table is affected by
> current operation,
> so which publications are interested in this operation and so can point wal
> -sender to the proper LSN without decoding huge part of WAL.
> But it seems to be not so easy to implement.

Sounds like confusing layering violations to me.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Konstantin Knizhnik



On 11.10.2017 10:07, Craig Ringer wrote:

On 9 October 2017 at 15:37, Konstantin Knizhnik
 wrote:

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)

So simplify it to a test case that doesn't.

The simplest reproducing scenario is the following:
1. Start two Posgtgres instances: synchronous_commit=on, fsync=off
2. Initialize pgbench database at both instances: pgbench -i
3. Create publication for pgbench_accounts table at one node
4. Create correspondent subscription at another node with 
copy_data=false parameter

5. Add subscription to synchronous_standby_names at first node.
6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems 
results are the following:

standalone postgres: 8600 TPS
asynchronous replication: 6600 TPS
synchronous replication:   5600 TPS
Quite good results.
7. Create some dummy table and perform bulk insert in it:
create table dummy(x integer primary key);
insert into dummy values (generate_series(1,1000));

pgbench almost stuck: until end of insert performance drops almost 
to zero.


The reason of such behavior is obvious: wal sender has to decode huge 
transaction generate by insert although it has no relation to this 
publication.
Filtering of insert records of this transaction is done only inside 
output plug-in.
Unfortunately it is not quite clear how to make wal-sender smarter and 
let him skip transaction not affecting its publication.
Once of the possible solutions is to let backend inform wal-sender about 
smallest LSN it should wait for (backend knows which table is affected 
by current operation,
so which publications are interested in this operation and so can point 
wal -sender to the proper LSN without decoding huge part of WAL.

But it seems to be not so easy to implement.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Andres Freund
Hi,

On 2017-10-09 10:37:01 +0300, Konstantin Knizhnik wrote:
> So we have implement sharding - splitting data between several remote tables
> using pg_pathman and postgres_fdw.
> It means that insert or update of parent table  cause insert or update of
> some derived partitions which is forwarded by postgres_fdw to the
> correspondent node.
> Number of shards is significantly larger than number of nodes, i.e. for 5
> nodes we have 50 shards. Which means that at each onde we have 10 shards.
> To provide fault tolerance each shard is replicated using logical
> replication to one or more nodes. Right now we considered only redundancy
> level 1 - each shard has only one replica.
> So from each node we establish 10 logical replication channels.

Isn't that part of the pretty fundamental problem? There shouldn't be 10
different replication channels per node. There should be one.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Craig Ringer
On 9 October 2017 at 15:37, Konstantin Knizhnik
 wrote:
> Thank you for explanations.
>
> On 08.10.2017 16:00, Craig Ringer wrote:
>>
>> I think it'd be helpful if you provided reproduction instructions,
>> test programs, etc, making it very clear when things are / aren't
>> related to your changes.
>
>
> It will be not so easy to provide some reproducing scenario, because
> actually it involves many components (postgres_fdw, pg_pasthman,
> pg_shardman, LR,...)

So simplify it to a test case that doesn't.

> I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function.
> Each wal sender independently calculates minimal LSN among all synchronous
> replicas and wakeup backends waiting for this LSN. It means that transaction
> performing update of data in one shard will actually wait confirmation from
> replication channels for all shards.

That's expected for the current sync rep design, yes. Because it's
based on lsn, and was designed for physical rep where there's no
question about whether we're sending some data to some peers and not
others.

So all backends will wait for the slowest-responding peer, including
peers that don't need to actually do anything for this xact. You could
possibly hack around that by having the output plugin advance the slot
position when it sees that it just processed an empty xact.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Masahiko Sawada
On Mon, Oct 9, 2017 at 4:37 PM, Konstantin Knizhnik
 wrote:
> Thank you for explanations.
>
> On 08.10.2017 16:00, Craig Ringer wrote:
>>
>> I think it'd be helpful if you provided reproduction instructions,
>> test programs, etc, making it very clear when things are / aren't
>> related to your changes.
>
>
> It will be not so easy to provide some reproducing scenario, because
> actually it involves many components (postgres_fdw, pg_pasthman,
> pg_shardman, LR,...)
> and requires multinode installation.
> But let me try to explain what going on:
> So we have implement sharding - splitting data between several remote tables
> using pg_pathman and postgres_fdw.
> It means that insert or update of parent table  cause insert or update of
> some derived partitions which is forwarded by postgres_fdw to the
> correspondent node.
> Number of shards is significantly larger than number of nodes, i.e. for 5
> nodes we have 50 shards. Which means that at each onde we have 10 shards.
> To provide fault tolerance each shard is replicated using logical
> replication to one or more nodes. Right now we considered only redundancy
> level 1 - each shard has only one replica.
> So from each node we establish 10 logical replication channels.
>
> We want commit to wait until data is actually stored at all replicas, so we
> are using synchronous replication:
> So we set synchronous_commit option to "on" and include all ten 10
> subscriptions in synchronous_standby_names list.
>
> In this setup commit latency is very large (about 100msec and most of the
> time is actually spent in commit) and performance is very bad - pgbench
> shows about 300 TPS for optimal number of clients (about 10, for larger
> number performance is almost the same). Without logical replication at the
> same setup we get about 6000 TPS.
>
> I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function.
> Each wal sender independently calculates minimal LSN among all synchronous
> replicas and wakeup backends waiting for this LSN. It means that transaction
> performing update of data in one shard will actually wait confirmation from
> replication channels for all shards.
> If some shard is updated rarely than other or is not updated at all (for
> example because communication channels between this node is broken), then
> all backens will stuck.
> Also all backends are competing for the single SyncRepLock, which also can
> be a contention point.
>

IIUC, I guess you meant to say that in current synchronous logical
replication a transaction has to wait for updated table data to be
replicated even on servers that don't subscribe for the table. If we
change it so that a transaction needs to wait for only the server that
are subscribing for the table it would be more efficiency, for at
least your use case.
We send at least the begin and commit data to all subscriptions and
then wait for the reply from them but can we skip to wait them, for
example, when the walsender actually didn't send any data modified by
the transaction?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-09 Thread Konstantin Knizhnik

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because 
actually it involves many components (postgres_fdw, pg_pasthman, 
pg_shardman, LR,...)

and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote 
tables using pg_pathman and postgres_fdw.
It means that insert or update of parent table  cause insert or update 
of some derived partitions which is forwarded by postgres_fdw to the 
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 
5 nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical 
replication to one or more nodes. Right now we considered only 
redundancy level 1 - each shard has only one replica.

So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so 
we are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10 
subscriptions in synchronous_standby_names list.


In this setup commit latency is very large (about 100msec and most of 
the time is actually spent in commit) and performance is very bad - 
pgbench shows about 300 TPS for optimal number of clients (about 10, for 
larger number performance is almost the same). Without logical 
replication at the same setup we get about 6000 TPS.


I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr 
function. Each wal sender independently calculates minimal LSN among all 
synchronous replicas and wakeup backends waiting for this LSN. It means 
that transaction performing update of data in one shard will actually 
wait confirmation from replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for 
example because communication channels between this node is broken), 
then all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also 
can be a contention point.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-08 Thread Craig Ringer
On 8 October 2017 at 03:58, Konstantin Knizhnik
 wrote:

> The question was about logical replication mechanism in mainstream version
> of Postgres.

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.

> I think that most of people are using asynchronous logical replication and
> synchronous LR is something exotic and not well tested and investigated.
> It will be great if I am wrong:)

I doubt it's widely used. That said, a lot of people use synchronous
replication with BDR and pglogical, which are ancestors of the core
logical rep code and design.

I think you actually need to collect some proper timings and
diagnostics here, rather than hand-waving about it being "slow". A
good starting point might be setting some custom 'perf' tracepoints,
or adding some 'elog()'ing for timestamps. Then scrape the results and
build a latency graph.

That said, if I had to guess why it's slow, I'd say that you're facing
a number of factors:

* By default, logical replication in PostgreSQL does not do an
immediate flush to disk after downstream commit. In the interests of
faster apply performance it instead delays sending flush confirmations
until the next time WAL is flushed out. See the docs for CREATE
SUBSCRIPTION, notably the synchronous_commit option. This will
obviously greatly increase latencies on sync commit.

* Logical decoding doesn't *start* streaming a transaction until the
origin node finishes the xact and writes a COMMIT, then the xlogreader
picks it up.

* As a consequence of the above, a big xact holds up commit
confirmations of smaller ones by a LOT more than is the case for
streaming physical replication.

Hopefully that gives you something to look into, anyway. Maybe you'll
be inspired to work on parallelized logical decoding :)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Konstantin Knizhnik

On 10/07/2017 10:42 PM, Andres Freund wrote:

Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:

In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication 
mechanism was originally developed for streaming replication, when all replicas have the same 
content and LSNs. When it is used for logical replication, it behaves very inefficiently. 
Commit has to wait confirmations from all receivers mentioned in 
"synchronous_standby_names" list. So we are waiting not only for our own single 
logical replication standby, but all other standbys as well. Number of synchronous standbyes is 
equal to number of shards divided by number of nodes. To provide uniform distribution number of 
shards should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication channel blocks all 
backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.


Sorry, If I was not clear.
The question was about logical replication mechanism in mainstream version of 
Postgres.
I think that most of people are using asynchronous logical replication and 
synchronous LR is something exotic and not well tested and investigated.
It will be great if I am wrong:)

Concerning our sharded cluster (pg_shardman) - it is not a commercial product 
yet, it is in development phase.
We are going to open its sources when it will be more or less stable.
But unlike multimaster, this sharded cluster is mostly built from existed 
components: pg_pathman  + postgres_fdw + logical replication.
So we are just trying to combine them all into some integrated system.
But currently the most obscure point is logical replication.

And the main goal of my e-mail was to know the opinion of authors and users of 
LR whether it is good idea to use LR to provide fault tolerance in sharded 
cluster.
Or some other approaches, for example sharding with redundancy or using 
streaming replication are preferable?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Andres Freund
Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:
> In our sharded cluster project we are trying to use logical relication for 
> providing HA (maintaining redundant shard copies).
> Using asynchronous logical replication has not so much sense in context of 
> HA. This is why we try to use synchronous logical replication.
> Unfortunately it shows very bad performance. With 50 shards and level of 
> redundancy=1 (just one copy) cluster is 20 times slower then without logical 
> replication.
> With asynchronous replication it is "only" two times slower.
> 
> As far as I understand, the reason of such bad performance is that 
> synchronous replication mechanism was originally developed for streaming 
> replication, when all replicas have the same content and LSNs. When it is 
> used for logical replication, it behaves very inefficiently. Commit has to 
> wait confirmations from all receivers mentioned in 
> "synchronous_standby_names" list. So we are waiting not only for our own 
> single logical replication standby, but all other standbys as well. Number of 
> synchronous standbyes is equal to number of shards divided by number of 
> nodes. To provide uniform distribution number of shards should >> than number 
> of nodes, for example for 10 nodes we usually create 100 shards. As a result 
> we get awful performance and blocking of any replication channel blocks all 
> backends.
> 
> So my question is whether my understanding is correct and synchronous logical 
> replication can not be efficiently used in such manner.
> If so, the next question is how difficult it will be to make synchronous 
> replication mechanism for logical replication more efficient and are there 
> some plans to  work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Slow synchronous logical replication

2017-10-07 Thread konstantin knizhnik
In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous 
replication mechanism was originally developed for streaming replication, when 
all replicas have the same content and LSNs. When it is used for logical 
replication, it behaves very inefficiently. Commit has to wait confirmations 
from all receivers mentioned in "synchronous_standby_names" list. So we are 
waiting not only for our own single logical replication standby, but all other 
standbys as well. Number of synchronous standbyes is equal to number of shards 
divided by number of nodes. To provide uniform distribution number of shards 
should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication 
channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers