Re: Synchronous replay take III

2019-02-03 Thread Thomas Munro
On Mon, Feb 4, 2019 at 4:47 PM Michael Paquier  wrote:
> Last patch set fails to apply properly, so moved to next CF waiting on
> author for a rebase.

Thanks.  Rebased.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Synchronous-replay-mode-for-avoiding-stale-reads-v11.patch
Description: Binary data


Re: Synchronous replay take III

2019-02-03 Thread Michael Paquier
On Fri, Feb 01, 2019 at 09:34:49AM -0500, Robert Haas wrote:
> Maybe I'm misunderstanding the terminology here, but if not, I find
> this theory wildly implausible.  *Most* people want read-your-writes
> behavior.  *Few* people want to wait for a dead standby.  The only
> application of the later is when even a tiny risk of transaction loss
> is unacceptable, but the former has all kinds of clustering-related
> uses.

Last patch set fails to apply properly, so moved to next CF waiting on
author for a rebase.
--
Michael


signature.asc
Description: PGP signature


Re: Synchronous replay take III

2019-02-01 Thread Robert Haas
On Thu, Jan 24, 2019 at 6:42 PM Thomas Munro
 wrote:
> Yes, this is essentially the same thing that you were arguing against
> above.  Perhaps you are right, and there are no people who would want
> synchronous replay, but not synchronous commit.

Maybe I'm misunderstanding the terminology here, but if not, I find
this theory wildly implausible.  *Most* people want read-your-writes
behavior.  *Few* people want to wait for a dead standby.  The only
application of the later is when even a tiny risk of transaction loss
is unacceptable, but the former has all kinds of clustering-related
uses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Synchronous replay take III

2019-01-30 Thread Michail Nikolaev
Hello,

Sorry, missed email.

>> In our case we have implemented some kind of "replication barrier"
functionality based on table with counters (one counter per application
backend in simple case).
>> Each application backend have dedicated connection to each replica. And
it selects its counter value few times (2-100) per second from each replica
in background process (depending on how often replication barrier is used).

> Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
> on all the standbys instead, so you don't have to generate extra write
> traffic?

Replay lsn was the first approach I tried. I was sampling 'select
replay_lsn from pg_stat_replication' on master to get info about replay
position on replicas.
However, for some unknown reason I was not able to get it to work. Because
after replay_lsn was reached - standby was unable to see the data.
I know it should not happen. I spend few days on debugging... And… Since I
was required to ping replicas anyway (to check if it is a master already,
monitor ping, locks, connections, etc.) - I have decided to introduce table
for now.

>> Once application have committed transaction it may want join replication
barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

> I'm interested in how you handle failure (taking too long to respond
> or to see the new counter value, connectivity failure etc).
> Specifically, if the writer decides to give up on a certain standby
> (timeout, circuit breaker etc), how should a client that is connected
> directly to that standby now or soon afterwards know that this standby
> has been 'dropped' from the replication barrier and it's now at risk
> of seeing stale data?

Each standby has some health flags attached to it. Health is "red" when:
* can't connect to replica, or all connections are in use
* replica lag according to pg_last_xact_replay_timestamp is more than 3000ms
* replica lag according to pg_last_xact_replay_timestamp was more than
3000ms some time ago (1ms)
* replica is new master now
* etc.

In case of replication barrier, we are waiting only for "green" replicas
and max for 5000ms. If we still no able to see new counter value on some
replicas - it is up to client to decide how to process it. In our case, it
means replica is lagging more than 3000ms - so it is "red" now and next
client request will dispatched to another "green" replica. It is done by
special connection pool with balancer inside.
Not sure it is all 100% correct, but we could just proceed in our case.

> someone can successfully execute queries on a standby
> that hasn't applied a transaction that you know to be committed on the
> primary.

>> Nice thing here - constant number of connection involved. Even if lot of
threads joining replication barrier in the moment. Even if some replicas
are lagging.
>>
>> Because 2-5 seconds lag of some replica will lead to out of connections
issue in few milliseconds in case of implementation described in this
thread.

> Right, if a standby is lagging more than the allowed amount, in my
> patch the lease is cancelled and it will refuse to handle requests if
> the GUC is on, with a special new error code, and then it's up to the
> client to decide what to do. Probably find another node.
> In case of high loded

> Could you please elaborate? What could you do that would be better?
> If the answer is that you just want to know that you might be seeing
> stale data but for some reason you don't want to have to find a new
> node, the reader is welcome to turn synchronous_standby off and try
> again (giving up data freshness guarantees). Not sure when that would
> be useful though.

Main problem I see here - is master connection usage. We have about 10.000
RPS on master. So, small lag on some replica (we have six of them) will
lead all master connections to be waiting for replay on stale replica until
timeout. It is out of service for whole system. Even if it lagged for
200-300ms (in real world it could lag for seconds on regular basis).

If we set synchronous_replay_max_lag to 10-20ms - standbys will be
cancelled all the time.
In our case, we are using constant amount of connections involved. In
addition, client requests are waiting for standby replay
inside application backend thread without blocking master connection. This
is the main difference as I think.

> I've attached a small shell script that starts up a primary and N
> replicas with synchronous_replay configured, in the hope of
> encouraging you to try it out.

Thanks - will try and report.


Re: Synchronous replay take III

2019-01-24 Thread Thomas Munro
On Tue, Jan 15, 2019 at 11:17 PM Masahiko Sawada  wrote:
> Regarding the current (v10 patch) design I have some questions and
> comments.

Hi Sawada-san,

Thanks for your testing and feedback.

> The patch introduces new GUC parameter synchronous_replay. We can set
> synchronous_commit = off while setting synchronous_replay = on. With
> this setting, the backend will synchrnously wait for standbys to
> replay. I'm concerned that having two separate GUC parameters
> controling the transaction commit behaviour would confuse users. It's
> a just idea but maybe we can use 'remote_apply' for synchronous replay
> purpose and introduce new parameter for standby server something like
> allow_stale_read.

That is an interesting idea.  That choice means that the new mode
always implies synchronous_commit = on (since remote_apply is a
"higher" level).  I wanted them to be independent, so you could
express your durability requirement separately from your visibility
requirement.

Concretely, if none of your potential sync replay standbys are keeping
up and they are all dropped to "unavailable", then you'd be able to
see a difference: with your proposal we'd still have a synchronous
commit wait, but with mine that could independently be on or off.

Generally, I think we are too restrictive in our durability levels,
and there was some discussion about whether it's OK to have a strict
linear knob (which your idea extends):

https://www.postgresql.org/message-id/flat/CAEepm%3D3FFaanSS4sugG%2BApzq2tCVjEYCO2wOQBod2d7GWb%3DDvA%40mail.gmail.com

Hmm, perhaps your way would be better for now anyway, just because
it's simpler to understand and explain.  Perhaps you wouldn't need a
separate "allow_stale_read" GUC, you could just set synchronous_commit
to a lower level when talking to the standby.  (That is, give
synchronous_commit a meaning on standbys, whereas currently it has no
effect there.)

> If while a transaction is waiting for all standbys to replay they
> became to unavailable state, should the waiter be released? the patch
> seems not to release the waiter. Similarly, wal senders are not aware
> of postgresql.conf change while waiting synchronous replay. I think we
> should call SyncReplayPotentialStandby() in SyncRepInitConfig().

Good point about the postgresql.conf change.

If all the standbys go to unavailable state, then a waiter should be
released once they have all either acknowledged that they are
unavailable (ie acknowledged that their lease has been revoked, via a
reply message with a serial number matching the revocation message),
or if that doesn't happen (due to lost network connection, crashed
process etc), once the any leases that have been issued have expired
(ie a few seconds).  Is that not what you see?

> With the setting synchronous_standby_names = '' and
> synchronous_replay_standby_names = '*' we would get the standby's
> status in pg_stat_replication, sync_state = 'async' and sync_replay =
> 'available'. It looks odd to me. Yes, this status is correct in
> principle. But considering the architecture of PostgreSQL replication
> this status is impossible.

Yes, this is essentially the same thing that you were arguing against
above.  Perhaps you are right, and there are no people who would want
synchronous replay, but not synchronous commit.

> The synchronous_replay_standby_name = '*' setting means that the
> backend wait for all standbys connected to the master server to
> replay, is that right? In my test, even when some of synchronous
> replay standby servers got stuck and then therefore are revoked their
> lease, the backend could proceed transactions.

It means that it waits for all standbys that are "available" to
replay.  It doesn't wait for the "unavailable" ones.  Most of the
patch deals with the transitions between those states.  During an
available->revoking->unavailable transition, we also wait for the
standby to know that it is unavailable (so that it begins to raise
errors), and during an unavailable->joining->available transition we
also wait for the standby to replay the transition LSN (so that it
stops raising errors).  That way clients on the standby can rely on
the error (or lack of error) to tell them whether their snapshot
definitely contains every commit that has returned control on the
primary.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Synchronous replay take III

2019-01-15 Thread Masahiko Sawada
On Sat, Dec 1, 2018 at 10:49 AM Thomas Munro
 wrote:
>
> On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Unfortunately, cfbot says that patch can't be applied without conflicts, 
> > could
> > you please post a rebased version and address commentaries from Masahiko?
>
> Right, it conflicted with 4c703369 and cfdf4dc4.  While rebasing on
> top of those, I found myself wondering why syncrep.c thinks it needs
> special treatment for postmaster death.  I don't see any reason why we
> shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
> this new version.  If you kill -9 the postmaster, I don't see any
> reason to think that the existing coding is more correct than simply
> exiting immediately.
>
> On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada  wrote:
> > On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
> >  wrote:
> > > I was pinged off-list by a fellow -hackers denizen interested in the
> > > synchronous replay feature and wanting a rebased patch to test.  Here
> > > it goes, just in time for a Commitfest.  Please skip to the bottom of
> > > this message for testing notes.
> >
> > Thank you for working on this. The overview and your summary was
> > helpful for me to understand this feature, thank you. I've started to
> > review this patch for PostgreSQL 12. I've tested this patch and found
> > some issue but let me ask you questions about the high-level design
> > first. Sorry if these have been already discussed.
>
> Thanks for your interest in this work!
>
> > > This is a design choice favouring read-mostly workloads at the expense
> > > of write transactions.  Hot standbys' whole raison for existing is to
> > > move *some* read-only workloads off the primary server.  This proposal
> > > is for users who are prepared to trade increased primary commit
> > > latency for a guarantee about visibility on the standbys, so that
> > > *all* read-only work could be moved to hot standbys.
> >
> > To be clear what did you mean read-mostly workloads?
>
> I mean workloads where only a small percentage of transactions perform
> a write.  If you need write-scalability, then hot_standby is not the
> solution for you (with or without this patch).
>
> The kind of user who would be interested in this feature is someone
> who already uses some kind of heuristics to move some queries to
> read-only standbys.  For example, some people send transaction for
> logged-in users to the primary database (because only logged-in users
> generate write queries), and all the rest to standby servers (for
> example "public" users who can only read content).  Another technique
> I have seen is to keep user sessions "pinned" on the primary server
> for N minutes after they perform a write transaction.  These types of
> load balancing policies are primitive ways of achieving
> read-your-writes consistency, but they are conservative and
> pessimistic: they probably send too many queries to the primary node.
>
> This proposal is much more precise, allowing you to run the minimum
> number of transactions on the primary node (ie transactions that
> actually need to perform a write), and the maximum number of
> transactions on the hot standbys.
>
> As discussed, making reads wait for a token would be a useful
> alternative (and I am willing to help make that work too), but:
>
> 1.  For users that do more many more reads than writes, would you
> rather make (say) 80% of transactions slower or 20%?  (Or 99% vs 1% as
> the case may be, depending on your application.)
>
> 2.  If you are also using synchronous_commit = on for increased
> durability, then you are already making writers wait, and you might be
> able to tolerate a small increase.
>
> Peter Eisentraut expressed an interesting point of view against this
> general line of thinking:
>
> https://www.postgresql.org/message-id/5643933F.4010701%40gmx.net
>
> My questions are:  Why do we have hot_standby mode?  Is load balancing
> a style of usage we want to support?  Do we want a technology that
> lets you do more of it?
>
> > I think there are two kind of reads on standbys: a read happend after
> > writes and a directly read (e.g. reporting). The former usually
> > requires the causal reads as you mentioned in order to read its own
> > writes but the latter might be different: it often wants to read the
> > latest data on the master at the time. IIUC even if we send a
> > read-only query directly to a synchronous replay server we could get a
> > stale result if the standby delayed for less than
> > synchronous_replay_max_lag. So this synchronous replay feature would
> > be helpful for the former case(i.e. a few writes and many reads wants
> > to see them) whereas for the latter case perhaps the keeping the reads
> > waiting on standby seems a reasonable solution.
>
> I agree 100% that this is not a solution for all users.  But I also
> suspect a token system would be quite complicated, and can't be done
> in a way that is transparent to applications 

Re: Synchronous replay take III

2018-12-27 Thread Thomas Munro
Hello,

Here is a rebased patch, and separate replies to Michael and Michail.

On Sat, Dec 1, 2018 at 4:57 PM Michael Paquier  wrote:
> On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:
> > Right, it conflicted with 4c703369 and cfdf4dc4.  While rebasing on
> > top of those, I found myself wondering why syncrep.c thinks it needs
> > special treatment for postmaster death.  I don't see any reason why we
> > shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
> > this new version.  If you kill -9 the postmaster, I don't see any
> > reason to think that the existing coding is more correct than simply
> > exiting immediately.
>
> Hm.  This stuff runs under many assumptions, so I think that we should
> be careful here with any changes as the very recent history has proved
> (4c70336).  If we were to switch WAL senders on postmaster death, I
> think that this could be a change independent of what is proposed here.

Fair point.  I think the effect should be the same with less code:
either way you see the server hang up without sending a COMMIT tag,
but maybe I'm missing something.  Change reverted; let's discuss that
another time.

On Mon, Dec 3, 2018 at 9:01 AM Michail Nikolaev
 wrote:
> It is really nice feature. I am working on the project which heavily reads 
> from replicas (6 of them).

Thanks for your feedback.

> In our case we have implemented some kind of "replication barrier" 
> functionality based on table with counters (one counter per application 
> backend in simple case).
> Each application backend have dedicated connection to each replica. And it 
> selects its counter value few times (2-100) per second from each replica in 
> background process (depending on how often replication barrier is used).

Interesting approach.  Why don't you sample pg_last_wal_replay_lsn()
on all the standbys instead, so you don't have to generate extra write
traffic?

> Once application have committed transaction it may want join replication 
> barrier before return new data to a user. So, it increments counter in the 
> table and waits until all replicas have replayed that value according to 
> background monitoring process. Of course timeout, replicas health checks and 
> few optimizations and circuit breakers are used.

I'm interested in how you handle failure (taking too long to respond
or to see the new counter value, connectivity failure etc).
Specifically, if the writer decides to give up on a certain standby
(timeout, circuit breaker etc), how should a client that is connected
directly to that standby now or soon afterwards know that this standby
has been 'dropped' from the replication barrier and it's now at risk
of seeing stale data?  My patch handles this by cancelling standbys'
leases explicitly and waiting for a response (if possible), but
otherwise waiting for them to expire (say if connectivity is lost or
standby has gone crazy or stopped responding), so that there is no
scenario where someone can successfully execute queries on a standby
that hasn't applied a transaction that you know to be committed on the
primary.

> Nice thing here - constant number of connection involved. Even if lot of 
> threads joining replication barrier in the moment. Even if some replicas are 
> lagging.
>
> Because 2-5 seconds lag of some replica will lead to out of connections issue 
> in few milliseconds in case of implementation described in this thread.

Right, if a standby is lagging more than the allowed amount, in my
patch the lease is cancelled and it will refuse to handle requests if
the GUC is on, with a special new error code, and then it's up to the
client to decide what to do.  Probably find another node.

> It may be the weak part of the patch I think. At least for our case.

Could you please elaborate?  What could you do that would be better?
If the answer is that you just want to know that you might be seeing
stale data but for some reason you don't want to have to find a new
node, the reader is welcome to turn synchronous_standby off and try
again (giving up data freshness guarantees).  Not sure when that would
be useful though.

> But it possible could be used to eliminate odd table with counters in my case 
> (if it possible to change setting per transaction).

Yes, the behaviour can be activated per transaction, using the usual
GUC scoping rules.  The setting synchronous_replay must be on in both
the write transaction and the following read transaction for the logic
to work (ie for the writer to wait, and for the reader to make sure
that it has a valid lease or raise an error).

It sounds like my synchronous_replay GUC is quite similar to your
replication barrier system, except that it has a way to handle node
failure and excessive lag without abandoning the guarantee.

I've attached a small shell script that starts up a primary and N
replicas with synchronous_replay configured, in the hope of
encouraging you to try it out.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Synchronous replay take III

2018-12-02 Thread Michail Nikolaev
Hello.

It is really nice feature. I am working on the project which heavily reads
from replicas (6 of them).

In our case we have implemented some kind of "replication barrier"
functionality based on table with counters (one counter per application
backend in simple case).
Each application backend have dedicated connection to each replica. And it
selects its counter value few times (2-100) per second from each replica in
background process (depending on how often replication barrier is used).

Once application have committed transaction it may want join replication
barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

Nice thing here - constant number of connection involved. Even if lot of
threads joining replication barrier in the moment. Even if some replicas
are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections
issue in few milliseconds in case of implementation described in this
thread.
It may be the weak part of the patch I think. At least for our case. But it
possible could be used to eliminate odd table with counters in my case (if
it possible to change setting per transaction).

Thanks a lot,
Michail.


Re: Synchronous replay take III

2018-11-30 Thread Michael Paquier
On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:
> On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>> Unfortunately, cfbot says that patch can't be applied without conflicts, 
>> could
>> you please post a rebased version and address commentaries from Masahiko?
> 
> Right, it conflicted with 4c703369 and cfdf4dc4.  While rebasing on
> top of those, I found myself wondering why syncrep.c thinks it needs
> special treatment for postmaster death.  I don't see any reason why we
> shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
> this new version.  If you kill -9 the postmaster, I don't see any
> reason to think that the existing coding is more correct than simply
> exiting immediately.

Hm.  This stuff runs under many assumptions, so I think that we should
be careful here with any changes as the very recent history has proved
(4c70336).  If we were to switch WAL senders on postmaster death, I
think that this could be a change independent of what is proposed here.
--
Michael


signature.asc
Description: PGP signature


Re: Synchronous replay take III

2018-11-30 Thread Thomas Munro
On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Unfortunately, cfbot says that patch can't be applied without conflicts, could
> you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4.  While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death.  I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version.  If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada  wrote:
> On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
>  wrote:
> > I was pinged off-list by a fellow -hackers denizen interested in the
> > synchronous replay feature and wanting a rebased patch to test.  Here
> > it goes, just in time for a Commitfest.  Please skip to the bottom of
> > this message for testing notes.
>
> Thank you for working on this. The overview and your summary was
> helpful for me to understand this feature, thank you. I've started to
> review this patch for PostgreSQL 12. I've tested this patch and found
> some issue but let me ask you questions about the high-level design
> first. Sorry if these have been already discussed.

Thanks for your interest in this work!

> > This is a design choice favouring read-mostly workloads at the expense
> > of write transactions.  Hot standbys' whole raison for existing is to
> > move *some* read-only workloads off the primary server.  This proposal
> > is for users who are prepared to trade increased primary commit
> > latency for a guarantee about visibility on the standbys, so that
> > *all* read-only work could be moved to hot standbys.
>
> To be clear what did you mean read-mostly workloads?

I mean workloads where only a small percentage of transactions perform
a write.  If you need write-scalability, then hot_standby is not the
solution for you (with or without this patch).

The kind of user who would be interested in this feature is someone
who already uses some kind of heuristics to move some queries to
read-only standbys.  For example, some people send transaction for
logged-in users to the primary database (because only logged-in users
generate write queries), and all the rest to standby servers (for
example "public" users who can only read content).  Another technique
I have seen is to keep user sessions "pinned" on the primary server
for N minutes after they perform a write transaction.  These types of
load balancing policies are primitive ways of achieving
read-your-writes consistency, but they are conservative and
pessimistic: they probably send too many queries to the primary node.

This proposal is much more precise, allowing you to run the minimum
number of transactions on the primary node (ie transactions that
actually need to perform a write), and the maximum number of
transactions on the hot standbys.

As discussed, making reads wait for a token would be a useful
alternative (and I am willing to help make that work too), but:

1.  For users that do more many more reads than writes, would you
rather make (say) 80% of transactions slower or 20%?  (Or 99% vs 1% as
the case may be, depending on your application.)

2.  If you are also using synchronous_commit = on for increased
durability, then you are already making writers wait, and you might be
able to tolerate a small increase.

Peter Eisentraut expressed an interesting point of view against this
general line of thinking:

https://www.postgresql.org/message-id/5643933F.4010701%40gmx.net

My questions are:  Why do we have hot_standby mode?  Is load balancing
a style of usage we want to support?  Do we want a technology that
lets you do more of it?

> I think there are two kind of reads on standbys: a read happend after
> writes and a directly read (e.g. reporting). The former usually
> requires the causal reads as you mentioned in order to read its own
> writes but the latter might be different: it often wants to read the
> latest data on the master at the time. IIUC even if we send a
> read-only query directly to a synchronous replay server we could get a
> stale result if the standby delayed for less than
> synchronous_replay_max_lag. So this synchronous replay feature would
> be helpful for the former case(i.e. a few writes and many reads wants
> to see them) whereas for the latter case perhaps the keeping the reads
> waiting on standby seems a reasonable solution.

I agree 100% that this is not a solution for all users.  But I also
suspect a token system would be quite complicated, and can't be done
in a way that is transparent to applications without giving up
performance advantages.  I wrote about my understanding of the
trade-offs here:

https://www.postgresql.org/message-id/CAEepm%3D0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ%40mail.gmail.com

> Also I think it's 

Re: Synchronous replay take III

2018-11-30 Thread Dmitry Dolgov
> On Thu, Nov 15, 2018 at 6:34 AM Masahiko Sawada  wrote:
> > On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro 
> >  wrote:
> >
> > In previous threads[1][2][3] I called this feature proposal "causal
> > reads".  That was a terrible name, borrowed from MySQL.  While it is
> > probably a useful term of art, for one thing people kept reading it as
> > "casual"

Yeah, that was rather annoying that I couldn't get rid of this while playing
with the "take II" version :)

> To be clear what did you mean read-mostly workloads?
>
> I think there are two kind of reads on standbys: a read happend after
> writes and a directly read (e.g. reporting). The former usually
> requires the causal reads as you mentioned in order to read its own
> writes but the latter might be different: it often wants to read the
> latest data on the master at the time. IIUC even if we send a
> read-only query directly to a synchronous replay server we could get a
> stale result if the standby delayed for less than
> synchronous_replay_max_lag. So this synchronous replay feature would
> be helpful for the former case(i.e. a few writes and many reads wants
> to see them) whereas for the latter case perhaps the keeping the reads
> waiting on standby seems a reasonable solution.
>
> Also I think it's worth to consider the cost both causal reads *and*
> non-causal reads.
>
> I've considered a mixed workload (transactions requiring causal reads
> and transactions not requiring it) on the current design. IIUC the
> current design seems like that we create something like
> consistent-reads group by specifying servers. For example, if a
> transaction doesn't want to causality read it can send query any
> server with synchronous_replay = off but if it wants, it should select
> a synchronous replay server. It also means that client applications or
> routing middlewares such as pgpool is required to be aware of
> available synchronous replay standbys. That is, this design would cost
> the read-only transactions requiring causal reads. On the other hand,
> in token-based causal reads we can send read-only query any standbys
> if we can wait for the change to be replayed. Of course if we don't
> wait forever we can timeout and switch to either another standby or
> the master to execute query but we don't need to choose a server of
> standby servers.

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?



Re: Synchronous replay take III

2018-11-14 Thread Masahiko Sawada
On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
 wrote:
>
> Hi hackers,
>
> I was pinged off-list by a fellow -hackers denizen interested in the
> synchronous replay feature and wanting a rebased patch to test.  Here
> it goes, just in time for a Commitfest.  Please skip to the bottom of
> this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

>
> In previous threads[1][2][3] I called this feature proposal "causal
> reads".  That was a terrible name, borrowed from MySQL.  While it is
> probably a useful term of art, for one thing people kept reading it as
> "casual", which it ain't, and more importantly this patch is only one
> way to achieve read-follows-write causal consistency.  Several others
> are proposed or exist in forks (user managed wait-for-LSN, global
> transaction manager, ...).
>
> OVERVIEW
>
> For writers, it works a bit like RAID mirroring: when you commit a
> write transaction, it waits until the data has become visible on all
> elements of the array, and if an array element is not responding fast
> enough it is kicked out of the array.  For readers, it's a little
> different because you're connected directly to the array elements
> (rather than going through a central controller), so it uses a system
> of leases allowing read transactions to know instantly and whether
> they are running on an element that is currently in the array and are
> therefore able to service synchronous_replay transactions, or should
> raise an error telling you to go and ask some other element.
>
> This is a design choice favouring read-mostly workloads at the expense
> of write transactions.  Hot standbys' whole raison for existing is to
> move *some* read-only workloads off the primary server.  This proposal
> is for users who are prepared to trade increased primary commit
> latency for a guarantee about visibility on the standbys, so that
> *all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Synchronous replay take III

2018-10-22 Thread Thomas Munro
On Mon, Sep 24, 2018 at 10:39 AM Thomas Munro
 wrote:
> On Mon, Jul 2, 2018 at 12:39 PM Thomas Munro
>  wrote:
> > On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
> >  wrote:
> > > Thanks Thomas, appreciate the rebase and the work you've done on this.
> > > I should have some time to test this out over the weekend.
> >
> > Rebased.
> Rebased.
Rebased.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Synchronous-replay-mode-for-avoiding-stale-reads--v8.patch
Description: Binary data


Re: Synchronous replay take III

2018-09-23 Thread Thomas Munro
On Mon, Jul 2, 2018 at 12:39 PM Thomas Munro
 wrote:
> On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
>  wrote:
> > Thanks Thomas, appreciate the rebase and the work you've done on this.
> > I should have some time to test this out over the weekend.
>
> Rebased.

Rebased.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Synchronous-replay-mode-for-avoiding-stale-reads--v7.patch
Description: Binary data


Re: Synchronous replay take III

2018-07-01 Thread Thomas Munro
On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
 wrote:
> Thanks Thomas, appreciate the rebase and the work you've done on this.
> I should have some time to test this out over the weekend.

Rebased.  Moved to September.  I still need to provide a TAP test and
explain that weirdness reported by Dmitry Dolgov, but I didn't get to
that in time for the bonus early commitfest that we're now in.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Synchronous-replay-mode-for-avoiding-stale-reads--v6.patch
Description: Binary data


Re: Synchronous replay take III

2018-03-02 Thread Adam Brusselback
Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

-Adam



Re: Synchronous replay take III

2018-03-01 Thread Michael Paquier
On Thu, Mar 01, 2018 at 06:55:18PM +1300, Thomas Munro wrote:
> On Thu, Mar 1, 2018 at 2:39 PM, Thomas Munro
>  wrote:
>> I was pinged off-list by a fellow -hackers denizen interested in the
>> synchronous replay feature and wanting a rebased patch to test.  Here
>> it goes, just in time for a Commitfest.  Please skip to the bottom of
>> this message for testing notes.
> 
> Moved to next CF based on
> https://www.postgresql.org/message-id/24193.1519882945%40sss.pgh.pa.us

Thanks, Thomas.  This looks like the right move to me.
--
Michael


signature.asc
Description: PGP signature


Re: Synchronous replay take III

2018-02-28 Thread Thomas Munro
On Thu, Mar 1, 2018 at 2:39 PM, Thomas Munro
 wrote:
> I was pinged off-list by a fellow -hackers denizen interested in the
> synchronous replay feature and wanting a rebased patch to test.  Here
> it goes, just in time for a Commitfest.  Please skip to the bottom of
> this message for testing notes.

Moved to next CF based on
https://www.postgresql.org/message-id/24193.1519882945%40sss.pgh.pa.us
.

-- 
Thomas Munro
http://www.enterprisedb.com



Synchronous replay take III

2018-02-28 Thread Thomas Munro
Hi hackers,

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test.  Here
it goes, just in time for a Commitfest.  Please skip to the bottom of
this message for testing notes.

In previous threads[1][2][3] I called this feature proposal "causal
reads".  That was a terrible name, borrowed from MySQL.  While it is
probably a useful term of art, for one thing people kept reading it as
"casual", which it ain't, and more importantly this patch is only one
way to achieve read-follows-write causal consistency.  Several others
are proposed or exist in forks (user managed wait-for-LSN, global
transaction manager, ...).

OVERVIEW

For writers, it works a bit like RAID mirroring: when you commit a
write transaction, it waits until the data has become visible on all
elements of the array, and if an array element is not responding fast
enough it is kicked out of the array.  For readers, it's a little
different because you're connected directly to the array elements
(rather than going through a central controller), so it uses a system
of leases allowing read transactions to know instantly and whether
they are running on an element that is currently in the array and are
therefore able to service synchronous_replay transactions, or should
raise an error telling you to go and ask some other element.

This is a design choice favouring read-mostly workloads at the expense
of write transactions.  Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server.  This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

The guarantee is: When two transactions tx1, tx2 are run with
synchronous_replay set to on and tx1 reports successful commit before
tx2 begins, then tx1 is guaranteed either to see tx1 or to raise a new
error 40P02 if it is run on a hot standby.  I have joked that that
error means "snapshot too young".  You could handle it the same way
you handle deadlocks and serialization failures: by retrying, except
in this case you might want to avoid that node for a while.

Note that this feature is concerned with transaction visibility.  It
is not concerned with transaction durability.  It will happily kick
all of your misbehaving or slow standbys out of the array so that you
fall back to single-node commit durability.  You can express your
durability requirement (ie I must have have N copies of the data on
disk before I tell any external party about a transaction) separately,
by configuring regular synchronous replication alongside this feature.
I suspect that this feature would be most popular with people who are
already using regular synchronous replication though, because they
already tolerate higher commit latency.

STATUS

Here's a quick summary of the status of this proposal as I see it:

* Simon Riggs, as the committer most concerned with the areas this
proposal touches -- namely streaming replication and specifically
syncrep -- has not so far appeared to be convinced by the value of
this approach, and has expressed a preference for pursuing client-side
or middleware tracked LSN tokens exclusively.  I am perceptive enough
to see that failing to sell the idea to Simon is probably fatal to the
proposal.  The main task therefore is to show convincingly that there
is a real use case for this high-level design and its set of
trade-offs, and that it justifies its maintenance burden.

* I have tried to show that there are already many users who route
their read-only queries to hot standby databases (not just "reporting
queries"), and libraries and tools to help people do that using
heuristics like "logged in users need fresh data, so primary only" or
"this session has written in the past N minutes, so primary only".
This proposal would provide a way for those users to do something
based on a guarantee instead of such flimsy heuristics.  I have tried
to show that the libraries used by Python, Ruby, Java etc to achieve
that sort of load balancing should easily be able to handle finding
read-only nodes, routing read-only queries and dealing with the new
error.  I do also acknowledge that such libraries could also be used
to provide transparent read-my-writes support by tracking LSNs and
injecting wait-for-LSN directives with alternative proposals, but that
is weaker than a global reads-follow-writes guarantee and the
difference can matter.

* I have argued that token-based systems are in fact rather
complicated[4] and by no means a panacea.  As usual, there are a whole
bunch of trade-offs.  I suspect that this proposal AND fully
user-managed causality tokens (no middleware) are both valuable sweet
spots for a non-GTM system.

* Ants Aasma pointed out that this proposal doesn't provide a
read-follows-read guarantee.  He is right, and I'm not sure to what
extent that is a