Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-05-04 Thread Andres Freund
On 2016-05-05 13:30:42 +1200, Thomas Munro wrote:
> That was a red herring.  I was confused because SUSv2 and POSIX call
> this argument 'errorfds' and say that sockets *also* tell you about
> errors this way.  (Many/most real OSs call the argument 'exceptfds'
> instead and only use it to tell you about out-of-band data and
> possibly implementation specific events for devices, pseudo-terminals
> etc.  If you want to know about errors on a socket it's enough to have
> it in readfds/writefds, and insufficient to have it only in
> errorfds/exceptfds unless you can find a computer that actually
> conforms to POSIX.)

Correct, exceptfds is pretty much meaningless for anything we do in
postgres. We rely on select returning a socket as read/writeable if the
socket has hung up.  That's been the case *before* the recent
WaitEventSet refactoring, so I think we're fairly solid on relying on
that.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-05-04 Thread Thomas Munro
On Tue, Mar 29, 2016 at 8:17 PM, Michael Paquier
 wrote:
> On Tue, Mar 29, 2016 at 1:11 PM, Thomas Munro  
> wrote:
>> (BTW, isn't the select call in libpq_select
>> lacking an exceptfds set, and can't it therefore block forever when
>> there is an error condition on the socket and no timeout?)
>
> Hm. I think you're right here when timeout is NULL... It would loop 
> infinitely.
> @Andres (in CC): your thoughts on that regarding the new
> WaitEventSetWaitBlock()? The same pattern is used there.

That was a red herring.  I was confused because SUSv2 and POSIX call
this argument 'errorfds' and say that sockets *also* tell you about
errors this way.  (Many/most real OSs call the argument 'exceptfds'
instead and only use it to tell you about out-of-band data and
possibly implementation specific events for devices, pseudo-terminals
etc.  If you want to know about errors on a socket it's enough to have
it in readfds/writefds, and insufficient to have it only in
errorfds/exceptfds unless you can find a computer that actually
conforms to POSIX.)

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-04-05 Thread Robert Haas
On Tue, Apr 5, 2016 at 7:21 PM, Thomas Munro
 wrote:
> Thanks.  I see a way to move that loop and change the overflow
> behaviour along those lines but due to other commitments I won't be
> able to post a well tested patch and still leave time for reviewers
> and committer before the looming deadline.  After the freeze I will
> post an updated version that addresses these problems for the next CF.

OK, I've marked this Returned with Feedback for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-04-05 Thread Thomas Munro
On Tue, Apr 5, 2016 at 4:17 AM, Robert Haas  wrote:
> On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro
>  wrote:
>> On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas  wrote:
>>> OK, I committed this, with a few tweaks.  In particular, I added a
>>> flag variable instead of relying on "latch set" == "need to send
>>> reply"; the other changes were cosmetic.
>>>
>>> I'm not sure how much more of this we can realistically get into 9.6;
>>> the latter patches haven't had much review yet.  But I'll set this
>>> back to Needs Review in the CommitFest and we'll see where we end up.
>>> But even if we don't get anything more than this, it's still rather
>>> nice: remote_apply turns out to be only slightly slower than remote
>>> flush, and it's a guarantee that a lot of people are looking for.
>>
>> Thank you Michael and Robert!
>>
>> Please find attached the rest of the patch series, rebased against
>> master.  The goal of the 0002 patch is to provide an accurate
>> indication of the current replay lag on each standby, visible to users
>> like this:
>>
>> postgres=# select application_name, replay_lag from pg_stat_replication;
>>  application_name │   replay_lag
>> ──┼─
>>  replica1 │ 00:00:00.000299
>>  replica2 │ 00:00:00.000323
>>  replica3 │ 00:00:00.000319
>>  replica4 │ 00:00:00.000303
>> (4 rows)
>>
>> It works by maintaining a buffer of (end of WAL, time now) samples
>> received from the primary, and then eventually feeding those times
>> back to the primary when the recovery process replays the
>> corresponding locations.
>>
>> Compared to approaches based on commit timestamps, this approach has
>> the advantage of providing non-misleading information between commits.
>> For example, if you run a batch load job that takes 1 minute to insert
>> the whole phonebook and no other transactions run, you will see
>> replay_lag updating regularly throughout that minute, whereas typical
>> commit timestamp-only approaches will show an increasing lag time
>> until a commit record is eventually applied.  Compared to simple LSN
>> location comparisons, it reports in time rather than bytes of WAL,
>> which can be more meaningful for DBAs.
>>
>> When the standby is entirely caught up and there is no write activity,
>> the reported time effectively represents the ping time between the
>> servers, and is updated every wal_sender_timeout / 2, when keepalive
>> messages are sent.  While new WAL traffic is arriving, the walreceiver
>> records timestamps at most once per second in a circular buffer, and
>> then sends back replies containing the recorded timestamps as fast as
>> the recovery process can apply the corresponding xlog.  The lag number
>> you see is computed by the primary server comparing two timestamps
>> generated by its own system clock, one of which has been on a journey
>> to the standby and back.
>>
>> Accurate lag estimates are a prerequisite for the 0004 patch (about
>> which more later), but I believe users would find this valuable as a
>> feature on its own.
>
> Well, one problem with this is that you can't put a loop inside of a
> spinlock-protected critical section.
>
> In general, I think this is a pretty reasonable way of attacking this
> problem, but I'd say it's significantly under-commented.  Where should
> someone go to get a general overview of this mechanism?  The answer is
> not "at place XXX within the patch".  (I think it might merit some
> more extensive documentation, too, although I'm not exactly sure what
> that should look like.)
>
> When you overflow the buffer, you could thin in out in a smarter way,
> like by throwing away every other entry instead of the oldest one.  I
> guess you'd need to be careful how you coded that, though, because
> replaying an entry with a timestamp invalidates some of the saved
> entries without formally throwing them out.
>
> Conceivably, 0002 could be split into two patches, one of which
> computes "stupid replay lag" considering only records that naturally
> carry timestamps, and a second adding the circular buffer to handle
> the case where much time passes without finding such a record.

Thanks.  I see a way to move that loop and change the overflow
behaviour along those lines but due to other commitments I won't be
able to post a well tested patch and still leave time for reviewers
and committer before the looming deadline.  After the freeze I will
post an updated version that addresses these problems for the next CF.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-04-04 Thread Robert Haas
On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro
 wrote:
> On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas  wrote:
>> OK, I committed this, with a few tweaks.  In particular, I added a
>> flag variable instead of relying on "latch set" == "need to send
>> reply"; the other changes were cosmetic.
>>
>> I'm not sure how much more of this we can realistically get into 9.6;
>> the latter patches haven't had much review yet.  But I'll set this
>> back to Needs Review in the CommitFest and we'll see where we end up.
>> But even if we don't get anything more than this, it's still rather
>> nice: remote_apply turns out to be only slightly slower than remote
>> flush, and it's a guarantee that a lot of people are looking for.
>
> Thank you Michael and Robert!
>
> Please find attached the rest of the patch series, rebased against
> master.  The goal of the 0002 patch is to provide an accurate
> indication of the current replay lag on each standby, visible to users
> like this:
>
> postgres=# select application_name, replay_lag from pg_stat_replication;
>  application_name │   replay_lag
> ──┼─
>  replica1 │ 00:00:00.000299
>  replica2 │ 00:00:00.000323
>  replica3 │ 00:00:00.000319
>  replica4 │ 00:00:00.000303
> (4 rows)
>
> It works by maintaining a buffer of (end of WAL, time now) samples
> received from the primary, and then eventually feeding those times
> back to the primary when the recovery process replays the
> corresponding locations.
>
> Compared to approaches based on commit timestamps, this approach has
> the advantage of providing non-misleading information between commits.
> For example, if you run a batch load job that takes 1 minute to insert
> the whole phonebook and no other transactions run, you will see
> replay_lag updating regularly throughout that minute, whereas typical
> commit timestamp-only approaches will show an increasing lag time
> until a commit record is eventually applied.  Compared to simple LSN
> location comparisons, it reports in time rather than bytes of WAL,
> which can be more meaningful for DBAs.
>
> When the standby is entirely caught up and there is no write activity,
> the reported time effectively represents the ping time between the
> servers, and is updated every wal_sender_timeout / 2, when keepalive
> messages are sent.  While new WAL traffic is arriving, the walreceiver
> records timestamps at most once per second in a circular buffer, and
> then sends back replies containing the recorded timestamps as fast as
> the recovery process can apply the corresponding xlog.  The lag number
> you see is computed by the primary server comparing two timestamps
> generated by its own system clock, one of which has been on a journey
> to the standby and back.
>
> Accurate lag estimates are a prerequisite for the 0004 patch (about
> which more later), but I believe users would find this valuable as a
> feature on its own.

Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.

In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented.  Where should
someone go to get a general overview of this mechanism?  The answer is
not "at place XXX within the patch".  (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)

When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one.  I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.

Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-29 Thread Robert Haas
On Tue, Mar 29, 2016 at 5:47 PM, Thomas Munro
 wrote:
> On Wed, Mar 30, 2016 at 6:04 AM, Robert Haas  wrote:
>> On Tue, Mar 29, 2016 at 3:17 AM, Michael Paquier
>>  wrote:
>>> OK, so I am switching this patch as "Ready for committer", for 0001.
>>> It is in better shape now.
>>
>> Well...  I have a few questions yet.
>>
>> The new argument to SyncRepWaitForLSN is called "bool commit", but
>> RecordTransactionAbortPrepared passes true.  Either it should be
>> passing false, or the parameter is misnamed or at the least in need of
>> a better comment.
>>
>> I don't understand why this patch is touching the abort paths at all.
>> XactLogAbortRecord sets XACT_COMPLETION_SYNC_APPLY_FEEDBACK, and
>> xact_redo_abort honors it.  But surely it makes no sense to wait for
>> an abort to become visible.
>
> You're right, that was totally unnecessary.  Here is a version that
> removes that (ie XactLogAbortRecord doesn't request apply feedback
> from the standby, xact_redo_abort doesn't send apply feedback to the
> primary and RecordTransactionAbortPrepared now passes false to
> SyncRepWaitForLSN so it doesn't wait for apply feedback from the
> standby).  Also I fixed a silly bug in SyncRepWaitForLSN when capping
> the mode.  I have also renamed  XACT_COMPLETION_SYNC_APPLY_FEEDBACK to
> the more general XACT_COMPLETION_APPLY_FEEDBACK, because the later
> 0004 patch will use it for a more general purpose than
> synchronous_commit.

OK, I committed this, with a few tweaks.  In particular, I added a
flag variable instead of relying on "latch set" == "need to send
reply"; the other changes were cosmetic.

I'm not sure how much more of this we can realistically get into 9.6;
the latter patches haven't had much review yet.  But I'll set this
back to Needs Review in the CommitFest and we'll see where we end up.
But even if we don't get anything more than this, it's still rather
nice: remote_apply turns out to be only slightly slower than remote
flush, and it's a guarantee that a lot of people are looking for.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-29 Thread Thomas Munro
On Wed, Mar 30, 2016 at 6:04 AM, Robert Haas  wrote:
> On Tue, Mar 29, 2016 at 3:17 AM, Michael Paquier
>  wrote:
>> OK, so I am switching this patch as "Ready for committer", for 0001.
>> It is in better shape now.
>
> Well...  I have a few questions yet.
>
> The new argument to SyncRepWaitForLSN is called "bool commit", but
> RecordTransactionAbortPrepared passes true.  Either it should be
> passing false, or the parameter is misnamed or at the least in need of
> a better comment.
>
> I don't understand why this patch is touching the abort paths at all.
> XactLogAbortRecord sets XACT_COMPLETION_SYNC_APPLY_FEEDBACK, and
> xact_redo_abort honors it.  But surely it makes no sense to wait for
> an abort to become visible.

You're right, that was totally unnecessary.  Here is a version that
removes that (ie XactLogAbortRecord doesn't request apply feedback
from the standby, xact_redo_abort doesn't send apply feedback to the
primary and RecordTransactionAbortPrepared now passes false to
SyncRepWaitForLSN so it doesn't wait for apply feedback from the
standby).  Also I fixed a silly bug in SyncRepWaitForLSN when capping
the mode.  I have also renamed  XACT_COMPLETION_SYNC_APPLY_FEEDBACK to
the more general XACT_COMPLETION_APPLY_FEEDBACK, because the later
0004 patch will use it for a more general purpose than
synchronous_commit.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-remote-apply-v10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-29 Thread Robert Haas
On Tue, Mar 29, 2016 at 3:17 AM, Michael Paquier
 wrote:
> OK, so I am switching this patch as "Ready for committer", for 0001.
> It is in better shape now.

Well...  I have a few questions yet.

The new argument to SyncRepWaitForLSN is called "bool commit", but
RecordTransactionAbortPrepared passes true.  Either it should be
passing false, or the parameter is misnamed or at the least in need of
a better comment.

I don't understand why this patch is touching the abort paths at all.
XactLogAbortRecord sets XACT_COMPLETION_SYNC_APPLY_FEEDBACK, and
xact_redo_abort honors it.  But surely it makes no sense to wait for
an abort to become visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-29 Thread Michael Paquier
On Tue, Mar 29, 2016 at 1:11 PM, Thomas Munro
 wrote:
> On Tue, Mar 29, 2016 at 2:28 AM, Michael Paquier
>  wrote:
>> I definitely prefer that, that's neater! libpq_select could be
>> simplified because a timeout does not matter much.
>
> Ok, here is a new version that exits the streaming loop correctly when
> endofwal becomes true.  To hit that codepath you have to set up a
> cascading standby with recovery_target_timeline = 'latest', and then
> promote the standby it's talking to.  I also got rid of the
> PostmasterIsAlive() check which became superfluous.

Yes, I can see the difference.

> You're right that libpq_select is now only ever called with timeout =
> -1 so could theoretically lose the parameter, but I decided against
> cluttering this patch up by touching that for now.  It seems like the
> only reason it's used by libpqrcv_PQexec is something to do with
> interrupts on Windows, which I'm not able to test so that was another
> reason not to touch it.

OK. I don't mind if the first patch is bare-bone. That's additional cleanup.

> (BTW, isn't the select call in libpq_select
> lacking an exceptfds set, and can't it therefore block forever when
> there is an error condition on the socket and no timeout?)

Hm. I think you're right here when timeout is NULL... It would loop infinitely.
@Andres (in CC): your thoughts on that regarding the new
WaitEventSetWaitBlock()? The same pattern is used there.

-bool walrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
-
-Retrieve any message available through the connection, blocking for
Oh, the description of walrcv_receive is actually incorrect in
src/backend/replication/README from the beginning... I am sure you
noticed that as well. Perhaps that's worth fixing in the back-branches
(I think it does matter). Thoughts from others?

OK, so I am switching this patch as "Ready for committer", for 0001.
It is in better shape now.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-28 Thread Michael Paquier
On Mon, Mar 28, 2016 at 10:08 PM, Thomas Munro
 wrote:
> On Tue, Mar 29, 2016 at 1:56 AM, Thomas Munro
>  wrote:
>> On Mon, Mar 28, 2016 at 8:54 PM, Michael Paquier
>>  wrote:
>>> I have been also thinking a lot about this patch, and the fact that
>>> the WAL receiver latch is being used within the internals of
>>> libpqwalreceiver has been bugging me a lot, because this makes the
>>> wait phase happening within the libpqwalreceiver depend on something
>>> that only the WAL receiver had a only control on up to now (among the
>>> things thought: having a second latch for libpqwalreceiver, having an
>>> event interface for libpqwalreceiver, switch libpq_receive into being
>>> asynchronous...).
>>
>> Yeah, it bugs me too.  Do you prefer this?
>>
>> int walrcv_receive(char **buffer, int *wait_fd);
>>
>> Return value -1 means end-of-copy as before, return value 0 means "no
>> data available now, please call me again when *wait_fd is ready to
>> read".  Then walreceiver.c can look after the WaitLatchOrSocket call
>> and deal with socket readiness, postmaster death, timeout and latch,
>> and libpqwalreceiver.c doesn't know anything about all that stuff
>> anymore, but it is now part of the interface that it must expose a
>> file descriptor for readiness testing when it doesn't have data
>> available.
>>
>> Please find attached a new patch series which does it that way.
>
> Oops, there is a bug in the primary disconnection case when len == 1
> and it breaks out of the loop and wait_fd is invalid.  I'll follow up
> on that tomorrow, but I'm interested to hear your thoughts (and anyone
> else's!) on that interface change and general approach.

I definitely prefer that, that's neater! libpq_select could be
simplified because a timeout does not matter much.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-28 Thread Thomas Munro
On Tue, Mar 29, 2016 at 1:56 AM, Thomas Munro
 wrote:
> On Mon, Mar 28, 2016 at 8:54 PM, Michael Paquier
>  wrote:
>> I have been also thinking a lot about this patch, and the fact that
>> the WAL receiver latch is being used within the internals of
>> libpqwalreceiver has been bugging me a lot, because this makes the
>> wait phase happening within the libpqwalreceiver depend on something
>> that only the WAL receiver had a only control on up to now (among the
>> things thought: having a second latch for libpqwalreceiver, having an
>> event interface for libpqwalreceiver, switch libpq_receive into being
>> asynchronous...).
>
> Yeah, it bugs me too.  Do you prefer this?
>
> int walrcv_receive(char **buffer, int *wait_fd);
>
> Return value -1 means end-of-copy as before, return value 0 means "no
> data available now, please call me again when *wait_fd is ready to
> read".  Then walreceiver.c can look after the WaitLatchOrSocket call
> and deal with socket readiness, postmaster death, timeout and latch,
> and libpqwalreceiver.c doesn't know anything about all that stuff
> anymore, but it is now part of the interface that it must expose a
> file descriptor for readiness testing when it doesn't have data
> available.
>
> Please find attached a new patch series which does it that way.

Oops, there is a bug in the primary disconnection case when len == 1
and it breaks out of the loop and wait_fd is invalid.  I'll follow up
on that tomorrow, but I'm interested to hear your thoughts (and anyone
else's!) on that interface change and general approach.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-28 Thread Michael Paquier
On Sun, Mar 27, 2016 at 7:30 AM, Thomas Munro
 wrote:
> On Sat, Mar 26, 2016 at 2:48 AM, Michael Paquier
>  wrote:
>> Should we worried about potential backward-incompatibilities with the
>> new return values of walrcv_receive?
>
> There are three changes to the walrcv_receive interface:
>
> 1.  It now takes a latch pointer, which may be NULL.
>
> 2.  If the latch pointer is non-NULL, the existing function might
> return a new sentinel value WALRCV_RECEIVE_LATCH_SET.  (The
> pre-existing sentinel value -1 is still in use and has the same value
> and meaning as before, but it now has a name:
> WALRCV_RECEIVE_COPY_ENDED.)
>
> 3.  It will no longer return when the process is signalled (a latch
> should be used to ask it to return instead).
>
> Effectively, any client code would need to change at least to add NULL
> or possibly a latch if it needs to ask it to return, and any
> alternative implementation of the WAL receiver interface would need to
> use WaitEventSet (or WaitLatchOrSocket) as its event loop instead of
> whatever it might be using now so that it can detect a latch's state.
> But in any case, any such code would fail to compile against 9.6 due
> to the new argument, and then you'd only be able to get the new return
> value if you asked for it by passing in a latch.  What affected code
> are we aware of -- either users of libpqwalreceiver.so or other WAL
> receiver implementations?

Any negative value returned by walrcv_receive would have stopped the
replication stream, not only -1. And as you say, it's actually a good
thing that the interface of walrcv_receive is changing with this
patch, this way compilation would just fail and failures would not
happen discreetly. That's too low-level to be mentioned in the release
notes either way, so just a compilation failure is acceptable to me.

>> Do you have numbers to share regarding how is performing the
>> latch-based approach and the approach that used SIGUSR2 when
>> remote_apply is used?
>
> I couldn't measure any significant change (Linux, all on localhost, 128 
> cores):
>
> pgbench -c 1 -N bench2 -T 600
>
> 0001-remote-apply-v5.patch (signals), remote_apply -> 449 TPS
> 0001-remote-apply-v6.patch (latches), remote_apply -> 452 TPS
>
> pgbench -c 64 -j 32 -N bench2 -T 600
>
> 0001-remote-apply-v5.patch (signals), remote_apply -> 8536 TPS
> 0001-remote-apply-v6.patch (latches), remote_apply -> 8534 TPS

Which concludes that both imply the same kind of performance. We are
likely seeing noise.

> Incidentally, I also did some testing on what happens when you signal
> a process that is busily writing and fsyncing.  I tested a few
> different kernels, write sizes and disk latencies and saw that things
> were fine on all of them up to 10k signals/sec but after that some
> systems' fsync performance started to reduce.  Only Linux on Power was
> still happily fsyncing at around 3/4 of full speed when signalled with
> a 2MHz+ tight kill loop (!), while FreeBSD and Linux on Intel weren't
> able to make much progress at all under such adversity.  So I suppose
> that if you could get the commit rate up into 5 figures you might be
> able to measure an improvement for the latch version due to
> latch-collapsing, though I noticed a large amount of signal-collapsing
> going on at the OS level on all systems I tested anyway, so maybe it
> wouldn't make a difference.  I attach that test program for interest.

Interesting results.

> Also, I updated the comment for the declaration of the latch in
> walreceiver.h to say something about the new usage.
>
> New patch series attached.

 static void WalRcvQuickDieHandler(SIGNAL_ARGS);

-
 static void
 ProcessWalRcvInterrupts(void)
Noise here.

+   ret = WaitLatchOrSocket(latch, events, PQsocket(streamConn), timeout_ms);
+
+   if (ret & WL_POSTMASTER_DEATH)
+   exit(0);
Exiting within libpqwalreceiver.so is no good. I think that even in
the case of a postmaster cleanup we should allow things to be cleaned
up.

 /*
+ * Wake up the walreceiver if it happens to be blocked in walrcv_receive,
+ * and tell it that a commit record has been applied.
+ *
+ * This is called by the startup process whenever interesting xlog records
+ * are applied, so that walreceiver can check if it needs to send an apply
+ * notification back to the master which may be waiting in a COMMIT with
+ * synchronous_commit = remote_apply.
+ */
+void
+WalRcvWakeup(void)
+{
+   SetLatch(>latch);
+}
I think here that it would be good to add an assertion telling that
this can just be called by the startup process while in recovery,
WalRcv->latch is not protected by a mutex lock.

+maximum of 'timeout' ms. If a message was successfully read, returns
+its length. Otherwise returns 0 for timeout, WALRCV_RECEIVE_COPY_ENDED
+for disconnection or WALRCV_RECEIVE_LATCH_SET. On success, a pointer
Having an assigned constant name for timeout would be good for
consistency with the rest.

I have been also 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-25 Thread Michael Paquier
On Fri, Mar 25, 2016 at 4:51 PM, Thomas Munro
 wrote:
> On Thu, Mar 24, 2016 at 12:34 AM, Thomas Munro
>  wrote:
>> On Wed, Mar 23, 2016 at 12:37 PM, Robert Haas  wrote:
>>> +static void WalRcvUnblockSigUsr2(void)
>>>
>>> And again here.
>>
>> Fixed.
>>
>>> +WalRcvUnblockSigUsr2();
>>>  len = walrcv_receive(NAPTIME_PER_CYCLE, );
>>> +WalRcvBlockSigUsr2();
>>>
>>> This does not seem like it will be cheap on all operating systems.  I
>>> think you should try to rejigger this somehow so that it can just set
>>> the process latch and the wal receiver figures it out from looking at
>>> shared memory.  Like maybe a flag in WalRcvData?  An advantage of this
>>> is that it should cut down on the number of signals significantly,
>>> because it won't need to send SIGUSR1 when the latch is already set.
>>
>> Still experimenting with a latch here.  I will come back on this point soon.
>
> Here is a latch-based version.

Thanks for the updated version. This looks pretty nice.

I find the routine name libpqrcv_wait to be a bit confusing. This is
not a routine aimed at being exposed externally as walrcv_send or
walrcv_receive. I would recommend changing the name, to something like
waitForWALStream or similar.

Should we worried about potential backward-incompatibilities with the
new return values of walrcv_receive?

Do you have numbers to share regarding how is performing the
latch-based approach and the approach that used SIGUSR2 when
remote_apply is used?
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-24 Thread Michael Paquier
On Thu, Mar 24, 2016 at 11:20 PM, Robert Haas  wrote:
> On Thu, Mar 24, 2016 at 2:11 AM, Michael Paquier
>  wrote:
>> -   SyncRepWaitForLSN(gxact->prepare_end_lsn);
>> +   SyncRepWaitForLSN(gxact->prepare_end_lsn, false);
>> Isn't it important to ensure that a PREPARE LSN is applied as well on
>> the standby with remote_apply? Say if an application prepares a
>> transaction, it would commit locally but its LSN may not be applied on
>> the standby with this patch. That would be a surprising behavior for
>> the user.
>
> You need to wait for COMMIT PREPARED, but I don't see why you need to
> wait for PREPARE itself.

Multi-master conflict resolution. Knowing in-time that a prepared
transaction has been applied is useful on another node for lock
conflicts resolution. Say a PRIMARY KEY insert is prepared, and we
want to know at application level that its prepared state is visible
everywhere.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-24 Thread Robert Haas
On Thu, Mar 24, 2016 at 2:11 AM, Michael Paquier
 wrote:
> On Wed, Mar 23, 2016 at 11:32 PM, Robert Haas  wrote:
>> Well, I wouldn't go that far.  It seems pretty clear that remote_apply
>> by itself is useful - I can't imagine anybody seriously arguing the
>> contrary, whatever they think of this implementation.  My view,
>> though, is that by itself that's pretty limiting: you can only have
>> one standby, and if that standby falls over then you lose
>> availability.  Causal reads fixes both of those problems - admittedly
>> that requires some knowledge in the application or the pooler, but
>> it's no worse than SSI in that regard.  Still, half a loaf is better
>> than none, and I imagine even just getting remote_apply would make a
>> few people quite happy.
>
> OK, let's do so then, even if causal reads don't get into 9.6 users
> could get advantage of remote_apply on multiple nodes if the N-sync
> patch gets in.
>
> Just looking at 0001.
>
> -remote_write, local, and off.
> +remote_write, remote_apply,
> local, and off.
>  The default, and safe, setting
> I imagine that a run of pgindent would be welcome for such large lines.

I didn't think pgindent touched the docs.  But I agree lines over 80
characters should be wrapped if they're being modified anyway.

> +#define XactCompletionSyncApplyFeedback(xinfo) \
> +   (!!(xinfo & XACT_COMPLETION_SYNC_APPLY_FEEDBACK))
> That's not directly something this patch should take care of, but the
> notation "!!" has better be avoided (see stdbool thread with VS2015).

+1.

> -   SyncRepWaitForLSN(gxact->prepare_end_lsn);
> +   SyncRepWaitForLSN(gxact->prepare_end_lsn, false);
> Isn't it important to ensure that a PREPARE LSN is applied as well on
> the standby with remote_apply? Say if an application prepares a
> transaction, it would commit locally but its LSN may not be applied on
> the standby with this patch. That would be a surprising behavior for
> the user.

You need to wait for COMMIT PREPARED, but I don't see why you need to
wait for PREPARE itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-24 Thread Michael Paquier
On Wed, Mar 23, 2016 at 11:32 PM, Robert Haas  wrote:
> Well, I wouldn't go that far.  It seems pretty clear that remote_apply
> by itself is useful - I can't imagine anybody seriously arguing the
> contrary, whatever they think of this implementation.  My view,
> though, is that by itself that's pretty limiting: you can only have
> one standby, and if that standby falls over then you lose
> availability.  Causal reads fixes both of those problems - admittedly
> that requires some knowledge in the application or the pooler, but
> it's no worse than SSI in that regard.  Still, half a loaf is better
> than none, and I imagine even just getting remote_apply would make a
> few people quite happy.

OK, let's do so then, even if causal reads don't get into 9.6 users
could get advantage of remote_apply on multiple nodes if the N-sync
patch gets in.

Just looking at 0001.

-remote_write, local, and off.
+remote_write, remote_apply,
local, and off.
 The default, and safe, setting
I imagine that a run of pgindent would be welcome for such large lines.

+#define XactCompletionSyncApplyFeedback(xinfo) \
+   (!!(xinfo & XACT_COMPLETION_SYNC_APPLY_FEEDBACK))
That's not directly something this patch should take care of, but the
notation "!!" has better be avoided (see stdbool thread with VS2015).

-   SyncRepWaitForLSN(gxact->prepare_end_lsn);
+   SyncRepWaitForLSN(gxact->prepare_end_lsn, false);
Isn't it important to ensure that a PREPARE LSN is applied as well on
the standby with remote_apply? Say if an application prepares a
transaction, it would commit locally but its LSN may not be applied on
the standby with this patch. That would be a surprising behavior for
the user.

(not commenting on the latch and SIGUSR2 handling, you are still
working on it per your last update).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-23 Thread Robert Haas
On Wed, Mar 23, 2016 at 8:43 AM, Michael Paquier
 wrote:
> On Wed, Mar 23, 2016 at 8:39 PM, Robert Haas  wrote:
>> On Wed, Mar 23, 2016 at 7:34 AM, Thomas Munro
>>  wrote:
>>> synchronous_commitTPS
>>>  
>>> off  9234
>>> local1223
>>> remote_write  907
>>> on587
>>> remote_apply  555
>>>
>>> synchronous_commitTPS
>>>  
>>> off  3937
>>> local1984
>>> remote_write 1701
>>> on   1373
>>> remote_apply 1349
>>
>> Hmm, so "remote_apply" is barely more expensive than "on".  That's 
>> encouraging.
>
> Indeed, interesting. This is perhaps proving that just having the
> possibility to have remote_apply (with feedback messages managed by
> signals, which is the best thing proposed for this release) is what we
> need to ensure the consistency of reads across nodes, and what would
> satisfy most of the user's requirements. Getting a slightly lower TPS
> may be worth the cost for some users if they can ensure that reads
> across nodes are accessible after a local commit, and there is no need
> to have an error management layer at application level to take care of
> errors caused by causal read timeouts.

Well, I wouldn't go that far.  It seems pretty clear that remote_apply
by itself is useful - I can't imagine anybody seriously arguing the
contrary, whatever they think of this implementation.  My view,
though, is that by itself that's pretty limiting: you can only have
one standby, and if that standby falls over then you lose
availability.  Causal reads fixes both of those problems - admittedly
that requires some knowledge in the application or the pooler, but
it's no worse than SSI in that regard.  Still, half a loaf is better
than none, and I imagine even just getting remote_apply would make a
few people quite happy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-23 Thread Michael Paquier
On Wed, Mar 23, 2016 at 8:39 PM, Robert Haas  wrote:
> On Wed, Mar 23, 2016 at 7:34 AM, Thomas Munro
>  wrote:
>> synchronous_commitTPS
>>  
>> off  9234
>> local1223
>> remote_write  907
>> on587
>> remote_apply  555
>>
>> synchronous_commitTPS
>>  
>> off  3937
>> local1984
>> remote_write 1701
>> on   1373
>> remote_apply 1349
>
> Hmm, so "remote_apply" is barely more expensive than "on".  That's 
> encouraging.

Indeed, interesting. This is perhaps proving that just having the
possibility to have remote_apply (with feedback messages managed by
signals, which is the best thing proposed for this release) is what we
need to ensure the consistency of reads across nodes, and what would
satisfy most of the user's requirements. Getting a slightly lower TPS
may be worth the cost for some users if they can ensure that reads
across nodes are accessible after a local commit, and there is no need
to have an error management layer at application level to take care of
errors caused by causal read timeouts.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-23 Thread Robert Haas
On Wed, Mar 23, 2016 at 7:34 AM, Thomas Munro
 wrote:
> synchronous_commitTPS
>  
> off  9234
> local1223
> remote_write  907
> on587
> remote_apply  555
>
> synchronous_commitTPS
>  
> off  3937
> local1984
> remote_write 1701
> on   1373
> remote_apply 1349

Hmm, so "remote_apply" is barely more expensive than "on".  That's encouraging.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-22 Thread Robert Haas
On Mon, Mar 14, 2016 at 2:38 PM, Thomas Munro
 wrote:
>> What's with the extra block?
>
> Yeah, that's silly, thanks.  Tidied up for the next version.

Some more comments on 0001:

+remote_write,  remote_apply,
local, and off.

Extra space.

+ * apply when the transaction eventuallly commits or aborts.

Spelling.

+if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL);
+
+SyncRepWaitForLSN(gxact->prepare_end_lsn);
+
+if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL);

You can't do this.  Directly changing the value of a variable that is
backing a GUC is verboten, and doing it through the thin veneer of
calling the assign-hook will not avoid the terrible wrath of the
powers that dwell in the outer dark, and/or Pittsburgh.  You probably
need a dance here similar to the way forcePageWrites/fullPageWrites
work.

 /*
+ * Check if the caller would like to ask standbys for immediate feedback
+ * once this commit is applied.
+*/

Whitespace.

+/*
+ * Check if the caller would like to ask standbys for immediate feedback
+ * once this abort is applied.
+*/

Whitespace again.

 /*
+ * doRequestWalReceiverReply is used by recovery code to ask the main recovery
+ * loop to trigger a walreceiver reply.
+ */
+static bool doRequestWalReceiverReply;

This is the sort of comment that leads me to ask "why even bother
writing a comment?".  Try to say something that's not obvious from the
variable name. The comment for XLogRequestWalReceiverReply has a
similar issue.

+static void WalRcvBlockSigUsr2(void)

Style - newline after void.

+static void WalRcvUnblockSigUsr2(void)

And again here.

+WalRcvUnblockSigUsr2();
 len = walrcv_receive(NAPTIME_PER_CYCLE, );
+WalRcvBlockSigUsr2();

This does not seem like it will be cheap on all operating systems.  I
think you should try to rejigger this somehow so that it can just set
the process latch and the wal receiver figures it out from looking at
shared memory.  Like maybe a flag in WalRcvData?  An advantage of this
is that it should cut down on the number of signals significantly,
because it won't need to send SIGUSR1 when the latch is already set.

+ * Although only "on", "off", "remote_apply", "remote_write", and "local" are
+ * documented, we accept all the likely variants of "on" and "off".

Maybe switch to listing the undocumented values.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-14 Thread Thomas Munro
On Tue, Mar 15, 2016 at 6:58 AM, Robert Haas  wrote:
> On Sun, Mar 13, 2016 at 11:50 PM, Thomas Munro
>  wrote:
>> The last patches I posted don't apply today due to changes in master,
>> so here's a freshly merged patch series.
>
> +from the current synchronous stanbyindicates it has received the
>
> Uh, no.

Oops, thanks, fixed.  I'll wait for some more feedback or a conflict
with master before sending a new version.

> -SyncRepWaitForLSN(gxact->prepare_end_lsn);
> +{
> +/*
> + * Don't wait for the prepare to be applied remotely in remote_apply
> + * mode, just wait for it to be flushed to the WAL.  We will wait for
> + * apply when the transaction eventuallly commits or aborts.
> + */
> +if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY)
> +assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL);
> +
> +SyncRepWaitForLSN(gxact->prepare_end_lsn);
> +
> +if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY)
> +assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL);
> +}
>
> What's with the extra block?

Yeah, that's silly, thanks.  Tidied up for the next version.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-14 Thread Robert Haas
On Sun, Mar 13, 2016 at 11:50 PM, Thomas Munro
 wrote:
> The last patches I posted don't apply today due to changes in master,
> so here's a freshly merged patch series.

+from the current synchronous stanbyindicates it has received the

Uh, no.

-SyncRepWaitForLSN(gxact->prepare_end_lsn);
+{
+/*
+ * Don't wait for the prepare to be applied remotely in remote_apply
+ * mode, just wait for it to be flushed to the WAL.  We will wait for
+ * apply when the transaction eventuallly commits or aborts.
+ */
+if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_FLUSH, NULL);
+
+SyncRepWaitForLSN(gxact->prepare_end_lsn);
+
+if (synchronous_commit == SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+assign_synchronous_commit(SYNCHRONOUS_COMMIT_REMOTE_APPLY, NULL);
+}

What's with the extra block?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-08 Thread Thomas Munro
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier
 wrote:
> WAL replay for 2PC should also
> call XLogRequestWalReceiverReply() when needed.

Ah yes, I missed this important sentence.  I will address that in the
next version after some testing.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-06 Thread Thomas Munro
On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier
 wrote:
> On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro
>  wrote:
>> On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote
>>  wrote:
>>>
>>> Hi,
>>>
>>> On 2016/02/29 18:05, Thomas Munro wrote:
 On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:
> + servers.  A transaction that is run with
> causal_reads set
> + to on is guaranteed either to see the effects of all
> + completed transactions run on the primary with the setting on, 
> or to
> + receive an error "standby is not available for causal reads".
>
> "A transaction that is run" means "A transaction that is run on a
> standby", right?

 Well, it could be any server, standby or primary.  Of course standbys
 are the interesting case since it it was already true that if you run
 two sequential transactions run on the primary, the second can see the
 effect of the first, but I like the idea of a general rule that
 applies anywhere, allowing you not to care which server it is.
>>>
>>> I meant actually in context of that sentence only.
>>
>> Ok, here's a new version that includes that change, fixes a conflict
>> with recent commit 10b48522 and removes an accidental duplicate copy
>> of the README file.
>
> Finally I got my eyes on this patch.
>
> To put it short, this patch introduces two different concepts:
> - addition of a new value, remote_apply, for synchronous_commit, which
> is actually where things overlap a bit with the N-sync patch, because
> by combining the addition of remote_apply with the N-sync patch, it is
> possible to ensure that an LSN is applied to multiple targets instead
> of one now. In this case, as the master will wait for the LSN to be
> applied on the sync standby, there is no need for the application to
> have error handling in case a read transaction is happening on the
> standby as the change is already visible on the standby when
> committing it on master. However the cost here is that if the standby
> node lags behind, it puts some extra wait as well on the master side.
> - casual reads, which makes the master able to drop standbys that are
> lagging too much behind and let callers of standbys know that it is
> lagging to much behind and cannot satisfy causal reads. In this case
> error handling is needed by the application in case a standby is
> lagging to much behind.
>
> That's one of my concerns about this patch now: it is trying to do too
> much. I think that there is definitely a need for both things:
> applications may be fine to pay the lagging price when remote_apply is
> used by not having an extra error handling layer, or they cannot
> accept a perhaps large of lag and are ready to have an extra error
> handling layer to manage read failures on standbys. So I would
> recommend a split to begin with:
> 1) Addition of remote_apply in synchronous_commit, which would be
> quite a simple patch, and I am convinced that there are benefits in
> having that. Compared to the previous patch sent, a standby message is
> sent back to the master once COMMIT has been applied, accelerating
> things a bit.
> 2) Patch for causal reads, with all its extra parametrization logic
> and stuff to select standby candidates.

Thanks.  Yes, this makes a lot of sense.  I have done some work on
splitting this up and will post the result soon, along with my
responses to your other feedback.

> Here is as well a more detailed review...
>
> Regression tests are failing, rules.sql is generating diffs because
> pg_stat_replication is changed.
>
> CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE,
> COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also
> call XLogRequestWalReceiverReply() when needed.
>
> The new parameters are missing from postgresql.conf.sample.
>
> +static bool
> +SyncRepCheckEarlyExit(void)
> +{
> Doing this refactoring would actually make sense as a separate patch I
> think, and that would simplify the core patch for causal reads.
>
> +For this reason we say that the causal reads guarantee only holds as
> +long as the absolute difference between the system clocks of the
> +machines is no more than max_clock_skew.  The theory is that NTP makes
> +it possible to reason about the maximum possible clock difference
> +between machines and choose a value that allows for a much larger
> +difference.  However, we do make a best effort attempt to detect
> +misconfigured systems as described above, to catch the case of servers
> +not running ntp a correctly configured ntp daemon, or with a clock so
> +far out of whack that ntp refuses to fix it.
> Just wondering how this reacts when standby and master are on
> different timezones. I know of two ways to measure WAL lag: one is
> what you are doing, by using a timestamp and rely on the two servers
> to be in sync at clock level. 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-02 Thread Michael Paquier
On Thu, Mar 3, 2016 at 3:34 PM, Michael Paquier
 wrote:
> That's one of my concerns about this patch now: it is trying to do too
> much. I think that there is definitely a need for both things:
> applications may be fine to pay the lagging price when remote_apply is
> used by not having an extra error handling layer, or they cannot
> accept a perhaps large of lag and are ready to have an extra error
> handling layer to manage read failures on standbys. So I would
> recommend a split to begin with:
> 1) Addition of remote_apply in synchronous_commit, which would be
> quite a simple patch, and I am convinced that there are benefits in
> having that. Compared to the previous patch sent, a standby message is
> sent back to the master once COMMIT has been applied, accelerating
> things a bit.

Hm. Looking now at
http://www.postgresql.org/message-id/canp8+j+jcpnoojc-kqltt4pdyox2sq6wywqcsy6aahwkvna...@mail.gmail.com
it would be nice to get a clear solution for it first, though the use
of signals to wake up the WAL receiver and enforce it to send a new
LSN apply position back to the master to unlock it asap does not look
very appealing. Seeing that no patch has been sent for 9.6 regarding
that, it would be better to simply drop this code from the causal-read
patch perhaps...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-03-02 Thread Michael Paquier
On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro
 wrote:
> On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote
>  wrote:
>>
>> Hi,
>>
>> On 2016/02/29 18:05, Thomas Munro wrote:
>>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:
 + servers.  A transaction that is run with
 causal_reads set
 + to on is guaranteed either to see the effects of all
 + completed transactions run on the primary with the setting on, 
 or to
 + receive an error "standby is not available for causal reads".

 "A transaction that is run" means "A transaction that is run on a
 standby", right?
>>>
>>> Well, it could be any server, standby or primary.  Of course standbys
>>> are the interesting case since it it was already true that if you run
>>> two sequential transactions run on the primary, the second can see the
>>> effect of the first, but I like the idea of a general rule that
>>> applies anywhere, allowing you not to care which server it is.
>>
>> I meant actually in context of that sentence only.
>
> Ok, here's a new version that includes that change, fixes a conflict
> with recent commit 10b48522 and removes an accidental duplicate copy
> of the README file.

Finally I got my eyes on this patch.

To put it short, this patch introduces two different concepts:
- addition of a new value, remote_apply, for synchronous_commit, which
is actually where things overlap a bit with the N-sync patch, because
by combining the addition of remote_apply with the N-sync patch, it is
possible to ensure that an LSN is applied to multiple targets instead
of one now. In this case, as the master will wait for the LSN to be
applied on the sync standby, there is no need for the application to
have error handling in case a read transaction is happening on the
standby as the change is already visible on the standby when
committing it on master. However the cost here is that if the standby
node lags behind, it puts some extra wait as well on the master side.
- casual reads, which makes the master able to drop standbys that are
lagging too much behind and let callers of standbys know that it is
lagging to much behind and cannot satisfy causal reads. In this case
error handling is needed by the application in case a standby is
lagging to much behind.

That's one of my concerns about this patch now: it is trying to do too
much. I think that there is definitely a need for both things:
applications may be fine to pay the lagging price when remote_apply is
used by not having an extra error handling layer, or they cannot
accept a perhaps large of lag and are ready to have an extra error
handling layer to manage read failures on standbys. So I would
recommend a split to begin with:
1) Addition of remote_apply in synchronous_commit, which would be
quite a simple patch, and I am convinced that there are benefits in
having that. Compared to the previous patch sent, a standby message is
sent back to the master once COMMIT has been applied, accelerating
things a bit.
2) Patch for causal reads, with all its extra parametrization logic
and stuff to select standby candidates.

Here is as well a more detailed review...

Regression tests are failing, rules.sql is generating diffs because
pg_stat_replication is changed.

CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE,
COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also
call XLogRequestWalReceiverReply() when needed.

The new parameters are missing from postgresql.conf.sample.

+static bool
+SyncRepCheckEarlyExit(void)
+{
Doing this refactoring would actually make sense as a separate patch I
think, and that would simplify the core patch for causal reads.

+For this reason we say that the causal reads guarantee only holds as
+long as the absolute difference between the system clocks of the
+machines is no more than max_clock_skew.  The theory is that NTP makes
+it possible to reason about the maximum possible clock difference
+between machines and choose a value that allows for a much larger
+difference.  However, we do make a best effort attempt to detect
+misconfigured systems as described above, to catch the case of servers
+not running ntp a correctly configured ntp daemon, or with a clock so
+far out of whack that ntp refuses to fix it.
Just wondering how this reacts when standby and master are on
different timezones. I know of two ways to measure WAL lag: one is
what you are doing, by using a timestamp and rely on the two servers
to be in sync at clock level. The second is in bytes with a WAL
quantity, though it is less user-friendly to set up, say max_wal_lag
or similar, symbolized by the number of WAL segments the standby is
lagging behind, the concerns regarding clock sync across nodes go
away. To put it short, I find the timestamp approach easy to set up
and understand for the user, but not really solid as it depends much
on the state dependency 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-29 Thread Thomas Munro
On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote
 wrote:
>
> Hi,
>
> On 2016/02/29 18:05, Thomas Munro wrote:
>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:
>>> + servers.  A transaction that is run with
>>> causal_reads set
>>> + to on is guaranteed either to see the effects of all
>>> + completed transactions run on the primary with the setting on, or 
>>> to
>>> + receive an error "standby is not available for causal reads".
>>>
>>> "A transaction that is run" means "A transaction that is run on a
>>> standby", right?
>>
>> Well, it could be any server, standby or primary.  Of course standbys
>> are the interesting case since it it was already true that if you run
>> two sequential transactions run on the primary, the second can see the
>> effect of the first, but I like the idea of a general rule that
>> applies anywhere, allowing you not to care which server it is.
>
> I meant actually in context of that sentence only.

Ok, here's a new version that includes that change, fixes a conflict
with recent commit 10b48522 and removes an accidental duplicate copy
of the README file.

-- 
Thomas Munro
http://www.enterprisedb.com


causal-reads-v8.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-29 Thread Amit Langote

Hi,

On 2016/02/29 18:05, Thomas Munro wrote:
> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:
>> + servers.  A transaction that is run with
>> causal_reads set
>> + to on is guaranteed either to see the effects of all
>> + completed transactions run on the primary with the setting on, or 
>> to
>> + receive an error "standby is not available for causal reads".
>>
>> "A transaction that is run" means "A transaction that is run on a
>> standby", right?
> 
> Well, it could be any server, standby or primary.  Of course standbys
> are the interesting case since it it was already true that if you run
> two sequential transactions run on the primary, the second can see the
> effect of the first, but I like the idea of a general rule that
> applies anywhere, allowing you not to care which server it is.

I meant actually in context of that sentence only.

>> By the way, is there some discussion in our existing
>> documentation to refer to about causal consistency in single node case?  I
>> don't know maybe that will help ease into the new feature.  Grepping the
>> existing source tree doesn't reveal the term "causal", so maybe even a
>> single line in the patch mentioning "single node operation trivially
>> implies (or does it?) causal consistency" would help.  Thoughts?
> 
> Hmm.  Where should such a thing go?  I probably haven't introduced the
> term well enough.  I thought for a moment about putting something
> here:
> 
> http://www.postgresql.org/docs/devel/static/sql-commit.html
> 
> "All changes made by the transaction become visible to others ..." --
> which others?  But I backed out, that succinct account of COMMIT is 20
> years old, and in any case visibility is tied to committing, not
> specifically to the COMMIT command.  But perhaps this patch really
> should include something there that refers back to the causal reads
> section.

I see.  I agree this is not exactly material for the COMMIT page.  Perhaps
somewhere under "Chapter 13. Concurrency Control" with cross-reference
to/from "25.5. Hot Standby".  Might be interesting to hear from others as
well.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-29 Thread Michael Paquier
On Mon, Feb 29, 2016 at 6:05 PM, Thomas Munro
 wrote:
> "All changes made by the transaction become visible to others ..." --
> which others?  But I backed out, that succinct account of COMMIT is 20
> years old, and in any case visibility is tied to committing, not
> specifically to the COMMIT command.  But perhaps this patch really
> should include something there that refers back to the causal reads
> section.

Luckily enough, read uncommitted behaves like read-committed in PG,
making this true :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-29 Thread Thomas Munro
On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote
 wrote:
>
> Hi Thomas,
>
> On 2016/02/29 15:20, Thomas Munro wrote:
>> Thanks for looking at the patch!  Here is a new version with the
>> following changes:
>>
>> 1.  Some draft user documentation has been added, as requested.
>
> Just to clarify, in:
>
> + servers.  A transaction that is run with
> causal_reads set
> + to on is guaranteed either to see the effects of all
> + completed transactions run on the primary with the setting on, or to
> + receive an error "standby is not available for causal reads".
>
> "A transaction that is run" means "A transaction that is run on a
> standby", right?

Well, it could be any server, standby or primary.  Of course standbys
are the interesting case since it it was already true that if you run
two sequential transactions run on the primary, the second can see the
effect of the first, but I like the idea of a general rule that
applies anywhere, allowing you not to care which server it is.

> By the way, is there some discussion in our existing
> documentation to refer to about causal consistency in single node case?  I
> don't know maybe that will help ease into the new feature.  Grepping the
> existing source tree doesn't reveal the term "causal", so maybe even a
> single line in the patch mentioning "single node operation trivially
> implies (or does it?) causal consistency" would help.  Thoughts?

Hmm.  Where should such a thing go?  I probably haven't introduced the
term well enough.  I thought for a moment about putting something
here:

http://www.postgresql.org/docs/devel/static/sql-commit.html

"All changes made by the transaction become visible to others ..." --
which others?  But I backed out, that succinct account of COMMIT is 20
years old, and in any case visibility is tied to committing, not
specifically to the COMMIT command.  But perhaps this patch really
should include something there that refers back to the causal reads
section.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-29 Thread Amit Langote

Hi Thomas,

On 2016/02/29 15:20, Thomas Munro wrote:
> Thanks for looking at the patch!  Here is a new version with the
> following changes:
> 
> 1.  Some draft user documentation has been added, as requested.

Just to clarify, in:

+ servers.  A transaction that is run with
causal_reads set
+ to on is guaranteed either to see the effects of all
+ completed transactions run on the primary with the setting on, or to
+ receive an error "standby is not available for causal reads".

"A transaction that is run" means "A transaction that is run on a
standby", right?  By the way, is there some discussion in our existing
documentation to refer to about causal consistency in single node case?  I
don't know maybe that will help ease into the new feature.  Grepping the
existing source tree doesn't reveal the term "causal", so maybe even a
single line in the patch mentioning "single node operation trivially
implies (or does it?) causal consistency" would help.  Thoughts?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-27 Thread Thom Brown
On 27 February 2016 at 13:20, Michael Paquier  wrote:
> On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown  wrote:
>> On 21 February 2016 at 23:18, Thomas Munro
>>  wrote:
>> The replay_lag is particularly cool.  Didn't think it was possible to
>> glean this information on the primary, but the timings are correct in
>> my tests.
>>
>> +1 for this patch.  Looks like this solves the problem that
>> semi-synchronous replication tries to solve, although arguably in a
>> more sensible way.
>
> Yeah, having extra logic at application layer to check if a certain
> LSN position has been applied or not is doable, but if we can avoid it
> that's a clear plus.
>
> This patch has no documentation. I will try to figure out by myself
> how the new parameters interact with the rest of the syncrep code
> while looking at it but if we want to move on to get something
> committable for 9.6 it would be good to get some documentation soon.

Could we rename "apply" to "remote_apply"?  It seems more consistent
with "remote_write", and matches its own enum entry too.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-27 Thread Michael Paquier
On Mon, Feb 22, 2016 at 9:39 AM, Thom Brown  wrote:
> On 21 February 2016 at 23:18, Thomas Munro
>  wrote:
> The replay_lag is particularly cool.  Didn't think it was possible to
> glean this information on the primary, but the timings are correct in
> my tests.
>
> +1 for this patch.  Looks like this solves the problem that
> semi-synchronous replication tries to solve, although arguably in a
> more sensible way.

Yeah, having extra logic at application layer to check if a certain
LSN position has been applied or not is doable, but if we can avoid it
that's a clear plus.

This patch has no documentation. I will try to figure out by myself
how the new parameters interact with the rest of the syncrep code
while looking at it but if we want to move on to get something
committable for 9.6 it would be good to get some documentation soon.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thom Brown
On 21 February 2016 at 23:18, Thomas Munro
 wrote:
> On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown  wrote:
>> On 3 February 2016 at 10:46, Thomas Munro  
>> wrote:
>>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>>>  wrote:
 There seems to be a copy-pasto there - shouldn't that be:

 + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>>>
>>> Indeed, thanks!  New patch attached.
>>
>> I've given this a test drive, and it works exactly as described.
>
> Thanks for trying it out!
>
>> But one thing which confuses me is when a standby, with causal_reads
>> enabled, has just finished starting up.  I can't connect to it
>> because:
>>
>> FATAL:  standby is not available for causal reads
>>
>> However, this is the same message when I'm successfully connected, but
>> it's lagging, and the primary is still waiting for the standby to
>> catch up:
>>
>> ERROR:  standby is not available for causal reads
>>
>> What is the difference here?  The problem being reported appears to be
>> identical, but in the first case I can't connect, but in the second
>> case I can (although I still can't issue queries).
>
> Right, you get the error at login before it has managed to connect to
> the primary, and for a short time after while it's in 'joining' state,
> or potentially longer if there is a backlog of WAL to apply.
>
> The reason is that when causal_reads = on in postgresql.conf (as
> opposed to being set for an individual session or role), causal reads
> logic is used for snapshots taken during authentication (in fact the
> error is generated when trying to take a snapshot slightly before
> authentication proper begins, in InitPostgres).  I think that's a
> desirable feature: if you have causal reads on and you create/alter a
> database/role (for example setting a new password) and commit, and
> then you immediately try to connect to that database/role on a standby
> where you have causal reads enabled system-wide, then you get the
> causal reads guarantee during authentication: you either see the
> effects of your earlier transaction or you see the error.  As you have
> discovered, there is a small window after a standby comes up where it
> will show the error because it hasn't got a lease yet so it can't let
> you log in yet because it could be seeing a stale catalog (your user
> may not exist on the standby yet, or have been altered in some way, or
> your database may not exist yet, etc).
>
> Does that make sense?

Ah, alles klar.  Yes, that makes sense now.  I've been trying to break
it the past few days, and this was the only thing which I wasn't clear
on.  The parameters all work as described

The replay_lag is particularly cool.  Didn't think it was possible to
glean this information on the primary, but the timings are correct in
my tests.

+1 for this patch.  Looks like this solves the problem that
semi-synchronous replication tries to solve, although arguably in a
more sensible way.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thomas Munro
On Mon, Feb 22, 2016 at 2:10 AM, Thom Brown  wrote:
> On 3 February 2016 at 10:46, Thomas Munro  
> wrote:
>> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>>  wrote:
>>> There seems to be a copy-pasto there - shouldn't that be:
>>>
>>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>>
>> Indeed, thanks!  New patch attached.
>
> I've given this a test drive, and it works exactly as described.

Thanks for trying it out!

> But one thing which confuses me is when a standby, with causal_reads
> enabled, has just finished starting up.  I can't connect to it
> because:
>
> FATAL:  standby is not available for causal reads
>
> However, this is the same message when I'm successfully connected, but
> it's lagging, and the primary is still waiting for the standby to
> catch up:
>
> ERROR:  standby is not available for causal reads
>
> What is the difference here?  The problem being reported appears to be
> identical, but in the first case I can't connect, but in the second
> case I can (although I still can't issue queries).

Right, you get the error at login before it has managed to connect to
the primary, and for a short time after while it's in 'joining' state,
or potentially longer if there is a backlog of WAL to apply.

The reason is that when causal_reads = on in postgresql.conf (as
opposed to being set for an individual session or role), causal reads
logic is used for snapshots taken during authentication (in fact the
error is generated when trying to take a snapshot slightly before
authentication proper begins, in InitPostgres).  I think that's a
desirable feature: if you have causal reads on and you create/alter a
database/role (for example setting a new password) and commit, and
then you immediately try to connect to that database/role on a standby
where you have causal reads enabled system-wide, then you get the
causal reads guarantee during authentication: you either see the
effects of your earlier transaction or you see the error.  As you have
discovered, there is a small window after a standby comes up where it
will show the error because it hasn't got a lease yet so it can't let
you log in yet because it could be seeing a stale catalog (your user
may not exist on the standby yet, or have been altered in some way, or
your database may not exist yet, etc).

Does that make sense?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-21 Thread Thom Brown
On 3 February 2016 at 10:46, Thomas Munro  wrote:
> On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
>  wrote:
>> There seems to be a copy-pasto there - shouldn't that be:
>>
>> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
>
> Indeed, thanks!  New patch attached.

I've given this a test drive, and it works exactly as described.

But one thing which confuses me is when a standby, with causal_reads
enabled, has just finished starting up.  I can't connect to it
because:

FATAL:  standby is not available for causal reads

However, this is the same message when I'm successfully connected, but
it's lagging, and the primary is still waiting for the standby to
catch up:

ERROR:  standby is not available for causal reads

What is the difference here?  The problem being reported appears to be
identical, but in the first case I can't connect, but in the second
case I can (although I still can't issue queries).

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-03 Thread Amit Langote

Hi Thomas,

On 2016/01/20 13:12, Thomas Munro wrote:
> That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so
> here is a new version.

-if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
+if (is_highest_priority_sync_standby)

[ ... ]

-if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
-{
-walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
-numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);

[ ... ]

+if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->write)
+{
+walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
+numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH,
+MyWalSnd->flush);

There seems to be a copy-pasto there - shouldn't that be:

+ if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-02-03 Thread Thomas Munro
On Wed, Feb 3, 2016 at 10:59 PM, Amit Langote
 wrote:
> There seems to be a copy-pasto there - shouldn't that be:
>
> + if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)

Indeed, thanks!  New patch attached.

-- 
Thomas Munro
http://www.enterprisedb.com


causal-reads-v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-01-19 Thread Michael Paquier
On Wed, Jan 20, 2016 at 1:12 PM, Thomas Munro
 wrote:
> On Wed, Dec 30, 2015 at 5:15 PM, Thomas Munro
>  wrote:
>> On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro
>>  wrote:
>>> Here is a new version of the patch with a few small improvements:
>>> ...
>>> [causal-reads-v3.patch]
>>
>> That didn't apply after 6e7b3359 (which fix a typo in a comment that I
>> moved).  Here is a new version that does.
>
> That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so
> here is a new version.

You should try to register it to a CF, though it may be too late for
9.6 if that's rather intrusive.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-01-19 Thread Thomas Munro
On Wed, Dec 30, 2015 at 5:15 PM, Thomas Munro
 wrote:
> On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro
>  wrote:
>> Here is a new version of the patch with a few small improvements:
>> ...
>> [causal-reads-v3.patch]
>
> That didn't apply after 6e7b3359 (which fix a typo in a comment that I
> moved).  Here is a new version that does.

That one conflicts with b1a9bad9e744857291c7d5516080527da8219854, so
here is a new version.

-- 
Thomas Munro
http://www.enterprisedb.com


causal-reads-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2016-01-07 Thread Joel Jacobson
+1 to both the feature and the concept of how it's implemented.
Haven't looked at the code though.

This feature would be very useful for us at Trustly.
This would mean we got get rid of an entire system component in our
architecture (=memcached) which we only use to write data which must
be immediately readable at the sync slave after the master commits.
The only such data we currently have is the backoffice sessionid which
must be readable on the slave, otherwise the read-only calls which we
route to the slave might fail because it's missing.

On Wed, Nov 11, 2015 at 6:37 AM, Thomas Munro
 wrote:
>
> Hi hackers,
>
> Many sites use hot standby servers to spread read-heavy workloads over more 
> hardware, or at least would like to.  This works well today if your 
> application can tolerate some time lag on standbys.  The problem is that 
> there is no guarantee of when a particular commit will become visible for 
> clients connected to standbys.  The existing synchronous commit feature is no 
> help here because it guarantees only that the WAL has been flushed on another 
> server before commit returns.  It says nothing about whether it has been 
> applied or whether it has been applied on the standby that you happen to be 
> talking to.
>
> A while ago I posted a small patch[1] to allow synchronous_commit to wait for 
> remote apply on the current synchronous standby, but (as Simon Riggs rightly 
> pointed out in that thread) that part isn't the main problem.  It seems to me 
> that the main problem for a practical 'writer waits' system is how to support 
> a dynamic set of servers, gracefully tolerating failures and timing out 
> laggards, while also providing a strong guarantee during any such 
> transitions.  Now I would like to propose something to do that, and share a 
> proof-of-concept patch.
>
>
> === PROPOSAL ===
>
> The working name for the proposed feature is "causal reads", because it 
> provides a form of "causal consistency"[2] (and "read-your-writes" 
> consistency) no matter which server the client is connected to.  There is a 
> similar feature by the same name in another product (albeit implemented 
> differently -- 'reader waits'; more about that later).  I'm not wedded to the 
> name.
>
> The feature allows arbitrary read-only transactions to be run on any hot 
> standby, with a specific guarantee about the visibility of preceding 
> transactions.  The guarantee is that if you set a new GUC "causal_reads = on" 
> in any pair of consecutive transactions (tx1, tx2) where tx2 begins after tx1 
> successfully returns, then tx2 will either see tx1 or fail with a new error 
> "standby is not available for causal reads", no matter which server it runs 
> on.  A discovery mechanism is also provided, giving an instantaneous snapshot 
> of the set of standbys that are currently available for causal reads (ie 
> won't raise the error), in the form of a new column in pg_stat_replication.
>
> For example, a web server might run tx1 to insert a new row representing a 
> message in a discussion forum on the primary server, and then send the user 
> to another web page that runs tx2 to load all messages in the forum on an 
> arbitrary hot standby server.  If causal_reads = on in both tx1 and tx2 (for 
> example, because it's on globally), then tx2 is guaranteed to see the new 
> post, or get a (hopefully rare) error telling the client to retry on another 
> server.
>
> Very briefly, the approach is:
> 1.  The primary tracks apply lag on each standby (including between commits).
> 2.  The primary deems standbys 'available' for causal reads if they are 
> applying WAL and replying to keepalives fast enough, and periodically sends 
> the standby an authorization to consider itself available for causal reads 
> until a time in the near future.
> 3.  Commit on the primary with "causal_reads = on" waits for all 'available' 
> standbys either to apply the commit record, or to cease to be 'available' and 
> begin raising the error if they are still alive (because their authorizations 
> have expired).
> 4.  Standbys can start causal reads transactions only while they have an 
> authorization with an expiry time in the future; otherwise they raise an 
> error when an initial snapshot is taken.
>
> In a follow-up email I can write about the design trade-offs considered 
> (mainly 'writer waits' vs 'reader waits'), comparison with some other 
> products, method of estimating replay lag, wait and timeout logic and how it 
> maintains the guarantee in various failure scenarios, logic for standbys 
> joining and leaving, implications of system clock skew between servers, or 
> any other questions you may have, depending on feedback/interest (but see 
> comments in the attached patch for some of those subjects).  For now I didn't 
> want to clog up the intertubes with too large a wall of text.
>
>
> === PROOF-OF-CONCEPT ===
>
> Please see the POC patch attached.  It adds two new GUCs. 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-12-29 Thread Thomas Munro
On Wed, Nov 18, 2015 at 11:50 PM, Thomas Munro
 wrote:
> Here is a new version of the patch with a few small improvements:
> ...
> [causal-reads-v3.patch]

That didn't apply after 6e7b3359 (which fix a typo in a comment that I
moved).  Here is a new version that does.

-- 
Thomas Munro
http://www.enterprisedb.com


causal-reads-v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-18 Thread Thomas Munro
Hi,

Here is a new version of the patch with a few small improvements:

1.  Adopted the term '[read] lease', replacing various hand-wavy language
in the comments and code.  That seems to be the established term for this
approach[1].

2.  Reduced the stalling time on failure.  When things go wrong with a
standby (such as losing contact with it), instead of stalling for a
conservative amount of time longer than any lease that might have been
granted, the primary now stalls only until the expiry of the last lease
that actually was granted to a given dropped standby, which should be
sooner.

3.  Fixed a couple of bugs that showed up in testing and review (some bad
flow control in the signal handling, and a bug in a circular buffer), and
changed the recovery->walreceiver wakeup signal handling to block the
signal except while waiting in walrcv_receive (it didn't seem a good idea
to interrupt arbitrary syscalls in walreceiver so I thought that would be a
improvement; but of course that area's going to be reworked by Simon's
patch anyway, as discussed elsewhere).

Restating the central idea using the new terminology:  So long as they are
replaying fast enough, the primary grants a series of causal reads leases
to standbys allowing them to handle causal reads queries locally without
any inter-node communication for a limited time.  Leases are promises that
the primary will wait for the standby to apply commit records OR be dropped
from the set of available causal reads standbys and know that it has been
dropped, before the primary returns from commit, in order to uphold the
causal reads guarantee.  In the worst case it can do that by waiting for
the most recently granted lease to expire.

I've also attached a couple of things which might be useful when trying the
patch out: test-causal-reads.c which can be used to test performance and
causality under various conditions, and test-causal-reads.sh which can be
used to bring up a primary and a bunch of local hot standbys to talk to.
 (In the hope of encouraging people to take the patch for a spin...)

[1] Originally from a well known 1989 paper on caching, but in the context
of databases and synchronous replication see for example the recent papers
on "Niobe" and "Paxos Quorum Leases" (especially the reference to Google
Megastore).  Of course a *lot* more is going on in those very different
algorithms, but at some level "read leases" are being used to allow
local-node-only reads for a limited time while upholding some kind of
global consistency guarantee, in some of those consensus database systems.
I spent a bit of time talking about consistency levels to database guru and
former colleague Alex Scotti who works on a Paxos-based system, and he gave
me the initial idea to try out a lease-based consistency system for
Postgres streaming rep.  It seems like a very useful point in the space of
trade-offs to me.

-- 
Thomas Munro
http://www.enterprisedb.com


test-causal-reads.sh
Description: Bourne shell script
/*
 * A simple test program to test performance and visibility with the causal
 * reads patch.
 *
 * Each test loop updates a row on the primary, and then optionally checks if
 * it can see that change immediately on a standby.  If you do this with
 * standard async replication, you should occasionally see an assertion fail
 * if run with --check (depending on the vaguaries of timing -- I can
 * reproduce this very reliably on my system).  If you do it with traditional
 * sync rep, it becomes a little bit less likely (but it's still reliably
 * reproducible on my system).  If you do it with traditional sync rep set up,
 * and "--synchronous-commit apply" then it should no longer be possible to
 * trigger than assertion, but that's just a straw-man mode.  If you do it
 * with --causal-reads then you should not be able to reproduce it, no matter
 * which standby you connect to.  If you're using --check and the standby gets
 * dropped (perhaps because you break/disconnect/pause it etc) you should
 * never see that assertion fail (= SELECT running but seeing stale data),
 * instead you should see an error when running the SELECT.
 *
 * Arguments:
 *
 *  --primary  how to connect to the primary
 *  --standby  how to connect to the standby to check
 *  --check   check that the update is visible on standby
 *  --causal-readsenable causal reads
 *  --synchronous-commit LEVELset synchronous_commit to LEVEL
 *  --loops COUNT how many loops to run through
 *  --verbose chatter
 */

#include 

#include 
#include 
#include 
#include 
#include 

int
main(int argc, char *argv[])
{
	PGconn *primary;
	PGconn *standby;
	PGresult *result;
	int i;
	int loops = 1;
	char buffer[1024];
	const char *synchronous_commit = "on";
	bool causal_reads = false;
	const char *primary_connstr = "dbname=postgres port=5432";
	const char *standby_connstr = "dbname=postgres port=5442";
	bool 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Thomas Munro
On Tue, Nov 17, 2015 at 12:44 AM, Simon Riggs  wrote:

> On 15 November 2015 at 10:41, Simon Riggs  wrote:
>
>
>>  So anyway, consider me nudged to finish my patch to provide capability
>> for that by 1 Jan.
>>
>
> My earlier patch aimed to allow WALReceiver to wait on both a latch and a
> socket as well as allow WALWriter to be active, so that WALReceiver could
> reply more quickly and handle greater workload. As I explained previously
> when we discussed that in recent posts, it is necessary infrastructure to
> have anybody wait on anything higher than remote-fsync. I aim to complete
> that by 1 Jan.
>

Right, handing write/fsync work off to WALWriter in standbys makes a lot of
sense for any kind of writer-waits system, so that WALReceiver doesn't
spend time in long syscalls which wouldn't play nicely with signals
(whether from 'kill' or SetLatch) and can deal with network IO with the
lowest possible latency.  I would like to help test/review that, if that
could be useful.

The SIGUSR1 code in the WalReceiverMain and WalRecvWakeup in this patch
works well enough for now for proof-of-concept purposes until then.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Robert Haas
On Mon, Nov 16, 2015 at 5:44 AM, Simon Riggs  wrote:
> On 15 November 2015 at 14:50, Robert Haas  wrote:
>> On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs 
>> wrote:
>> > Hmm, if that's where we're at, I'll summarize my thoughts.
>> >
>> > All of this discussion presupposes we are distributing/load balancing
>> > queries so that reads and writes might occur on different nodes.
>>
>> Agreed.  I think that's a pretty common pattern, though certainly not
>> the only one.
> It looks to me this functionality is only of use in a pooler. Please explain
> how else this would be used.

I think you're right.  I mean, you could have the pooling built into
the application, but some place connections have to be being farmed
out to different nodes, or there's no point to using this feature.
Some people may not want to use this feature, but those who do are
using some form of pooling at some level.

>> > Your option (2) is wider but also worse in some ways. It can be
>> > implemented
>> > in a pooler.
>> >
>> > Your option (3) doesn't excite me much. You've got a load of stuff that
>> > really should happen in a pooler. And at its core we have
>> > synchronous_commit
>> > = apply but with a timeout rather than a wait.
>>
>> I don't see how either option (2) or option (3) could be implemented
>> in a pooler.  How would that work?
>
> My starting thought was that (1) was the only way forwards. Through
> discussion, I now see that its not the best solution for the general case.
>
> The pooler knows which statements are reads and writes, it also knows about
> transaction boundaries, so it is possible for it to perform the waits for
> either (2) or (3).

As Craig says, it may not: pgbouncer, for example, won't.  pgpool
will, except when it's wrong because some apparently read-only
function is actually writing data.  But even if the pooler does know,
that isn't enough for it to perform the waits for (2) or (3) without
some support for the server.  If it wants to wait for a particular
transaction to be applied on the standby, it needs to know how long to
wait, and without some support for the server, it has no way of
knowing.  Now that could be done by doing (1) and then having the
pooler perform the waits, but now every pooler has to be taught how to
do that.  pgpool needs to know, pgbouncer needs to know, every
JDBC-based connection pooler needs to know.  Uggh.  Thomas's solution
is nice because it works with any pooler.

The other point I'd make, which I think may be what you said before
but I think is worth making very explicit, is that (1) supposes that
we know which reads are dependent on which previous writes.  In the
real world, that's probably frequently untrue.  If someone does SELECT
sum(salary) FROM emp on the standby, there's no particular write to
the emp table that they want to wait for: they want to wait for ALL
such writes previously acknowledged as committed.  Now, even when the
dependent writes can be identified, it may be convenient to just wait
for all of them instead of a particular subset that we know are
related.  But I bet there will be many cases where identification is
impractical or impossible, and thus I suspect (1) won't be very
useful.

I think (2) and (3) both have good prospects for being useful, but I
suspect that the performance consequences of (3), which is what Thomas
actually implemented, although possibly severe, are still likely to be
only a fraction of the cost of (2).  Having to potentially wait every
time a standby takes a snapshot just sounds awful to me.

> I would like to see a load balancing pooler in Postgres.

Me, too, but I don't expect that to be possible in the near future,
and I think this is awfully useful until it does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Simon Riggs
On 15 November 2015 at 10:41, Simon Riggs  wrote:


>  So anyway, consider me nudged to finish my patch to provide capability
> for that by 1 Jan.
>

My earlier patch aimed to allow WALReceiver to wait on both a latch and a
socket as well as allow WALWriter to be active, so that WALReceiver could
reply more quickly and handle greater workload. As I explained previously
when we discussed that in recent posts, it is necessary infrastructure to
have anybody wait on anything higher than remote-fsync. I aim to complete
that by 1 Jan.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Simon Riggs
On 15 November 2015 at 14:50, Robert Haas  wrote:

> On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs 
> wrote:
> > Hmm, if that's where we're at, I'll summarize my thoughts.
> >
> > All of this discussion presupposes we are distributing/load balancing
> > queries so that reads and writes might occur on different nodes.
>
> Agreed.  I think that's a pretty common pattern, though certainly not
> the only one.
>

It looks to me this functionality is only of use in a pooler. Please
explain how else this would be used.


> > Your option (2) is wider but also worse in some ways. It can be
> implemented
> > in a pooler.
> >
> > Your option (3) doesn't excite me much. You've got a load of stuff that
> > really should happen in a pooler. And at its core we have
> synchronous_commit
> > = apply but with a timeout rather than a wait.
>
> I don't see how either option (2) or option (3) could be implemented
> in a pooler.  How would that work?
>

My starting thought was that (1) was the only way forwards. Through
discussion, I now see that its not the best solution for the general case.

The pooler knows which statements are reads and writes, it also knows about
transaction boundaries, so it is possible for it to perform the waits for
either (2) or (3). The pooler *needs* to know which nodes it can route
queries to, so it looks to me that the pooler is the best place to put
waits and track status of nodes, no matter when we wait. I don't see any
benefit in having other nodes keep track of node status since that will
just replicate work that *must* be performed in the pooler.

I would like to see a load balancing pooler in Postgres.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Simon Riggs
On 16 November 2015 at 11:01, Craig Ringer  wrote:

> On 16 November 2015 at 18:44, Simon Riggs  wrote:
>
>>
>> The pooler knows which statements are reads and writes
>>
>
> I think that's an iffy assumption.
>

It's not an assumption, its a requirement. If it can't do this in some
manner then you can't use a load balancing pooler.

Randomly submitting things works as well, since it leads to a write error
when you try to write data on a read only server, so you do then learn
whether it is a read or a write. Once you know its a write, you submit to
master. But you still need to be careful of other effects, so that isn't
recommended.

It's one we tend to make because otherwise read/write pooling won't work,
> but in PostgreSQL there's really no way to know when calling a function.
>
> What does
>
> SELECT get_user_stats()
>
> do? The pooler has _no_ _idea_ unless manually configured with knowledge
> about particular user defined functions.
>
> In the absence of such knowledge it can:
>
> - send the work to a replica and report the ERROR to the user if it fails
> due to an attempted write;
> - send the work to a replica, capture an ERROR due to attempted write, and
> retry on the master;
> - send everything it's not sure about to the master
>
> If a pooler had insight into the catalogs and if we had readonly /
> readwrite attributes on functions, it could be smarter.
>

pgpool supports white/black function listing for exactly this reason.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Craig Ringer
On 16 November 2015 at 18:44, Simon Riggs  wrote:

>
> The pooler knows which statements are reads and writes
>

I think that's an iffy assumption. It's one we tend to make because
otherwise read/write pooling won't work, but in PostgreSQL there's really
no way to know when calling a function.

What does

SELECT get_user_stats()

do? The pooler has _no_ _idea_ unless manually configured with knowledge
about particular user defined functions.

In the absence of such knowledge it can:

- send the work to a replica and report the ERROR to the user if it fails
due to an attempted write;
- send the work to a replica, capture an ERROR due to attempted write, and
retry on the master;
- send everything it's not sure about to the master

If a pooler had insight into the catalogs and if we had readonly /
readwrite attributes on functions, it could be smarter.


> I would like to see a load balancing pooler in Postgres.
>

Given the number of times I say "no, no, don't raise max_connections to
2000 to solve your performance problems, lower it to around 100 and put
pgbouncer in front if your application doesn't support connection pooling
internally"  yes!

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-16 Thread Jim Nasby

On 11/12/15 1:11 PM, Thomas Munro wrote:

It's true that a pooling system/middleware could spy on your sessions
and insert causality token handling imposing a global ordering of
visibility for you, so that naive users don't have to deal with them.
Whenever it sees a COMMIT result (assuming they are taught to return
LSNs), it could update a highest-LSN-seen variable, and transparently
insert a wait for that LSN into every transaction that it sees
beginning.  But then you would have to push all your queries through a
single point that can see everything across all Postgres servers, and
maintain this global high LSN.


I think that depends on what you're doing. Frequently you don't care 
about anyone elses writes, just your own. In that case, there's no need 
for a shared connection pooler, you just have to come back to the same one.


There's also a 4th option: until a commit has made it out to some number 
of slaves, re-direct all reads from a session back to the master. That 
might sound horrible for master performance, but in reality I think it'd 
normally be fine. Generally, you only care about this when you're going 
to read data that you've just written, which means the data's still in 
shared buffers.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-15 Thread Robert Haas
On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs  wrote:
> Hmm, if that's where we're at, I'll summarize my thoughts.
>
> All of this discussion presupposes we are distributing/load balancing
> queries so that reads and writes might occur on different nodes.

Agreed.  I think that's a pretty common pattern, though certainly not
the only one.

> We need a good balancer. Any discussion of this that ignores the balancer
> component is only talking about half the solution. What we need to do is
> decide whether functionality should live in the balancer or the core.

I'm all in favor of having a load-balancer in core, but that seems
completely unrelated to the patch at hand.

> Your option (1) is viable, but only in certain cases. We could add support
> for some token/wait mechanism but as you say, this would require application
> changes not pooler changes.

Agreed.

> Your option (2) is wider but also worse in some ways. It can be implemented
> in a pooler.
>
> Your option (3) doesn't excite me much. You've got a load of stuff that
> really should happen in a pooler. And at its core we have synchronous_commit
> = apply but with a timeout rather than a wait. So anyway, consider me nudged
> to finish my patch to provide capability for that by 1 Jan.

I don't see how either option (2) or option (3) could be implemented
in a pooler.  How would that work?

To be frank, it's starting to seem to me like you are just trying to
block this patch so you can have time to develop your own version
instead.  I hope that's not the case, because it would be quite
unfair.  When Thomas originally posted the patch, you complained that
"This causes every writer to wait. What we want is to isolate the wait
only to people performing a write-read sequence, so I think it should
be readers that wait. Let's have that debate up front before we start
reviewing the patch."  Now, you seem to be saying that's OK, because
you want to post a patch to do exactly the same thing under the name
synchronous_commit=apply, but you want it to be your own patch,
leaving out the other stuff that Thomas has put into this one.

That could be the right thing to do, but how about we discuss it a
bit?  The timeout stuff that Thomas has added here is really useful.
Without that, if a machine goes down, we wait forever.  That's the
right thing to do if we're replicating to make sure transactions can
never be lost, but it's a bad idea if we're replicating for load
balancing.  In the load balancing case, you want to drop sync slaves
quickly to ensure the cluster remains available, and you need them to
know they are out of sync so that the load balancer doesn't get
confused.  That's exactly what is implemented here.

If you have an idea for a simpler implementation, great, but I think
we need something.  I don't see how it's going to work to make it
entirely the pooler's job to figure out whether the cluster is in sync
- it needs a push from the core server.  Here, that push is easy to
find: if a particular replica starts returning the "i'm out of sync"
error when you query it, then stop routing queries to that replica
until the error clears (which the pooler can find out by polling it
with some trivial query).  That's a great deal more useful than
synchronous_commit=apply without such a feature.

> On a related note, any further things like "GUC causal_reads_standby_names"
> should be implemented by Node Registry as a named group of nodes. We can
> have as many arbitrary groups of nodes as we want. If that sounds strange
> look back at exactly why GUCs are called GUCs.

I think a node registry is a good idea, and my impression from the
session in Vienna is quite a few other hackers do, too.  But I also
don't think it's remotely reasonable to make that a precondition for
accepting this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-15 Thread Simon Riggs
On 12 November 2015 at 18:25, Thomas Munro 
wrote:


>  I don't want to get bogged down in details, while we're talking about the
> 30,000 foot view).
>

Hmm, if that's where we're at, I'll summarize my thoughts.

All of this discussion presupposes we are distributing/load balancing
queries so that reads and writes might occur on different nodes.

We need a good balancer. Any discussion of this that ignores the balancer
component is only talking about half the solution. What we need to do is
decide whether functionality should live in the balancer or the core.

Your option (1) is viable, but only in certain cases. We could add support
for some token/wait mechanism but as you say, this would require
application changes not pooler changes.

Your option (2) is wider but also worse in some ways. It can be implemented
in a pooler.

Your option (3) doesn't excite me much. You've got a load of stuff that
really should happen in a pooler. And at its core we have
synchronous_commit = apply but with a timeout rather than a wait. So
anyway, consider me nudged to finish my patch to provide capability for
that by 1 Jan.

On a related note, any further things like "GUC causal_reads_standby_names"
should be implemented by Node Registry as a named group of nodes. We can
have as many arbitrary groups of nodes as we want. If that sounds strange
look back at exactly why GUCs are called GUCs.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-15 Thread Thomas Munro
On Sun, Nov 15, 2015 at 11:41 PM, Simon Riggs  wrote:

> On 12 November 2015 at 18:25, Thomas Munro 
> wrote:
>
>
>>  I don't want to get bogged down in details, while we're talking about
>> the 30,000 foot view).
>>
>
> Hmm, if that's where we're at, I'll summarize my thoughts.
>
> All of this discussion presupposes we are distributing/load balancing
> queries so that reads and writes might occur on different nodes.
>
> We need a good balancer. Any discussion of this that ignores the balancer
> component is only talking about half the solution. What we need to do is
> decide whether functionality should live in the balancer or the core.
>
> Your option (1) is viable, but only in certain cases. We could add support
> for some token/wait mechanism but as you say, this would require
> application changes not pooler changes.
>
> Your option (2) is wider but also worse in some ways. It can be
> implemented in a pooler.
>
> Your option (3) doesn't excite me much. You've got a load of stuff that
> really should happen in a pooler. And at its core we have
> synchronous_commit = apply but with a timeout rather than a wait. So
> anyway, consider me nudged to finish my patch to provide capability for
> that by 1 Jan.
>

Just to be clear, this patch doesn't use a "timeout rather than a wait".
It always waits for the current set of available causal reads standbys to
apply the commit.  It's just that nodes get kicked out of that set pretty
soon if they don't keep up, a bit like a RAID controller dropping a failing
disk.  And it does so using a protocol that ensures that the dropped
standby starts raising the error, even if contact has been lost with it, so
the causal reads guarantee is maintained at all times for all clients.

On a related note, any further things like "GUC causal_reads_standby_names"
> should be implemented by Node Registry as a named group of nodes. We can
> have as many arbitrary groups of nodes as we want. If that sounds strange
> look back at exactly why GUCs are called GUCs.
>

Agreed, the application_name whitelist stuff is clunky.  I left it out of
the first version I posted, not wanting the focus of this proposal to be
side-tracked.  But as Ants Aasma pointed out, some users might need
something like that, so I posted a 2nd version that follows the established
example, again not wanting to distract with anything new in that area.  Of
course that would eventually be replaced/improved as part of a future node
topology management project.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-12 Thread Simon Riggs
On 11 November 2015 at 09:22, Thomas Munro 
wrote:


> 1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
> BerkeleyDB does in "read-your-writes" mode.  It means that application
> developers have the responsibility for correctly identifying transactions
> with causal dependencies and dealing with LSNs (or whatever equivalent
> tokens), potentially even passing them to other processes where the
> transactions are causally dependent but run by multiple communicating
> clients (for example, communicating microservices).  This makes it
> difficult to retrofit load balancing to pre-existing applications and (like
> anything involving concurrency) difficult to reason about as applications
> grow in size and complexity.  It is efficient if done correctly, but it is
> a tax on application complexity.
>

Agreed. This works if you have a single transaction connected thru a pool
that does statement-level load balancing, so it works in both session and
transaction mode.

I was in favour of a scheme like this myself, earlier, but have more
thoughts now.

We must also consider the need for serialization across sessions or
transactions.

In transaction pooling mode, an application could get assigned a different
session, so a token would be much harder to pass around.

2.  Reader waits for a conservatively chosen LSN.  This is roughly what
> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
> 1" modes.  Read transactions would start off by finding the current end
> of WAL on the primary, since that must be later than any commit that
> already completed, and then waiting for that to apply locally.  That means
> every read transaction waits for a complete replication lag period,
> potentially unnecessarily.  This is tax on readers with unnecessary waiting.
>

This tries to make it easier for users by forcing all users to experience a
causality delay. Given the whole purpose of multi-node load balancing is
performance, referencing the master again simply defeats any performance
gain, so you couldn't ever use it for all sessions. It could be a USERSET
parameter, so could be turned off in most cases that didn't need it.  But
its easier to use than (1).

Though this should be implemented in the pooler.

3.  Writer waits, as proposed.  In this model, there is no tax on readers
> (they have zero overhead, aside from the added complexity of dealing with
> the possibility of transactions being rejected when a standby falls behind
> and is dropped from 'available' status; but database clients must already
> deal with certain types of rare rejected queries/failures such as
> deadlocks, serialization failures, server restarts etc).  This is a tax on
> writers.
>

This would seem to require that all readers must first check with the
master as to which standbys are now considered available, so it looks like
(2).

The alternative is that we simply send readers to any standby and allow the
pool to work out separately whether the standby is still available, which
mostly works, but it doesn't handle sporadic slow downs on particular
standbys very well (if at all).

I think we need to look at whether this does actually give us anything, or
whether we are missing the underlying Heisenberg reality.

More later.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-12 Thread Thomas Munro
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs  wrote:

> On 11 November 2015 at 09:22, Thomas Munro 
> wrote:
>
>
>> 1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
>> BerkeleyDB does in "read-your-writes" mode.  It means that application
>> developers have the responsibility for correctly identifying transactions
>> with causal dependencies and dealing with LSNs (or whatever equivalent
>> tokens), potentially even passing them to other processes where the
>> transactions are causally dependent but run by multiple communicating
>> clients (for example, communicating microservices).  This makes it
>> difficult to retrofit load balancing to pre-existing applications and (like
>> anything involving concurrency) difficult to reason about as applications
>> grow in size and complexity.  It is efficient if done correctly, but it is
>> a tax on application complexity.
>>
>
> Agreed. This works if you have a single transaction connected thru a pool
> that does statement-level load balancing, so it works in both session and
> transaction mode.
>
> I was in favour of a scheme like this myself, earlier, but have more
> thoughts now.
>
> We must also consider the need for serialization across sessions or
> transactions.
>
> In transaction pooling mode, an application could get assigned a different
> session, so a token would be much harder to pass around.
>

Sorry for the double reply, I just wanted to add a couple more thoughts.

As discussed elsewhere in the thread, I think it makes absolute sense to
offer some kind of support for causality tokens, I don't see that on its
own as enough for most users.  (At the least, it would be good to have
pg_wait_for_xlog_replay_location(lsn, timeout), but perhaps explicit BEGIN
syntax as suggested by Heikki, or a new field in the libpq protocol which
can be attached to any statement, and likewise for the commit LSN of
results).

It's true that a pooling system/middleware could spy on your sessions and
insert causality token handling imposing a global ordering of visibility
for you, so that naive users don't have to deal with them.  Whenever it
sees a COMMIT result (assuming they are taught to return LSNs), it could
update a highest-LSN-seen variable, and transparently insert a wait for
that LSN into every transaction that it sees beginning.  But then you would
have to push all your queries through a single point that can see
everything across all Postgres servers, and maintain this global high LSN.

In contrast, my writer-waits proposal makes different trade-offs to provide
causal reads as a built-in feature without an external single point
observer of all transactions.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-12 Thread Thomas Munro
On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs  wrote:

> On 11 November 2015 at 09:22, Thomas Munro 
> wrote:
>
>
>> 1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
>> BerkeleyDB does in "read-your-writes" mode.  It means that application
>> developers have the responsibility for correctly identifying transactions
>> with causal dependencies and dealing with LSNs (or whatever equivalent
>> tokens), potentially even passing them to other processes where the
>> transactions are causally dependent but run by multiple communicating
>> clients (for example, communicating microservices).  This makes it
>> difficult to retrofit load balancing to pre-existing applications and (like
>> anything involving concurrency) difficult to reason about as applications
>> grow in size and complexity.  It is efficient if done correctly, but it is
>> a tax on application complexity.
>>
>
> Agreed. This works if you have a single transaction connected thru a pool
> that does statement-level load balancing, so it works in both session and
> transaction mode.
>
> I was in favour of a scheme like this myself, earlier, but have more
> thoughts now.
>
> We must also consider the need for serialization across sessions or
> transactions.
>
> In transaction pooling mode, an application could get assigned a different
> session, so a token would be much harder to pass around.
>
> 2.  Reader waits for a conservatively chosen LSN.  This is roughly what
>> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
>> 1" modes.  Read transactions would start off by finding the current end
>> of WAL on the primary, since that must be later than any commit that
>> already completed, and then waiting for that to apply locally.  That means
>> every read transaction waits for a complete replication lag period,
>> potentially unnecessarily.  This is tax on readers with unnecessary waiting.
>>
>
> This tries to make it easier for users by forcing all users to experience
> a causality delay. Given the whole purpose of multi-node load balancing is
> performance, referencing the master again simply defeats any performance
> gain, so you couldn't ever use it for all sessions. It could be a USERSET
> parameter, so could be turned off in most cases that didn't need it.  But
> its easier to use than (1).
>
> Though this should be implemented in the pooler.
>
> 3.  Writer waits, as proposed.  In this model, there is no tax on readers
>> (they have zero overhead, aside from the added complexity of dealing with
>> the possibility of transactions being rejected when a standby falls behind
>> and is dropped from 'available' status; but database clients must already
>> deal with certain types of rare rejected queries/failures such as
>> deadlocks, serialization failures, server restarts etc).  This is a tax on
>> writers.
>>
>
> This would seem to require that all readers must first check with the
> master as to which standbys are now considered available, so it looks like
> (2).
>

No -- in (3), that is this proposal, standbys don't check with the primary
when you run a transaction.  Instead, the primary sends a constant stream
of authorizations (in the form of keepalives sent every
causal_reads_timeout / 2 in the current patch) to the standby, allowing it
to consider itself available for a short time into the future (currently
now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I
can elaborate on that logic in a separate email).  At the start of a
transaction in causal reads mode (the first call to GetTransaction to be
specific), the standby knows immediately without communicating with the
primary whether it can proceed or must raise the error.  In the happy case,
the reader simply compares the most recently received authorization's
expiry time with the system clock and proceeds.  In the worst case, when
contact is lost between primary and standby, the primary must stall
causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall).
Doing that makes sure that no causal reads commit can return (see
CausalReadsCommitCanReturn) before the lost standby has definitely started
raising the error for causal_reads queries (because its most recent
authorization has expired), in case it is still alive and handling requests
from clients.

It is not at all like (2), which introduces a conservative wait at the
start of every read transaction, slowing all readers down.  In (3), readers
don't wait, they run (or are rejected) as fast as possible, but instead the
primary has to do extra things.  Hence my categorization of (2) as a 'tax
on readers', and of (3) as a 'tax on writers'.  The idea is that a site
with a high ratio of reads to writes would prefer zero-overhead reads.


> The alternative is that we simply send readers to any standby and allow
> the pool to work out separately whether the standby is still available,
> which mostly works, but it 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Thomas Munro
On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas  wrote:

> On 11/11/2015 10:23 AM, Simon Riggs wrote:
>
>> On 11 November 2015 at 05:37, Thomas Munro > >
>> wrote:
>>
>> Many sites use hot standby servers to spread read-heavy workloads over
>> more
>>
>>> hardware, or at least would like to.  This works well today if your
>>> application can tolerate some time lag on standbys.  The problem is that
>>> there is no guarantee of when a particular commit will become visible for
>>> clients connected to standbys.  The existing synchronous commit feature
>>> is
>>> no help here because it guarantees only that the WAL has been flushed on
>>> another server before commit returns.  It says nothing about whether it
>>> has
>>> been applied or whether it has been applied on the standby that you
>>> happen
>>> to be talking to.
>>>
>>
>> Thanks for working on this issue.
>>
>
> +1.
>
> 3.  Commit on the primary with "causal_reads = on" waits for all
>>> 'available' standbys either to apply the commit record, or to cease to be
>>> 'available' and begin raising the error if they are still alive (because
>>> their authorizations have expired).
>>>
>>>
>> This causes every writer to wait.
>>
>> What we want is to isolate the wait only to people performing a write-read
>> sequence, so I think it should be readers that wait. Let's have that
>> debate
>> up front before we start reviewing the patch.
>>
>
> Agreed. And in the write-read sequence, you don't need to wait at the
> write either, it's enough that you wait just before you start doing the
> read. An application might do a lot of other things between the two, so
> that in most cases, there would in fact be no waiting as the record is
> already applied when you perform the read.
>
> I'm thinking the client should get some kind of a token back from the
> commit, and it could use the token on the standby, to wait for that commit
> to be applied. The token could be just the XID, or the LSN of the commit
> record. Or the application could generate the token and pass it to the
> server in the commit, similar to how 2PC works. So the interaction would be
> something like:
>
> In master:
> BEGIN;
> INSERT INTO FOO ...;
> COMMIT;
> Server returns: COMMITted with token 1234
>
> Later, in standby:
> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
> SELECT * FROM foo;
> ...


I thought about this question, and considered three different approaches:

1.  Reader waits with exposed LSNs, as Heikki suggests.  This is what
BerkeleyDB does in "read-your-writes" mode.  It means that application
developers have the responsibility for correctly identifying transactions
with causal dependencies and dealing with LSNs (or whatever equivalent
tokens), potentially even passing them to other processes where the
transactions are causally dependent but run by multiple communicating
clients (for example, communicating microservices).  This makes it
difficult to retrofit load balancing to pre-existing applications and (like
anything involving concurrency) difficult to reason about as applications
grow in size and complexity.  It is efficient if done correctly, but it is
a tax on application complexity.

2.  Reader waits for a conservatively chosen LSN.  This is roughly what
MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait = 1"
modes.  Read transactions would start off by finding the current end of WAL
on the primary, since that must be later than any commit that already
completed, and then waiting for that to apply locally.  That means every
read transaction waits for a complete replication lag period, potentially
unnecessarily.  This is tax on readers with unnecessary waiting.

3.  Writer waits, as proposed.  In this model, there is no tax on readers
(they have zero overhead, aside from the added complexity of dealing with
the possibility of transactions being rejected when a standby falls behind
and is dropped from 'available' status; but database clients must already
deal with certain types of rare rejected queries/failures such as
deadlocks, serialization failures, server restarts etc).  This is a tax on
writers.

My thinking was that the reason for wanting to load balance over a set of
hot standbys is because you have a very read-heavy workload, so it makes
sense to tax the writers and leave the many dominant readers unburdened, so
(3) should be better than (2) for the majority of users who want such a
configuration.  (Note also that it's not a requirement to tax every write;
with this proposal you can set causal_reads to off for those transactions
where you know there is no possibility of a causally dependent read).

As for (1), my thinking was that most application developers would probably
prefer not to have to deal with that type of interface.  For users who do
want to do that, it would be comparatively simple to make that possible,
and would not conflict with this proposal.  This proposal could be used by
people 

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Ants Aasma
On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro
 wrote:
> On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas  wrote:
>> On 11/11/2015 10:23 AM, Simon Riggs wrote:
>>> Thanks for working on this issue.
>>
>> +1.

+1. I have seen a lot of interest for something along these lines.

>> I'm thinking the client should get some kind of a token back from the
>> commit, and it could use the token on the standby, to wait for that commit
>> to be applied. The token could be just the XID, or the LSN of the commit
>> record. Or the application could generate the token and pass it to the
>> server in the commit, similar to how 2PC works. So the interaction would be
>> something like:
>>
>> In master:
>> BEGIN;
>> INSERT INTO FOO ...;
>> COMMIT;
>> Server returns: COMMITted with token 1234
>>
>> Later, in standby:
>> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
>> SELECT * FROM foo;
>> ...

To avoid read anomalies (backwards timetravel) it should also be
possible to receive a token from read-only transactions based on the
latest snapshot used.

> My thinking was that the reason for wanting to load balance over a set of
> hot standbys is because you have a very read-heavy workload, so it makes
> sense to tax the writers and leave the many dominant readers unburdened, so
> (3) should be better than (2) for the majority of users who want such a
> configuration.  (Note also that it's not a requirement to tax every write;
> with this proposal you can set causal_reads to off for those transactions
> where you know there is no possibility of a causally dependent read).
>
> As for (1), my thinking was that most application developers would probably
> prefer not to have to deal with that type of interface.  For users who do
> want to do that, it would be comparatively simple to make that possible, and
> would not conflict with this proposal.  This proposal could be used by
> people retrofitting load balancing to an existing applications with relative
> ease, or simply not wanting to have to deal with LSNs and complexity.  (I
> have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)
> separately, which could be called on a standby with the lsn obtained from
> pg_current_xlog_location() on the primary any time after a COMMIT completes,
> but I was thinking of that as a different feature addressing a different
> user base: people prepared to do more work to squeeze out some extra
> performance.)

Although I still think that 1) is the correct long term solution I
must say that I agree with the reasoning presented. I think we should
review the API in the light that in the future we might have a mix of
clients, some clients that are able to keep track of causality tokens
and either want to wait when a read request arrives, or pick a host to
use based on the token, and then there are "dumb" clients that want to
use write side waits.

Also, it should be possible to configure which standbys are considered
for waiting on. Otherwise a reporting slave will occasionally catch up
enough to be considered "available" and then cause a latency peak when
a long query blocks apply again.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Atri Sharma
> I'm thinking the client should get some kind of a token back from the
commit, and it could use the token on the standby, to wait for that commit
to be applied. The token could be just the XID, or the LSN of the commit
record. Or the application could generate the token and pass it to the
server in the commit, similar to how 2PC works. So the interaction would be
something like:
>
> In master:
> BEGIN;
> INSERT INTO FOO ...;
> COMMIT;
> Server returns: COMMITted with token 1234
>
> Later, in standby:
> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
> SELECT * FROM foo;

+1.

The LSN should be good enough IMO.


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Heikki Linnakangas

On 11/11/2015 10:23 AM, Simon Riggs wrote:

On 11 November 2015 at 05:37, Thomas Munro 
wrote:

Many sites use hot standby servers to spread read-heavy workloads over more

hardware, or at least would like to.  This works well today if your
application can tolerate some time lag on standbys.  The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys.  The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns.  It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.


Thanks for working on this issue.


+1.


3.  Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).



This causes every writer to wait.

What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.


Agreed. And in the write-read sequence, you don't need to wait at the 
write either, it's enough that you wait just before you start doing the 
read. An application might do a lot of other things between the two, so 
that in most cases, there would in fact be no waiting as the record is 
already applied when you perform the read.


I'm thinking the client should get some kind of a token back from the 
commit, and it could use the token on the standby, to wait for that 
commit to be applied. The token could be just the XID, or the LSN of the 
commit record. Or the application could generate the token and pass it 
to the server in the commit, similar to how 2PC works. So the 
interaction would be something like:


In master:
BEGIN;
INSERT INTO FOO ...;
COMMIT;
Server returns: COMMITted with token 1234

Later, in standby:
BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
SELECT * FROM foo;
...

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Robert Haas
On Wed, Nov 11, 2015 at 3:23 AM, Simon Riggs  wrote:
> This causes every writer to wait.
>
> What we want is to isolate the wait only to people performing a write-read
> sequence, so I think it should be readers that wait. Let's have that debate
> up front before we start reviewing the patch.

One advantage of having writers wait is that the master and its read
slaves can't ever get too far apart.  Suppose the master is generating
WAL much faster than the read slaves (or one of them) can replay it.
You might say it sucks to slow down the master to the speed the slaves
can keep up with, and that's true.  On the other hand, if the master
is allowed to run ahead, then a process that sends a read query to a
standby which has gotten far behind might have to wait minutes or
hours for it to catch up.  I think a lot of people are enabling
synchronous replication today just for the purpose of avoiding this
problem - keeping the two machines "together in time" makes the
overall system behavior a lot more predictable.

Also, if we made readers wait, wouldn't that require a network
roundtrip to the master every time a query on a reader wanted a new
snapshot?  That seems like it would be unbearably expensive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Thomas Munro
On Thu, Nov 12, 2015 at 12:10 AM, Ants Aasma  wrote:

> On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro
>  wrote:
> > On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas 
> wrote:
> >> On 11/11/2015 10:23 AM, Simon Riggs wrote:
> >>> Thanks for working on this issue.
> >>
> >> +1.
>
> +1. I have seen a lot of interest for something along these lines.
>
> >> I'm thinking the client should get some kind of a token back from the
> >> commit, and it could use the token on the standby, to wait for that
> commit
> >> to be applied. The token could be just the XID, or the LSN of the commit
> >> record. Or the application could generate the token and pass it to the
> >> server in the commit, similar to how 2PC works. So the interaction
> would be
> >> something like:
> >>
> >> In master:
> >> BEGIN;
> >> INSERT INTO FOO ...;
> >> COMMIT;
> >> Server returns: COMMITted with token 1234
> >>
> >> Later, in standby:
> >> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
> >> SELECT * FROM foo;
> >> ...
>
> To avoid read anomalies (backwards timetravel) it should also be
> possible to receive a token from read-only transactions based on the
> latest snapshot used.
>
> > My thinking was that the reason for wanting to load balance over a set of
> > hot standbys is because you have a very read-heavy workload, so it makes
> > sense to tax the writers and leave the many dominant readers unburdened,
> so
> > (3) should be better than (2) for the majority of users who want such a
> > configuration.  (Note also that it's not a requirement to tax every
> write;
> > with this proposal you can set causal_reads to off for those transactions
> > where you know there is no possibility of a causally dependent read).
> >
> > As for (1), my thinking was that most application developers would
> probably
> > prefer not to have to deal with that type of interface.  For users who do
> > want to do that, it would be comparatively simple to make that possible,
> and
> > would not conflict with this proposal.  This proposal could be used by
> > people retrofitting load balancing to an existing applications with
> relative
> > ease, or simply not wanting to have to deal with LSNs and complexity.  (I
> > have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)
> > separately, which could be called on a standby with the lsn obtained from
> > pg_current_xlog_location() on the primary any time after a COMMIT
> completes,
> > but I was thinking of that as a different feature addressing a different
> > user base: people prepared to do more work to squeeze out some extra
> > performance.)
>
> Although I still think that 1) is the correct long term solution I
> must say that I agree with the reasoning presented. I think we should
> review the API in the light that in the future we might have a mix of
> clients, some clients that are able to keep track of causality tokens
> and either want to wait when a read request arrives, or pick a host to
> use based on the token, and then there are "dumb" clients that want to
> use write side waits.
>

Exactly!

I see the causality tokens approach (thank you for that terminology) not so
much as a "long term" solution, but rather as an expert feature likely to
interest a small number of sophisticated users willing to take on more
responsibility in exchange for greater control.  We should definitely add
support for that, and I expect the patch would be fairly simple and short.

But I believe the vast majority of users would like to be able to run new
and existing plain SQL on any node and see the data they just wrote, with
graceful failure modes, and without extra conceptual load or invasive code
changes.  So I think we should cater for that mode of usage that too.

Also, it should be possible to configure which standbys are considered
> for waiting on. Otherwise a reporting slave will occasionally catch up
> enough to be considered "available" and then cause a latency peak when
> a long query blocks apply again.
>

Good point.  Here's a new version which adds the GUC
causal_reads_standby_names, defaulting to '*' (but as before, the feature
is not activated until you set causal_reads_timeout).  Now you can list
standby names explicitly if you want a way to exclude certain standbys.
Also, I noticed that cascaded standbys shouldn't be available for causal
reads, so I added a check for that.

-- 
Thomas Munro
http://www.enterprisedb.com


causal-reads-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Peter Eisentraut
On 11/11/15 4:22 AM, Thomas Munro wrote:
> My thinking was that the reason for wanting to load balance over a set
> of hot standbys is because you have a very read-heavy workload, so it
> makes sense to tax the writers and leave the many dominant readers
> unburdened, so (3) should be better than (2) for the majority of users
> who want such a configuration.

One problem I can see is that even if you have a read-heavy workload,
the writes can still be a bottleneck, since they are necessarily bound
to one node.  And so if the feature proposal is, we can make your reads
more consistent but the writes will become slower, then that's not a
good deal.

More generally, no matter whether you pick the writers or the readers to
wait, if you assume that read-only slaves are an application performance
feature, then it's questionable how much better such applications will
perform overall when network-bound waits are introduced in the system.

I think in practice applications that are busy enough to worry about
this don't really work like that anyway.  For example, the writes should
go to a message queue and are written out whenever, with a copy kept in
a cache for display in the meantime.  Maybe there could be additional
features to make managing this easier.

I think there are a lot of different variations of this in practice, not
only depending on the workload and other measurables, but also
business-dependent decisions on application behavior and degradability.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Simon Riggs
On 11 November 2015 at 05:37, Thomas Munro 
wrote:

Many sites use hot standby servers to spread read-heavy workloads over more
> hardware, or at least would like to.  This works well today if your
> application can tolerate some time lag on standbys.  The problem is that
> there is no guarantee of when a particular commit will become visible for
> clients connected to standbys.  The existing synchronous commit feature is
> no help here because it guarantees only that the WAL has been flushed on
> another server before commit returns.  It says nothing about whether it has
> been applied or whether it has been applied on the standby that you happen
> to be talking to.
>

Thanks for working on this issue.


> 3.  Commit on the primary with "causal_reads = on" waits for all
> 'available' standbys either to apply the commit record, or to cease to be
> 'available' and begin raising the error if they are still alive (because
> their authorizations have expired).
>

This causes every writer to wait.

What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait. Let's have that debate
up front before we start reviewing the patch.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


[HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-10 Thread Thomas Munro
Hi hackers,

Many sites use hot standby servers to spread read-heavy workloads over more
hardware, or at least would like to.  This works well today if your
application can tolerate some time lag on standbys.  The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys.  The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns.  It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.

A while ago I posted a small patch[1] to allow synchronous_commit to wait
for remote apply on the current synchronous standby, but (as Simon Riggs
rightly pointed out in that thread) that part isn't the main problem.  It
seems to me that the main problem for a practical 'writer waits' system is
how to support a dynamic set of servers, gracefully tolerating failures and
timing out laggards, while also providing a strong guarantee during any
such transitions.  Now I would like to propose something to do that, and
share a proof-of-concept patch.


=== PROPOSAL ===

The working name for the proposed feature is "causal reads", because it
provides a form of "causal consistency"[2] (and "read-your-writes"
consistency) no matter which server the client is connected to.  There is a
similar feature by the same name in another product (albeit implemented
differently -- 'reader waits'; more about that later).  I'm not wedded to
the name.

The feature allows arbitrary read-only transactions to be run on any hot
standby, with a specific guarantee about the visibility of preceding
transactions.  The guarantee is that if you set a new GUC "causal_reads =
on" in any pair of consecutive transactions (tx1, tx2) where tx2 begins
after tx1 successfully returns, then tx2 will either see tx1 or fail with a
new error "standby is not available for causal reads", no matter which
server it runs on.  A discovery mechanism is also provided, giving an
instantaneous snapshot of the set of standbys that are currently available
for causal reads (ie won't raise the error), in the form of a new column in
pg_stat_replication.

For example, a web server might run tx1 to insert a new row representing a
message in a discussion forum on the primary server, and then send the user
to another web page that runs tx2 to load all messages in the forum on an
arbitrary hot standby server.  If causal_reads = on in both tx1 and tx2
(for example, because it's on globally), then tx2 is guaranteed to see the
new post, or get a (hopefully rare) error telling the client to retry on
another server.

Very briefly, the approach is:
1.  The primary tracks apply lag on each standby (including between
commits).
2.  The primary deems standbys 'available' for causal reads if they are
applying WAL and replying to keepalives fast enough, and periodically sends
the standby an authorization to consider itself available for causal reads
until a time in the near future.
3.  Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).
4.  Standbys can start causal reads transactions only while they have an
authorization with an expiry time in the future; otherwise they raise an
error when an initial snapshot is taken.

In a follow-up email I can write about the design trade-offs considered
(mainly 'writer waits' vs 'reader waits'), comparison with some other
products, method of estimating replay lag, wait and timeout logic and how
it maintains the guarantee in various failure scenarios, logic for standbys
joining and leaving, implications of system clock skew between servers, or
any other questions you may have, depending on feedback/interest (but see
comments in the attached patch for some of those subjects).  For now I
didn't want to clog up the intertubes with too large a wall of text.


=== PROOF-OF-CONCEPT ===

Please see the POC patch attached.  It adds two new GUCs.  After setting up
one or more hot standbys as per usual, simply add "causal_reads_timeout =
4s" to the primary's postgresql.conf and restart.  Now, you can set
"causal_reads = on" in some/all sessions to get guaranteed causal
consistency.  Expected behaviour: the causal reads guarantee is maintained
at all times, even when you overwhelm, kill, crash, disconnect, restart,
pause, add and remove standbys, and the primary drops them from the set it
waits for in a timely fashion.  You can monitor the system with the
replay_lag and causal_reads_status in pg_stat_replication and some state
transition LOG messages on the primary.  (The patch also supports
"synchronous_commit = apply", but it's not clear how useful that is in
practice, as already discussed.)

Lastly, a few notes about how this feature related to some other