Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-10-02 Thread Simon Riggs

On Thu, 2008-09-11 at 17:58 +0300, Heikki Linnakangas wrote:

 BTW, we haven't talked about how to acquire a snapshot in the slave. 
 You'll somehow need to know which transactions have not yet committed,
 but will in the future. In the master, we keep track of in-progress 
 transaction in the ProcArray, so I suppose we'll need to do the same
 in the slave. Very similar to prepared transactions, actually. I
 believe the Abort records, which are not actually needed for normal
 operation, become critical here. The slave will need to put an entry
 to ProcArray for any new XLogRecord.xl_xid it sees in the WAL, and
 remove the entry at a Commit and Abort record. And clear them all at a
 shutdown record.

Although I said differently earlier, it seems cleaner to make recovery
snapshots work by emulating ProcArray entries as you suggest. Fatal
errors were my problem there, but I think that's solvable now.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-25 Thread Simon Riggs

On Wed, 2008-09-24 at 21:19 -0400, Bruce Momjian wrote:
 Simon Riggs wrote:
  2. Master ignores Standby's OldestXmin
  Effects:
  * Long running queries on standby...
 Have no effect on primary
 Can delay apply of WAL records on standby
  * Queries on standby give consistent answers in all cases.
 
 Just for clarification, if you set a max_slave_delay it means it is the
 maximum amount of time WAL replay can be delayed on the slave, _and_ it
 is the maximum amount of time a query/snapshot can be guaranteed to run
 without the possibility of being canceled.  My point is that these two
 concepts are linked to the same setting.

I didn't even see them as separate, but now you mention it they could
be.

Startup process may need to wait up to max_slave_delay before applying
WAL, if it sees that it must cancel a query to continue. max_slave_delay
is set in postgresql.conf and honoured by the startup process.

What is not yet defined is whether max_slave_delay is a setting
per-blockage event, or a total time that could occur because of a single
long blockage or many smaller ones. The latter makes more sense, but
would only be sensible to calculate it when we have WAL streaming using
calculations similar to the bgwriter delay calculations. That definition
makes sense for the sysadmin because it is the amount of extra time a
standby could take to startup after a failover.

If we take max_slave_delay to mean the latter then there is no guarantee
for user queries, since we may already have used up all our time waiting
on one query and there may be no wait time left for the current user
query. It would depend heavily on the rows accessed, so queries might
easily run to completion even though they exceed the max_slave_delay.
That makes sense because if you are asking a question like what is the
current total of widgets available you understand that answer changes
quickly over time, whereas the how many widgets were reordered on day
X doesn't change over time at all.

If you define this as a per wait event setting it provides a guarantee
to the user queries, but also puts WAL apply into free-fall since you
can't control number of blockage points queries might cause.

From what Merlin was saying, it would be sensible to have multiple
slaves: one with a max_slave_delay of 30 seconds to be your HA
fast-startup standby and another where we set max_slave_delay to 5 hours
to guarantee execution time for large reporting queries. Anyway, its
clear that max_slave_delay needs to be settable while running.

From my perspective, all I can say is all things are possible and I
appreciate the need for options to satisfy different use cases. I'll
build the basic mechanics and then we can add fine tuning over time. So
I'll get the basic mechanics in place, suggest we observe how that works
and then decide on the control mechanisms because they sound relatively
simple to add.

Another important note is Hannu's suggestion of using snapshot
filesystems with Postgres. We can include that feature very quickly and
it will complement Hot Standby very well.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-25 Thread Zeugswetter Andreas OSB sIT

 Simon Riggs wrote:
  2. Master ignores Standby's OldestXmin
  Effects:
  * Long running queries on standby...
 Have no effect on primary
 Can delay apply of WAL records on standby
  * Queries on standby give consistent answers in all cases.

 Just for clarification, if you set a max_slave_delay it means it is the
 maximum amount of time WAL replay can be delayed on the slave, _and_ it
 is the maximum amount of time a query/snapshot can be guaranteed to run
 without the possibility of being canceled.  My point is that these two
 concepts are linked to the same setting.

I wonder whether the cancel can be delayed until a tuple/page is actually 
accessed
that shows a too new xid.

The procedure would be like this:
Instead of cancel, the backend gets a message with a lsn_horizon.
From there on, whenever the backend reads a page/tuple with a LSN  
lsn_horizon it cancels.
I think that should allow some more queries to complete.
Especially such that target static tables, or static parts of large tables
using appropriate index btree ranges that are also static.

Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-25 Thread Simon Riggs

On Thu, 2008-09-25 at 11:14 +0200, Zeugswetter Andreas OSB sIT wrote:
  Simon Riggs wrote:
   2. Master ignores Standby's OldestXmin
   Effects:
   * Long running queries on standby...
  Have no effect on primary
  Can delay apply of WAL records on standby
   * Queries on standby give consistent answers in all cases.
 
  Just for clarification, if you set a max_slave_delay it means it is the
  maximum amount of time WAL replay can be delayed on the slave, _and_ it
  is the maximum amount of time a query/snapshot can be guaranteed to run
  without the possibility of being canceled.  My point is that these two
  concepts are linked to the same setting.
 
 I wonder whether the cancel can be delayed until a tuple/page is actually 
 accessed
 that shows a too new xid.

Yes, its feasible and is now part of the design.

This is all about what happens *if* we need to remove rows that a query
can still see.

We might also make HOT and VACUUM slightly less aggressive at removing
rows on the master. We can't use later than OldestXmin for row removal,
but we might choose to use something earlier. That would be an
alternative to using the standby's OldestXmin.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-25 Thread Simon Riggs

On Wed, 2008-09-24 at 21:22 -0400, Bruce Momjian wrote:
 Bruce Momjian wrote:
  Simon Riggs wrote:
   2. Master ignores Standby's OldestXmin
   Effects:
   * Long running queries on standby...
  Have no effect on primary
  Can delay apply of WAL records on standby
   * Queries on standby give consistent answers in all cases.
  
  Just for clarification, if you set a max_slave_delay it means it is the
  maximum amount of time WAL replay can be delayed on the slave, _and_ it
  is the maximum amount of time a query/snapshot can be guaranteed to run
  without the possibility of being canceled.  My point is that these two
  concepts are linked to the same setting.
 
 FYI, max_slave_delay does not relate to the amount of time of data loss
 in the case of master failure, assuming the WAL files are stored on the
 slave in some way during the delay.

Another way to handle row removal would be to check whether any users
have access to particular tables or not. If we could issue a permission
to prevent access to security definer functions, then we'd be able to
work out with certainty whether a row removal would ever be visible to
certain users. That would allow us to isolate fast changing tables from
slow changing ones.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-25 Thread Zeugswetter Andreas OSB sIT

  I wonder whether the cancel can be delayed until a tuple/page is actually 
  accessed
  that shows a too new xid.

 Yes, its feasible and is now part of the design.

 This is all about what happens *if* we need to remove rows that a query
 can still see.

I was describing a procedure for exactly that case.

If a slave backend has a snapshot that we cannot guarantee any more
(because max_slave_delay has been exceeded):

  Instead of cancel, the backend gets a message with a lsn_horizon.
  From there on, whenever the backend reads a page/tuple with a LSN  
  lsn_horizon it cancels.

but not before (at the time max_slave_delay has been reached), as described 
earlier.

Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-25 Thread Simon Riggs

On Thu, 2008-09-25 at 12:34 +0200, Zeugswetter Andreas OSB sIT wrote:
   I wonder whether the cancel can be delayed until a tuple/page is actually 
   accessed
   that shows a too new xid.
 
  Yes, its feasible and is now part of the design.
 
  This is all about what happens *if* we need to remove rows that a query
  can still see.
 
 I was describing a procedure for exactly that case.

Ok, I see, sorry.

 If a slave backend has a snapshot that we cannot guarantee any more
 (because max_slave_delay has been exceeded):
 
   Instead of cancel, the backend gets a message with a lsn_horizon.
   From there on, whenever the backend reads a page/tuple with a LSN  
   lsn_horizon it cancels.
 
 but not before (at the time max_slave_delay has been reached), as described 
 earlier.

Like that.

OK, so in full: 

Each WAL record that cleans tuples has a latestxid on it. If latestxid
is later than a running query on standby then we wait. When we stop
waiting we tell all at-risk queries the LSN of the first WAL record that
has potentially removed tuples they can see. If they see a block with a
higher LSN they cancel *themselves*. 

This works OK, since SeqScans never read blocks at end of file that
didn't exist when they started, so long queries need not be cancelled
when they access growing tables.

That combines all the suggested approaches into one. It still leaves the
possibility of passing the standby's OldestXmin to the primary, but does
not rely upon it.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-24 Thread Bruce Momjian
Simon Riggs wrote:
 2. Master ignores Standby's OldestXmin
 Effects:
 * Long running queries on standby...
Have no effect on primary
Can delay apply of WAL records on standby
 * Queries on standby give consistent answers in all cases.

Just for clarification, if you set a max_slave_delay it means it is the
maximum amount of time WAL replay can be delayed on the slave, _and_ it
is the maximum amount of time a query/snapshot can be guaranteed to run
without the possibility of being canceled.  My point is that these two
concepts are linked to the same setting.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-24 Thread Bruce Momjian
Bruce Momjian wrote:
 Simon Riggs wrote:
  2. Master ignores Standby's OldestXmin
  Effects:
  * Long running queries on standby...
 Have no effect on primary
 Can delay apply of WAL records on standby
  * Queries on standby give consistent answers in all cases.
 
 Just for clarification, if you set a max_slave_delay it means it is the
 maximum amount of time WAL replay can be delayed on the slave, _and_ it
 is the maximum amount of time a query/snapshot can be guaranteed to run
 without the possibility of being canceled.  My point is that these two
 concepts are linked to the same setting.

FYI, max_slave_delay does not relate to the amount of time of data loss
in the case of master failure, assuming the WAL files are stored on the
slave in some way during the delay.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:

 The main idea was to invert the meaning of the xid array in the snapshot
 struct - instead of storing all the xid's between xmin and xmax that are
 to be considering in-progress, the array contained all the xid's 
 xmin that are to be considered completed.

 The downside is that the size of the read-only snapshot is theoretically
 unbounded, which poses a bit of a problem if it's supposed to live
 inside shared memory...

Why do it inverted? That clearly has problems.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:

 The current read-only snapshot (which current meaning the
 corresponding state on the master at the time the last replayed wal
 record was generated) was maintained in shared memory. It' xmin field
 was continually updated with the (newly added) XLogRecord.xl_xmin
 field, which contained the xid of the oldest running query on the
 master, with a pruning step after each ReadOnlySnapshot.xmin update to
 remove all entries  xmin from the xid array. If a commit was seen for
 an xid, that xid was added to the ReadOnlySnapshot.xid array.
 
 The advantage of this concept is that it handles snapshotting on the
 slave without too much additional work for the master (The only change
 is the addition of the xl_xmin field to XLogRecord). It especially
 removes that need to track ShmemVariableCache-nextXid.

Snapshots only need to know which transactions are currently running
during WAL apply. The standby can't remove any tuples itself, so it
doesn't need to know what the master's OldestXmin is. 

So passing xl_xmin from master to standby seems not necessary to me. The
standby's OldestXmin needs to be passed through to the master, not the
other way around so that WAL records for tuple removal are not
generated.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 So passing xl_xmin from master to standby seems not necessary to me. The
 standby's OldestXmin needs to be passed through to the master, not the
 other way around so that WAL records for tuple removal are not
 generated.

I think most people were pretty leery of doing it that way because it means
activity on the standby database can cause the master to bloat. The consensus
seemed to be headed towards having WAL replay on the standby stall if it meets
a tuple removal record for a tuple that's visible to a query running on it.
Probably with a mechanism to configure a maximum amount of time it can be
stalled before shooting those queries.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Mon, 2008-09-15 at 13:13 +0100, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  So passing xl_xmin from master to standby seems not necessary to me. The
  standby's OldestXmin needs to be passed through to the master, not the
  other way around so that WAL records for tuple removal are not
  generated.
 
 I think most people were pretty leery of doing it that way because it means
 activity on the standby database can cause the master to bloat. The consensus
 seemed to be headed towards having WAL replay on the standby stall if it meets
 a tuple removal record for a tuple that's visible to a query running on it.
 Probably with a mechanism to configure a maximum amount of time it can be
 stalled before shooting those queries.

Well, my impression from all inputs is there is no single preferred
route. Any one of the approaches seems to have a possible objection,
depending upon the needs of the user. So I'm going to give options.

In any case it's not just a two horse race. There are other options,
favoured by some people, that you've not personally commented on in any
of your summaries (thats up to you, of course).

And Standby causing master to bloat is not such a big problem. It's no
different to running queries directly on the master. But please don't
take it that I don't see the problem or think it the best solution in
all cases.

Certainly, halting WAL replay for any length of time might mean it can
never catch up again, so that won't be acceptable for many cases.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Merlin Moncure
On Mon, Sep 15, 2008 at 8:40 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 On Mon, 2008-09-15 at 13:13 +0100, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:

  So passing xl_xmin from master to standby seems not necessary to me. The
  standby's OldestXmin needs to be passed through to the master, not the
  other way around so that WAL records for tuple removal are not
  generated.

 I think most people were pretty leery of doing it that way because it means
 activity on the standby database can cause the master to bloat. The consensus
 seemed to be headed towards having WAL replay on the standby stall if it 
 meets
 a tuple removal record for a tuple that's visible to a query running on it.
 Probably with a mechanism to configure a maximum amount of time it can be
 stalled before shooting those queries.

 Well, my impression from all inputs is there is no single preferred
 route. Any one of the approaches seems to have a possible objection,
 depending upon the needs of the user. So I'm going to give options.

 In any case it's not just a two horse race. There are other options,
 favoured by some people, that you've not personally commented on in any
 of your summaries (thats up to you, of course).

 And Standby causing master to bloat is not such a big problem. It's no
 different to running queries directly on the master. But please don't
 take it that I don't see the problem or think it the best solution in
 all cases.

It's not a problem, but a relative disadvantage.  What makes warm
standby really attractive relative to other data transfer solutions is
that there are no side effects on the master outside of setting the
archive command...communication is one way.  Any 'master bloat' style
approach seems to be increasingly fragile if you want to increase the
number of standby servers, if it's even possible to do that.

I'm not saying it should be the only way to do it (and it may not even
be possible)...but it would be very attractive to be able to run a hot
standby much the same as a warm standby is running today...I could,
for example easily script a second standby but keep it a week behind
for example.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Heikki Linnakangas

Gregory Stark wrote:

Simon Riggs [EMAIL PROTECTED] writes:


So passing xl_xmin from master to standby seems not necessary to me. The
standby's OldestXmin needs to be passed through to the master, not the
other way around so that WAL records for tuple removal are not
generated.


I think most people were pretty leery of doing it that way because it means
activity on the standby database can cause the master to bloat. The consensus
seemed to be headed towards having WAL replay on the standby stall if it meets
a tuple removal record for a tuple that's visible to a query running on it.
Probably with a mechanism to configure a maximum amount of time it can be
stalled before shooting those queries.


Yes, but we can quite easily provide an option on top of that to 
advertise the slave xmin back to the master, for those who prefer some 
bloat on master over stalling replay or killing queries in the slave. In 
fact, I think a lot of people will choose some compromise, where the 
slave xmin is advertised back to the master, but the master will obey it 
only up to some limit, after which the slave will need to stall or kill 
queries again.


It's not something that needs to be done in the first phase, but should 
be straightforward to add after the basics are working. In any case, 
we'll need the capability in the slave to notice when it's about to 
remove a tuple that's still visible to a snapshot in the slave.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Mon, 2008-09-15 at 16:26 +0300, Heikki Linnakangas wrote:
 In any case, we'll need the capability in the slave to notice when
 it's about to remove a tuple that's still visible to a snapshot in the
 slave.

Agreed.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Mon, 2008-09-15 at 09:07 -0400, Merlin Moncure wrote:
  Well, my impression from all inputs is there is no single preferred
  route. Any one of the approaches seems to have a possible objection,
  depending upon the needs of the user. So I'm going to give options.
 
  In any case it's not just a two horse race. There are other options,
  favoured by some people, that you've not personally commented on in
 any
  of your summaries (thats up to you, of course).
 
  And Standby causing master to bloat is not such a big problem. It's no
  different to running queries directly on the master. But please don't
  take it that I don't see the problem or think it the best solution in
  all cases.
 
 It's not a problem, but a relative disadvantage.  

In some circumstances, yes, in others, not. But its not the only
consideration and different people attach different weightings in their
decision making. 

 What makes warm
 standby really attractive relative to other data transfer solutions is
 that there are no side effects on the master outside of setting the
 archive command...communication is one way.  Any 'master bloat' style
 approach seems to be increasingly fragile if you want to increase the
 number of standby servers, if it's even possible to do that.
 
 I'm not saying it should be the only way to do it (and it may not even
 be possible)...but it would be very attractive to be able to run a hot
 standby much the same as a warm standby is running today...I could,
 for example easily script a second standby but keep it a week behind
 for example.

I hope to provide options for all users, not just a single approach. If
you choose never to use the option to link standby and master, I respect
that and understand. I've listened to your requirements and believe I'm
including things to allow it to work for you the way you've said.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Mon, 2008-09-15 at 16:26 +0300, Heikki Linnakangas wrote:
 In any case, 
 we'll need the capability in the slave to notice when it's about to 
 remove a tuple that's still visible to a snapshot in the slave.

Looks like what I'll do is this:

Alter functions in pruneheap.c so that we bubble up the latest xid that
is being removed as part of block cleaning. We then add that xid into
the WAL record for cleaning.

If latest xid of clean is ahead of oldestxmin of running queries on
standby then Startup process needs to take action, of some kind.

Re-examining the tuples in WAL apply seems bad plan, since we'd have to
touch stuff in the block twice and juggle the locks.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Robert Treat
On Friday 12 September 2008 07:44:36 Csaba Nagy wrote:
 And then
 the actually interesting question: what will the slave do with views,
 rules, triggers ? I guess triggers are out of the question to be
 executed, what about rules ? Probably must be also ignored... user
 functions will probably get errors if they try to update something...
 Views should probably function correctly.


If we dont have rules, we dont get views, so those need to be. Really we 
should allow anything that would work in the context of a read only 
transaction. (ie. functions that dont change anything should be fine to call)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Florian G. Pflug

Simon Riggs wrote:

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:


The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering in-progress, the array contained all the xid's 
xmin that are to be considered completed.



The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...


Why do it inverted? That clearly has problems.


Because it solves the problem of sponteaously apprearing XIDs in the 
WAL. At least prior to 8.3 with virtual xids, a transaction might have 
allocated it's xid long before actually writing anything to disk, and 
therefore long before this XID ever shows up in the WAL. And with a 
non-inverted snapshot such an XID would be considered to be completed 
by transactions on the slave... So, one either needs to periodically log 
a snapshot on the master or log XID allocations which both seem to cause 
considerable additional load on the master. With an inverted snapshot, 
it's sufficient to log the current RecentXmin - a values that is readily 
available on the master, and therefore the cost amounts to just one 
additional 4-byte field per xlog entry.


regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Florian G. Pflug

Simon Riggs wrote:

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:


The current read-only snapshot (which current meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries  xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache-nextXid.


Snapshots only need to know which transactions are currently running
during WAL apply. The standby can't remove any tuples itself, so it
doesn't need to know what the master's OldestXmin is. 


I used the logged xmin value to track the shared snapshot's xmin, which 
in turn allowed me to prune the xid array, eliminating all xids  that xmin.


regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Simon Riggs

On Mon, 2008-09-15 at 19:20 +0100, Florian G. Pflug wrote:
 Simon Riggs wrote:
  On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:
  
  The main idea was to invert the meaning of the xid array in the snapshot
  struct - instead of storing all the xid's between xmin and xmax that are
  to be considering in-progress, the array contained all the xid's 
  xmin that are to be considered completed.
  
  The downside is that the size of the read-only snapshot is theoretically
  unbounded, which poses a bit of a problem if it's supposed to live
  inside shared memory...
  
  Why do it inverted? That clearly has problems.
 
 Because it solves the problem of sponteaously apprearing XIDs in the 
 WAL. At least prior to 8.3 with virtual xids, a transaction might have 
 allocated it's xid long before actually writing anything to disk, and 
 therefore long before this XID ever shows up in the WAL. And with a 
 non-inverted snapshot such an XID would be considered to be completed 
 by transactions on the slave... So, one either needs to periodically log 
 a snapshot on the master or log XID allocations which both seem to cause 
 considerable additional load on the master. With an inverted snapshot, 
 it's sufficient to log the current RecentXmin - a values that is readily 
 available on the master, and therefore the cost amounts to just one 
 additional 4-byte field per xlog entry.

I think I understand what you're saying now, though I think it
mostly/only applies before your brilliant idea in 8.3.

If we have a transaction history that looks like this:

ReadA, WriteB, WriteA (both commit in either order)

then pre-8.3 we would have xidA  xidB, whereas at 8.3 and above we see
that xidA is actually higher than xidB. Now, TransactionIds are assigned
in the order of their first page write and *not* in the order of
transaction start as was previously the case, which isn't very obvious.

So when we replay WAL, we know that WAL is only written with locks held,
so that WriteA and WriteB must be independent of each other. So that
means the locks held by Read1 can be ignored and we can assume that the
above history is the same as

WriteB, WriteA

So if we took a snapshot in the middle of WriteB we would be safe to say
that only transaction B was in progress and that transaction A was not
yet started. So the snapshot we derive on the standby is different to
the one we would have derived on the client, yet the serializable order
is the same. In general, this means that all reads on a transaction
prior to the first write can be reordered later so that they can be
assumed to occur exactly prior to the first write of a transaction.
(Please shoot me down, if incorrect).

So when we see the first WAL record of a transaction we know that there
are no in progress transactions with a *lower* xid that we have not yet
seen. So we cannot be confused about whether a transaction is
in-progress, or not.

Almost. Now having written all of that I see there is an obvious race
condition between assignment of an xid and actions that result in the
acquisition of WALInsertLock. So even though the above seems mostly
correct, there is still a gap to plug, but a much smaller one.

So when writing the first WAL record for xid=48, it is possible that
xid=47 has just been assigned and is also just about to write a WAL
record. Thus the absence of a WAL record for xid=47 is not evidence that
xid=47 is complete because it was read-only.

We might handle this with the inverted technique you describe, but there
should be an easier way to track dense packing of xid sequence.

We expect xids to be written to WAL in the order assigned, so we might
check whether a newly assigned xid is 1 higher than the last highest
value to have inserted into WAL. If it is not, then we can write a short
WAL record to inform readers of WAL that the missing xids in sequence
are in progress also. So readers of WAL will see xids in the correct
sequence and are thus able to construct valid snapshots direct from WAL.

I think I should measure how often that occurs to see what problem or
overhead this might cause, if any.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-13 Thread Florian G. Pflug

Heikki Linnakangas wrote:

BTW, we haven't talked about how to acquire a snapshot in the slave.
 You'll somehow need to know which transactions have not yet
committed, but will in the future. In the master, we keep track of
in-progress transaction in the ProcArray, so I suppose we'll need to
do the same in the slave. Very similar to prepared transactions,
actually. I believe the Abort records, which are not actually needed
for normal operation, become critical here. The slave will need to
put an entry to ProcArray for any new XLogRecord.xl_xid it sees in
the WAL, and remove the entry at a Commit and Abort record. And clear
them all at a shutdown record.


For reference, here is how I solved the snapshot problem in my
Summer-of-Code project last year, which dealt exactly with executing
read-only queries on PITR slaves (But sadly never came out of alpha
stage due to both my and Simon's lack of time)

The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering in-progress, the array contained all the xid's 
xmin that are to be considered completed.

The current read-only snapshot (which current meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries  xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache-nextXid.

The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...

regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 17:58 +0300, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  So part of the handshake between
  primary and standby must be what is your recentxmin?. The primary will
  then use the lower/earliest of the two.
 
 Even then, the master might already have vacuumed away tuples that are 
 visible to an already running transaction in the slave, before the slave 
 connects. Presumably the master doesn't wait for the slave to connect 
 before starting to accept new connections.

Yep, OK.

  As you mentioned, the options there are to defer applying WAL, or cancel 
  queries. I think both options need the same ability to detect when 
  you're about to remove a tuple that's still visible to some snapshot, 
  just the action is different. We should probably provide a GUC to 
  control which you want.
  
  I don't see any practical way of telling whether a tuple removal will
  affect a snapshot or not. Each removed row would need to be checked
  against each standby snapshot. Even if those were available, it would be
  too costly. 
 
 How about using the same method as we use in HeapTupleSatisfiesVacuum? 
 Before replaying a vacuum record, look at the xmax of the tuple 
 (assuming it committed). If it's  slave's OldestXmin, it can be 
 removed. Otherwise not. Like HeapTupleSatisfiesVacuum, it's 
 conservative, but doesn't require any extra bookkeeping.
 
 And vice versa: if we implement the more precise book-keeping, with all 
 snapshots in shared memory or something, we might as well use it in 
 HeapTupleSatisfiesVacuum. That has been discussed before, but it's a 
 separate project.

Tuple removals earlier than the slave's OldestXmin are easy, thats true.
I'm not sure what you had in mind for Otherwise not? 

Maybe you mean stop applying WAL until slave's OldestXmin is  tuple
removal xid. Not sure, reading other subthreads of this post.

I think its possible to defer removal actions on specific blocks only,
but that is an optimisation that's best left for a while.

BTW, tuple removals would need a cleanup lock on a block, just as they
do on master server. So WAL apply can be delayed momentarily by
pinholders anyway, whatever we do.

  It was also suggested we might take the removed rows and put them in a
  side table, but that makes me think of the earlier ideas for HOT and so
  I've steered clear of that.
 
 Yeah, that's non-trivial. Basically a whole new, different 
 implementation of MVCC, but without changing any on-disk formats.
 
 BTW, we haven't talked about how to acquire a snapshot in the slave. 
 You'll somehow need to know which transactions have not yet committed, 
 but will in the future. In the master, we keep track of in-progress 
 transaction in the ProcArray, so I suppose we'll need to do the same in 
 the slave. Very similar to prepared transactions, actually. I believe 
 the Abort records, which are not actually needed for normal operation, 
 become critical here. The slave will need to put an entry to ProcArray 
 for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry 
 at a Commit and Abort record. And clear them all at a shutdown record.

I wouldn't do it like that.

I was going to maintain a current snapshot in shared memory, away from
the PROCARRAY. Each time we see a TransactionId we check whether its
already been seen, if not, insert it. When a transaction commits or
aborts we remove the stated xid. If we see a shutdown checkpoint we
clear the array completely. When query backends want a snapshot they
just read the array.  It doesn't matter whether queries commit or abort,
since those changes can't be seen anyway by queries until commit.

Reason for doing it this way is PROCARRAY may be full of query backends,
so having dummy backends in there as well sounds confusing.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 11:38 +0300, Hannu Krosing wrote:
 On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
 
  I like the idea of acquiring snapshots locally in the slave much more. 
  As you mentioned, the options there are to defer applying WAL, or cancel 
  queries. 
 
 More exotic ways to defer applying WAL include using some smart
 filesystems to get per-backend data snapshots, using either
 copy-of-write overlay filesystems and filesystem or disk level
 snapshots.

That's certainly possible.

That would mean we maintain a single consistent viewpoint of the whole
database for a period of time. The frozen viewpoint could move forwards
by operator command, or we might keep multiple frozen views.

We can have a LookasideHash table in memory that keeps track of which
blocks have had rows removed from them since the frozen view was
taken.

If you request a block, we check to see whether there is a lookaside
copy of it prior to the tuple removals. We then redirect the block
request to a viewpoint relation's block. Each viewpoint gets a separate
relfilenode. We do the switcheroo while holding cleanup lock on block.

So effectively we would be adding frozen snapshot technology to
Postgres. Although we would want to do, copy-on-clean rather than
copy-on-write. Which could mean significantly better performance.

That might be encumbered by patent in some way.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
 Gregory Stark wrote:
  b) vacuum on the server which cleans up a tuple the slave has in scope has 
  to
 block WAL reply on the slave (which I suppose defeats the purpose of 
  having
 a live standby for users concerned more with fail-over latency).
 
 One problem with this, BTW, is that if there's a continuous stream of 
 medium-length transaction in the slave, each new snapshot taken will 
 prevent progress in the WAL replay, so the WAL replay will advance in 
 baby steps, and can fall behind indefinitely. As soon as there's a 
 moment that there's no active snapshot, it can catch up, but if the 
 slave is seriously busy, that might never happen.

It should be possible to do mixed mode.

Stall WAL apply for up to X seconds, then cancel queries. Some people
may want X=0 or low, others might find X = very high acceptable (Merlin
et al).

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Hannu Krosing
On Fri, 2008-09-12 at 09:45 +0100, Simon Riggs wrote:
 On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
  Gregory Stark wrote:
   b) vacuum on the server which cleans up a tuple the slave has in scope 
   has to
  block WAL reply on the slave (which I suppose defeats the purpose of 
   having
  a live standby for users concerned more with fail-over latency).
  
  One problem with this, BTW, is that if there's a continuous stream of 
  medium-length transaction in the slave, each new snapshot taken will 
  prevent progress in the WAL replay, so the WAL replay will advance in 
  baby steps, and can fall behind indefinitely. As soon as there's a 
  moment that there's no active snapshot, it can catch up, but if the 
  slave is seriously busy, that might never happen.
 
 It should be possible to do mixed mode.
 
 Stall WAL apply for up to X seconds, then cancel queries. Some people
 may want X=0 or low, others might find X = very high acceptable (Merlin
 et al).

Or even milder version.

* Stall WAL apply for up to X seconds, 
* then stall new queries, let old ones run to completion (with optional
fallback to canceling after Y sec), 
* apply WAL. 
* Repeat.

-
Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 17:04 +0300, Heikki Linnakangas wrote:
 Csaba Nagy wrote:
  and that means in fact that if you have
  continuously overlapping small transactions, the blocking horizon
  could be even blocked forever, as there'll always be a query running,
  and the new queries will always have the snapshot of the currently
  running ones because WAL recovery is stalled... 
 
 Hmm, no I don't think the WAL recovery can become completely stalled. To 
 completely stop progressing, we'd need to take a new snapshot that 
 includes transaction X, and at the same time be blocked on a vacuum 
 record that vacuums a tuple that's visible to transaction X. I don't 
 think that can happen, because for such a scenario to arise, in the 
 corresponding point in time in the master, there would've been a 
 scenario where the vacuum would've removed a tuple that would have been 
   visible to a newly starting transaction. Which can't happen. I think..

ISTM Csaba is correct. If WAL blocks the xids don't change and so the
snapshots never change, so wal is blocked. 

The only way out of that is to store up removals for particular blocks,
but that's complicated.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 01:07 +0100, Simon Riggs wrote:
 Transaction snapshots is probably the most difficult problem for Hot
 Standby to resolve. 

In summary of thread so far:

When queries on standby run for significantly longer than longest
queries on primary, some problems can occur. Various people have argued
for these responses to the problems:

1. Master uses Standby's OldestXmin
Effects: 
* Long running queries on standby...
   Can delay row removal on primary
   Do not delay apply of WAL records on standby
* Queries on standby give consistent answers in all cases.

2. Master ignores Standby's OldestXmin
Effects:
* Long running queries on standby...
   Have no effect on primary
   Can delay apply of WAL records on standby
* Queries on standby give consistent answers in all cases.

3. Ignore problem
Effects:
* Long running queries on standby...
   Have no effect on primary
   Do not delay apply of WAL records on standby
* Queries on standby give inconsistent answers in some cases, though
doesn't generate any messages to show inconsistency occurred. Acceptable
for read-only and insert only tables only.

Hot Standby should provide all 3 responses as options.

(1) would be implemented by sending Standby OldestXmin to primary.
Snapshots would not be sent from primary, they will be derived locally
from transactions currently being applied.

(2) would be implemented by setting a timer. When Startup process has
waited for more than redo_max_delay/max_lag_delay (SIGHUP) we cancel
queries. If timeout is 0 we aggressively cancel queries without a
timeout.

(3) would be implemented using read_consistency = on (default) | off, a
USERSET parameter. When read_consistency = off we ignore the backend's
xmin when deciding whether to wait before applying WAL or not.

Summary OK for everyone?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 15:17 +0200, Jochem van Dieten wrote:
 On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote:
  Transaction snapshots is probably the most difficult problem for Hot
  Standby to resolve.
 
   * remotely from primary node
   * locally on the standby node
 
  If we derive a snapshot locally, then we will end up with a situation
  where the xmin of the local snapshot precedes the xmin of the primary
  node. When this occurs it will then be possible for WAL records to
  arrive on the standby that request removal of rows that a transaction
  might wish to see. Preventing that situation can be done by either
  deferring WAL apply or by cancelling queries.
 
 Which operations can request row removal? Isn't that just specific
 operations that have their own 'is this save to remove' calculations
 anyway (i.e. vacuum and HOT prune)?
 
 What I am thinking about is a design where the primary node were to
 regularly push an OldestXMin into the WAL, the WAL apply process on
 the standby nodes pushes it into shared memory and the backends keep
 an OldestMasterXMin in shared memory. The standby nodes then regularly
 pushes back the oldest OldestMasterXMin from all backends to the
 master. Vacuum and HOT prune could then base their calculations on an
 OldestXMin that is not the OldestXMin of the master itself, but of the
 master and the standby nodes. That way removal of records that are
 still visible on one of the standby nodes is prevented on the master
 instead of worked around on the standby nodes.
 
 The obvious downside would be bloat on the master (which could get out
 of hand if a slave is a few days behind due to a long report), but I
 think in terms of visibility and consistency this would work. Or am I
 completely misunderstanding the problem?

Yes, just sending the xmin from standby to primary is the best way of
having primary and standby work together. I will include this, thanks.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Hannu Krosing
On Fri, 2008-09-12 at 11:21 +0200, Csaba Nagy wrote:
 On Fri, 2008-09-12 at 09:38 +0100, Simon Riggs wrote:
  If you request a block, we check to see whether there is a lookaside
  copy of it prior to the tuple removals. We then redirect the block
  request to a viewpoint relation's block. Each viewpoint gets a separate
  relfilenode. We do the switcheroo while holding cleanup lock on block.
 
 Wouldn't it make sense to also have a hint bit on the pages which are
 copied away ? 

There is no need for hint bit (and one bit would not be enough anyway,
as we may need to keep multiple versions)

A LSN of last row-removal (VACUUM or HOT-pruning) would serve as a hint
to start digging around in hash tables.

It seems though , that you may have to look into several disk snapshots
to find the page you need.

 Then instead of looking up a hash table, you first would
 look up that bit, and if not set you won't look up the hash table at
 all.  Then when you clean up the lookaside copies you clear those bits
 too...
 
 That would probably perform somewhat better for reading than always
 looking up a potentially big hash table, and the cost of setting the
 hint is probably a lot less than copying away the page in the first
 place. Resetting the hint bit might be a bit more expensive.
 
 Cheers,
 Csaba.
 
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Csaba Nagy
On Fri, 2008-09-12 at 09:38 +0100, Simon Riggs wrote:
 If you request a block, we check to see whether there is a lookaside
 copy of it prior to the tuple removals. We then redirect the block
 request to a viewpoint relation's block. Each viewpoint gets a separate
 relfilenode. We do the switcheroo while holding cleanup lock on block.

Wouldn't it make sense to also have a hint bit on the pages which are
copied away ? Then instead of looking up a hash table, you first would
look up that bit, and if not set you won't look up the hash table at
all.  Then when you clean up the lookaside copies you clear those bits
too...

That would probably perform somewhat better for reading than always
looking up a potentially big hash table, and the cost of setting the
hint is probably a lot less than copying away the page in the first
place. Resetting the hint bit might be a bit more expensive.

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Thu, 2008-09-11 at 12:18 +0100, Gregory Stark wrote:

 Is there any middle ground or brilliant ways to get the best of both worlds?

Possibly. Nobody has commented yet on the other ideas on the post
itself.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Hannu Krosing
On Fri, 2008-09-12 at 11:19 +0100, Simon Riggs wrote:
 On Thu, 2008-09-11 at 01:07 +0100, Simon Riggs wrote:
  Transaction snapshots is probably the most difficult problem for Hot
  Standby to resolve. 
 
 In summary of thread so far:
 
 When queries on standby run for significantly longer than longest
 queries on primary, some problems can occur. Various people have argued
 for these responses to the problems:
 
 1. Master uses Standby's OldestXmin
...
 2. Master ignores Standby's OldestXmin
...
 3. Ignore problem
...

 Summary OK for everyone?

Maybe we should at least mention option 4.

4. Slave keeps copies of removed pages or rows when WAL apply removes
old versions .

Possible ways to do this

 * inside Slave - have some backup store tied to OldestXmin intervals

  * variant 1 - have one global store, accessed through shared mem
  * variant 2 - present removed pages to interested backends and 
let them (decide to) keep them 

 * outside Slave - having file system keep old snapshots as long as
needed, still must tie to OldestXmin intervals, but most of work done by
storage layer (SAN or overlay file system).

-
Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Hannu Krosing
On Fri, 2008-09-12 at 13:53 +0300, Hannu Krosing wrote:

 4. Slave keeps copies of removed pages or rows when WAL apply removes
 old versions .
 
 Possible ways to do this
 
  * inside Slave - have some backup store tied to OldestXmin intervals
 
   * variant 1 - have one global store, accessed through shared mem
   * variant 2 - present removed pages to interested backends and 
 let them (decide to) keep them 
 
  * outside Slave - having file system keep old snapshots as long as
 needed, still must tie to OldestXmin intervals, but most of work done by
 storage layer (SAN or overlay file system).

Possible options for outside Slave filesystem snapshooting - 

ZFS ( http://en.wikipedia.org/wiki/ZFS ) is very likely usable 

Linux LVM + XFS may be usable -
http://arstechnica.com/articles/columns/linux/linux-20041013.ars

Possibly also http://en.wikipedia.org/wiki/Btrfs .


Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Richard Huxton
Gregory Stark wrote:
 In that case the problem is dealing with different usage patterns on different
 tables. There might be a way to solve just that use case such as deferring WAL
 records for those tables. That doesn't guarantee inter-table data consistency
 if there were other queries which read from those tables and updated other
 tables based on that data though. Perhaps there's a solution for that too
 though.

There was a suggestion (Simon - from you?) of a transaction voluntarily
restricting itself to a set of tables. That would obviously reduce the
impact of all the options where the accessed tables weren't being
updated (where update = vacuum + HOT if I've got this straight).

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Gregory Stark
Simon Riggs [EMAIL PROTECTED] writes:

 3. Ignore problem
 Effects:
 * Long running queries on standby...
Have no effect on primary
Do not delay apply of WAL records on standby
 * Queries on standby give inconsistent answers in some cases, though
 doesn't generate any messages to show inconsistency occurred. Acceptable
 for read-only and insert only tables only.

This seems like a non-starter. 

Your comment about read-only and insert-only tuples only seems to make sense
if you assume there are other tables being updated simultaneously. Otherwise
of course there would be no WAL records for tuple removals.

In that case the problem is dealing with different usage patterns on different
tables. There might be a way to solve just that use case such as deferring WAL
records for those tables. That doesn't guarantee inter-table data consistency
if there were other queries which read from those tables and updated other
tables based on that data though. Perhaps there's a solution for that too
though.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Hannu Krosing
On Fri, 2008-09-12 at 12:31 +0300, Hannu Krosing wrote:
 On Fri, 2008-09-12 at 09:45 +0100, Simon Riggs wrote:
  On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
   Gregory Stark wrote:
b) vacuum on the server which cleans up a tuple the slave has in scope 
has to
   block WAL reply on the slave (which I suppose defeats the purpose of 
having
   a live standby for users concerned more with fail-over latency).
   
   One problem with this, BTW, is that if there's a continuous stream of 
   medium-length transaction in the slave, each new snapshot taken will 
   prevent progress in the WAL replay, so the WAL replay will advance in 
   baby steps, and can fall behind indefinitely. As soon as there's a 
   moment that there's no active snapshot, it can catch up, but if the 
   slave is seriously busy, that might never happen.
  
  It should be possible to do mixed mode.
  
  Stall WAL apply for up to X seconds, then cancel queries. Some people
  may want X=0 or low, others might find X = very high acceptable (Merlin
  et al).
 
 Or even milder version.
 
 * Stall WAL apply for up to X seconds, 
 * then stall new queries, let old ones run to completion (with optional
 fallback to canceling after Y sec), 
 * apply WAL. 
 * Repeat.

Now that I have thought a little more about delegating keeping old
versions to filesystem level (ZFS , XFS+LVM) snapshots I'd like to
propose the following:

 0. run queries and apply WAL freely until WAL application would 
remove old rows.

 1. stall applying WAL for up to N seconds

 2. stall starting new queries for up to M seconds

 3.  if some backends are still running long queries, then 

   3.1. make filesystem level snapshot (FS snapshot), 
   3.2. mount the FS snapshot somewhere (maybe as data.at.OldestXmin 
in parallel to $PGDATA) and 
   3.3 hand this mounted FS snapshot over to those backends

 4. apply WAL

 5. GoTo 0.

Of course we need to do the filesystem level snapshots in 3. only if the
long-running queries don't already have one given to them. Or maybe also
if they are running in READ COMMITTED mode and and have aquired a new PG
snapshot since they got their FS snapshot need a new one.

Also, snapshots need to be reference counted, so we can unmount and
destroy them once all their users have finished.

I think that enabling long-running queries this way is both low-hanging
fruit (or at least medium-height-hanging ;) ) and also consistent to
PostgreSQL philosophy of not replication effort. As an example we trust
OS's file system cache and don't try to write our own.


Hannu










-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Csaba Nagy
On Fri, 2008-09-12 at 12:31 +0100, Richard Huxton wrote:
 There was a suggestion (Simon - from you?) of a transaction voluntarily
 restricting itself to a set of tables.

While thinking about how easy it would be for the DBA to specify the set
of tables a single query is accessing, first I thought that it should be
straight enough to look at the query itself for that. Then I thought
what about views, rules, triggers, user functions etc. ? All those have
the potential to access more than you see in the query itself. And then
the actually interesting question: what will the slave do with views,
rules, triggers ? I guess triggers are out of the question to be
executed, what about rules ? Probably must be also ignored... user
functions will probably get errors if they try to update something...
Views should probably function correctly.

So in any case the functionality available for querying slaves would be
less than for the primary. This is probably good enough for most
purposes...

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Csaba Nagy
 I think that enabling long-running queries this way is both
 low-hanging
 fruit (or at least medium-height-hanging ;) ) and also consistent to
 PostgreSQL philosophy of not replication effort. As an example we trust
 OS's file system cache and don't try to write our own.

I have again questions (unfortunately I only have questions usually):

* how will the buffers keep 2 different versions of the same page ?
* how will you handle the creation of snapshots ? I guess there's no portable 
and universal API for that (just guessing), or there is some POSIX thing which 
is supported or not by the specific FS ? So if the FS is not supporting it, you 
skip the snapshot step ? And if there's no universal API, will it be handled by 
plugins providing a specified API for snapshotting the FS ?

I hope my continuous questioning is not too annoying...

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Hannu Krosing
On Fri, 2008-09-12 at 13:54 +0200, Csaba Nagy wrote:
  I think that enabling long-running queries this way is both
  low-hanging
  fruit (or at least medium-height-hanging ;) ) and also consistent to
  PostgreSQL philosophy of not replication effort. As an example we trust
  OS's file system cache and don't try to write our own.
 
 I have again questions (unfortunately I only have questions usually):
 
 * how will the buffers keep 2 different versions of the same page ?

As the FS snapshot is mounted as a different directory, it will have
it's own buffer pages.

To conserve RAM, one could go to FS snapshot files only in case main
pages have LSN too big to be trusted.

 * how will you handle the creation of snapshots ? 

probably an external command, possibly shell script. 
similar to current archive_command for wal copying

maybe 'create_fs_snapshot_command' and 'destroy_fs_snapshot_command'

 I guess there's no portable and universal API for that (just guessing), 
 or there is some POSIX thing which is supported or not by the specific FS ? 
 So if the FS is not supporting it, you skip the snapshot step ? 

Yes, if not FS snapshots are not supported, we fall back to either
inconsistent read or killing long-running queries.

 And if there's no universal API, will it be handled by plugins providing 
 a specified API for snapshotting the FS ?

Yes, the simplest one being external command. As FS snapshots are
supposed to happen not-too often, at least not every second, just having
external commands may be enough.

 I hope my continuous questioning is not too annoying...

On the contrary, much appreciated. :)


Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Csaba Nagy
On Fri, 2008-09-12 at 15:08 +0300, Hannu Krosing wrote:
  * how will the buffers keep 2 different versions of the same page ?
 
 As the FS snapshot is mounted as a different directory, it will have
 it's own buffer pages.

Lack of knowledge about this shows my ignorance about the implementation
of the page buffers...

  * how will you handle the creation of snapshots ? 
 
 probably an external command, possibly shell script. 
 similar to current archive_command for wal copying
 
 maybe 'create_fs_snapshot_command' and 'destroy_fs_snapshot_command'
[snip]
 Yes, the simplest one being external command. As FS snapshots are
 supposed to happen not-too often, at least not every second, just having
 external commands may be enough.

You could restrict the creation of snapshots to some minimum amount of
time between them, and maybe also restrict the maximum number of
concurrent snapshots possible. Then if the time limit (as calculated
from the last open snapshot) is currently not met, any new query could
reuse that last snapshot. The time intervals do not need to be evenly
distributed BTW, it could be a function of the already opened snapshots,
like increase the minimum interval exponentially with the number of
already opened snapshots. That would help to catch more long running
queries to just a few snapshots.

  I hope my continuous questioning is not too annoying...
 
 On the contrary, much appreciated. :)

Ok, then I'll continue :-) I would like to see this feature succeed, but
there's slim chance I'll ever code well in C...

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Fri, 2008-09-12 at 12:25 +0100, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  3. Ignore problem
  Effects:
  * Long running queries on standby...
 Have no effect on primary
 Do not delay apply of WAL records on standby
  * Queries on standby give inconsistent answers in some cases, though
  doesn't generate any messages to show inconsistency occurred. Acceptable
  for read-only and insert only tables only.
 
 This seems like a non-starter. 

It works, and is proposed as a non-default option since a number of
people have independently said to me that this would be
acceptable/preferred.

 Your comment about read-only and insert-only tuples only seems to make sense
 if you assume there are other tables being updated simultaneously. Otherwise
 of course there would be no WAL records for tuple removals.

Yeh, you got it.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-12 Thread Simon Riggs

On Fri, 2008-09-12 at 15:08 +0300, Hannu Krosing wrote:
 On Fri, 2008-09-12 at 13:54 +0200, Csaba Nagy wrote:
   I think that enabling long-running queries this way is both
   low-hanging
   fruit (or at least medium-height-hanging ;) ) and also consistent to
   PostgreSQL philosophy of not replication effort. As an example we trust
   OS's file system cache and don't try to write our own.
  
  I have again questions (unfortunately I only have questions usually):
  
  * how will the buffers keep 2 different versions of the same page ?
 
 As the FS snapshot is mounted as a different directory, it will have
 it's own buffer pages.

RelFileNode has a spcNode which can be redirected to a temporary
filesystem snapshot. So its relatively easy to imagine redirecting
access to a table from its normal tablespace to the snapshot one.

 To conserve RAM, one could go to FS snapshot files only in case main
 pages have LSN too big to be trusted.

That would mean you'd need to do two I/Os, one to get the newly changed
page to get its LSN and another to get the old COW copy. We might waste
buffer space with that technique also. Since we'd be trying to avoid
cacheing bigger tables anyway (since 8.3) it seems easier to just go
straight to the COW copy.

So I think its fairly straightforward to support temporary snapshots in
Postgres, with creation/destruction handled in the way you say.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Simon Riggs wrote:

Taking snapshots from primary has a few disadvantages

 ...
  * snapshots on primary prevent row removal (but this was also an
advantage of this technique!)


That makes it an awful solution for high availability. A backend hung in 
transaction-in-progress state in the slave will prevent row removal on 
the master. Isolating the master from queries done performed in the 
slave is exactly the reason why people use hot standby. And running long 
reporting queries in the standby is again a very typical use case.


And still we can't escape the scenario that the slave receives a WAL 
record that vacuums away a tuple that's still visible according to a 
snapshot used in the slave. Even with the proposed scheme, this can happen:


1. Slave receives a snapshot from master
2. A long-running transaction begins on the slave, using that snapshot
3. Network connection is lost
4. Master hits a timeout, and decides to discard the snapshot it sent to 
the slave

5. A tuple visible to the snapshot is vacuumed
6. Network connection is re-established
7. Slave receives the vacuum WAL record, even though the long-running 
transaction still needs the tuple.


I like the idea of acquiring snapshots locally in the slave much more. 
As you mentioned, the options there are to defer applying WAL, or cancel 
queries. I think both options need the same ability to detect when 
you're about to remove a tuple that's still visible to some snapshot, 
just the action is different. We should probably provide a GUC to 
control which you want.


However, if we still to provide the behavior that as long as the 
network connection works, the master will not remove tuples still needed 
in the slave as an option, a lot simpler implementation is to 
periodically send the slave's oldest xmin to master. Master can take 
that into account when calculating its own oldest xmin. That requires a 
lot less communication than the proposed scheme to send snapshots back 
and forth. A softer version of that is also possible, where the master 
obeys the slave's oldest xmin, but only up to a point.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Hannu Krosing
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:

 I like the idea of acquiring snapshots locally in the slave much more. 
 As you mentioned, the options there are to defer applying WAL, or cancel 
 queries. 

More exotic ways to defer applying WAL include using some smart
filesystems to get per-backend data snapshots, using either
copy-of-write overlay filesystems and filesystem or disk level
snapshots.

Al least the disk level snapshots exist in SAN-s with aim of easing
backups, though I'm not sure if it is effective for use hot standby
intended use.

Using any of those needs detecting and bypassing shared buffers if they
hold too new data pages and reading these pages directly from disk
snapshot.

 I think both options need the same ability to detect when 
 you're about to remove a tuple that's still visible to some snapshot, 
 just the action is different. We should probably provide a GUC to 
 control which you want.

We probably need to have two LSN's per page to make maximal use of our
MVCC in Hot Standby situation, so we can distinguish addition to a page,
which implies no data loss from row removal which does. Currently only
Vacuum and Hot pruning can cause row removal.

 However, if we still to provide the behavior that as long as the 
 network connection works, the master will not remove tuples still needed 
 in the slave as an option, a lot simpler implementation is to 
 periodically send the slave's oldest xmin to master. Master can take 
 that into account when calculating its own oldest xmin. That requires a 
 lot less communication than the proposed scheme to send snapshots back 
 and forth. A softer version of that is also possible, where the master 
 obeys the slave's oldest xmin, but only up to a point.

That point could be statement_timeout or (currently missing)
transaction_timeout

Also, decision to advance xmin should probably be sent to slave as well,
even though it is not something that is needed in local WAL logs.

--
Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Simon Riggs

On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  Taking snapshots from primary has a few disadvantages
  
   ...
* snapshots on primary prevent row removal (but this was also an
  advantage of this technique!)
 
 That makes it an awful solution for high availability. 

Please be careful about making such statements. People might think you
were opposing the whole idea of Hot Standby, rather than giving an
opinion about one suggestion out of many implementation proposals.

Looks like you've got some good additional suggestions later in the
post. I'll reply later to those, so thanks for that.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Richard Huxton
Heikki Linnakangas wrote:
 Simon Riggs wrote:
 Taking snapshots from primary has a few disadvantages

  ...
   * snapshots on primary prevent row removal (but this was also an
 advantage of this technique!)
 
 That makes it an awful solution for high availability. A backend hung in
 transaction-in-progress state in the slave will prevent row removal on
 the master. Isolating the master from queries done performed in the
 slave is exactly the reason why people use hot standby. And running long
 reporting queries in the standby is again a very typical use case.

I have to say I agree with Heikki here. Blocking the master based on
what the slave is doing seems to make host standby less useful than warm.

 I like the idea of acquiring snapshots locally in the slave much more.

It's the option that I can see people (well, me) understanding the
easiest. All the others sound like ways to get things wrong. As for
inconsistent query-results - that way madness lies. How on earth will
anyone be able to diagnose or report bugs when they occur?

 As you mentioned, the options there are to defer applying WAL, or cancel
 queries. I think both options need the same ability to detect when
 you're about to remove a tuple that's still visible to some snapshot,
 just the action is different. We should probably provide a GUC to
 control which you want.

I think there's only one value here: hot standby wal delay time before
cancelling query. Might be a shorter name.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Simon Riggs

On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote:

 I have to say I agree with Heikki here. Blocking the master based on
 what the slave is doing seems to make host standby less useful than warm.

I agree also, that why I flagged it up for discussion.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote:

 I have to say I agree with Heikki here. Blocking the master based on
 what the slave is doing seems to make host standby less useful than warm.

 I agree also, that why I flagged it up for discussion.

So as far as I can see there are basically two option here. Either

a) transactions with live snapshots on the slave prevent the master from being
   able to vacuum tuples (which defeats the purpose of having a live standby
   server for some users).

or

b) vacuum on the server which cleans up a tuple the slave has in scope has to
   block WAL reply on the slave (which I suppose defeats the purpose of having
   a live standby for users concerned more with fail-over latency).

Is there any middle ground or brilliant ways to get the best of both worlds?

If not it seems to me the latter is preferable since at least the consequence
of having a long-running query on the slave occurs on the same machine running
the query. And the work-around -- killing the long-running query -- requires
taking action on the same machine as the consequences. Also, when you take
action it fixes the problem immediately as WAL reply can recommence which
seems like a better deal than a bloated database.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Merlin Moncure
On Thu, Sep 11, 2008 at 7:18 AM, Gregory Stark [EMAIL PROTECTED] wrote:
 a) transactions with live snapshots on the slave prevent the master from being
   able to vacuum tuples (which defeats the purpose of having a live standby
   server for some users).

 or

 b) vacuum on the server which cleans up a tuple the slave has in scope has to
   block WAL reply on the slave (which I suppose defeats the purpose of having
   a live standby for users concerned more with fail-over latency).

There is nothing stopping you from setting up two (or more) slave
servers, with one designated as failover that doens't serve queries,
right?  Option b seems pretty reasonable to me, although I'd prefer to
block wal replay vs canceling queries...although it might be nice to
manually be able to force wal replay 'with query cancel' via a
checkpoint.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Merlin Moncure wrote:

There is nothing stopping you from setting up two (or more) slave
servers, with one designated as failover that doens't serve queries,
right?


I'd imagine that even if applying the WAL on the slave is blocked, it's 
still streamed from the master to the slave, and in case of failover the 
slave will fast-forward before starting up as the new master. Of course, 
if it has fallen 3 days behind because of a giant reporting query, it 
can take a while to replay all the WAL that has accumulated.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Csaba Nagy
On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote:
 I'd imagine that even if applying the WAL on the slave is blocked, it's 
 still streamed from the master to the slave, and in case of failover the 
 slave will fast-forward before starting up as the new master.

Which begs the question: what happens with a query which is running on
the slave in the moment when the slave switches from recovery mode and
starts up ? Should the running queries be canceled if they are blocking
applying of WAL, to allow start-up, or let them finish ?

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Gregory Stark wrote:

b) vacuum on the server which cleans up a tuple the slave has in scope has to
   block WAL reply on the slave (which I suppose defeats the purpose of having
   a live standby for users concerned more with fail-over latency).


One problem with this, BTW, is that if there's a continuous stream of 
medium-length transaction in the slave, each new snapshot taken will 
prevent progress in the WAL replay, so the WAL replay will advance in 
baby steps, and can fall behind indefinitely. As soon as there's a 
moment that there's no active snapshot, it can catch up, but if the 
slave is seriously busy, that might never happen.


Nevertheless, I think it's a much nicer approach.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Csaba Nagy wrote:

On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote:
I'd imagine that even if applying the WAL on the slave is blocked, it's 
still streamed from the master to the slave, and in case of failover the 
slave will fast-forward before starting up as the new master.


Which begs the question: what happens with a query which is running on
the slave in the moment when the slave switches from recovery mode and
starts up ? Should the running queries be canceled if they are blocking
applying of WAL, to allow start-up, or let them finish ?


Depends on application, I'd say. I guess we'll need both, like the smart 
and fast shutdown modes.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Csaba Nagy
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
 One problem with this, BTW, is that if there's a continuous stream of 
 medium-length transaction in the slave, each new snapshot taken will 
 prevent progress in the WAL replay, so the WAL replay will advance in 
 baby steps, and can fall behind indefinitely.

Why would it fall behind indefinitely ? It only should fall behind to
the blocking horizon, which should be the start of the longest
currently running transaction... which should be continually advancing
and not too far in the past if there are only medium length transactions
involved. Isn't normal WAL recovery also doing baby-steps, one WAL
record a time ? ;-)

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Jochem van Dieten
On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote:
 Transaction snapshots is probably the most difficult problem for Hot
 Standby to resolve.

  * remotely from primary node
  * locally on the standby node

 If we derive a snapshot locally, then we will end up with a situation
 where the xmin of the local snapshot precedes the xmin of the primary
 node. When this occurs it will then be possible for WAL records to
 arrive on the standby that request removal of rows that a transaction
 might wish to see. Preventing that situation can be done by either
 deferring WAL apply or by cancelling queries.

Which operations can request row removal? Isn't that just specific
operations that have their own 'is this save to remove' calculations
anyway (i.e. vacuum and HOT prune)?

What I am thinking about is a design where the primary node were to
regularly push an OldestXMin into the WAL, the WAL apply process on
the standby nodes pushes it into shared memory and the backends keep
an OldestMasterXMin in shared memory. The standby nodes then regularly
pushes back the oldest OldestMasterXMin from all backends to the
master. Vacuum and HOT prune could then base their calculations on an
OldestXMin that is not the OldestXMin of the master itself, but of the
master and the standby nodes. That way removal of records that are
still visible on one of the standby nodes is prevented on the master
instead of worked around on the standby nodes.

The obvious downside would be bloat on the master (which could get out
of hand if a slave is a few days behind due to a long report), but I
think in terms of visibility and consistency this would work. Or am I
completely misunderstanding the problem?

Jochem

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Csaba Nagy wrote:

On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote:
One problem with this, BTW, is that if there's a continuous stream of 
medium-length transaction in the slave, each new snapshot taken will 
prevent progress in the WAL replay, so the WAL replay will advance in 
baby steps, and can fall behind indefinitely.


Why would it fall behind indefinitely ? It only should fall behind to
the blocking horizon, which should be the start of the longest
currently running transaction... which should be continually advancing
and not too far in the past if there are only medium length transactions
involved.


Well, yes, but you can fall behind indefinitely that way. Imagine that 
each transaction on the slave lasts, say 10 minutes, with a new 
transaction starting every 5 minutes. On the master, there's a table 
that's being vacuumed (or HOT-updated) frequently, say after each 
transaction for simplicity. What can happen is that every transaction 
that finishes on the slave will only let the WAL replay advance by one 
XID before blocking on the snapshot of the next slave transaction. The 
WAL replay will advance at a rate of 0.2 TPM, while the master is 
generating 1.0 TPM.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Robert Haas
 I'd imagine that even if applying the WAL on the slave is blocked, it's
 still streamed from the master to the slave, and in case of failover the
 slave will fast-forward before starting up as the new master. Of course, if
 it has fallen 3 days behind because of a giant reporting query, it can take
 a while to replay all the WAL that has accumulated.

Yes, and meanwhile, any other queries that are running on that box are
seeing three-day old data as well.

In an ideal world, it would be nice if the slave could keep tuples
around even after they are dead and vacuumed on the master.  Pushing
Xmin from the slave to the master creates the possibility of bloating
the master due to activity on the slave, which is not going to improve
reliability.  OTOH, not pushing Xmin leads to several pathological
query behaviors on the slave: cancelled queries, inconsistent reads,
and falling way behind on WAL application.  Either way, it seems to me
a massive and uncomfortable violation of the POLA.

If it were possible for tuples that had been vacuumed on the master to
stick around on the slave for as long as the slave still needed them,
then you'd have the best of both worlds, but I have a feeling
someone's going to say that that's just about impossible to implement.
 Against that, all I can say is that neither of the behaviors
described thus far sounds very appealing as a feature, though I'm
certain there are some people who, with sufficient jiggering, could
make effective use of them.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Csaba Nagy
On Thu, 2008-09-11 at 16:19 +0300, Heikki Linnakangas wrote:
 Well, yes, but you can fall behind indefinitely that way. Imagine that 
 each transaction on the slave lasts, say 10 minutes, with a new 
 transaction starting every 5 minutes. On the master, there's a table 
 that's being vacuumed (or HOT-updated) frequently, say after each 
 transaction for simplicity. What can happen is that every transaction 
 that finishes on the slave will only let the WAL replay advance by one 
 XID before blocking on the snapshot of the next slave transaction. The 
 WAL replay will advance at a rate of 0.2 TPM, while the master is 
 generating 1.0 TPM.

Aha, now I see where I was mistaken... I thought in terms of time and
not transaction IDs. So the time distance between the slave transactions
does not matter at all, only the distance in recovered XIDs matter for
the blocking horizon... and if the WAL recovery is blocked, the
blocking horizon is stalled as well, so the next transaction on the
slave will in fact require the same blocking horizon as all currently
running ones. Now I got it... and that means in fact that if you have
continuously overlapping small transactions, the blocking horizon
could be even blocked forever, as there'll always be a query running,
and the new queries will always have the snapshot of the currently
running ones because WAL recovery is stalled... or at least that's what
I understand from the whole thing...

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Dimitri Fontaine
Le jeudi 11 septembre 2008, Heikki Linnakangas a écrit :
 Well, yes, but you can fall behind indefinitely that way. Imagine that
 each transaction on the slave lasts, say 10 minutes, with a new
 transaction starting every 5 minutes. On the master, there's a table
 that's being vacuumed (or HOT-updated) frequently, say after each
 transaction for simplicity. What can happen is that every transaction
 that finishes on the slave will only let the WAL replay advance by one
 XID before blocking on the snapshot of the next slave transaction. The
 WAL replay will advance at a rate of 0.2 TPM, while the master is
 generating 1.0 TPM.

What would forbid the slave to choose to replay all currently lagging WALs 
each time it's given the choice to advance a little?
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Csaba Nagy
On Thu, 2008-09-11 at 15:33 +0200, Dimitri Fontaine wrote:
 What would forbid the slave to choose to replay all currently lagging WALs 
 each time it's given the choice to advance a little?

Well now that I think I understand what Heikki meant, I also think the
problem is that there's no choice at all to advance, because the new
queries will simply have the same snapshot as currently running ones as
long as WAL reply is blocked... further blocking the WAL reply. When
saying this I suppose that the snapshot is in fact based on the last
recovered XID, and not on any slave-local XID. In that case once WAL
recovery is blocked, the snapshot is stalled too, further blocking WAL
recovery, and so on...

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Dimitri Fontaine
Le jeudi 11 septembre 2008, Csaba Nagy a écrit :
 Well now that I think I understand what Heikki meant, I also think the
 problem is that there's no choice at all to advance, because the new
 queries will simply have the same snapshot as currently running ones as
 long as WAL reply is blocked... further blocking the WAL reply. When
 saying this I suppose that the snapshot is in fact based on the last
 recovered XID, and not on any slave-local XID. In that case once WAL
 recovery is blocked, the snapshot is stalled too, further blocking WAL
 recovery, and so on...

Well, it may be possible to instruct the WAL replay daemon to stop being 
polite sometimes: when a given max_lag_delay is reached, it could take locks 
and as soon as it obtains them, replay all remaining WAL.
The max_lag_delay would be a GUC allowing to set the threshold between 
continuing the replay and running queries.

There could some other nice ideas without inventing yet another GUC, but this 
one was eaiser to think about for me ;)
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Csaba Nagy wrote:

and that means in fact that if you have
continuously overlapping small transactions, the blocking horizon
could be even blocked forever, as there'll always be a query running,
and the new queries will always have the snapshot of the currently
running ones because WAL recovery is stalled... 


Hmm, no I don't think the WAL recovery can become completely stalled. To 
completely stop progressing, we'd need to take a new snapshot that 
includes transaction X, and at the same time be blocked on a vacuum 
record that vacuums a tuple that's visible to transaction X. I don't 
think that can happen, because for such a scenario to arise, in the 
corresponding point in time in the master, there would've been a 
scenario where the vacuum would've removed a tuple that would have been 
 visible to a newly starting transaction. Which can't happen. I think..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Simon Riggs

Thanks for the detailed thinking. At least one very good new idea here,
some debate on other points.


On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote:

 And still we can't escape the scenario that the slave receives a WAL 
 record that vacuums away a tuple that's still visible according to a 
 snapshot used in the slave. Even with the proposed scheme, this can happen:
 
 1. Slave receives a snapshot from master
 2. A long-running transaction begins on the slave, using that snapshot
 3. Network connection is lost
 4. Master hits a timeout, and decides to discard the snapshot it sent to 
 the slave
 5. A tuple visible to the snapshot is vacuumed
 6. Network connection is re-established
 7. Slave receives the vacuum WAL record, even though the long-running 
 transaction still needs the tuple.

Interesting point. (4) is a problem, though not for the reason you
suggest. If we were to stop and start master, that would be sufficient
to discard the snapshot that the standby is using and so cause problems.
So the standby *must* tell the master the recentxmin it is using, as you
suggest later, so good thinking. So part of the handshake between
primary and standby must be what is your recentxmin?. The primary will
then use the lower/earliest of the two.

 I like the idea of acquiring snapshots locally in the slave much more.

Me too. We just need to know how, if at all.
 
 As you mentioned, the options there are to defer applying WAL, or cancel 
 queries. I think both options need the same ability to detect when 
 you're about to remove a tuple that's still visible to some snapshot, 
 just the action is different. We should probably provide a GUC to 
 control which you want.

I don't see any practical way of telling whether a tuple removal will
affect a snapshot or not. Each removed row would need to be checked
against each standby snapshot. Even if those were available, it would be
too costly. And even if we can do that, ISTM that neither option is
acceptable: if we cancel queries then touching a frequently updated
table is nearly impossible, or if we delay applying WAL then the standby
could fall behind, impairing its ability for use in HA. (If there was a
way, yes, we should have a parameter for it).

It was also suggested we might take the removed rows and put them in a
side table, but that makes me think of the earlier ideas for HOT and so
I've steered clear of that.

You might detect blocks that have had tuples removed from them *after* a
query started by either
* keeping a hash table of changed blocks - it would be a very big data
structure and hard to keep clean
* adding an additional last cleaned LSN onto every data block
* keeping an extra LSN on the bufhdr for each of the shared_buffers,
plus keeping a hash table of blocks that have been cleaned and then
paged out
Once detected, your only option is to cancel the query.

ISTM if we want to try to avoid making recentxmin same on both primary
and standby then the only viable options are the 3 on the original post.

 However, if we still to provide the behavior that as long as the 
 network connection works, the master will not remove tuples still needed 
 in the slave as an option, a lot simpler implementation is to 
 periodically send the slave's oldest xmin to master. Master can take 
 that into account when calculating its own oldest xmin. That requires a 
 lot less communication than the proposed scheme to send snapshots back 
 and forth. A softer version of that is also possible, where the master 
 obeys the slave's oldest xmin, but only up to a point.

I like this very much. Much simpler implementation and no need for a
delay in granting snapshots. I'll go for this as the default
implementation. Thanks for the idea.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Simon Riggs wrote:

So part of the handshake between
primary and standby must be what is your recentxmin?. The primary will
then use the lower/earliest of the two.


Even then, the master might already have vacuumed away tuples that are 
visible to an already running transaction in the slave, before the slave 
connects. Presumably the master doesn't wait for the slave to connect 
before starting to accept new connections.


As you mentioned, the options there are to defer applying WAL, or cancel 
queries. I think both options need the same ability to detect when 
you're about to remove a tuple that's still visible to some snapshot, 
just the action is different. We should probably provide a GUC to 
control which you want.


I don't see any practical way of telling whether a tuple removal will
affect a snapshot or not. Each removed row would need to be checked
against each standby snapshot. Even if those were available, it would be
too costly. 


How about using the same method as we use in HeapTupleSatisfiesVacuum? 
Before replaying a vacuum record, look at the xmax of the tuple 
(assuming it committed). If it's  slave's OldestXmin, it can be 
removed. Otherwise not. Like HeapTupleSatisfiesVacuum, it's 
conservative, but doesn't require any extra bookkeeping.


And vice versa: if we implement the more precise book-keeping, with all 
snapshots in shared memory or something, we might as well use it in 
HeapTupleSatisfiesVacuum. That has been discussed before, but it's a 
separate project.



It was also suggested we might take the removed rows and put them in a
side table, but that makes me think of the earlier ideas for HOT and so
I've steered clear of that.


Yeah, that's non-trivial. Basically a whole new, different 
implementation of MVCC, but without changing any on-disk formats.


BTW, we haven't talked about how to acquire a snapshot in the slave. 
You'll somehow need to know which transactions have not yet committed, 
but will in the future. In the master, we keep track of in-progress 
transaction in the ProcArray, so I suppose we'll need to do the same in 
the slave. Very similar to prepared transactions, actually. I believe 
the Abort records, which are not actually needed for normal operation, 
become critical here. The slave will need to put an entry to ProcArray 
for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry 
at a Commit and Abort record. And clear them all at a shutdown record.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Gregory Stark
Heikki Linnakangas [EMAIL PROTECTED] writes:

 BTW, we haven't talked about how to acquire a snapshot in the slave. You'll
 somehow need to know which transactions have not yet committed, but will in 
 the
 future. 

I'm not sure why you need to know which ones will commit in the future. ISTM
you need the same information you normally have which is just which ones have
committed as of the point of WAL replay you're at.

However it does occur to me that if you know in advance that a transaction
will abort in the future you could skip applying its WAL records when you see
them. I'm not sure it's worth adding such an optimization though and it might
get weird around vacuum.

 In the master, we keep track of in-progress transaction in the ProcArray, so
 I suppose we'll need to do the same in the slave. Very similar to prepared
 transactions, actually. I believe the Abort records, which are not actually
 needed for normal operation, become critical here. The slave will need to
 put an entry to ProcArray for any new XLogRecord.xl_xid it sees in the WAL,
 and remove the entry at a Commit and Abort record. And clear them all at a
 shutdown record.

That's how I envisioned it when the topic came up, but another solution was
also bandied about -- I'm not sure who originally proposed it. The alternative
method is to have the master periodically insert a data structure describing a
snapshot in the WAL. The slave then keeps a copy of the last snapshot it saw
in the WAL and any new query which starts up uses that.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Heikki Linnakangas

Gregory Stark wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:


BTW, we haven't talked about how to acquire a snapshot in the slave. You'll
somehow need to know which transactions have not yet committed, but will in the
future. 


I'm not sure why you need to know which ones will commit in the future.


Hmm, I phrased that badly. We need to know which transactions *might* 
commit in the future, IOW, are still in progress. Because we want to 
mark those as in-progress in the snapshots that are taken in the slave. 
Otherwise, when they do commit, they will suddenly become visible in the 
snapshots that didn't know that they were in progress.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-11 Thread Chris Browne
[EMAIL PROTECTED] (Heikki Linnakangas) writes:
 Simon Riggs wrote:
 Taking snapshots from primary has a few disadvantages

  ...
   * snapshots on primary prevent row removal (but this was also an
 advantage of this technique!)

 That makes it an awful solution for high availability. A backend hung
 in transaction-in-progress state in the slave will prevent row removal
 on the master. Isolating the master from queries done performed in the
 slave is exactly the reason why people use hot standby. And running
 long reporting queries in the standby is again a very typical use case.

I agree that this is a demerit to this approach.

Whether or not, on balance, it makes it an 'awful solution for high
availability' is much more in the eye of the beholder, and NOT obvious
on the face of it.
-- 
let name=cbbrowne and tld=linuxdatabases.info in name ^ @ ^ tld;;
http://linuxfinances.info/info/sgml.html
Question: How many surrealists does it take to change a light bulb?

Answer: Two, one to hold the giraffe, and the other to fill the bathtub
with brightly colored machine tools.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Transaction Snapshots and Hot Standby

2008-09-10 Thread Simon Riggs
Transaction snapshots is probably the most difficult problem for Hot
Standby to resolve. We *must* have a transaction snapshot to access
table data in a consistent way, or we must accept some inconsistency,
though that is not simple either.

We can derive transaction snapshots

  * remotely from primary node
  * locally on the standby node

If we derive a snapshot locally, then we will end up with a situation
where the xmin of the local snapshot precedes the xmin of the primary
node. When this occurs it will then be possible for WAL records to
arrive on the standby that request removal of rows that a transaction
might wish to see. Preventing that situation can be done by either
deferring WAL apply or by cancelling queries. 

We can defer WAL apply for particular tables only, but this could
significantly complicate the apply process and is not a suggested option
for the first release of this feature.

We might control query cancellation by tracking which tables have had
rows removed that would have been visible to particular queries. Again,
possible but suffers from the problem that tables on which HOT is
frequently active would be almost unusable. So not a general solution.

Simply ignoring WAL removal has been discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php
Explicitly defining the tables a transaction wishes to see has also been
discussed and rejected.
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php

So the most generally applicable and simplest initial solution for
generating snapshots is to take them from the remote primary node. The
proposal for this follows:

Transaction snapshots will be managed by a new process, Snapproc.
Snapproc will start only in recovery mode and will exit when normal
processing mode begins. Snapshot process will connect to the remote
primary node and export snapshot data and copy this to shared memory on
the standby node. When all standby backends have run
UnregisterSnapshot() then the snapshot will then be unregistered on the
remote primary node also.

The standby must not think a transaction is visible until all changes
made by it are have been applied. So snapshots from the primary cannot
be used on the standby until the LSN at which they were taken has been
reached by WAL apply on the standby. Snapshots don't normally have an
LSN, so we must derive that information as well when we take a Snapshot.
With asynchronous commits a transaction can be committed and yet not yet
written to disk, so we cannot rely on the current WAL write pointer.
Taking the WAL insert pointer is the safest thing to do, though most
heavily contended. We don't want to hold ProcArrayLock while requesting
WALInsertLock, so we will get the LSN of the WAL insert pointer *after*
the Snapshot has been derived (it doesn't matter that much, as long as
its not before the snapshot).

So snapshots from the primary take time before they can be used. The
delay is exactly the current processing delay from primary to standby,
so another reason why we do not wish to fall behind.

Taking snapshots from primary has a few disadvantages

  * snapshots take time before they are usable
  * requesting new snapshots is via remote request
  * snapshots on primary prevent row removal (but this was also an
advantage of this technique!)

If primary and standby node are connected by private ethernet then the
snapshot request time will be ~10ms, though that is probably 100 times
slower than current snapshot access. If primary and standby are opposite
sides of an ocean then times could be as high as 500ms. Taking snapshots
directly could be annoyingly slow for small read-only statements, so we
need to buffer this process in various ways. For larger queries, this
may not be a problem at all, but we do not wish to limit or assume the
uses of the standby node. 

First of all, each backend will have the opportunity to reuse previous
snapshots both within transactions and across them. A userset parameter
snapshot_reuse_window = 0..6ms will define the time window in which
any request for a new snapshot will simply result in being fed the last
snapshot again. When the window on a snapshot has expired a newer
snapshot will be presented instead. This idea is similar to serializable
transactions, which continually reuse the same snapshot. This is a
useful parameter for normal processing as well, since it will reduce
contention on the ProcArrayLock for many applications. 

Snapshots can be reused across transactions in recovery mode, since they
are held in shared memory. Amount of shared memory dedicated to storing
snapshots will be max_connections * max size of snapshots.

Since there is a delay between obtaining a new snapshot and it becoming
usable the Snapshot process will buffer them until they become mature,
like a good Whiskey. Snapshot process will take regular snapshots and
pre-age them so that when a backend requests a snapshot it will be