subject:"Re\: \[HACKERS\] Catalog\/Metadata consistency during changeset extraction from wal"

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Andres Freund

On Monday, June 25, 2012 08:50:54 PM Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  We most particularly *don't* want DDL to replicate automatically,
  because the schema changes are deployed along with related
  software changes, and we like to pilot any changes for at least a
  few days.  Depending on the release, the rollout may take a
  couple months, or we may slam in out everywhere a few days after
  the first pilot deployment.
  
  Thats a sensible for your use-case - but I do not think its thats
  the appropriate behaviour for anything which is somewhat
  out-of-the box...

 Right.  We currently consider the issues involved in a change and
 script it by hand.  I think we want to continue to do that.  The
 point was that, given the variety of timings and techniques for
 distributing schema changes, maybe is was only worth doing that
 automatically for the most constrained and controlled cases.
Agreed.

  So you could certainly punt all of this for any release as far as
  Wisconsin Courts are concerned.  We need to know table and column
  names, before and after images, and some application-supplied
  metadata.
  
  I am not sure were going to get all that into 9.3.
 
 Sure, that was more related to why I was questioning how much these
 use cases even *could* integrate -- whether it even paid to
 *consider* these use cases at this point.  Responses from Robert and
 you have pretty much convinced me that I was being overly
 pessimistic on that.
I think its an important question to ask, otherwise we might just end up with 
infrastructure unusable for all our goals... Or usable but unfinished 
infrastructure because its to complex to build in sensible time.

 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in both
 the BEFORE and AFTER tuple images, as long as we have the null
 column bitmaps -- or some other way of distinguishing unchanged from
 NULL.  (This could be especially important when the unchanged column
 was a 50 MB bytea.)  It doesn't matter to me which way this is
 optimized -- in our trigger-based system we chose to keep the full
 BEFORE tuple and just show AFTER values which were distinct from the
 corresponding BEFORE values, but it would be trivial to adapt the
 code to the other way.
I don't think doing that is worth the trouble in the first incarnation. There 
is enough detail hidden in that that makes that non-trivial that I wouldn't 
worry about it until the rest of the infrastructure exists. That part of the 
code will definitely be version specific so I see no problem improving on that 
incrementally.

 Sorry for that bout of pessimism.
Oh, no reason for that. I have some doubts about that myself, so...

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Simon Riggs

On 25 June 2012 17:42, Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 This is clearly going to depend on the topology.  You would
 definitely want to try to replicate the DDL for the case on which
 Simon is focused (which seems to me to be essentially physical
 replication of catalogs with logical replication of data changes
 from any machine to all others).

Just to remove any doubt: I'm not trying to support a single use case.

The overall proposals include a variety of design patterns. Each of
those covers many reasons for doing it, but end up with same
architecture.

1) Single master replication, with options not possible with physical
2) Multimaster
3) Many to One: data aggregation
4) Online upgrade

I don't think it will be possible to support all of those in one
release. Each has different challenges.

3 and 4 will not be worked on until 9.4, unless someone else is
willing to work on them. That isn't meant to be harsh, just an
explanation of practical reality that I hope people can accept without
needing to argue it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Robert Haas

On Mon, Jun 25, 2012 at 3:17 PM, Andres Freund and...@2ndquadrant.com wrote:
 I suppose the main reason we haven't done it already is that it
 increases the period of time during which we're using 2X the disk
 space.
 I find that an acceptable price if its optional. Making it such doesn't seem
 to be a problem for me.

+1.

 I think there is absolutely nothing wrong with doing extra things in
 ALTER TABLE when logical replication is enabled.  We've got code
 that's conditional on Hot Standby being enabled in many places in the
 system; why should logical replication be any different?  If we set
 the bar for logical replication at the system can't do anything
 differently when logical replication is enabled then I cheerfully
 submit that we are doomed.  You've already made WAL format changes to
 support logging the pre-image of the tuple, which is a hundred times
 more likely to cause a performance problem than any monkeying around
 we might want to do in ALTER TABLE.

 I am deeply skeptical that we need to look inside of transactions that
 do full-table rewrites.  But even if we do, I don't see that what I'm
 proposing precludes it.  For example, I think we could have ALTER
 TABLE emit WAL records specifically for logical replication that allow
 us to disentangle which tuple descriptor to use at which point in the
 transaction.  I don't see that that would even be very difficult to
 set up.
 Sorry, I was imprecise above: I have no problem doing some changes during
 ALTER TABLE if logical rep is enabled. I am worried though that to make that
 robust you would need loads of places that emit additional information:
 * ALTER TABLE
 * ALTER FUNCTIION
 * ALTER OPERATOR
 * ALTER/CREATE CAST
 * TRUNCATE
 * CLUSTER
 * ...

 I have the feeling that would we want to do that the full amount of required
 information would be rather high and end up being essentially the catalog.
 Just having an intermediate tupledesc doesn't help that much if you have e.g.
 record_out doing type lookups of its own.

 There also is the issue you have talked about before, that a user-type might
 depend on values in other tables. Unless were ready to break at least
 transactional behaviour for those for now...) I don't see how decoding outside
 of the transaction is ever going to be valid? I wouldn't have a big problem
 declaring that as broken for now...

I've been thinking about this a lot.  My thinking's still evolving
somewhat, but consider the following case.  A user defines a type T
with an input function I and and output function O.   They create a
table which uses type T and insert a bunch of data, which is duly
parsed using I; then, they replace I with a new input function I' and
O with a new output function O'.  Now, clearly, if we process the
inserts using the catalogs that were in effect at the time the inserts
we're done, we could theoretically get different output than if we use
the catalogs that were in effect after the I/O functions were
replaced.  But is the latter output wrong, or merely different?  My
first thought when we started talking about this was it's wrong, but
the more I think about it, the less convinced I am...

...because it can't possibly be right to suppose that it's impossible
to decode heap tuples using any catalog contents other than the ones
that were in effect at the time the tuples got inserted.  If that were
true, then we wouldn't be able to read a table after adding or
dropping a column, which of course we can.  It seems to me that it
might be sufficient to guarantee that we'll decode using the same
*types* that were in effect at the time the inserts happened.  If the
user yanks the rug out from under us by changing the type definition,
maybe we simply define that as a situation in which they get to keep
both pieces.  After all, if you replace the type definition in a way
that makes sensible decoding of the table impossible, you've pretty
much shot yourself in the foot whether logical replication enters the
picture or not.

If the enum case, for example, we go to great pains to make sure that
the table contents are always decodable not only under the current
version of SnapshotNow, but also any successor version.  We do that by
prohibiting ALTER TYPE .. ADD VALUE from running inside a transaction
block - because if we inserted a row into pg_enum and then inserted
dependent rows into some user table, a rollback could leave us with
rows that we can't decode.  For the same reason, we don't allow ALTER
TYPE .. DROP VALUE.  I think that we can infer a general principle
from this: while I/O functions may refer to catalog contents, they may
not do so in a way that could be invalidated by subsequent commits or
rollbacks.  If they did, they'd be breaking the ability of subsequent
SELECT statements to read the table.

An interesting case that is arguably an exception to this rule is that
regwhatever types, which will cheerfully output their value as an OID
if it can't be decoded to text, but will bail on

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread David Fetter

On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  
  I am not sure were going to get all that into 9.3.
  
 Sure, that was more related to why I was questioning how much these
 use cases even *could* integrate -- whether it even paid to
 *consider* these use cases at this point.  Responses from Robert and
 you have pretty much convinced me that I was being overly
 pessimistic on that.
  
 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in both
 the BEFORE and AFTER tuple images, as long as we have the null
 column bitmaps -- or some other way of distinguishing unchanged from
 NULL.  (This could be especially important when the unchanged column
 was a 50 MB bytea.)

How about two bitmaps: one telling which columns are actually there,
the other with NULLs?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Kevin Grittner

David Fetter da...@fetter.org wrote:
 On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
 
 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in
 both the BEFORE and AFTER tuple images, as long as we have the
 null column bitmaps -- or some other way of distinguishing
 unchanged from NULL.  (This could be especially important when
 the unchanged column was a 50 MB bytea.)
 
 How about two bitmaps: one telling which columns are actually
 there, the other with NULLs?
 
There are quite a few ways that could be done, but I suspect
Álvaro's idea is best:
 
http://archives.postgresql.org/message-id/1340654533-sup-5...@alvh.no-ip.org
 
In any event, it sounds like Andres wants to keep it as simple as
possible for the moment, and just include both tuples in their
entirety.  Hopefully that is something which can be revisited before
the last CF.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread David Fetter

On Tue, Jun 26, 2012 at 05:05:27PM -0500, Kevin Grittner wrote:
 David Fetter da...@fetter.org wrote:
  On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
  
  One fine point regarding before and after images -- if a value
  doesn't change in an UPDATE, there's no reason to include it in
  both the BEFORE and AFTER tuple images, as long as we have the
  null column bitmaps -- or some other way of distinguishing
  unchanged from NULL.  (This could be especially important when
  the unchanged column was a 50 MB bytea.)
  
  How about two bitmaps: one telling which columns are actually
  there, the other with NULLs?
  
 There are quite a few ways that could be done, but I suspect
 Álvaro's idea is best:
  
 http://archives.postgresql.org/message-id/1340654533-sup-5...@alvh.no-ip.org

Looks great (or at least way better than mine) to me :)

 In any event, it sounds like Andres wants to keep it as simple as
 possible for the moment, and just include both tuples in their
 entirety.  Hopefully that is something which can be revisited before
 the last CF.

I hope so, too...

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Andres Freund

On Monday, June 25, 2012 03:08:51 AM Robert Haas wrote:
 On Sun, Jun 24, 2012 at 5:11 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  There are some interesting problems related to locking and snapshots
  here. Not sure if they are resolvable:
  
  We need to restrict SnapshotNow to represent to the view it had back when
  the wal record were currently decoding had. Otherwise we would possibly
  get wrong column types and similar. As were working in the past locking
  doesn't protect us against much here. I have that (mostly and
  inefficiently).
  
  One interesting problem are table rewrites (truncate, cluster, some ALTER
  TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
  view it had back when the wal record was created we get the old
  relfilenode. Which might have been dropped in part of the transaction
  cleanup...
  With most types thats not a problem. Even things like records and arrays
  aren't problematic. More interesting cases include VACUUM FULL $systable
  (e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
  function of a type (like a user level pg_enum implementation).
  
  The only theoretical way I see against that problem would be to postpone
  all relation unlinks untill everything that could possibly read them has
  finished. Doesn't seem to alluring although it would be needed if we
  ever move more things of SnapshotNow.
  
  Input/Ideas/Opinions?
 
 Yeah, this is slightly nasty.  I'm not sure whether or not there's a
 way to make it work.
Postponing all non-rollback unlinks to the next logical checkpoint is the 
only thing I can think of...

 I had another idea.  Suppose decoding happens directly on the primary,
 because I'm still hoping there's a way to swing that.  Suppose further
 that we handle DDL by insisting that (1) any backend which wants to
 add columns or change the types of existing columns must first wait
 for logical replication to catch up and (2) if a backend which has
 added columns or changed the types of existing columns then writes to
 the modified table, decoding of those writes will be postponed until
 transaction commit.  I think that's enough to guarantee that the
 decoding process can just use the catalogs as they stand, with plain
 old SnapshotNow.
I don't think its that easy. If you e.g. have multiple ALTER's in the same 
transaction interspersed with inserted rows they will all have different 
TupleDesc's.
I don't see how thats resolvable without either replicating ddl to the target 
system or changing what SnapshotNow does...

 The downside of this approach is that it makes certain kinds of DDL
 suck worse if logical replication is in use and behind.  But I don't
 necessarily see that as prohibitive because (1) logical replication
 being behind is likely to suck for a lot of other reasons too and (2)
 adding or retyping columns isn't a terribly frequent operation and
 people already expect a hit when they do it.  Also, I suspect that we
 could find ways to loosen those restrictions at least in common cases
 in some future version; meanwhile, less work now.
Agreed.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Robert Haas

On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com wrote:
  The only theoretical way I see against that problem would be to postpone
  all relation unlinks untill everything that could possibly read them has
  finished. Doesn't seem to alluring although it would be needed if we
  ever move more things of SnapshotNow.
 
  Input/Ideas/Opinions?

 Yeah, this is slightly nasty.  I'm not sure whether or not there's a
 way to make it work.
 Postponing all non-rollback unlinks to the next logical checkpoint is the
 only thing I can think of...

There are a number of cool things we could do if we postponed unlinks.
 Like, why can't we allow concurrent read-only queries while a CLUSTER
operation is in progress?  Well, two reasons.  The first is that we
currently can't do ANY DDL with less than a full table lock because of
SnapshotNow-related race conditions.  The second is that people might
still need to look at the old heap after the CLUSTER transaction
commits.  Some kind of delayed unlink facility where we
garbage-collect relation backing files when their refcount falls to
zero would solve the second problem - not that that's any help by
itself without a solution to the first one, but hey.

 I had another idea.  Suppose decoding happens directly on the primary,
 because I'm still hoping there's a way to swing that.  Suppose further
 that we handle DDL by insisting that (1) any backend which wants to
 add columns or change the types of existing columns must first wait
 for logical replication to catch up and (2) if a backend which has
 added columns or changed the types of existing columns then writes to
 the modified table, decoding of those writes will be postponed until
 transaction commit.  I think that's enough to guarantee that the
 decoding process can just use the catalogs as they stand, with plain
 old SnapshotNow.
 I don't think its that easy. If you e.g. have multiple ALTER's in the same
 transaction interspersed with inserted rows they will all have different
 TupleDesc's.

If new columns were added, then tuples created with those older
tuple-descriptors can still be interpreted with the latest
tuple-descriptor.

Columns that are dropped or retyped are a little trickier, but
honestly... how much do we care about those cases?  How practical is
it to suppose we're going to be able to handle them sanely anyway?
Suppose that the user defines a type which works just like int4 except
that the output functions writes out each number in pig latin (and the
input function parses pig latin).  The user defines the types as
binary coercible to each other and then does ALTER TABLE on a large
table with an int4 column, transforming it into an int4piglatin
column.  Due to Noah Misch's fine work, we will conclude that no table
rewrite is needed.  But if logical replication is in use, then in
theory we should scan the whole table and generate an LCR for each row
saying the row with primary key X was updated, and column Y, which
used to contain 42, now contains ourty-two-fay.  Otherwise, if we're
doing heterogenous replication into a system that just stores that
column as text, it'll end up with the wrong contents.  On the other
hand, if we're trying to ship data to another PostgreSQL instance
where the column hasn't yet been updated, then all of those LCRs are
just going to error out when we try to apply them.

A more realistic scenario where you have the same problem is with
something like ALTER TABLE .. ADD COLUMN .. DEFAULT.   If you add a
column with a default in a single step (as opposed to first adding the
column and then setting its default), we rewrite the table and set
every row to the default value.  Should that generate LCRs showing
every row being updated to add that new value, or should we generate
no LCRs and assume that the DBA will independently do the same
operation on the remote side?  Either answer could be correct,
depending on how the LCRs are being used.  If you're just rewriting
with a constant default, then perhaps the sensible thing is to
generate no LCRs, since it will be more efficient to mimic the
operation on the remote side than to replay the changes row-by-row.
But what if the default isn't a constant, like maybe it's
nextval('new_synthetic_pkey_seq') or even something like now().  In
those cases, it seems quite likely that if you don't generate LCRs,
manual user intervention will be required to get things back on track.
 On the other hand, if you do generate LCRs, the remote side will
become horribly bloated on replay, unless the LCRs also instruct the
far side that they should be applied via a full-table rewrite.

Can we just agree to punt all this complexity for version 1 (and maybe
versions 2, 3, and 4)?  I'm not sure what Slony does in situations
like this but I bet for a lot of replication systems, the answer is
do a full resync.  In other words, we either forbid the operation
outright when the table is enabled for logical replication, or else we
emit an LCR that

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Kevin Grittner

Robert Haas robertmh...@gmail.com wrote:
 
 I bet for a lot of replication systems, the answer is do a full
 resync.  In other words, we either forbid the operation outright
 when the table is enabled for logical replication, or else we emit
 an LCR that says, in effect, transaction 12345 monkeyed with the
 table, please resync.  It strikes me that it's really the job of
 some higher-level control logic to decide what the correct
 behavior is in these cases; the decoding process doesn't really
 have enough information about what the user is trying to do to
 make a sensible decision anyway.
 
This is clearly going to depend on the topology.  You would
definitely want to try to replicate the DDL for the case on which
Simon is focused (which seems to me to be essentially physical
replication of catalogs with logical replication of data changes
from any machine to all others).  What you do about transactions in
flight is the hard part.  You could try to suppress concurrent DML
of the same objects or have some complex matrix of rules for trying
to resolve the transactions in flight.  I don't see how the latter
could ever be 100% accurate.
 
In our shop it is much easier.  We always have one database which is
the only valid source for any tuple, although rows from many such
databases can be in one table, and one row might replicate to many
databases.  Thus, we don't want automatic replication of DDL.
 
 - When a column is going to be added to the source machines, we
   first add it to the targets, with either a default or as
   NULL-capable.
 
 - When a column is going to be deleted from the source machines, we
   make sure it is NULL-capable or has a default on the replicas. 
   We drop it from all replicas after it is gone from all sources.
 
 - If a column is changing name or is changing to a fundamentally
   different type we need to give the new column a new name, have
   triggers to convert old to new (and vice versa) on the replicas,
   and drop the old after all sources are updated.
 
 - If a column is changing in a minor way, like its precision, we
   make sure the replicas can accept either format until all sources
   have been converted.  We update the replicas to match the sources
   after all sources are converted.
 
We most particularly *don't* want DDL to replicate automatically,
because the schema changes are deployed along with related software
changes, and we like to pilot any changes for at least a few days. 
Depending on the release, the rollout may take a couple months, or
we may slam in out everywhere a few days after the first pilot
deployment.
 
So you could certainly punt all of this for any release as far as
Wisconsin Courts are concerned.  We need to know table and column
names, before and after images, and some application-supplied
metadata.
 
I don't know that what we're looking for is any easier (although I
doubt that it's any harder), but I'm starting to wonder how much
mechanism they can really share.  The 2Q code is geared toward page
format OIDs and data values for automated DDL distribution and
faster replication, while we're looking for something which works
between releases, architectures, and OSes.  We keep coming back to
the idea of one mechanism because both WAL and a logical transaction
stream would have after tuples, although they need them in
different formats.
 
I think the need for truly logical replication is obvious, since so
many different people have developed trigger-based versions of that.
And it sure seems like 2Q has clients who are willing to pay for the
other.
 
Perhaps the first question is: Is there enough in common between
logical replication (and all the topologies that might be created
with that) and the proposal on the table (which seems to be based
around one particular topology with a vague notion of bolting
logical replication on to it after the fact) to try to resolve the
differences in one feature?  Or should the identical schema with
multiple identical copies case be allowed to move forward more or
less in isolation, with logical replication having its own design if
and when someone wants to take it on?  Two non-compromised features
might be cleaner -- I'm starting to feel like we're trying to design
a toaster which can also water your garden.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Robert Haas

On Mon, Jun 25, 2012 at 12:42 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Perhaps the first question is: Is there enough in common between
 logical replication (and all the topologies that might be created
 with that) and the proposal on the table (which seems to be based
 around one particular topology with a vague notion of bolting
 logical replication on to it after the fact) to try to resolve the
 differences in one feature?  Or should the identical schema with
 multiple identical copies case be allowed to move forward more or
 less in isolation, with logical replication having its own design if
 and when someone wants to take it on?  Two non-compromised features
 might be cleaner -- I'm starting to feel like we're trying to design
 a toaster which can also water your garden.

I think there are a number of shared pieces.  Being able to read WAL
and do something with it is a general need that both solutions share;
I think actually that might be the piece that we should try to get
committed first.  I suspect that there are a number of applications
for just that and nothing more - for example, it might allow a contrib
module that reads WAL as it's generated and prints out a debug trace,
which I can imagine being useful.

Also, I think that even for MMR there will be a need for control
logic, resynchronization, and similar mechanisms.  I mean, suppose you
have four servers in an MMR configuration.  Now, you want to deploy a
schema change that adds a new column and which, as it so happens,
requires a table rewrite to add the default.  It is very possible that
you do NOT want that to automatically replicate around the cluster.
Instead, you likely want to redirect load to the remaining three
servers, do the change on the fourth, put it back into the ring and
take out a different one, do the change on that one, and so on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Andres Freund

On Monday, June 25, 2012 05:34:13 PM Robert Haas wrote:
 On Mon, Jun 25, 2012 at 9:43 AM, Andres Freund and...@2ndquadrant.com 
wrote:
   The only theoretical way I see against that problem would be to
   postpone all relation unlinks untill everything that could possibly
   read them has finished. Doesn't seem to alluring although it would be
   needed if we ever move more things of SnapshotNow.
   
   Input/Ideas/Opinions?
  
  Yeah, this is slightly nasty.  I'm not sure whether or not there's a
  way to make it work.
  
  Postponing all non-rollback unlinks to the next logical checkpoint is
  the only thing I can think of...
 There are a number of cool things we could do if we postponed unlinks.
  Like, why can't we allow concurrent read-only queries while a CLUSTER
 operation is in progress?  Well, two reasons.  The first is that we
 currently can't do ANY DDL with less than a full table lock because of
 SnapshotNow-related race conditions.  The second is that people might
 still need to look at the old heap after the CLUSTER transaction
 commits.  Some kind of delayed unlink facility where we
 garbage-collect relation backing files when their refcount falls to
 zero would solve the second problem - not that that's any help by
 itself without a solution to the first one, but hey.
Its an argument why related infrastructure would be interesting to more than 
that patch and thats not bad.
If the garbage collecting is done in a very simplistic manner it doesn't sound 
too hard... The biggest problem is probably crash-recovery of that knowledge 
and how to hook knowledge into it that logical rep needs that data...

  I had another idea.  Suppose decoding happens directly on the primary,
  because I'm still hoping there's a way to swing that.  Suppose further
  that we handle DDL by insisting that (1) any backend which wants to
  add columns or change the types of existing columns must first wait
  for logical replication to catch up and (2) if a backend which has
  added columns or changed the types of existing columns then writes to
  the modified table, decoding of those writes will be postponed until
  transaction commit.  I think that's enough to guarantee that the
  decoding process can just use the catalogs as they stand, with plain
  old SnapshotNow.
  
  I don't think its that easy. If you e.g. have multiple ALTER's in the
  same transaction interspersed with inserted rows they will all have
  different TupleDesc's.
 
 If new columns were added, then tuples created with those older
 tuple-descriptors can still be interpreted with the latest
 tuple-descriptor.
But you need to figure that out. If you have just the before-after images of 
the tupledescs you don't know what happened in there... That would mean either 
doing special things on catalog changes or reassembling the meaning from the 
changed pg_* rows. Neither seems enticing.

 Columns that are dropped or retyped are a little trickier, but
 honestly... how much do we care about those cases?  How practical is
 it to suppose we're going to be able to handle them sanely anyway?
 Suppose that the user defines a type which works just like int4 except
 that the output functions writes out each number in pig latin (and the
 input function parses pig latin).  The user defines the types as
 binary coercible to each other and then does ALTER TABLE on a large
 table with an int4 column, transforming it into an int4piglatin
 column.  Due to Noah Misch's fine work, we will conclude that no table
 rewrite is needed.  But if logical replication is in use, then in
 theory we should scan the whole table and generate an LCR for each row
 saying the row with primary key X was updated, and column Y, which
 used to contain 42, now contains ourty-two-fay.  Otherwise, if we're
 doing heterogenous replication into a system that just stores that
 column as text, it'll end up with the wrong contents.  On the other
 hand, if we're trying to ship data to another PostgreSQL instance
 where the column hasn't yet been updated, then all of those LCRs are
 just going to error out when we try to apply them.

 A more realistic scenario where you have the same problem is with
 something like ALTER TABLE .. ADD COLUMN .. DEFAULT.   If you add a
 column with a default in a single step (as opposed to first adding the
 column and then setting its default), we rewrite the table and set
 every row to the default value.  Should that generate LCRs showing
 every row being updated to add that new value, or should we generate
 no LCRs and assume that the DBA will independently do the same
 operation on the remote side?  Either answer could be correct,
 depending on how the LCRs are being used.  If you're just rewriting
 with a constant default, then perhaps the sensible thing is to
 generate no LCRs, since it will be more efficient to mimic the
 operation on the remote side than to replay the changes row-by-row.
 But what if the default isn't a constant, like maybe it's

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Robert Haas

On Mon, Jun 25, 2012 at 1:50 PM, Andres Freund and...@2ndquadrant.com wrote:
 Its an argument why related infrastructure would be interesting to more than
 that patch and thats not bad.
 If the garbage collecting is done in a very simplistic manner it doesn't sound
 too hard... The biggest problem is probably crash-recovery of that knowledge
 and how to hook knowledge into it that logical rep needs that data...

I suppose the main reason we haven't done it already is that it
increases the period of time during which we're using 2X the disk
space.

  I don't think its that easy. If you e.g. have multiple ALTER's in the
  same transaction interspersed with inserted rows they will all have
  different TupleDesc's.

 If new columns were added, then tuples created with those older
 tuple-descriptors can still be interpreted with the latest
 tuple-descriptor.
 But you need to figure that out. If you have just the before-after images of
 the tupledescs you don't know what happened in there... That would mean either
 doing special things on catalog changes or reassembling the meaning from the
 changed pg_* rows. Neither seems enticing.

I think there is absolutely nothing wrong with doing extra things in
ALTER TABLE when logical replication is enabled.  We've got code
that's conditional on Hot Standby being enabled in many places in the
system; why should logical replication be any different?  If we set
the bar for logical replication at the system can't do anything
differently when logical replication is enabled then I cheerfully
submit that we are doomed.  You've already made WAL format changes to
support logging the pre-image of the tuple, which is a hundred times
more likely to cause a performance problem than any monkeying around
we might want to do in ALTER TABLE.

 Can we just agree to punt all this complexity for version 1 (and maybe
 versions 2, 3, and 4)?  I'm not sure what Slony does in situations
 like this but I bet for a lot of replication systems, the answer is
 do a full resync.  In other words, we either forbid the operation
 outright when the table is enabled for logical replication, or else we
 emit an LCR that says, in effect, transaction 12345 monkeyed with the
 table, please resync.  It strikes me that it's really the job of some
 higher-level control logic to decide what the correct behavior is in
 these cases; the decoding process doesn't really have enough
 information about what the user is trying to do to make a sensible
 decision anyway.  It would be nice to be able to support some simple
 cases like adding a column that has no default or dropping a
 column without punting, but going much further than that seems like
 it will require embedding policy decisions that should really be
 happening at a higher level.
 I am totally fine with saying that we do not support everything from the
 start. But we need to choose an architecture where its possible to add that
 support gradually and I don't think without looking inside transaction makes
 that possible.

I am deeply skeptical that we need to look inside of transactions that
do full-table rewrites.  But even if we do, I don't see that what I'm
proposing precludes it.  For example, I think we could have ALTER
TABLE emit WAL records specifically for logical replication that allow
us to disentangle which tuple descriptor to use at which point in the
transaction.  I don't see that that would even be very difficult to
set up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Andres Freund

Hi,

(munching the mail from Robert and Kevin together)

On Monday, June 25, 2012 06:42:41 PM Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
  I bet for a lot of replication systems, the answer is do a full
  resync.  In other words, we either forbid the operation outright
  when the table is enabled for logical replication, or else we emit
  an LCR that says, in effect, transaction 12345 monkeyed with the
  table, please resync.  It strikes me that it's really the job of
  some higher-level control logic to decide what the correct
  behavior is in these cases; the decoding process doesn't really
  have enough information about what the user is trying to do to
  make a sensible decision anyway.
 
 This is clearly going to depend on the topology.  You would
 definitely want to try to replicate the DDL for the case on which
 Simon is focused (which seems to me to be essentially physical
 replication of catalogs with logical replication of data changes
 from any machine to all others).  What you do about transactions in
 flight is the hard part.  You could try to suppress concurrent DML
 of the same objects or have some complex matrix of rules for trying
 to resolve the transactions in flight.  I don't see how the latter
 could ever be 100% accurate.
Yes. Thats why I dislike that proposal. I don't think thats going to be 
understandable and robust enough.

If we really look inside transactions (3b) and 1)) that shouldn't be a problem 
though. So I think it really has to be one of those.


 In our shop it is much easier.  We always have one database which is
 the only valid source for any tuple, although rows from many such
 databases can be in one table, and one row might replicate to many
 databases.  Thus, we don't want automatic replication of DDL.
 
  - When a column is going to be added to the source machines, we
first add it to the targets, with either a default or as
NULL-capable.
 
  - When a column is going to be deleted from the source machines, we
make sure it is NULL-capable or has a default on the replicas.
We drop it from all replicas after it is gone from all sources.
 
  - If a column is changing name or is changing to a fundamentally
different type we need to give the new column a new name, have
triggers to convert old to new (and vice versa) on the replicas,
and drop the old after all sources are updated.
 
  - If a column is changing in a minor way, like its precision, we
make sure the replicas can accept either format until all sources
have been converted.  We update the replicas to match the sources
after all sources are converted.

 We most particularly *don't* want DDL to replicate automatically,
 because the schema changes are deployed along with related software
 changes, and we like to pilot any changes for at least a few days.
 Depending on the release, the rollout may take a couple months, or
 we may slam in out everywhere a few days after the first pilot
 deployment.
Thats a sensible for your use-case - but I do not think its thats the 
appropriate behaviour for anything which is somewhat out-of-the box...

 So you could certainly punt all of this for any release as far as
 Wisconsin Courts are concerned.  We need to know table and column
 names, before and after images, and some application-supplied
 metadata.
I am not sure were going to get all that into 9.3. More on that below.

On Monday, June 25, 2012 07:09:38 PM Robert Haas wrote:
 On Mon, Jun 25, 2012 at 12:42 PM, Kevin Grittner wrote:
  I don't know that what we're looking for is any easier (although I
  doubt that it's any harder), but I'm starting to wonder how much
  mechanism they can really share.  The 2Q code is geared toward page
  format OIDs and data values for automated DDL distribution and
  faster replication, while we're looking for something which works
  between releases, architectures, and OSes.  We keep coming back to
  the idea of one mechanism because both WAL and a logical transaction
  stream would have after tuples, although they need them in
  different formats.
  
  I think the need for truly logical replication is obvious, since so
  many different people have developed trigger-based versions of that.
  And it sure seems like 2Q has clients who are willing to pay for the
  other.
 
  Perhaps the first question is: Is there enough in common between
  logical replication (and all the topologies that might be created
  with that) and the proposal on the table (which seems to be based
  around one particular topology with a vague notion of bolting
  logical replication on to it after the fact) to try to resolve the
  differences in one feature?  Or should the identical schema with
  multiple identical copies case be allowed to move forward more or
  less in isolation, with logical replication having its own design if
  and when someone wants to take it on?  Two non-compromised features
  might be cleaner -- I'm starting to feel like we're trying

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Kevin Grittner

Andres Freund and...@2ndquadrant.com wrote:
 
 We most particularly *don't* want DDL to replicate automatically,
 because the schema changes are deployed along with related
 software changes, and we like to pilot any changes for at least a
 few days.  Depending on the release, the rollout may take a
 couple months, or we may slam in out everywhere a few days after
 the first pilot deployment.
 Thats a sensible for your use-case - but I do not think its thats
 the appropriate behaviour for anything which is somewhat
 out-of-the box...
 
Right.  We currently consider the issues involved in a change and
script it by hand.  I think we want to continue to do that.  The
point was that, given the variety of timings and techniques for
distributing schema changes, maybe is was only worth doing that
automatically for the most constrained and controlled cases.
 
 So you could certainly punt all of this for any release as far as
 Wisconsin Courts are concerned.  We need to know table and column
 names, before and after images, and some application-supplied
 metadata.
 I am not sure were going to get all that into 9.3.
 
Sure, that was more related to why I was questioning how much these
use cases even *could* integrate -- whether it even paid to
*consider* these use cases at this point.  Responses from Robert and
you have pretty much convinced me that I was being overly
pessimistic on that.
 
One fine point regarding before and after images -- if a value
doesn't change in an UPDATE, there's no reason to include it in both
the BEFORE and AFTER tuple images, as long as we have the null
column bitmaps -- or some other way of distinguishing unchanged from
NULL.  (This could be especially important when the unchanged column
was a 50 MB bytea.)  It doesn't matter to me which way this is
optimized -- in our trigger-based system we chose to keep the full
BEFORE tuple and just show AFTER values which were distinct from the
corresponding BEFORE values, but it would be trivial to adapt the
code to the other way.
 
Sorry for that bout of pessimism.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Andres Freund

Hi,

On Monday, June 25, 2012 08:13:54 PM Robert Haas wrote:
 On Mon, Jun 25, 2012 at 1:50 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  Its an argument why related infrastructure would be interesting to more
  than that patch and thats not bad.
  If the garbage collecting is done in a very simplistic manner it doesn't
  sound too hard... The biggest problem is probably crash-recovery of that
  knowledge and how to hook knowledge into it that logical rep needs that
  data...
 I suppose the main reason we haven't done it already is that it
 increases the period of time during which we're using 2X the disk
 space.
I find that an acceptable price if its optional. Making it such doesn't seem 
to be a problem for me.

   I don't think its that easy. If you e.g. have multiple ALTER's in the
   same transaction interspersed with inserted rows they will all have
   different TupleDesc's.
  
  If new columns were added, then tuples created with those older
  tuple-descriptors can still be interpreted with the latest
  tuple-descriptor.
  
  But you need to figure that out. If you have just the before-after images
  of the tupledescs you don't know what happened in there... That would
  mean either doing special things on catalog changes or reassembling the
  meaning from the changed pg_* rows. Neither seems enticing.
 
 I think there is absolutely nothing wrong with doing extra things in
 ALTER TABLE when logical replication is enabled.  We've got code
 that's conditional on Hot Standby being enabled in many places in the
 system; why should logical replication be any different?  If we set
 the bar for logical replication at the system can't do anything
 differently when logical replication is enabled then I cheerfully
 submit that we are doomed.  You've already made WAL format changes to
 support logging the pre-image of the tuple, which is a hundred times
 more likely to cause a performance problem than any monkeying around
 we might want to do in ALTER TABLE.

 I am deeply skeptical that we need to look inside of transactions that
 do full-table rewrites.  But even if we do, I don't see that what I'm
 proposing precludes it.  For example, I think we could have ALTER
 TABLE emit WAL records specifically for logical replication that allow
 us to disentangle which tuple descriptor to use at which point in the
 transaction.  I don't see that that would even be very difficult to
 set up.
Sorry, I was imprecise above: I have no problem doing some changes during 
ALTER TABLE if logical rep is enabled. I am worried though that to make that 
robust you would need loads of places that emit additional information:
* ALTER TABLE
* ALTER FUNCTIION
* ALTER OPERATOR
* ALTER/CREATE CAST
* TRUNCATE
* CLUSTER
* ...

I have the feeling that would we want to do that the full amount of required 
information would be rather high and end up being essentially the catalog. 
Just having an intermediate tupledesc doesn't help that much if you have e.g. 
record_out doing type lookups of its own.

There also is the issue you have talked about before, that a user-type might 
depend on values in other tables. Unless were ready to break at least 
transactional behaviour for those for now...) I don't see how decoding outside 
of the transaction is ever going to be valid? I wouldn't have a big problem 
declaring that as broken for now...

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-25 Thread Alvaro Herrera


Excerpts from Kevin Grittner's message of lun jun 25 14:50:54 -0400 2012:

 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in both
 the BEFORE and AFTER tuple images, as long as we have the null
 column bitmaps -- or some other way of distinguishing unchanged from
 NULL.  (This could be especially important when the unchanged column
 was a 50 MB bytea.)

Yeah, probably the best is to have the whole thing in BEFORE, and just
send AFTER values for those columns that changed, and include the
'replace' bool array (probably packed as a bitmap), so that the update
can be trivially constructed at the other end just like in
heap_modify_tuple.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Andres Freund

On Thursday, June 21, 2012 01:41:25 PM Andres Freund wrote:
 Below are two possible implementation strategies for that concept
 
 Advantages:
 * Decoding is done on the master in an asynchronous fashion
 * low overhead during normal DML execution, not much additional code in
 that  path
 * can be very efficient if architecture/version are the same
 * version/architecture compatibility can be done transparently by falling
 back  to textual versions on mismatch
 
 Disadvantages:
 * decoding probably has to happen on the master which might not be what
 people  want performancewise

 3b)
 Ensure that enough information in the catalog remains by fudging the xmin 
 horizon. Then reassemble an appropriate snapshot to read the catalog as
 the  tuple in question has seen it.
 
 Advantages:
 * should be implementable with low impact to general code
 
 Disadvantages:
 * requires some complex code for assembling snapshots
 * it might be hard to guarantee that we always have enough information to 
 reassemble a snapshot (subxid overflows ...)
 * impacts vacuum if replication to some site is slow
There are some interesting problems related to locking and snapshots here. Not 
sure if they are resolvable:

We need to restrict SnapshotNow to represent to the view it had back when the 
wal record were currently decoding had. Otherwise we would possibly get wrong 
column types and similar. As were working in the past locking doesn't protect 
us against much here. I have that (mostly and inefficiently).

One interesting problem are table rewrites (truncate, cluster, some ALTER 
TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it 
had back when the wal record was created we get the old relfilenode. Which 
might have been dropped in part of the transaction cleanup...
With most types thats not a problem. Even things like records and arrays  
aren't problematic. More interesting cases include VACUUM FULL $systable (e.g. 
pg_enum) and vacuum full'ing a table which is used in the *_out function of a 
type (like a user level pg_enum implementation).

The only theoretical way I see against that problem would be to postpone all 
relation unlinks untill everything that could possibly read them has finished. 
Doesn't seem to alluring although it would be needed if we ever move more 
things of SnapshotNow.

Input/Ideas/Opinions?

Greetings,

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Simon Riggs

On 24 June 2012 22:11, Andres Freund and...@2ndquadrant.com wrote:

 One interesting problem are table rewrites (truncate, cluster, some ALTER
 TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
 had back when the wal record was created we get the old relfilenode. Which
 might have been dropped in part of the transaction cleanup...
 With most types thats not a problem. Even things like records and arrays
 aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
 pg_enum) and vacuum full'ing a table which is used in the *_out function of a
 type (like a user level pg_enum implementation).

That's only a problem if you are generating changes to the relfilenode
rather than the relid.

ISTM that this step differs depending upon whether we are generating
portable SQL, or whether we are generating changes for immediate
apply. If it is the latter, then it should never actually happen
because if a table rewrite occurred and then committed we would never
need to re-read earlier WAL.

So treating this as a generic problem leads to some weird results that
we don't need to worry about cos they can't actually happen.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Simon Riggs

On 22 June 2012 20:30, Andres Freund and...@2ndquadrant.com wrote:

 The problem making replacement of SnapshotNow.satisfies useful is that there 
 is
 no convenient way to represent subtransactions of the current transaction
 which already have committed according to the TransactionLog but aren't yet
 visible at the current lsn because they only started afterwards. Its
 relatively easy to fake this in an mvcc snapshot but way harder for
 SnapshotNow because you cannot mark transactions as in-progress.

I'm starting to like these ideas now.

We can solve many things by emitting a new WAL record type in any
subtransaction that issues catalog changes. That wasn't possible in
Hot Standby for performance reasons, but since we only care about
catalog changes those things are much rarer and wouldn't be a problem.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Andres Freund

On Sunday, June 24, 2012 11:37:26 PM Simon Riggs wrote:
 On 24 June 2012 22:11, Andres Freund and...@2ndquadrant.com wrote:
  One interesting problem are table rewrites (truncate, cluster, some ALTER
  TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
  view it had back when the wal record was created we get the old
  relfilenode. Which might have been dropped in part of the transaction
  cleanup...
  With most types thats not a problem. Even things like records and arrays
  aren't problematic. More interesting cases include VACUUM FULL $systable
  (e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
  function of a type (like a user level pg_enum implementation).

 That's only a problem if you are generating changes to the relfilenode
 rather than the relid.
Hm. I can't follow so far. Could you paraphrase?

 ISTM that this step differs depending upon whether we are generating
 portable SQL, or whether we are generating changes for immediate
 apply.
I fear only generating changes for immediate, low-level apply is going to fly 
given the various interests people have voiced.

 If it is the latter, then it should never actually happen because if a table
 rewrite occurred and then committed we would never need to re-read earlier
 WAL.
 So treating this as a generic problem leads to some weird results that
 we don't need to worry about cos they can't actually happen.
Well, even if it were true that we don't need to worry about the state before 
a full-table rewritte - I don't think it is - we still need to be able to cope 
with CLUSTER or VACUUM FULL...

Greetings,

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Simon Riggs

On 24 June 2012 22:50, Andres Freund and...@2ndquadrant.com wrote:
 On Sunday, June 24, 2012 11:37:26 PM Simon Riggs wrote:
 On 24 June 2012 22:11, Andres Freund and...@2ndquadrant.com wrote:
  One interesting problem are table rewrites (truncate, cluster, some ALTER
  TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
  view it had back when the wal record was created we get the old
  relfilenode. Which might have been dropped in part of the transaction
  cleanup...
  With most types thats not a problem. Even things like records and arrays
  aren't problematic. More interesting cases include VACUUM FULL $systable
  (e.g. pg_enum) and vacuum full'ing a table which is used in the *_out
  function of a type (like a user level pg_enum implementation).

 That's only a problem if you are generating changes to the relfilenode
 rather than the relid.
 Hm. I can't follow so far. Could you paraphrase?

Basically, whether we generate SQL or not.

 ISTM that this step differs depending upon whether we are generating
 portable SQL, or whether we are generating changes for immediate
 apply.
 I fear only generating changes for immediate, low-level apply is going to fly
 given the various interests people have voiced.

I'm OK with that, just checking what the objectives are.

 If it is the latter, then it should never actually happen because if a table
 rewrite occurred and then committed we would never need to re-read earlier
 WAL.
 So treating this as a generic problem leads to some weird results that
 we don't need to worry about cos they can't actually happen.
 Well, even if it were true that we don't need to worry about the state before
 a full-table rewritte - I don't think it is - we still need to be able to cope
 with CLUSTER or VACUUM FULL...

If you have a WAL record for a new relfilenode, then you don't need to
read the catalog at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Robert Haas

On Sun, Jun 24, 2012 at 5:11 PM, Andres Freund and...@2ndquadrant.com wrote:
 There are some interesting problems related to locking and snapshots here. Not
 sure if they are resolvable:

 We need to restrict SnapshotNow to represent to the view it had back when the
 wal record were currently decoding had. Otherwise we would possibly get wrong
 column types and similar. As were working in the past locking doesn't protect
 us against much here. I have that (mostly and inefficiently).

 One interesting problem are table rewrites (truncate, cluster, some ALTER
 TABLE's) and dropping tables. Because we nudge SnapshotNow to the past view it
 had back when the wal record was created we get the old relfilenode. Which
 might have been dropped in part of the transaction cleanup...
 With most types thats not a problem. Even things like records and arrays
 aren't problematic. More interesting cases include VACUUM FULL $systable (e.g.
 pg_enum) and vacuum full'ing a table which is used in the *_out function of a
 type (like a user level pg_enum implementation).

 The only theoretical way I see against that problem would be to postpone all
 relation unlinks untill everything that could possibly read them has finished.
 Doesn't seem to alluring although it would be needed if we ever move more
 things of SnapshotNow.

 Input/Ideas/Opinions?

Yeah, this is slightly nasty.  I'm not sure whether or not there's a
way to make it work.

I had another idea.  Suppose decoding happens directly on the primary,
because I'm still hoping there's a way to swing that.  Suppose further
that we handle DDL by insisting that (1) any backend which wants to
add columns or change the types of existing columns must first wait
for logical replication to catch up and (2) if a backend which has
added columns or changed the types of existing columns then writes to
the modified table, decoding of those writes will be postponed until
transaction commit.  I think that's enough to guarantee that the
decoding process can just use the catalogs as they stand, with plain
old SnapshotNow.

The downside of this approach is that it makes certain kinds of DDL
suck worse if logical replication is in use and behind.  But I don't
necessarily see that as prohibitive because (1) logical replication
being behind is likely to suck for a lot of other reasons too and (2)
adding or retyping columns isn't a terribly frequent operation and
people already expect a hit when they do it.  Also, I suspect that we
could find ways to loosen those restrictions at least in common cases
in some future version; meanwhile, less work now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-24 Thread Amit Kapila

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
Sent: Monday, June 25, 2012 6:39 AM
To: Andres Freund
Cc: pgsql-hackers@postgresql.org; Florian Pflug; Simon Riggs
Subject: Re: [HACKERS] Catalog/Metadata consistency during changeset
extraction from wal

On Sun, Jun 24, 2012 at 5:11 PM, Andres Freund and...@2ndquadrant.com
wrote:
 There are some interesting problems related to locking and snapshots here.
Not
 sure if they are resolvable:

 We need to restrict SnapshotNow to represent to the view it had back when
the
 wal record were currently decoding had. Otherwise we would possibly get
wrong
 column types and similar. As were working in the past locking doesn't
protect
 us against much here. I have that (mostly and inefficiently).

 One interesting problem are table rewrites (truncate, cluster, some ALTER
 TABLE's) and dropping tables. Because we nudge SnapshotNow to the past
view it
 had back when the wal record was created we get the old relfilenode. Which
 might have been dropped in part of the transaction cleanup...
 With most types thats not a problem. Even things like records and arrays
 aren't problematic. More interesting cases include VACUUM FULL $systable
(e.g.
 pg_enum) and vacuum full'ing a table which is used in the *_out function
of a
 type (like a user level pg_enum implementation).

 The only theoretical way I see against that problem would be to postpone
all
 relation unlinks untill everything that could possibly read them has
finished.
 Doesn't seem to alluring although it would be needed if we ever move more
 things of SnapshotNow.

 Input/Ideas/Opinions?

 Yeah, this is slightly nasty.  I'm not sure whether or not there's a
 way to make it work.

 I had another idea.  Suppose decoding happens directly on the primary,
 because I'm still hoping there's a way to swing that.  Suppose further
 that we handle DDL by insisting that (1) any backend which wants to
 add columns or change the types of existing columns must first wait
 for logical replication to catch up and (2) if a backend which has
 added columns or changed the types of existing columns then writes to
 the modified table, decoding of those writes will be postponed until
 transaction commit.  I think that's enough to guarantee that the
 decoding process can just use the catalogs as they stand, with plain
 old SnapshotNow.

What will be the way to make the DDL's wait for logical replication to catch
up?
One way can be that by looking WAL location till what it has generated
logical records. 
If we do this by holding table level locks (which will be required for
DDL's), then other DML's will also get blocked.
And If we do this without holding locks and decided to go ahead then there
can be other WAL generated by the meantime
and can create a problem.

Or if there is another way to make DDL's wait?

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-22 Thread Amit Kapila

Andres Freund Sent: Thursday, June 21, 2012 5:11 PM
 4.)
 Log enough information in the walstream to make decoding possible using
only 
 the walstream.

What I understood is that enough information is catalog data. Is that right
or something else?

 Advantages:
 * Decoding can optionally be done on the master
 * No catalog syncing/access required
 * its possible to make this architecture independent

 Disadvantage:
 * high to very high implementation overhead depending on efficiency aims
 * high space overhead in the wal because at least all the catalog
information 

In Multiversion approach also, there will be overhead of space to
maintain multiple versions
irrespective of any approach you use.

  needs to be logged in a transactional manner repeatedly
Why it needs to be logged repeatedly, once we log the catalog
information in WAL, during that
time we can disallow/block other DDL's and after that changes to Catalog
information can be
retrievied from WAL only.  

 * misuses wal far more than other methods
What is the misuse in this, I believe it can be later used for log
mining purposes also.

 * significant new complexity in somewhat cricital code paths (heapam.c)
 * insanely high space overhead if the decoding should be possible
architecture 
  independent

The option 4 seems to be better as compare to others w.r.t top level
approach to solve the problem.
Some other databases also uses similar approach for the use cases similar to
what you have described.

With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-22 Thread Simon Riggs

On 21 June 2012 12:41, Andres Freund and...@2ndquadrant.com wrote:

 2.)
 Keep the decoding site up2date by replicating the catalog via normal HS
 recovery
 mechanisms.

 Advantages:
 * most of the technology is already there
 * minimal overhead (space, performance)
 * no danger of out of sync catalogs
 * no support for command triggers required that can keep a catalog in sync,
 including oids

 Disadvantages:
 * driving the catalog recovery that way requires some somewhat intricate code
 as it needs to be done in lockstep with decoding the wal-stream
 * requires an additional feature to guarantee HS always has enough information
 to be queryable after a crash/shutdown
 * some complex logic/low-level fudging required to keep the transactional
 behaviour sensible when querying the catalog
 * full version/architecture compatibility required
 * the decoding site will always ever be only readable

My initial reaction was this wont work, but that requires
qualification since this is a complex topic:  You can use this
approach as long as you realise that the catalog it gives can never be
rewound.

So the generic approach to construct me a catalog as of this LSN
would need to start with a base backup of the catalog and then roll
forward to the appropriate LSN. Which means a generic user of this
approach would need to be able to construct an initial catalog using a
PITR.

Constructing a decoding site requires you to
a)  take a partial base backup of the catalog
b) apply WAL records to bring that forwards to the correct LSN, which
would require some alteration of the recovery code to skip the files
missing in a)

So taking the approach of a decoding site means we have to modify
recovery code, and even when we do that we still end up with a
difficult to deploy option in the real world. Difficult to deploy
becaus we need a whole new instance of Postgres, plus we need all of
the WAL files, which could easily be impractical.

The overall approach is good, but the architecture is wrong. What we
need is a catalog base backup and a means of rolling forward to the
appropriate LSN. Rolling forward using WAL is too bulky, so we need a
separate log of DDL changes to the catalog. So what we need is a
catalog base backup plus a ddl-log.

And we need to be able to reconstruct the correct catalog on the
target server directly.

To translate the WAL we maintain a secondary set of catalog tables,
which only exist for logical replication. These are normal,
non-versioned tables, but held in a new schema pg_logical or similar.
One reason why this must be a secondary catalog is to allow the
translation to take place on the target server, and to allow
translation of WAL from a prior version of the catalog - so we can
allow online upgrades across different catalog versions (and possibly
major versions).

The secondary catalog is used in place of the normal catalog during
InitCatalogCache() in the apply process. All the normal caches exist,
they just point to secondary relations rather than the normal ones.

When we initialise replication we take a copy of the appropriate
tables, columns and rows in a catalog-base-backup, using something
like pg_dump. Overall, this is much smaller than normal catalog since
it avoids temp tables, and anything not related to WAL translation.

On each non-temp change to the database we record changes as SQL in
the ddl-log, together with the LSN of the change.

When number of changes in ddl-log hits a limit we take a new
catalog-base-backup. This process is similar to a checkpoint, but much
less frequent, lets call it a ddl-checkpoint.

When we start to read WAL logs to translate them, we start by
truncating/re-bootstrapping and reloading the secondary catalog from
the base backup. We then apply all changes from the ddl-log (which is
just a sequence of SQL statements) up until the LSN at the start of
WAL. The secondary catalog is then an exact copy of the catalog as of
that LSN.

As we read through WAL we apply further changes to secondary catalog
so it maintains in lock step with the WAL we currently read.

Having the ddl-base-backup and ddl-log allows reconstruction of the
catalog without needing to put whole catalog into WAL each checkpoint.
We can truncate old WAL segments and yet still recreate the DDL needed
to translate current WAL data. As a result, ddl-checkpoints are much
less frequent, perhaps weekly or monthly, rather than every few
minutes.

The whole process is similar in algorithm to recovery, but is just
normal userspace tables and SQL.

Constructing the correct catalog seems to be the heart of this
problem, so it is likely to take a while and look complex. Getting the
HS initial state was around 50% of the effort in making it all work,
so I guess its similar here.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-22 Thread Andres Freund

Hi,

On Friday, June 22, 2012 08:48:41 AM Simon Riggs wrote:
 On 21 June 2012 12:41, Andres Freund and...@2ndquadrant.com wrote:
  2.)
  Keep the decoding site up2date by replicating the catalog via normal HS
  recovery
  mechanisms.
  
  Advantages:
  * most of the technology is already there
  * minimal overhead (space, performance)
  * no danger of out of sync catalogs
  * no support for command triggers required that can keep a catalog in
  sync, including oids
  
  Disadvantages:
  * driving the catalog recovery that way requires some somewhat intricate
  code as it needs to be done in lockstep with decoding the wal-stream
  * requires an additional feature to guarantee HS always has enough
  information to be queryable after a crash/shutdown
  * some complex logic/low-level fudging required to keep the transactional
  behaviour sensible when querying the catalog
  * full version/architecture compatibility required
  * the decoding site will always ever be only readable
 
 My initial reaction was this wont work, but that requires
 qualification since this is a complex topic:  You can use this
 approach as long as you realise that the catalog it gives can never be
 rewound.
Well, only as far as the min recovery point has been advanced. Thats advanced 
less frequent than we apply xlog records.

 So the generic approach to construct me a catalog as of this LSN
 would need to start with a base backup of the catalog and then roll
 forward to the appropriate LSN. Which means a generic user of this
 approach would need to be able to construct an initial catalog using a
 PITR.

 Constructing a decoding site requires you to
 a)  take a partial base backup of the catalog
 b) apply WAL records to bring that forwards to the correct LSN, which
 would require some alteration of the recovery code to skip the files
 missing in a)
 
 So taking the approach of a decoding site means we have to modify
 recovery code, and even when we do that we still end up with a
 difficult to deploy option in the real world. Difficult to deploy
 becaus we need a whole new instance of Postgres, plus we need all of
 the WAL files, which could easily be impractical.
 
 The overall approach is good, but the architecture is wrong. What we
 need is a catalog base backup and a means of rolling forward to the
 appropriate LSN. Rolling forward using WAL is too bulky, so we need a
 separate log of DDL changes to the catalog. So what we need is a
 catalog base backup plus a ddl-log.
The idea was to store the applycache to disk everytime UpdateMinRecoveryPoint 
is called. That way you wouldn't have to scroll back, even if the database 
crashes/is stopped hard.
But I agree, I don't like the architecture that much either.

 To translate the WAL we maintain a secondary set of catalog tables,
 which only exist for logical replication. These are normal,
 non-versioned tables, but held in a new schema pg_logical or similar.
 One reason why this must be a secondary catalog is to allow the
 translation to take place on the target server, and to allow
 translation of WAL from a prior version of the catalog - so we can
 allow online upgrades across different catalog versions (and possibly
 major versions).
 The secondary catalog is used in place of the normal catalog during
 InitCatalogCache() in the apply process. All the normal caches exist,
 they just point to secondary relations rather than the normal ones.
 When we initialise replication we take a copy of the appropriate
 tables, columns and rows in a catalog-base-backup, using something
 like pg_dump. Overall, this is much smaller than normal catalog since
 it avoids temp tables, and anything not related to WAL translation.
 
 On each non-temp change to the database we record changes as SQL in
 the ddl-log, together with the LSN of the change.
 
 When number of changes in ddl-log hits a limit we take a new
 catalog-base-backup. This process is similar to a checkpoint, but much
 less frequent, lets call it a ddl-checkpoint.
 
 When we start to read WAL logs to translate them, we start by
 truncating/re-bootstrapping and reloading the secondary catalog from
 the base backup. We then apply all changes from the ddl-log (which is
 just a sequence of SQL statements) up until the LSN at the start of
 WAL. The secondary catalog is then an exact copy of the catalog as of
 that LSN.
 
 As we read through WAL we apply further changes to secondary catalog
 so it maintains in lock step with the WAL we currently read.
I can't see how thats going to fly because the *_out functions use the 
syscache and also plain access to catalog tables. We would have to completely 
map oids to the alternative catalog.
For one I think that mapping would involve far too many places (shared 
catalogs/relmapper. smgr, fd, syscache, ...). For another you need to access 
those tables in a completely normal fashion from non-recovery backends which 
means that we cannot just have duplicated oids hidden away somewhere.


 Constructing the

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-22 Thread Andres Freund

On Thursday, June 21, 2012 05:40:08 PM Andres Freund wrote:
 On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:
  On Jun21, 2012, at 13:41 , Andres Freund wrote:
   3b)
   Ensure that enough information in the catalog remains by fudging the
   xmin horizon. Then reassemble an appropriate snapshot to read the
   catalog as the tuple in question has seen it.
  
  The ComboCID machinery makes that quite a bit harder, I fear. If a tuple
  is updated multiple times by the same transaction, you cannot decide
  whether a tuple was visible in a certain snapshot unless you have access
  to the updating backend's ComboCID hash.
 
 Thats a very good point. Not sure how I forgot that.
 
 It think it might be possible to reconstruct a sensible combocid mapping
 from the walstream. Let me think about it for a while...
I have a very, very preliminary thing which seems to work somewhat. I just log 
(cmin, cmax) additionally for every modified catalog tuple into the wal and so 
far that seems to be enough.
Do you happen to have suggestions for other problematic things to look into 
before I put more time into it?

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-22 Thread Andres Freund

On Friday, June 22, 2012 03:22:03 PM Andres Freund wrote:
 On Thursday, June 21, 2012 05:40:08 PM Andres Freund wrote:
  On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:
   On Jun21, 2012, at 13:41 , Andres Freund wrote:
3b)
Ensure that enough information in the catalog remains by fudging the
xmin horizon. Then reassemble an appropriate snapshot to read the
catalog as the tuple in question has seen it.
   
   The ComboCID machinery makes that quite a bit harder, I fear. If a
   tuple is updated multiple times by the same transaction, you cannot
   decide whether a tuple was visible in a certain snapshot unless you
   have access to the updating backend's ComboCID hash.
  
  Thats a very good point. Not sure how I forgot that.
  
  It think it might be possible to reconstruct a sensible combocid mapping
  from the walstream. Let me think about it for a while...
 
 I have a very, very preliminary thing which seems to work somewhat. I just
 log (cmin, cmax) additionally for every modified catalog tuple into the
 wal and so far that seems to be enough.
 Do you happen to have suggestions for other problematic things to look into
 before I put more time into it?
Im continuing to play around with this. The tricky bit so far is 
subtransaction handling in transactions which modify the catalog (+ possible 
tables which are marked as being required for decoding like pg_enum 
equivalent).

Would somebody fundamentally object to one the following things:
1.
replace

#define IsMVCCSnapshot(snapshot)  \
((snapshot)-satisfies == HeapTupleSatisfiesMVCC)

with something like

#define IsMVCCSnapshot(snapshot)  \
((snapshot)-satisfies == HeapTupleSatisfiesMVCC || 
(snapshot)-satisfies == 
HeapTupleSatisfiesMVCCDuringDecode)

The define is only used sparingly and none of the code path looks so hot that 
this could make a difference.

2.
Set SnapshotNowData.satisfies to HeapTupleSatisfiesNowDuringRecovery while 
reading the catalog for decoding.

Its possible to go on without both but the faking up of data gets quite a bit 
more complex.

The problem making replacement of SnapshotNow.satisfies useful is that there is 
no convenient way to represent subtransactions of the current transaction 
which already have committed according to the TransactionLog but aren't yet 
visible at the current lsn because they only started afterwards. Its 
relatively easy to fake this in an mvcc snapshot but way harder for 
SnapshotNow because you cannot mark transactions as in-progress.

Thanks,

Andres

-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Florian Pflug

On Jun21, 2012, at 13:41 , Andres Freund wrote:
 3b)
 Ensure that enough information in the catalog remains by fudging the xmin 
 horizon. Then reassemble an appropriate snapshot to read the catalog as the 
 tuple in question has seen it.

The ComboCID machinery makes that quite a bit harder, I fear. If a tuple is
updated multiple times by the same transaction, you cannot decide whether a
tuple was visible in a certain snapshot unless you have access to the updating
backend's ComboCID hash.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Florian Pflug

On Jun21, 2012, at 13:41 , Andres Freund wrote:
 5.)
 The actually good idea. Yours?

What about a mixure of (3b) and (4), which writes the data not to the WAL
but to a separate logical replication log. More specifically:

There's a per-backend queue of change notifications.

Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
record containing (xid, databaseoid, tableoid, old xmin, old ctid, new ctid)

Whenever a table (or something that a table depends on) is modified we
wait until all references to that table's oid have vanished from the queue,
then queue a DDL record containing (xid, databaseoid, tableoid, text). Other
backend cannot concurrently add further TUPLE_MODIFIED records since we alreay
hold an exclusive lock on the table at that point.

A background process continually processes these queues. If the front of the
queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
based on their ctids and writes the old tuple's PK and the full new tuple
to the logical replication log. Since table modifications always wait for
all previously queued TUPLE_MODIFIED records referencing that table to be
processes *before* altering the catalog, tuples can always be interpreted
according to the current (SnapshotNow) catalog contents.

Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
which are also written to the log by the background process. The background
process may decide to wait until a backend commits before processing that
backend's log. In that case, rolled back transaction don't leave a trace in
the logical replication log. Should a backend, however, issue a DDL statement,
the background process *must* process that backend's queue immediately, since
otherwise there's a dead lock.

The background process also maintains a value in shared memory which
contains the oldest value in any of the queue's xid or old xmin fields.
VACUUM and the like must not remove tuples whose xmin is = that value.
Hit bits *may* be set for newest tuples though, provided that the background
process ignores hint bits when fetching the old and new tuples.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Simon Riggs

On 21 June 2012 12:41, Andres Freund and...@2ndquadrant.com wrote:

 3)
 Multi-Versioned catalog

 Below are two possible implementation strategies for that concept

 Advantages:
 * Decoding is done on the master in an asynchronous fashion
 * low overhead during normal DML execution, not much additional code in that
 path
 * can be very efficient if architecture/version are the same
 * version/architecture compatibility can be done transparently by falling back
 to textual versions on mismatch

 Disadvantages:
 * decoding probably has to happen on the master which might not be what people
 want performancewise

 3a)
 Change the system catalogs to be versioned

 Advantages.
 * catalog access is easy
 * might be interesting for other users

 Disadvantages:
 * catalog versioning is complex to implement
 * space overhead for all users, even without using logical replication
 * I can't see -hackers signing off

Hmm, there's all sorts of stuff mixed up there in your description.

ISTM we should maintain a lookup table on target system that has the
minimal required information on it.

There is no need to version the whole catalog. (Complete overkill - I
would oppose it ;-)

If we keep the lookup table on the target as a normal table, we can
insert new rows into it as changes occur. If we need to perform
recovery then the earlier version rows will still be there and we just
use those. Versioning is easy to implement, just use LSN as additional
key in the table. Then lookup based on key and LSN. If a transaction
that makes DDL changes aborts, then the changes will be automatically
backed out.

Only keep the lookup table if using logical replication, so zero
overhead otherwise. We just need to setup the initial state carefully,
so it matches whats in the database, but that sounds OK.

So I don't see any of the disadvantages you have there. Its just darn
simple, and hence will probably work. It's also a very similar
solution to the other lookups required in memory by the apply process.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Andres Freund

Hi,

On Thursday, June 21, 2012 04:39:21 PM Simon Riggs wrote:
 On 21 June 2012 12:41, Andres Freund and...@2ndquadrant.com wrote:
  3)
  Multi-Versioned catalog
  
  Below are two possible implementation strategies for that concept
  
  Advantages:
  * Decoding is done on the master in an asynchronous fashion
  * low overhead during normal DML execution, not much additional code in
  that path
  * can be very efficient if architecture/version are the same
  * version/architecture compatibility can be done transparently by falling
  back to textual versions on mismatch
  
  Disadvantages:
  * decoding probably has to happen on the master which might not be what
  people want performancewise
  
  3a)
  Change the system catalogs to be versioned
  
  Advantages.
  * catalog access is easy
  * might be interesting for other users
  
  Disadvantages:
  * catalog versioning is complex to implement
  * space overhead for all users, even without using logical replication
  * I can't see -hackers signing off
 
 Hmm, there's all sorts of stuff mixed up there in your description.
Sure, it tried to compress a complex topic discussed in a long thread ;)

 ISTM we should maintain a lookup table on target system that has the
 minimal required information on it.
You need just about the whole catalog because the *_out procs might need to 
lookup types, operators and such again. 
Unless you want to rewrite those functions you need to provide a normal 
execution environment.

I don't see how your idea works because of that? Am I missing something?

Yes, that would be easier if we didn't want to support conversion to text and 
similar, but I don't see that flying. And even if it would be acceptable you 
would need to have enough information to construct a btree ScanKey which means 
you already need a lot of the catalogs.

 There is no need to version the whole catalog. (Complete overkill - I
 would oppose it ;-)
Hey, that originally was your idea :P. But I definitely agree, its not a good 
idea.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Simon Riggs

On 21 June 2012 15:53, Andres Freund and...@2ndquadrant.com wrote:

 ISTM we should maintain a lookup table on target system that has the
 minimal required information on it.

 You need just about the whole catalog because the *_out procs might need to
 lookup types, operators and such again.
 Unless you want to rewrite those functions you need to provide a normal
 execution environment.

OK, so its more tables than I first thought, but its not all rows and
columns of all catalog tables.


 I don't see how your idea works because of that? Am I missing something?

Why does the number/size of the tables required make that not work?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Andres Freund

On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:
 On 21 June 2012 15:53, Andres Freund and...@2ndquadrant.com wrote:
  ISTM we should maintain a lookup table on target system that has the
  minimal required information on it.
  
  You need just about the whole catalog because the *_out procs might need
  to lookup types, operators and such again.
  Unless you want to rewrite those functions you need to provide a normal
  execution environment.
 
 OK, so its more tables than I first thought, but its not all rows and
 columns of all catalog tables.
Sure, there are a few you probably can leave out (pg_database, pg_auth*, 
pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not many.

  I don't see how your idea works because of that? Am I missing something?
 Why does the number/size of the tables required make that not work?
The number of tables itself isn't a fundamental problem although it would make 
stuff harder.
The problem is that the out functions expect a normal operating environment 
and might e.g. do catalog lookups themselves. I don't see how we can do 
anything here without providing a (nearly) full catalog.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Simon Riggs

On 21 June 2012 16:13, Andres Freund and...@2ndquadrant.com wrote:
 On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:
 On 21 June 2012 15:53, Andres Freund and...@2ndquadrant.com wrote:
  ISTM we should maintain a lookup table on target system that has the
  minimal required information on it.
 
  You need just about the whole catalog because the *_out procs might need
  to lookup types, operators and such again.
  Unless you want to rewrite those functions you need to provide a normal
  execution environment.

 OK, so its more tables than I first thought, but its not all rows and
 columns of all catalog tables.
 Sure, there are a few you probably can leave out (pg_database, pg_auth*,
 pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not many.

That's a start. Leaving out the shared catalogs makes me smile already.


  I don't see how your idea works because of that? Am I missing something?
 Why does the number/size of the tables required make that not work?
 The number of tables itself isn't a fundamental problem although it would make
 stuff harder.
 The problem is that the out functions expect a normal operating environment
 and might e.g. do catalog lookups themselves. I don't see how we can do
 anything here without providing a (nearly) full catalog.

I accept that there could be pathological functions in there. We're
not trying to make it work with any conceivable datatype/operator, so
forcing logical replication to follow sensible rules makes sense. Are
there any out functions that anybody uses that do that?

It's too much change to actually version the main catalog. Keeping a
separate copy of a versioned catalog for use by replication sounds
much more likely to fly.


In any case, I think we'll have to go back through the list and do
more work on evaluation. When the options look like that, its typical
to have ruled out the final winner early on, but that doesn't mean it
isn't in there somewhere.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Andres Freund

On Thursday, June 21, 2012 05:25:41 PM Simon Riggs wrote:
 On 21 June 2012 16:13, Andres Freund and...@2ndquadrant.com wrote:
  On Thursday, June 21, 2012 05:05:04 PM Simon Riggs wrote:
  On 21 June 2012 15:53, Andres Freund and...@2ndquadrant.com wrote:
   ISTM we should maintain a lookup table on target system that has the
   minimal required information on it.
   
   You need just about the whole catalog because the *_out procs might
   need to lookup types, operators and such again.
   Unless you want to rewrite those functions you need to provide a
   normal execution environment.
  
  OK, so its more tables than I first thought, but its not all rows and
  columns of all catalog tables.
  
  Sure, there are a few you probably can leave out (pg_database, pg_auth*,
  pg_*acl, pg_(sh)?depend, pg_database, pg_tablespace, ...) but its not
  many.
 
 That's a start. Leaving out the shared catalogs makes me smile already.
 
   I don't see how your idea works because of that? Am I missing
   something?
  
  Why does the number/size of the tables required make that not work?
  
  The number of tables itself isn't a fundamental problem although it would
  make stuff harder.
  The problem is that the out functions expect a normal operating
  environment and might e.g. do catalog lookups themselves. I don't see
  how we can do anything here without providing a (nearly) full catalog.
 
 I accept that there could be pathological functions in there. We're
 not trying to make it work with any conceivable datatype/operator, so
 forcing logical replication to follow sensible rules makes sense. Are
 there any out functions that anybody uses that do that?
Loads. enum_out, record_out, array_out are examples I can think of without 
even looking. I am pretty sure there are more. But imo this list already shows 
its prohibitive.

 It's too much change to actually version the main catalog. Keeping a
 separate copy of a versioned catalog for use by replication sounds
 much more likely to fly.
I don't yet see how that should work given oids and everything are quite 
possibly hardcoded in those functions. You could start switching out the 
catalogs on a lower level but I think at that point its getting too ugly.

 In any case, I think we'll have to go back through the list and do
 more work on evaluation. When the options look like that, its typical
 to have ruled out the final winner early on, but that doesn't mean it
 isn't in there somewhere.
I hope we have but I am not convinced that there is an elegant solution...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Andres Freund

On Thursday, June 21, 2012 03:56:54 PM Florian Pflug wrote:
 On Jun21, 2012, at 13:41 , Andres Freund wrote:
  3b)
  Ensure that enough information in the catalog remains by fudging the xmin
  horizon. Then reassemble an appropriate snapshot to read the catalog as
  the tuple in question has seen it.
 
 The ComboCID machinery makes that quite a bit harder, I fear. If a tuple is
 updated multiple times by the same transaction, you cannot decide whether a
 tuple was visible in a certain snapshot unless you have access to the
 updating backend's ComboCID hash.
Thats a very good point. Not sure how I forgot that.

It think it might be possible to reconstruct a sensible combocid mapping from 
the walstream. Let me think about it for a while...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-21 Thread Andres Freund

On Thursday, June 21, 2012 04:05:54 PM Florian Pflug wrote:
 On Jun21, 2012, at 13:41 , Andres Freund wrote:
  5.)
  The actually good idea. Yours?
 
 What about a mixure of (3b) and (4), which writes the data not to the WAL
 but to a separate logical replication log. More specifically:
 
 There's a per-backend queue of change notifications.
 
 Whenever a non-catalog tuple is modified, we queue a TUPLE_MODIFIED
 record containing (xid, databaseoid, tableoid, old xmin, old ctid, new
 ctid)
 
 Whenever a table (or something that a table depends on) is modified we
 wait until all references to that table's oid have vanished from the queue,
 then queue a DDL record containing (xid, databaseoid, tableoid, text).
 Other backend cannot concurrently add further TUPLE_MODIFIED records since
 we alreay hold an exclusive lock on the table at that point.
 
 A background process continually processes these queues. If the front of
 the queue is a TUPLE_MODIFIED record, it fetches the old and the new tuple
 based on their ctids and writes the old tuple's PK and the full new tuple
 to the logical replication log. Since table modifications always wait for
 all previously queued TUPLE_MODIFIED records referencing that table to be
 processes *before* altering the catalog, tuples can always be interpreted
 according to the current (SnapshotNow) catalog contents.
 
 Upon transaction COMMIT and ROLLBACK, we queue COMMIT and ROLLBACK records,
 which are also written to the log by the background process. The background
 process may decide to wait until a backend commits before processing that
 backend's log. In that case, rolled back transaction don't leave a trace in
 the logical replication log. Should a backend, however, issue a DDL
 statement, the background process *must* process that backend's queue
 immediately, since otherwise there's a dead lock.
 
 The background process also maintains a value in shared memory which
 contains the oldest value in any of the queue's xid or old xmin fields.
 VACUUM and the like must not remove tuples whose xmin is = that value.
 Hit bits *may* be set for newest tuples though, provided that the
 background process ignores hint bits when fetching the old and new tuples.
I think thats too complicated to fly. Getting that to recover cleanly in case 
of crash would mean you'd need another wal.

I think if it comes to that going for 1) is more realistic...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

38 matches

Mail list logo