Re: [HACKERS] Changeset Extraction Interfaces

2014-01-08 Thread Andres Freund
On 2014-01-07 17:54:21 +0100, Andres Freund wrote:
 On 2013-12-12 16:49:33 +0100, Andres Freund wrote:
  On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
   On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund and...@2ndquadrant.com 
   wrote:
As far as I have been thinking of, this would be another catalog table 
like
pg_decoding_plugin(oid, dpname name, dpload regproc).
   
   Instead of adding another catalog table, I think we should just define
   a new type.  Again, please look at the way that foreign data wrappers
   do this:
  
  I don't really see what the usage of a special type has to do with this,
  but I think that's besides your main point. What you're saying is that
  the output plugin is just defined by a function name, possibly schema
  prefixed. That has an elegance to it. +1
 
 Ok, so I've implemented this, but I am not so sure it's sufficient,
 there's some issue:
 Currently a logical replication slot has a plugin assigned, previously
 that has just been identified by the basename of a .so. But with the
 above proposal the identifier is pointing to a function, currently via
 its oid. But what happens if somebody drops or recreates the function?
 We can't make pg_depend entries or anything since that won't work on a
 standby.
 Earlier, if somebody removed the .so we'd just error out, but pg's
 dependency tracking always only mattered to things inside the catalogs.
 
 I see the following possible solutions for this:
 
 1) accept that fact, and throw an error if the function doesn't exist
 anymore, or has an unsuitable signature. We can check the return type of
 output_plugin_callbacks, so that's a pretty specific test.
 
 2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or
 similar) when there's a slot defined. But how'd that work if the slot is
 only defined on standbys? We could have the redo routine block and/or
 kill the slot if necessary?
 
 3) Don't assign a specific output plugin to a slot, but have it
 specified everytime data is streamed, not just when a slot is
 created. Currently that wouldn't be a problem, but I am afraid it will
 constrict some future optimizations.
 
 Good ideas?

So, Robert and I had a IM discussion about this. Neither of us was
particularly happy about the proposed solutions.

So, what we've concluded is that using a function as the handler doesn't
work out well enough given the constraints (primarily the inability to
create dependency records on a HS node). We've concluded that the best
way forward is a variant of the current implementation where the output
plugin is specified as a dynamic library. Which is:
CREATE_REPLICATION_SLOT slot_name LOGICAL OUTPUT_PLUGIN library_name;
but in contrast to the current code where each individual output plugin
callback is dlsym()ed via a fixed function name, only a
_PG_output_plugin_init() function is looked up  called which fills out
a struct containing the individual callbacks. Additionally the init
and cleanup output plugin callbacks will be renamed to
startup/shutdown to avoid possible confusions.

This unfortunately still prohibits implementing output plugins within
the core postgres binary, but that can be solved by shipping
core-provided output plugins - should they ever exist - as shared
objects, like it's already done for libpqwalreceiver.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2014-01-07 Thread Andres Freund
On 2013-12-12 16:49:33 +0100, Andres Freund wrote:
 On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
  On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
   As far as I have been thinking of, this would be another catalog table 
   like
   pg_decoding_plugin(oid, dpname name, dpload regproc).
  
  Instead of adding another catalog table, I think we should just define
  a new type.  Again, please look at the way that foreign data wrappers
  do this:
 
 I don't really see what the usage of a special type has to do with this,
 but I think that's besides your main point. What you're saying is that
 the output plugin is just defined by a function name, possibly schema
 prefixed. That has an elegance to it. +1

Ok, so I've implemented this, but I am not so sure it's sufficient,
there's some issue:
Currently a logical replication slot has a plugin assigned, previously
that has just been identified by the basename of a .so. But with the
above proposal the identifier is pointing to a function, currently via
its oid. But what happens if somebody drops or recreates the function?
We can't make pg_depend entries or anything since that won't work on a
standby.
Earlier, if somebody removed the .so we'd just error out, but pg's
dependency tracking always only mattered to things inside the catalogs.

I see the following possible solutions for this:

1) accept that fact, and throw an error if the function doesn't exist
anymore, or has an unsuitable signature. We can check the return type of
output_plugin_callbacks, so that's a pretty specific test.

2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or
similar) when there's a slot defined. But how'd that work if the slot is
only defined on standbys? We could have the redo routine block and/or
kill the slot if necessary?

3) Don't assign a specific output plugin to a slot, but have it
specified everytime data is streamed, not just when a slot is
created. Currently that wouldn't be a problem, but I am afraid it will
constrict some future optimizations.

Good ideas?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2014-01-04 Thread Robert Haas
On Fri, Jan 3, 2014 at 10:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
 On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I think there'll always be a bit of a difference between slots for
  physical and logical data, even if 90% of the implementation is the
  same. We can signal that difference by specifying logical/physical as an
  option or having two different sets of commands.
 
  Maybe?
 
  ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
  ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
  -- already exists without slot, PHYSICAL arguments
  START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
  START_REPLICATION SLOT LOGICAL slot plugin_options
  RELEASE_REPLICATION_SLOT slot_name

 I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
 but basically this seems OK to me.

 When writing the code for this, I decided that I need to reneg a bit on
 those names - they don't work nicely enough on the C level for
 me. Specifically during a START_REPLICATION we need to temporarily mark
 the slot as being actively used and mark it unused again
 afterwards. That's much more Acquire/Release like than the persistent
 Acquire/Release above for me.

 The C names in the version I am working on currently are:
 extern void ReplicationSlotCreate(const char *name);
 extern void ReplicationSlotDrop(const char *name);
 extern void ReplicationSlotAcquire(const char *name);
 extern void ReplicationSlotRelease(void);
 extern void ReplicationSlotSave(void);

 which would make the walsender ones

 CREATE_REPLICATION_SLOT ...
 START_REPLICATION [SLOT slot] [LOGICAL | PHYSICAL] ...
 DROP_REPLICATION_SLOT ...

 where START_REPLICATION internally does acquire/release on the passed
 SLOT.

 Does that work for you?

Yep, no objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2014-01-03 Thread Andres Freund
On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
 On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund and...@2ndquadrant.com wrote:
  I think there'll always be a bit of a difference between slots for
  physical and logical data, even if 90% of the implementation is the
  same. We can signal that difference by specifying logical/physical as an
  option or having two different sets of commands.
 
  Maybe?
 
  ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
  ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
  -- already exists without slot, PHYSICAL arguments
  START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
  START_REPLICATION SLOT LOGICAL slot plugin_options
  RELEASE_REPLICATION_SLOT slot_name
 
 I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
 but basically this seems OK to me.

When writing the code for this, I decided that I need to reneg a bit on
those names - they don't work nicely enough on the C level for
me. Specifically during a START_REPLICATION we need to temporarily mark
the slot as being actively used and mark it unused again
afterwards. That's much more Acquire/Release like than the persistent
Acquire/Release above for me.

The C names in the version I am working on currently are:
extern void ReplicationSlotCreate(const char *name);
extern void ReplicationSlotDrop(const char *name);
extern void ReplicationSlotAcquire(const char *name);
extern void ReplicationSlotRelease(void);
extern void ReplicationSlotSave(void);

which would make the walsender ones

CREATE_REPLICATION_SLOT ...
START_REPLICATION [SLOT slot] [LOGICAL | PHYSICAL] ...
DROP_REPLICATION_SLOT ...

where START_REPLICATION internally does acquire/release on the passed
SLOT.

Does that work for you?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-17 Thread Andres Freund
On 2013-12-16 23:01:16 -0500, Robert Haas wrote:
 On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
  Well, it still seems to me that the right way to think about this is
  that the change stream begins at a certain point, and then once you
  cross a certain threshold (all transactions in progress at that time
  have ended) any subsequent snapshot is a possible point from which to
  roll forward.
 
  Unfortunately it's not possible to build exportable snapshots at any
  time - it requires keeping far more state around since we need to care
  about all transactions, not just transactions touching the
  catalog. Currently you can only export the snapshot in the one point we
  become consistent, after that we stop maintaining that state.
 
 I don't get it.  Once all the old transactions are gone, I don't see
 why you need any state at all to build an exportable snapshot.  Just
 take a snapshot.

The state we're currently decoding, somewhere in already fsynced WAL,
won't correspond to the state in the procarray. There might be
situations where it will, but we can't guarantee that we ever reach that
point without taking locks that will be problematic.

 The part that you're expressing willingness to do sounds entirely
 satisfactory to me.  As I mentioned on the other thread, I'm perhaps
 even willing to punt that feature entirely provided that we have a
 clear design for how to add it later, but I think it'd be nicer to get
 it done now.

We'll see how the next version looks like. Not sure on that myself yet
;)

 And just for the record, I think the idea that I am holding this patch
 hostage is absurd.  I have devoted a large amount of time and energy
 to moving this forward and plan to devote more.  Because of that work,
 big chunks of what is needed here are already committed. If my secret
 plan is to make it as difficult as possible for you to get this
 committed, I'm playing a deep game.

I am not saying at all that you're planning to stop the patch from
getting in. You've delivered pretty clear proof that that's not the
case.
But that doesn't prevent us from arguing over details and disagreeing
whether they are dealbreakers or not, does it ;)

I think you know that I am hugely grateful for the work you've put into
the topic.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-17 Thread Robert Haas
On Tue, Dec 17, 2013 at 4:31 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-16 23:01:16 -0500, Robert Haas wrote:
 On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
  Well, it still seems to me that the right way to think about this is
  that the change stream begins at a certain point, and then once you
  cross a certain threshold (all transactions in progress at that time
  have ended) any subsequent snapshot is a possible point from which to
  roll forward.
 
  Unfortunately it's not possible to build exportable snapshots at any
  time - it requires keeping far more state around since we need to care
  about all transactions, not just transactions touching the
  catalog. Currently you can only export the snapshot in the one point we
  become consistent, after that we stop maintaining that state.

 I don't get it.  Once all the old transactions are gone, I don't see
 why you need any state at all to build an exportable snapshot.  Just
 take a snapshot.

 The state we're currently decoding, somewhere in already fsynced WAL,
 won't correspond to the state in the procarray. There might be
 situations where it will, but we can't guarantee that we ever reach that
 point without taking locks that will be problematic.

You don't need to guarantee that.  Just take a current snapshot and
then throw away (or don't decode in the first place) any transactions
that would be visible to that snapshot.  This is simpler and more
flexible, and possibly more performant, too, because with your design
you'll have to hold back xmin to the historical snapshot you build
while copying the table rather than to a current snapshot.

I really think we should consider whether we can't get by with ripping
out the build-an-exportable-snapshot code altogether.  I don't see
that it's really buying us much.  We need a way for the client to know
when decoding has reached the point where it is guaranteed complete -
i.e. all transactions in progress at the time decoding was initiated
have ended.  We also need a way for a backend performing decoding to
take a current MVCC snapshot, export it, and send the identifier to
the client.  And we need a way for the client to know whether any
given one of those snapshots includes a particular XID we may have
decoded.  But I think all of that might still be simpler than what you
have now, and it's definitely more flexible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-16 Thread Robert Haas
On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
 On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  But can you imagine those users needing an exported snapshot? I can
  think of several short-lived usages, but all of those are unlikely to
  need a consistent view of the overall database. And those are less
  likely to be full blown replication solutions.
  I.e. it's not the DBA making that decision but the developer making the
  decision based on whether he requires the snapshot or not.

 Well, it still seems to me that the right way to think about this is
 that the change stream begins at a certain point, and then once you
 cross a certain threshold (all transactions in progress at that time
 have ended) any subsequent snapshot is a possible point from which to
 roll forward.

 Unfortunately it's not possible to build exportable snapshots at any
 time - it requires keeping far more state around since we need to care
 about all transactions, not just transactions touching the
 catalog. Currently you can only export the snapshot in the one point we
 become consistent, after that we stop maintaining that state.

I don't get it.  Once all the old transactions are gone, I don't see
why you need any state at all to build an exportable snapshot.  Just
take a snapshot.

 1. Slots.  We know we need physical slots as well as logical slots,
 but the patch as currently constituted only offers logical slots.

 Well, then tell me the way you want to go forward on that end. I can
 make the slot interface more generic if we know exactly what we need,
 but I doesn't seem fair to take this patch hostage until I develop a
 separate not so small feature. Why is that my task?
 Because I think it's important, and because by now I know the related
 code pretty well by now, I am willing to provide the parts of the that
 prevent required WAL from being deleted, peg xmin and report the current
 state to SQL, but somebody else is going to have to the rest.

The part that you're expressing willingness to do sounds entirely
satisfactory to me.  As I mentioned on the other thread, I'm perhaps
even willing to punt that feature entirely provided that we have a
clear design for how to add it later, but I think it'd be nicer to get
it done now.

And just for the record, I think the idea that I am holding this patch
hostage is absurd.  I have devoted a large amount of time and energy
to moving this forward and plan to devote more.  Because of that work,
big chunks of what is needed here are already committed.  If my secret
plan is to make it as difficult as possible for you to get this
committed, I'm playing a deep game.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-14 Thread Robert Haas
On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 If you imagine a scenario where somebody establishes a replication
 slot and then keeps it forever, not often.  But if you're trying to do
 something more ad hoc, where replication slots might be used just for
 short periods of time and then abandoned, I think it could come up
 pretty frequently.

 But can you imagine those users needing an exported snapshot? I can
 think of several short-lived usages, but all of those are unlikely to
 need a consistent view of the overall database. And those are less
 likely to be full blown replication solutions.
 I.e. it's not the DBA making that decision but the developer making the
 decision based on whether he requires the snapshot or not.

Well, it still seems to me that the right way to think about this is
that the change stream begins at a certain point, and then once you
cross a certain threshold (all transactions in progress at that time
have ended) any subsequent snapshot is a possible point from which to
roll forward.  You'll need to avoid applying any transactions that are
already included during the snapshot, but I don't really think that's
any great matter.  You're focusing on the first point at which the
consistent snapshot can be taken, and on throwing away any logical
changes that might have been available before that point so that they
don't have to be ignored in the application code, but I think that's
myopic.

For example, suppose somebody is replication tables on node A to node
B.  And then the decide to replicate some of the same tables to node
C.  Well, one way to do this is to have node C connect to node A and
acquire its own slot, but that means decoding everything twice.
Alternatively, you could reuse the same change stream, but you'll need
a new snapshot to roll forward from.  That doesn't seem like a problem
unless the API makes it a problem.

 Generally, I think you're being too dismissive of the stuff I'm
 complaining about here.  If we just can't get this, well then I
 suppose we can't.

 I think I am just scared of needing to add more features before getting
 the basics done and in consequence overrunning 9.4...

I am sensitive to that.  On the other hand, this API is going to be a
lot harder to change once it's released, so we really need to avoid
painting ourselves into a corner with v1.  As far as high-level design
concerns go, there are three things that I'm not happy with:

1. Slots.  We know we need physical slots as well as logical slots,
but the patch as currently constituted only offers logical slots.
2. Snapshot Mangement.  This issue.
3. Incremental Decoding.  So that we can begin applying a really big
transaction speculatively before it's actually committed.

I'm willing to completely punt #3 as far as 9.4 is concerned, because
I see a pretty clear path to fixing that later.  I am not yet
convinced that either of the other two can or should be postponed.

 Right.  I think your idea is good, but maybe there should also be a
 version of the function that never confirms receipt even if the
 transaction commits.  That would be useful for ad-hoc poking at the
 queue.

 Ok, that sounds easy enough, maybe
 pg_decoding_slot_get_[binary_]changes()
 pg_decoding_slot_peek_[binary_]changes()
 ?

s/pg_decoding_slot/pg_logical_stream/?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-14 Thread Andres Freund
On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
 On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote:
  But can you imagine those users needing an exported snapshot? I can
  think of several short-lived usages, but all of those are unlikely to
  need a consistent view of the overall database. And those are less
  likely to be full blown replication solutions.
  I.e. it's not the DBA making that decision but the developer making the
  decision based on whether he requires the snapshot or not.
 
 Well, it still seems to me that the right way to think about this is
 that the change stream begins at a certain point, and then once you
 cross a certain threshold (all transactions in progress at that time
 have ended) any subsequent snapshot is a possible point from which to
 roll forward.

Unfortunately it's not possible to build exportable snapshots at any
time - it requires keeping far more state around since we need to care
about all transactions, not just transactions touching the
catalog. Currently you can only export the snapshot in the one point we
become consistent, after that we stop maintaining that state.

I see pretty much no chance to change that without major effort that's
imo clearly out of scope for 9.4. Maybe we want to provide that at some
point, but it's going to be a bit.
Also note that in order to import a snapshot, the exporting snapshot
still has to be alive, in the same transaction that has exported the
snapshot. So, until the snapshot has been imported, replication cannot
progress. That's existing limitations from the snapshot import code, but
the reasoning for them make it unlikely that we can easily change them

Under that light, usecases like your example won't be able to benefit
from getting changes before the snapshot, or do you still see usecases
for that?

I've already prototyped the quickstart option that doesn't allow
exporting a snapshot, although I don't know how the UI for it is going
to look.

  I think I am just scared of needing to add more features before getting
  the basics done and in consequence overrunning 9.4...
 
 I am sensitive to that.  On the other hand, this API is going to be a
 lot harder to change once it's released, so we really need to avoid
 painting ourselves into a corner with v1.

I think we need to accept the fact that quite possibly we won't get
everything right in 9.4. Obviously we should strive to avoid that, but
trying to design things perfectly usually ends in not getting there.
Yes, the API changes for FDWs caused pain. But there's no way we would
have the current features if we'd tried to introduce the correct API
allowing them from the get go.

  As far as high-level design
 concerns go, there are three things that I'm not happy with:
 
 1. Slots.  We know we need physical slots as well as logical slots,
 but the patch as currently constituted only offers logical slots.

Well, then tell me the way you want to go forward on that end. I can
make the slot interface more generic if we know exactly what we need,
but I doesn't seem fair to take this patch hostage until I develop a
separate not so small feature. Why is that my task?
Because I think it's important, and because by now I know the related
code pretty well by now, I am willing to provide the parts of the that
prevent required WAL from being deleted, peg xmin and report the current
state to SQL, but somebody else is going to have to the rest.

 s/pg_decoding_slot/pg_logical_stream/?

Works for me.

Regards,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-14 Thread Jim Nasby

On 12/12/13 11:13 AM, Robert Haas wrote:

I think it sucks (that's the technical term) to have to wait for all
currently-running transactions to terminate before being able to begin
streaming changes, because that could take a long time.  And you might
well know that the long-running transaction which is rolling up
enormous table A that you don't care about is never going to touch
table B which you actually want to replicate.  Now, ideally, the DBA
would have a way to ignore that long-running transaction and force
replication to start, perhaps with the caveat that if that
long-running transaction actually does touch B after all then we have
to resync.  Your model's fine when we want to replicate the whole
database, but a big part of why I want this feature is to allow
finer-grained replication, down to the table level, or even slices of
tables.


I know you're not going to attempt this for 9.4, but I want to mention a 
related case here. I've often wanted the ability to limit the tables a 
transaction can touch, so that it will not interfere with vacuuming other 
tables.

This would be useful when you have some tables that see very frequent 
updates/deletes in a database that also has to support long-running 
transactions that don't hit those tables. You'd explicitly limit the tables 
your long-running transaction will touch and that way vacuum can ignore the 
long-running XID when calculating minimum tuple age for the heavy-hit tables.

If we had that capability it could also be used to improve the time required to 
get a snapshot for a limited set of tables.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-13 Thread Robert Haas
On Thu, Dec 12, 2013 at 1:52 PM, Andres Freund and...@2ndquadrant.com wrote:
 Puh. I honestly have zero confidence in DBAs making an informed decision
 about something like this. Honestly, for a replication solution, how
 often do you think this will be an issue?

If you imagine a scenario where somebody establishes a replication
slot and then keeps it forever, not often.  But if you're trying to do
something more ad hoc, where replication slots might be used just for
short periods of time and then abandoned, I think it could come up
pretty frequently.  Generally, I think you're being too dismissive of
the stuff I'm complaining about here.  If we just can't get this, well
then I suppose we can't.  But I think the amount of time that it takes
Hot Standby to open for connections is an issue, precisely because
it's got to wait until certain criteria are met before it can
establish a snapshot, and sometimes that takes an unpleasantly long
time.  I think it unlikely that we can export that logic to this case
also and experience no pain as a result.

In fact, I think that even restricting things to streaming changes
from transactions started after we initiate replication is going to be
an annoying amount of delay for some purposes.  People will accept it
because, no matter how you slice it, this is an awesome new
capability.  Full stop.  That having been said, I don't find it at all
hard to imagine someone wanting to jump into the replication stream at
an arbitrary point in time and see changes from every transaction that
*commits* after that point, even if it began earlier, or even to see
changes from transactions that have not yet committed as they happen.
I realize that's asking for a pony, and I'm not saying you have to go
off and do that right now in order for this to move forward, or indeed
that it will ever happen at all.  What I am saying is that I find it
entirely likely that people are going to push the limits of this
thing, that this is one of the limits I expect them to push, and that
the more we can do to put policy in the hands of the user without
pre-judging the sanity of what they're trying to do, the happier we
(and our users) will be.

  It's not too difficult to provide an option to do that. What I've been
  thinking of was to correlate the confirmation of consumption with the
  transaction the SRF is running in. So, confirm the data as consumed if
  it commits, and don't if not. I think we could do that relatively easily
  by registering a XACT_EVENT_COMMIT.

 That's a bit too accident-prone for my taste.  I'd rather the DBA had
 some equivalent of peek_at_replication(nchanges int).

 One point for my suggested behaviour is that it closes a bigger
 racecondition. Currently as soon as start_logical_replication() has
 finished building the tuplestore it marks the endposition as
 received. But we very well can fail before the user has received all
 those changes.

Right.  I think your idea is good, but maybe there should also be a
version of the function that never confirms receipt even if the
transaction commits.  That would be useful for ad-hoc poking at the
queue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-13 Thread Andres Freund
On 2013-12-13 08:30:41 -0500, Robert Haas wrote:
 On Thu, Dec 12, 2013 at 1:52 PM, Andres Freund and...@2ndquadrant.com wrote:
  Puh. I honestly have zero confidence in DBAs making an informed decision
  about something like this. Honestly, for a replication solution, how
  often do you think this will be an issue?
 
 If you imagine a scenario where somebody establishes a replication
 slot and then keeps it forever, not often.  But if you're trying to do
 something more ad hoc, where replication slots might be used just for
 short periods of time and then abandoned, I think it could come up
 pretty frequently.

But can you imagine those users needing an exported snapshot? I can
think of several short-lived usages, but all of those are unlikely to
need a consistent view of the overall database. And those are less
likely to be full blown replication solutions.
I.e. it's not the DBA making that decision but the developer making the
decision based on whether he requires the snapshot or not.

 Generally, I think you're being too dismissive of the stuff I'm
 complaining about here.  If we just can't get this, well then I
 suppose we can't.

I think I am just scared of needing to add more features before getting
the basics done and in consequence overrunning 9.4...

 But I think the amount of time that it takes
 Hot Standby to open for connections is an issue, precisely because
 it's got to wait until certain criteria are met before it can
 establish a snapshot, and sometimes that takes an unpleasantly long
 time.

The HS situation is quite different though. There the problem is hit on
every restart, not just the initial start; Suboverflowed xl_running
xacts can delay HS startup for a long time; and we don't start up at the
exact point we know a xl_running_xacts just has been logged.

FWIW I think we need to fix HS to store it's visibility state at
restartpoints, so we get rid of that problem. Not sure who will have
time to do that tho.

 That having been said, I don't find it at all
 hard to imagine someone wanting to jump into the replication stream at
 an arbitrary point in time and see changes from every transaction that
 *commits* after that point, even if it began earlier, or even to see
 changes from transactions that have not yet committed as they happen.

I agree they are desirable, but all those will require additional state
to be kept about running transactions. That's not saying it
shouldn't/cannot be done, to the contrary, just that it requires some
engineering effort.

 I realize that's asking for a pony, and I'm not saying you have to go
 off and do that right now in order for this to move forward, or indeed
 that it will ever happen at all.  What I am saying is that I find it
 entirely likely that people are going to push the limits of this
 thing, that this is one of the limits I expect them to push, and that
 the more we can do to put policy in the hands of the user without
 pre-judging the sanity of what they're trying to do, the happier we
 (and our users) will be.

Completely agreed. I really think this the most basic building block,
missing many important features. And we'll be busy for some time adding
those.

  One point for my suggested behaviour is that it closes a bigger
  racecondition. Currently as soon as start_logical_replication() has
  finished building the tuplestore it marks the endposition as
  received. But we very well can fail before the user has received all
  those changes.
 
 Right.  I think your idea is good, but maybe there should also be a
 version of the function that never confirms receipt even if the
 transaction commits.  That would be useful for ad-hoc poking at the
 queue.

Ok, that sounds easy enough, maybe
pg_decoding_slot_get_[binary_]changes()
pg_decoding_slot_peek_[binary_]changes()
?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Andres Freund
Hello Robert,

On 2013-12-11 22:29:46 -0500, Robert Haas wrote:
 On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund and...@2ndquadrant.com wrote:
  There's basically three major 'verbs' that can be performed on a
  stream, currently named (walsender names):
  * INIT_LOGICAL_REPLICATION name output_plugin
  * START_LOGICAL_REPLICATION name last_received (option_name value,...)
  * FREE_LOGICAL_REPLICATION name
 
  The SQL variant currrently has:
  * init_logical_replication(name, plugin)
  * start_logical_replication(name, stream_upto, options[])
  * stop_logical_replication(name)
 
  You might have noticed the slight inconsistency...
 
 I think this naming is probably not the greatest.

Completely agreed.

 When I hear init,
 I don't think permanently allocate a resource that will never be
 released unless I explicitly throw it away, and when I hear stop, I
 don't think free that resource.  I suggest naming based around
 create/drop, register/unregister, or acquire/release.  Since, as
 previously noted, I'm gunning for these slots to apply to ordinary
 replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and
 RELEASE_REPLICATION_SLOT.  If we're going to make them specific to
 logical replication, then your suggestion of CREATE_DECODING_SLOT and
 DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and
 RELEASE_DECODING_SLOT, sounds fine.

I think there'll always be a bit of a difference between slots for
physical and logical data, even if 90% of the implementation is the
same. We can signal that difference by specifying logical/physical as an
option or having two different sets of commands.

Maybe?

ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
-- already exists without slot, PHYSICAL arguments
START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
START_REPLICATION SLOT LOGICAL slot plugin_options
RELEASE_REPLICATION_SLOT slot_name

 It also strikes me that just as it's possible to stream WAL without
 allocating a slot first (since we don't at present have slots),
 perhaps it ought also to be possible to stream logical replication
 data without acquiring a slot first.  You could argue that it was a
 mistake not to introduce slots in the first place, but the stateless
 nature of WAL streaming definitely has some benefits, and it's unclear
 to me why you shouldn't be able to do the same thing with logical
 decoding.

I think it would be quite a bit harder for logical decoding. The
difference is that, from the perspective of the walsender, for plain WAL
streaming, all that needs to be checked is whether the WAL is still
there. For decoding though, we need to be sure that a) the catalog xmin
is still low enough and has been all along b) that we are able instantly
build a historical mvcc snapshot from the point we want to start
streaming.
Both a) and b) are solved by keeping the xmin and the point where to
reread WAL from in the slot data and by serializing data about
historical snapshots to disk. But those are removed if there isn't a
slot around requiring them...

So what you could get is something that starts streaming you changes
sometime after you asked it to start streaming, without a guarantee that
you can restart at exactly the position you stopped. If that's useful,
we can do it, but I am not sure what the usecase would be?

  b) Decide which of the SQL functions should be in a contrib module, and
 which in core. Currently init_logical_replication() and
 stop_logical_replication() are in core, whereas
 start_logical_replication() is in the 'test_logical_decoding'
 extension. The reasoning behind that is that init/stop ones are
 important to the DBA and the start_logical_replication() SRF isn't
 all that useful in the real world because our SRFs don't support
 streaming changes out.
 
 Seems pretty arbitrary to me.  If start_logical_replication() is
 usable with any output plugin, then it ought to be in core.

Ok, I certainly have no problem with that.

 I think
 the name isn't great, though; the actual functionality seems to be
 more or less decode-from-last-position-up-to-present, which doesn't
 sound much like start.

The name originally comes from the START_REPLICATION walsender command,
but I agree, there's not much point in trying to keep that name.

I am also open to different behaviour for the SRF, but I am not sure
what that could be. There's just no sensible way to stream data on the
SQL level afaics.

What about pg_decoding_slot_get_[binary_]changes()?

  c) Which data-types does start_logical_replication() return. Currently
  it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
  column text has some obvious disadvantages though - there's obvious
  usecases for output plugins that return binary data. But making it bytea
  sucks, because the output is harder to read by default...

 I think having two functions might be sensible.  I'm not sure what
 happens if the text 

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Robert Haas
On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund and...@2ndquadrant.com wrote:
 I think there'll always be a bit of a difference between slots for
 physical and logical data, even if 90% of the implementation is the
 same. We can signal that difference by specifying logical/physical as an
 option or having two different sets of commands.

 Maybe?

 ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
 ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
 -- already exists without slot, PHYSICAL arguments
 START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
 START_REPLICATION SLOT LOGICAL slot plugin_options
 RELEASE_REPLICATION_SLOT slot_name

I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
but basically this seems OK to me.  I hadn't realized that the options
were going to be different for logical vs. physical.  So you could
also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
not sure whether that's better.

 It also strikes me that just as it's possible to stream WAL without
 allocating a slot first (since we don't at present have slots),
 perhaps it ought also to be possible to stream logical replication
 data without acquiring a slot first.  You could argue that it was a
 mistake not to introduce slots in the first place, but the stateless
 nature of WAL streaming definitely has some benefits, and it's unclear
 to me why you shouldn't be able to do the same thing with logical
 decoding.

 I think it would be quite a bit harder for logical decoding. The
 difference is that, from the perspective of the walsender, for plain WAL
 streaming, all that needs to be checked is whether the WAL is still
 there. For decoding though, we need to be sure that a) the catalog xmin
 is still low enough and has been all along b) that we are able instantly
 build a historical mvcc snapshot from the point we want to start
 streaming.
 Both a) and b) are solved by keeping the xmin and the point where to
 reread WAL from in the slot data and by serializing data about
 historical snapshots to disk. But those are removed if there isn't a
 slot around requiring them...

 So what you could get is something that starts streaming you changes
 sometime after you asked it to start streaming, without a guarantee that
 you can restart at exactly the position you stopped. If that's useful,
 we can do it, but I am not sure what the usecase would be?

I haven't yet looked closely at the snapshot-building stuff, but my
thought is that you ought to be able to decode any transactions that
start after you make the connection.  You might not be able to decode
transactions that are already in progress at that point, because you
might have already missed XID assignment records, catalog changes,
etc. that they've performed.  But transactions that begin after that
point ought to be OK.  I have a feeling you're going to tell me it
doesn't work like that, but maybe it should, because there's a whole
lot of benefit in having decoding start up quickly, and a whole lot of
benefit also to having the rules for that be easy to understand.

Now if you have that, then I think ad-hoc decoding is potentially
useful.  Granted, you're not going to want to build a full-fledged
replication solution that way, but you might want to just connect and
watch the world stream by... or you might imagine an application that
opens a replication connection and a regular connection, copies a
table, and then applies the stream of changes made to that table after
the fact.  When that completes, the table is sync'd between the two
machines as of the end of the copy.  Useful enough to bother with?  I
don't know.  But not obviously useless.

 I am also open to different behaviour for the SRF, but I am not sure
 what that could be. There's just no sensible way to stream data on the
 SQL level afaics.

I don't have a problem with the behavior.  Seems useful.  One useful
addition might be to provide an option to stream out up to X changes
but without consuming them, so that the DBA can peek at the
replication stream.  I think it's a safe bet DBAs will want to do
things like that, so it'd be nice to make it easy, if we can.

 What about pg_decoding_slot_get_[binary_]changes()?

Sounds about right, but I think we need to get religion about figuring
out what terminology to use.  At the moment it seems to vary quite a
bit between logical, logical decoding, and decoding.  Not sure
how to nail that down.

As a more abstract linguistic question, what do we think the
difference is between logical *replication* and logical *decoding*?
Are they the same or different?  If different, how?

 I wonder if we should let the output plugin tell us whether it will
 output data in binary? I think it generally would be a good idea to let
 the output plugin's _init() function return some configuration
 data. That will make extending the interface to support more features
 easier.

Maybe, but 

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Andres Freund
On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
 On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund and...@2ndquadrant.com wrote:
  I think there'll always be a bit of a difference between slots for
  physical and logical data, even if 90% of the implementation is the
  same. We can signal that difference by specifying logical/physical as an
  option or having two different sets of commands.
 
  Maybe?
 
  ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
  ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
  -- already exists without slot, PHYSICAL arguments
  START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
  START_REPLICATION SLOT LOGICAL slot plugin_options
  RELEASE_REPLICATION_SLOT slot_name
 
 I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
 but basically this seems OK to me.

Uh, yes.

 I hadn't realized that the options were going to be different for
 logical vs. physical.

I don't see how we could avoid that, there just are some differences
between both.

 So you could
 also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
 START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
 not sure whether that's better.

Not sure either, but I slightly favor keeping the the toplevel slot
commands the same. I think we'll want one namespace for both and
possibly similar reporting functions and that seems less surprising if
they are treated more similar.

  So what you could get is something that starts streaming you changes
  sometime after you asked it to start streaming, without a guarantee that
  you can restart at exactly the position you stopped. If that's useful,
  we can do it, but I am not sure what the usecase would be?
 
 I haven't yet looked closely at the snapshot-building stuff, but my
 thought is that you ought to be able to decode any transactions that
 start after you make the connection.  You might not be able to decode
 transactions that are already in progress at that point, because you
 might have already missed XID assignment records, catalog changes,
 etc. that they've performed.  But transactions that begin after that
 point ought to be OK.

It works mostly like that, yes. At least on primaries. When we start
decoding, we jot down the current xlog insertion pointer to know where
to start decoding from, then trigger a xl_running_xacts record to be
logged so we have enough information. Then we start reading from that
point onwards. On standbys the process is the same, just that we have to
wait for the primary to issue a xl_running_xacts.
(I had considered starting with information from the procarray, but
turns out that's hard to do without race conditions.)

We only decode changes in transactions that commit after the last
transaction that was in-progress when we started observing has finished
though. That allows us to export a snapshot when the last still-running
transaction finished which shows a snapshot of the database that can be
rolled forward exactly by the changes contained in the changestream. I
think that's a useful property for the majority of cases.

If we were to start out streaming changes before the last running
transaction has finished, they would be visible in that exported
snapshot and you couldn't use it to to roll forward from anymore.

It'd be pretty easy to optionally decode the transactions we currently
skip if we want that feature later. That would remove the option to
export a snapshot in many cases though (think suboverflowed snapshots).

 I have a feeling you're going to tell me it
 doesn't work like that, but maybe it should, because there's a whole
 lot of benefit in having decoding start up quickly, and a whole lot of
 benefit also to having the rules for that be easy to understand.

I am not sure if the above qualifies as doesn't work like that, if
not, sometimes the correct thing isn't the immediately obvious thing. I
think all transactions that were running when initiating decoding need
to finish is reasonably easy to explain.

  I am also open to different behaviour for the SRF, but I am not sure
  what that could be. There's just no sensible way to stream data on the
  SQL level afaics.

 I don't have a problem with the behavior.  Seems useful.  One useful
 addition might be to provide an option to stream out up to X changes
 but without consuming them, so that the DBA can peek at the
 replication stream.  I think it's a safe bet DBAs will want to do
 things like that, so it'd be nice to make it easy, if we can.

It's not too difficult to provide an option to do that. What I've been
thinking of was to correlate the confirmation of consumption with the
transaction the SRF is running in. So, confirm the data as consumed if
it commits, and don't if not. I think we could do that relatively easily
by registering a XACT_EVENT_COMMIT.

  What about pg_decoding_slot_get_[binary_]changes()?

 Sounds about right, but I think we need to get religion about figuring
 out what terminology to use.  At the 

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Robert Haas
On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund and...@2ndquadrant.com wrote:
 I hadn't realized that the options were going to be different for
 logical vs. physical.

 I don't see how we could avoid that, there just are some differences
 between both.

Right, I'm not complaining, just observing that it was a point I had overlooked.

 So you could
 also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
 START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
 not sure whether that's better.

 Not sure either, but I slightly favor keeping the the toplevel slot
 commands the same. I think we'll want one namespace for both and
 possibly similar reporting functions and that seems less surprising if
 they are treated more similar.

OK.

 If we were to start out streaming changes before the last running
 transaction has finished, they would be visible in that exported
 snapshot and you couldn't use it to to roll forward from anymore.

Actually, you could.  You'd just have to throw away any transactions
whose XIDs are visible to the exported snapshot.  In other words, you
begin replication at time T0, and all transactions which begin after
that time are included in the change stream.  At some later time T1,
all transactions in progress at time T0 have ended, and now you can
export a snapshot at that time, or any later time, from which you can
roll forward.  Any change-stream entries for XIDs which would be
visible to that snapshot shouldn't be replayed when rolling forward
from it, though.

I think it sucks (that's the technical term) to have to wait for all
currently-running transactions to terminate before being able to begin
streaming changes, because that could take a long time.  And you might
well know that the long-running transaction which is rolling up
enormous table A that you don't care about is never going to touch
table B which you actually want to replicate.  Now, ideally, the DBA
would have a way to ignore that long-running transaction and force
replication to start, perhaps with the caveat that if that
long-running transaction actually does touch B after all then we have
to resync.  Your model's fine when we want to replicate the whole
database, but a big part of why I want this feature is to allow
finer-grained replication, down to the table level, or even slices of
tables.

So imagine this.  After initiating logical replication, a replication
solution either briefly x-locks a table it wants to replicate, so that
there can't be anyone else touching it, or it observes who has a lock
= RowExclusiveLock and waits for all of those locks to drop away.  At
that point, it knows that no currently-in-progress transaction can
have modified the table prior to the start of replication, and begins
copying the table.  If a transaction that began before the start of
replication subsequently modifies the table, a WAL record will be
written, and the core logical decoding support could let the plugin
know by means of an optional callback (hey, btw, a change I can't
decode just hit table XYZ).  The plugin will need to respond by
recopying the table, which sucks, but it was the plugin's decision to
be optimistic in the first place, and that will in many cases be a
valid policy decision.  If no such callback arrives before the
safe-snapshot point, then the plugin made the right bet and will reap
the just rewards of its optimism.

 I don't have a problem with the behavior.  Seems useful.  One useful
 addition might be to provide an option to stream out up to X changes
 but without consuming them, so that the DBA can peek at the
 replication stream.  I think it's a safe bet DBAs will want to do
 things like that, so it'd be nice to make it easy, if we can.

 It's not too difficult to provide an option to do that. What I've been
 thinking of was to correlate the confirmation of consumption with the
 transaction the SRF is running in. So, confirm the data as consumed if
 it commits, and don't if not. I think we could do that relatively easily
 by registering a XACT_EVENT_COMMIT.

That's a bit too accident-prone for my taste.  I'd rather the DBA had
some equivalent of peek_at_replication(nchanges int).

 Sounds about right, but I think we need to get religion about figuring
 out what terminology to use.  At the moment it seems to vary quite a
 bit between logical, logical decoding, and decoding.  Not sure
 how to nail that down.

 Agreed. Perhaps we should just avoid both logical and decoding entirely
 and go for changestream or similar?

So wal_level=changestream?  Not feeling it.  Of course we don't have
to be 100% rigid about this but we should try to make our terminology
corresponding with natural semantic boundaries.  Maybe we should call
the process logical decoding, and the results logical streams, or
something like that.

 As a more abstract linguistic question, what do we think the
 difference is between logical *replication* and logical *decoding*?
 Are they the same or different?  If different, how?

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-12 Thread Andres Freund
On 2013-12-12 12:13:24 -0500, Robert Haas wrote:
 On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  If we were to start out streaming changes before the last running
  transaction has finished, they would be visible in that exported
  snapshot and you couldn't use it to to roll forward from anymore.
 
 Actually, you could.  You'd just have to throw away any transactions
 whose XIDs are visible to the exported snapshot.  In other words, you
 begin replication at time T0, and all transactions which begin after
 that time are included in the change stream.  At some later time T1,
 all transactions in progress at time T0 have ended, and now you can
 export a snapshot at that time, or any later time, from which you can
 roll forward.  Any change-stream entries for XIDs which would be
 visible to that snapshot shouldn't be replayed when rolling forward
 from it, though.

But that would become a too complex interface, imo without a
corresponding benefit. If you skip the changes when rolling forward,
there's no point in streaming them out in the first place.

 I think it sucks (that's the technical term) to have to wait for all
 currently-running transactions to terminate before being able to begin
 streaming changes, because that could take a long time.

I don't think there's much of an alternative for replication solutions,
for other usecases, we may want to add an option to skip the wait. It's
not like that's something you do all the time. As soon as a slot was
acquired, there's no further waits anymore.

 And you might
 well know that the long-running transaction which is rolling up
 enormous table A that you don't care about is never going to touch
 table B which you actually want to replicate.  Now, ideally, the DBA
 would have a way to ignore that long-running transaction and force
 replication to start, perhaps with the caveat that if that
 long-running transaction actually does touch B after all then we have
 to resync.

Puh. I honestly have zero confidence in DBAs making an informed decision
about something like this. Honestly, for a replication solution, how
often do you think this will be an issue?

 So imagine this.  After initiating logical replication, a replication
 solution either briefly x-locks a table it wants to replicate, so that
 there can't be anyone else touching it, or it observes who has a lock
 = RowExclusiveLock and waits for all of those locks to drop away.  At
 that point, it knows that no currently-in-progress transaction can
 have modified the table prior to the start of replication, and begins
 copying the table.  If a transaction that began before the start of
 replication subsequently modifies the table, a WAL record will be
 written, and the core logical decoding support could let the plugin
 know by means of an optional callback (hey, btw, a change I can't
 decode just hit table XYZ).  The plugin will need to respond by
 recopying the table, which sucks, but it was the plugin's decision to
 be optimistic in the first place, and that will in many cases be a
 valid policy decision.  If no such callback arrives before the
 safe-snapshot point, then the plugin made the right bet and will reap
 the just rewards of its optimism.

Sure, all that's possible. But hell, it's complicated to use. If reality
proves people want this, lets go there, but lets get the basics right
and committed first.

All the logic around whether to decode a transaction is:
void
SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
   int nsubxacts, TransactionId *subxacts)
...
if (builder-state  SNAPBUILD_CONSISTENT)
{
/* ensure that only commits after this are getting replayed */
if (builder-transactions_after  lsn)
builder-transactions_after = lsn;
and then

/*
 * Should the contents of a transaction ending at 'ptr' be decoded?
 */
bool
SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
{
return ptr = builder-transactions_after;
}

so it's not like it will require all too many changes.

What I can see as possibly getting into 9.4 is a FASTSTART option that
doesn't support exporting a snapshot, but doesn't have to wait for the
SNAPBUILD_CONSISTENT state in return. That's fine for some usecases,
although I don't think for any of the major ones.

  It's not too difficult to provide an option to do that. What I've been
  thinking of was to correlate the confirmation of consumption with the
  transaction the SRF is running in. So, confirm the data as consumed if
  it commits, and don't if not. I think we could do that relatively easily
  by registering a XACT_EVENT_COMMIT.
 
 That's a bit too accident-prone for my taste.  I'd rather the DBA had
 some equivalent of peek_at_replication(nchanges int).

One point for my suggested behaviour is that it closes a bigger
racecondition. Currently as soon as start_logical_replication() has
finished 

Re: [HACKERS] Changeset Extraction Interfaces

2013-12-11 Thread Robert Haas
On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund and...@2ndquadrant.com wrote:
 There's basically three major 'verbs' that can be performed on a
 stream, currently named (walsender names):
 * INIT_LOGICAL_REPLICATION name output_plugin
 * START_LOGICAL_REPLICATION name last_received (option_name value,...)
 * FREE_LOGICAL_REPLICATION name

 The SQL variant currrently has:
 * init_logical_replication(name, plugin)
 * start_logical_replication(name, stream_upto, options[])
 * stop_logical_replication(name)

 You might have noticed the slight inconsistency...

I think this naming is probably not the greatest.  When I hear init,
I don't think permanently allocate a resource that will never be
released unless I explicitly throw it away, and when I hear stop, I
don't think free that resource.  I suggest naming based around
create/drop, register/unregister, or acquire/release.  Since, as
previously noted, I'm gunning for these slots to apply to ordinary
replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and
RELEASE_REPLICATION_SLOT.  If we're going to make them specific to
logical replication, then your suggestion of CREATE_DECODING_SLOT and
DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and
RELEASE_DECODING_SLOT, sounds fine.

It also strikes me that just as it's possible to stream WAL without
allocating a slot first (since we don't at present have slots),
perhaps it ought also to be possible to stream logical replication
data without acquiring a slot first.  You could argue that it was a
mistake not to introduce slots in the first place, but the stateless
nature of WAL streaming definitely has some benefits, and it's unclear
to me why you shouldn't be able to do the same thing with logical
decoding.

 b) Decide which of the SQL functions should be in a contrib module, and
which in core. Currently init_logical_replication() and
stop_logical_replication() are in core, whereas
start_logical_replication() is in the 'test_logical_decoding'
extension. The reasoning behind that is that init/stop ones are
important to the DBA and the start_logical_replication() SRF isn't
all that useful in the real world because our SRFs don't support
streaming changes out.

Seems pretty arbitrary to me.  If start_logical_replication() is
usable with any output plugin, then it ought to be in core.  I think
the name isn't great, though; the actual functionality seems to be
more or less decode-from-last-position-up-to-present, which doesn't
sound much like start.

 c) Which data-types does start_logical_replication() return. Currently
 it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
 column text has some obvious disadvantages though - there's obvious
 usecases for output plugins that return binary data. But making it bytea
 sucks, because the output is harder to read by default...

I think having two functions might be sensible.  I'm not sure what
happens if the text function is used and the plugin outputs something
that's not valid in the database encoding, though.  I guess you'd
better check for that and error out.

 d) is my main question, and Robert, Peter G. and I previously argued
 about it a fair bit. I know of the following alternatives:

 I) The output plugin that's specified in INIT_LOGICAL_REPLICATION is
 actually a library name, and we simply lookup the fixed symbol names in
 it. That's what currently implemented.
 The advantage is that it's pretty easy to implement, works on a HS
 standby without involving the primary, and doesn't have a problem if the
 library is used in shared_preload_library.
 The disadvantages are: All output plugins need to be shared libraries
 and there can only be one output plugin per shared library (although you
 could route differently, via options, but ugh).

I still dislike this.

 II) Keep the output plugin a library, but only lookup a
 _PG_init_output_plugin() which registers/returns the callbacks. Pretty
 much the same tradeoffs as I)

 III) Keep the output plugin a library, but simply rely on _PG_init()
 calling a function to register all callbacks. Imo it's worse than I) and
 II) because it basically prohibits using the library in
 shared_preload_libraries as well, because then it's _PG_init() doesn't
 get called when starting to stream, and another library might have
 registered other callbacks.

I don't understand the disadvantage that you're describing here.  What
I'm imagining is that you have some struct that looks like this:

struct output_plugin
{
char name[64];
void (*begin)(args);
void (*change)(args);
void (*commit)(args);
};

Now you provide a function RegisterOutputPlugin(output_plugin *).  If
there are any output plugins built into core, core will call
RegisterOutputPlugin once for each one.  If a shared library
containing an output plugin is loaded, the libraries _PG_init function
does the same thing.  When someone tries to use a plugin, they ask for
it by name.  We go iterate through the data 

[HACKERS] Changeset Extraction Interfaces

2013-12-04 Thread Andres Freund
Hi,

Short recap:

From the perspective of the user interface the changeset extraction
feature consists out of two abstract interfaces that the user has to
do with:

1) The slot or changestream management interface which manages
individual streams of changes. The user can create and destroy a
changestream, and most importantly stream the changes.

Simplified, a logical replication slot is a position in the WAL and a
bunch of state associated with it. As long as a slot exists, the user
can ask, for all changes that happened since the last time he asked, to
be streamed out.

It is abstract, because different usecases require the changes to be
streamed out via different methods. The series contains two
implementation of that interface:
I) One integrated into walsender that allows for efficient streaming,
   including support for synchronous replication.
II) Another that is accessible via SQL functions, very useful for
writing pg_regress/isolationtester tests.

It is, with a relatively low amount of code, possible to add other such
interfaces without touching core code. One example, that has been asked
for by a number of people, is consuming the changestream in a background
worker without involving SQL or connecting to a walsender.

There's basically three major 'verbs' that can be performed on a
stream, currently named (walsender names):
* INIT_LOGICAL_REPLICATION name output_plugin
* START_LOGICAL_REPLICATION name last_received (option_name value,...)
* FREE_LOGICAL_REPLICATION name

The SQL variant currrently has:
* init_logical_replication(name, plugin)
* start_logical_replication(name, stream_upto, options[])
* stop_logical_replication(name)

You might have noticed the slight inconsistency...

2) The output plugin interface, which transforms a changestream
(begin, change, commit) into the desired target format.

There are 5 callbacks, 3 of them obligatory:
* pg_decode_init(context, is_initial) [optional]
* pg_decode_begin(context, txn)
* pg_decode_change(context, txn, relation, change)
* pg_decode_commit(context, txn)
* pg_decode_cleanup(context) [optional]

Every output plugin can be used from every slot management
interface.

The current pain points, that I'd like to discuss, are:
a) Better naming for the slot management between walsender, SQL and
   possible future interfaces.

b) Decide which of the SQL functions should be in a contrib module, and
   which in core. Currently init_logical_replication() and
   stop_logical_replication() are in core, whereas
   start_logical_replication() is in the 'test_logical_decoding'
   extension. The reasoning behind that is that init/stop ones are
   important to the DBA and the start_logical_replication() SRF isn't
   all that useful in the real world because our SRFs don't support
   streaming changes out.

c) Which data-types does start_logical_replication() return. Currently
it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
column text has some obvious disadvantages though - there's obvious
usecases for output plugins that return binary data. But making it bytea
sucks, because the output is harder to read by default...

d) How does a slot acquire the callbacks of an output plugin.

For a), my current feeling is to name them:
* LOGICAL_DECODING_SLOT_CREATE/pg_logical_decoding_slot_create()
* LOGICAL_DECODING_SLOT_STREAM/pg_logical_decoding_slot_extract()
* LOGICAL_DECODING_SLOT_DESTROY/pg_logical_decoding_slot_destroy()
with an intentional discrepancy between stream and extract, to make the
difference obvious. One day we might have the facility - which would be
rather cool - to do the streaming from sql as well.

Better ideas? Leave out the logical?

For b), I am happy with that split, I would just like others to comment.

For c), I have better idea than two functions.

d) is my main question, and Robert, Peter G. and I previously argued
about it a fair bit. I know of the following alternatives:

I) The output plugin that's specified in INIT_LOGICAL_REPLICATION is
actually a library name, and we simply lookup the fixed symbol names in
it. That's what currently implemented.
The advantage is that it's pretty easy to implement, works on a HS
standby without involving the primary, and doesn't have a problem if the
library is used in shared_preload_library.
The disadvantages are: All output plugins need to be shared libraries
and there can only be one output plugin per shared library (although you
could route differently, via options, but ugh).

II) Keep the output plugin a library, but only lookup a
_PG_init_output_plugin() which registers/returns the callbacks. Pretty
much the same tradeoffs as I)

III) Keep the output plugin a library, but simply rely on _PG_init()
calling a function to register all callbacks. Imo it's worse than I) and
II) because it basically prohibits using the library in
shared_preload_libraries as well, because then it's _PG_init() doesn't
get called when starting to stream, and another library might have