Re: [HACKERS] logical changeset generation v5

2013-09-05 Thread Andres Freund
Hi,

On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
 On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  0005 wal_decoding: Log xl_running_xact's at a higher frequency than 
  checkpoints are done
  * benefits hot standby startup

I tried to update the patch to address the comments you made.

  0003 wal_decoding: Allow walsender's to connect to a specific database
  * biggest problem is how to specify the connection we connect
to. Currently with the patch walsender connects to a database if it's
not named replication (via dbname). Perhaps it's better to invent a
replication_dbname parameter?

I've updated the patch so it extends the replication startup parameter
to not only specify a boolean but also database. In the latter case it
will connect to the database specified in dbname.
As discussed downthread, this patch doesn't have an immediate advantage
for users until the changeset extraction patch itself is
applied. Whether or not it should be applied separately is unclear.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 2aa39548f5990e9663e95f011f25a89a0dc8d8a1 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 19 Aug 2013 13:24:30 +0200
Subject: [PATCH 2/9] wal_decoding: Allow walsender's to connect to a specific
 database

Extend the existing 'replication' parameter to not only allow a boolean value
but also database. If the latter is specified we connect to the database
specified in 'dbname'.

This is useful for future walsender commands which need database interaction,
e.g. changeset extraction.
---
 doc/src/sgml/protocol.sgml | 24 +---
 src/backend/postmaster/postmaster.c| 23 ++--
 .../libpqwalreceiver/libpqwalreceiver.c|  4 +-
 src/backend/replication/walsender.c| 43 +++---
 src/backend/utils/init/postinit.c  |  5 +++
 src/bin/pg_basebackup/pg_basebackup.c  |  4 +-
 src/bin/pg_basebackup/pg_receivexlog.c |  4 +-
 src/bin/pg_basebackup/receivelog.c |  4 +-
 src/include/replication/walsender.h|  1 +
 9 files changed, 89 insertions(+), 23 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 0b2e60e..2ea14e5 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1301,10 +1301,13 @@
 
 para
 To initiate streaming replication, the frontend sends the
-literalreplication/ parameter in the startup message. This tells the
-backend to go into walsender mode, wherein a small set of replication commands
-can be issued instead of SQL statements. Only the simple query protocol can be
-used in walsender mode.
+literalreplication/ parameter in the startup message. A boolean value
+of literaltrue/ tells the backend to go into walsender mode, wherein a
+small set of replication commands can be issued instead of SQL statements. Only
+the simple query protocol can be used in walsender mode.
+Passing a literaldatabase/ as the value instructs walsender to connect to
+the database specified in the literaldbname/ paramter which will in future
+allow some additional commands to the ones specified below to be run.
 
 The commands accepted in walsender mode are:
 
@@ -1314,7 +1317,7 @@ The commands accepted in walsender mode are:
 listitem
  para
   Requests the server to identify itself. Server replies with a result
-  set of a single row, containing three fields:
+  set of a single row, containing four fields:
  /para
 
  para
@@ -1356,6 +1359,17 @@ The commands accepted in walsender mode are:
   /listitem
   /varlistentry
 
+  varlistentry
+  term
+   dbname
+  /term
+  listitem
+  para
+   Database connected to or NULL.
+  /para
+  /listitem
+  /varlistentry
+
   /variablelist
  /para
 /listitem
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 01d2618..a31b01d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1894,10 +1894,21 @@ retry1:
 port-cmdline_options = pstrdup(valptr);
 			else if (strcmp(nameptr, replication) == 0)
 			{
-if (!parse_bool(valptr, am_walsender))
+/*
+ * Due to backward compatibility concerns replication is a
+ * bybrid beast which allows the value to be either a boolean
+ * or the string 'database'. The latter connects to a specific
+ * database which is e.g. required for changeset extraction.
+ */
+if (strcmp(valptr, database) == 0)
+{
+	am_walsender = true;
+	am_db_walsender = true;
+}
+else if (!parse_bool(valptr, am_walsender))
 	ereport(FATAL,
 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-			 errmsg(invalid value for boolean option \replication\)));
+		

Re: [HACKERS] logical changeset generation v5

2013-09-04 Thread Andres Freund
On 2013-09-04 10:02:05 -0400, Robert Haas wrote:
 On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote:
  I don't think it particularly needs to be configurable, but I wonder
  if we can't be a bit smarter about when we do it.  For example,
  suppose we logged it every 15 s but only until we log a non-overflowed
  snapshot.
 
  There's actually more benefits than just overflowed snapshots (pruning
  of the known xids machinery, exclusive lock cleanup).

 I know that, but I thought the master and slave could only lose sync
 on those things after a master crash and that once per checkpoint
 cycle was enough for those other benefits.  Am I wrong?

The xid tracking can keep track without the additional records but it
sometimes needs a good bit more memory to do so if the primary burns to
xids quite fast.  Everytime we see an running xacts record we can do
cleanup (that's the ExpireOldKnownAssignedTransactionIds() in
ProcArrayApplyRecoveryInfo()).

  The problem with using dbname=replication as a trigger for anything is
  that we actually allow databases to be created with that name. Perhaps
  that was a design mistake.
 
 It seemed like a good idea at the time, but maybe it wasn't.  I'm not
 sure where to go with it at this point; a forcible backward
 compatibility break would probably screw things up for a lot of
 people.

Yes, breaking things now doesn't seem like a good idea.

  I wondered about turning replication from a boolean into something like
  off|0, on|1, database. dbname= gets only used in the latter
  variant. That would be compatible with previous versions and would even
  support using old tools (since all of them seem to do replication=1).
 
 I don't love that, but I don't hate it, either.

Ok. Will update the patch that way. Seems better than it's current state.

 But it still doesn't
 answer the following question, which I think is important: if I (or
 someone else) commits this patch, how will that make things better for
 users?  At the moment it's just adding a knob that doesn't do anything
 for you when you twist it.

I am not sure it's a good idea to commit it before we're sure were going
to commit the changeset extraction. It's an independently reviewable and
testable piece of code that's simple enough to understand quickly in
contrast to the large changeset extraction patch. That's why I kept it
separate.
On the other hand, as you know, it's not without precedent to commit
pieces of infrastructure that aren't really useful for the enduser
(think FDW).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-09-04 Thread Robert Haas
On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't think it particularly needs to be configurable, but I wonder
 if we can't be a bit smarter about when we do it.  For example,
 suppose we logged it every 15 s but only until we log a non-overflowed
 snapshot.

 There's actually more benefits than just overflowed snapshots (pruning
 of the known xids machinery, exclusive lock cleanup).

I know that, but I thought the master and slave could only lose sync
on those things after a master crash and that once per checkpoint
cycle was enough for those other benefits.  Am I wrong?

 The patch as-is only writes if there has been WAL written since the last
 time it logged a running_xacts. I think it's not worth building more
 smarts than that?

Hmm, maybe.

 Because I don't see any reason to believe that this WAL record is any
 more important or urgent than any other WAL record that might get
 logged.

 I can't follow the logic behind that statement. Just about all WAL
 records are either pretty immediately flushed afterwards or are done in
 the context of a transaction where we flush (or do a
 XLogSetAsyncXactLSN) at transaction commit.

 XLogBackgroundFlush() won't necessarily flush the running_xacts
 record.

OK, this was the key point I was missing.

 It seems we need some more design there.  Perhaps entering replication
 mode could be triggered by writing either dbname=replication or
 replication=yes.  But then, do the replication commands simply become
 SQL commands?  I've certainly seen hackers use them that way.  And I
 can imagine that being a sensible approach, but this patch seems like
 it's only covering a fairly small fraction of what really ought to be
 a single commit.

 Yes. I think you're right that we need more input/design here. I've
 previously started threads about it, but nobody replied :(.

 The problem with using dbname=replication as a trigger for anything is
 that we actually allow databases to be created with that name. Perhaps
 that was a design mistake.

It seemed like a good idea at the time, but maybe it wasn't.  I'm not
sure where to go with it at this point; a forcible backward
compatibility break would probably screw things up for a lot of
people.

 I wondered about turning replication from a boolean into something like
 off|0, on|1, database. dbname= gets only used in the latter
 variant. That would be compatible with previous versions and would even
 support using old tools (since all of them seem to do replication=1).

I don't love that, but I don't hate it, either.  But it still doesn't
answer the following question, which I think is important: if I (or
someone else) commits this patch, how will that make things better for
users?  At the moment it's just adding a knob that doesn't do anything
for you when you twist it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-09-03 Thread Robert Haas
On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund and...@2ndquadrant.com wrote:
 0005 wal_decoding: Log xl_running_xact's at a higher frequency than 
 checkpoints are done
 * benefits hot standby startup

Review:

1. I think more comments are needed here to explain why we need this.
I don't know if the comments should go into the functions modified by
this patch or in some other location, but I don't find what's here now
adequate for understanding.

2. I think the variable naming could be better.  If nothing else, I'd
spell out snapshot rather than abbreviating it to snap.  I'd also
add comments explaining what each of those variables does.  And why
isn't log_snap_interval_ms a #define rather than a variable?  (Don't
even talk to me about using gdb on a running instance.  If you're even
thinking about that, this needs to be a GUC.)

3. Why does LogCurrentRunningXacts() need to call
XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
sync'd in a reasonably timely fashion; I can't see off-hand why this
one should need special handling.

 0003 wal_decoding: Allow walsender's to connect to a specific database
 * biggest problem is how to specify the connection we connect
   to. Currently with the patch walsender connects to a database if it's
   not named replication (via dbname). Perhaps it's better to invent a
   replication_dbname parameter?

I understand why logical replication needs to connect to a database,
but I don't understand why any other walsender would need to connect
to a database.  Absent a clear use case for such a thing, I don't
think we should allow it.  Ignorant suggestion: perhaps the database
name could be stored in the logical replication slot.

 0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public
 * Pretty trivial and boring.

Seems fine.

 0007 wal_decoding: Add information about a tables primary key to struct 
 RelationData
 * Could be used in the matview refresh code

I think you and Kevin should discuss whether this is actually the
right way to do this.  ISTM that if logical replication and
materialized views end up selecting different approaches to this
problem, everybody loses.

 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new 
 maximum for CommandCounterIncrement

I'm still unconvinced we want this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-09-03 Thread Andres Freund
On 2013-09-03 11:40:57 -0400, Robert Haas wrote:
 On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  0005 wal_decoding: Log xl_running_xact's at a higher frequency than 
  checkpoints are done
  * benefits hot standby startup
 
 Review:
 
 1. I think more comments are needed here to explain why we need this.
 I don't know if the comments should go into the functions modified by
 this patch or in some other location, but I don't find what's here now
 adequate for understanding.

Hm. What information are you actually missing? I guess the
XLogSetAsyncXactLSN() needs a bit more context based on your question,
what else?
Not sure if it makes sense to explain in detail why it helps us to get
into a consistent state faster?

 2. I think the variable naming could be better.  If nothing else, I'd
 spell out snapshot rather than abbreviating it to snap.  I'd also
 add comments explaining what each of those variables does.

Ok.

 And why
 isn't log_snap_interval_ms a #define rather than a variable?  (Don't
 even talk to me about using gdb on a running instance.  If you're even
 thinking about that, this needs to be a GUC.)

Ugh. It certainly doesn't have anything to do with wanting to change it
on a running system using gdb. Brrr.

I think I wanted it to be a constant variable but forgot the const. I
personally prefer 'static const' to #define's if its legal C, but I
guess the project's style differs, so I'll change that.

 3. Why does LogCurrentRunningXacts() need to call
 XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
 sync'd in a reasonably timely fashion; I can't see off-hand why this
 one should need special handling.

No, we don't force writing out wal records in a timely fashion if
there's no pressure in wal_buffers, basically only on commits and
various XLogFlush()es. It doesn't make much of a difference if the
entire system is busy, but if it's not the wal writer will sleep. The
alternative would be to XLogFlush() the record, but that would actually
block, which isn't really what we want/need.

  0003 wal_decoding: Allow walsender's to connect to a specific database
  * biggest problem is how to specify the connection we connect
to. Currently with the patch walsender connects to a database if it's
not named replication (via dbname). Perhaps it's better to invent a
replication_dbname parameter?

 I understand why logical replication needs to connect to a database,
 but I don't understand why any other walsender would need to connect
 to a database.

Well, logical replication actually streams out data using the walsender,
so that's the reason why I want to add it there. But there have been
cases in the past where we wanted to do stuff in the walsender that need
database access, but we couldn't do so because you cannot connect to
one.

  Absent a clear use case for such a thing, I don't
 think we should allow it.  Ignorant suggestion: perhaps the database
 name could be stored in the logical replication slot.

The problem is that you need to InitPostgres() with a database. You
cannot do that again, after connecting with an empty database which we
do in a plain walsender.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-09-03 Thread Robert Haas
On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund and...@2ndquadrant.com wrote:
 1. I think more comments are needed here to explain why we need this.
 I don't know if the comments should go into the functions modified by
 this patch or in some other location, but I don't find what's here now
 adequate for understanding.

 Hm. What information are you actually missing? I guess the
 XLogSetAsyncXactLSN() needs a bit more context based on your question,
 what else?
 Not sure if it makes sense to explain in detail why it helps us to get
 into a consistent state faster?

Well, we must have had some idea in mind when the original Hot Standby
patch went in that doing this once per checkpoint was good enough.
Now we think we need it every 15 seconds, but not more or less often.
So, why the change of heart?  To my way of thinking, it seems as
though we ought to always begin replay at a checkpoint, so the standby
ought always to see one of these records immediately.  Obviously
that's not good enough, but why not?  And why is every 15 seconds good
enough?

 3. Why does LogCurrentRunningXacts() need to call
 XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
 sync'd in a reasonably timely fashion; I can't see off-hand why this
 one should need special handling.

 No, we don't force writing out wal records in a timely fashion if
 there's no pressure in wal_buffers, basically only on commits and
 various XLogFlush()es. It doesn't make much of a difference if the
 entire system is busy, but if it's not the wal writer will sleep. The
 alternative would be to XLogFlush() the record, but that would actually
 block, which isn't really what we want/need.

The WAL writer is supposed to call XLogBackgroundFlush() every time
WalWriterDelay expires.  Yeah, it can hibernate, but if it's
hibernating, then we should respect that decision for this WAL record
type also.

  0003 wal_decoding: Allow walsender's to connect to a specific database
  * biggest problem is how to specify the connection we connect
to. Currently with the patch walsender connects to a database if it's
not named replication (via dbname). Perhaps it's better to invent a
replication_dbname parameter?

 I understand why logical replication needs to connect to a database,
 but I don't understand why any other walsender would need to connect
 to a database.

 Well, logical replication actually streams out data using the walsender,
 so that's the reason why I want to add it there. But there have been
 cases in the past where we wanted to do stuff in the walsender that need
 database access, but we couldn't do so because you cannot connect to
 one.

Could you be more specific?

  Absent a clear use case for such a thing, I don't
 think we should allow it.  Ignorant suggestion: perhaps the database
 name could be stored in the logical replication slot.

 The problem is that you need to InitPostgres() with a database. You
 cannot do that again, after connecting with an empty database which we
 do in a plain walsender.

Are you saying that the logical replication slot can't be read before
calling InitPostgres()?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-09-03 Thread Andres Freund
On 2013-09-03 12:22:22 -0400, Robert Haas wrote:
 On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund and...@2ndquadrant.com wrote:
  1. I think more comments are needed here to explain why we need this.
  I don't know if the comments should go into the functions modified by
  this patch or in some other location, but I don't find what's here now
  adequate for understanding.
 
  Hm. What information are you actually missing? I guess the
  XLogSetAsyncXactLSN() needs a bit more context based on your question,
  what else?
  Not sure if it makes sense to explain in detail why it helps us to get
  into a consistent state faster?

 Well, we must have had some idea in mind when the original Hot Standby
 patch went in that doing this once per checkpoint was good enough.
 Now we think we need it every 15 seconds, but not more or less often.
 So, why the change of heart?

I think the primary reason for that was that it's was a pretty
complicated patchset and we needed to start somewhere. By now we do have
reports of standbys taking their time to get consistent.

 To my way of thinking, it seems as though we ought to always begin
 replay at a checkpoint, so the standby ought always to see one of
 these records immediately.  Obviously that's not good enough, but why
 not?

We always see one after the checkpoint (well, actually before the
checkpoint record, but ...), correct. The problem is just that reading a
single xact_running record doesn't automatically make you consistent. If
there's a single suboverflowed transaction running on the primary when
the xl_runing_xacts is logged we won't be able to switch to
consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun
and some optimizations.
Since the only place where we currently have the information to
potentially become consistent is ProcArrayApplyRecoveryInfo() we will
have to wait checkpoint_timeout time till we get consistent. Which
sucks as there are good arguments to set that to 1h.
That especially sucks as you loose consistency everytime you restart the
standby...

 And why is every 15 seconds good enough?

Waiting 15s to become consistent instead of checkpoint_timeout seems to
be ok to me and to be a good tradeoff between overhead and waiting. We
can certainly discuss other values or making it configurable. The latter
seemed to be unnecessary to me, but I have don't have a problem
implementing it. I just don't want to document it :P

  3. Why does LogCurrentRunningXacts() need to call
  XLogSetAsyncXactLSN()?  Hopefully any WAL record is going to get
  sync'd in a reasonably timely fashion; I can't see off-hand why this
  one should need special handling.
 
  No, we don't force writing out wal records in a timely fashion if
  there's no pressure in wal_buffers, basically only on commits and
  various XLogFlush()es. It doesn't make much of a difference if the
  entire system is busy, but if it's not the wal writer will sleep. The
  alternative would be to XLogFlush() the record, but that would actually
  block, which isn't really what we want/need.

 The WAL writer is supposed to call XLogBackgroundFlush() every time
 WalWriterDelay expires.  Yeah, it can hibernate, but if it's
 hibernating, then we should respect that decision for this WAL record
 type also.

Why should we respect it? There is work to be done and the wal writer
has no way of knowing that without us telling it? Normally we rely on
commit records and XLogFlush()es to trigger the wal writer.
Alternatively we can start a transaction and set synchronous_commit =
off, but that seems like a complication to me.

  I understand why logical replication needs to connect to a database,
  but I don't understand why any other walsender would need to connect
  to a database.
 
  Well, logical replication actually streams out data using the walsender,
  so that's the reason why I want to add it there. But there have been
  cases in the past where we wanted to do stuff in the walsender that need
  database access, but we couldn't do so because you cannot connect to
  one.

 Could you be more specific?

I only remember 3959.1349384...@sss.pgh.pa.us but I think it has come up
before.

   Absent a clear use case for such a thing, I don't
  think we should allow it.  Ignorant suggestion: perhaps the database
  name could be stored in the logical replication slot.
 
  The problem is that you need to InitPostgres() with a database. You
  cannot do that again, after connecting with an empty database which we
  do in a plain walsender.
 
 Are you saying that the logical replication slot can't be read before
 calling InitPostgres()?

The slot can be read just fine, but we won't know that we should do
that. Walsender accepts commands via PostgresMain()'s mainloop which has
done a InitPostgres(dbname) before. Which we need to do because we need
the environment it sets up.

The database is stored in the slots btw (as oid, not as a name though) ;)

Greetings,

Andres Freund

-- 
 Andres Freund  

Re: [HACKERS] logical changeset generation v5

2013-09-03 Thread Robert Haas
On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote:
 To my way of thinking, it seems as though we ought to always begin
 replay at a checkpoint, so the standby ought always to see one of
 these records immediately.  Obviously that's not good enough, but why
 not?

 We always see one after the checkpoint (well, actually before the
 checkpoint record, but ...), correct. The problem is just that reading a
 single xact_running record doesn't automatically make you consistent. If
 there's a single suboverflowed transaction running on the primary when
 the xl_runing_xacts is logged we won't be able to switch to
 consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun
 and some optimizations.
 Since the only place where we currently have the information to
 potentially become consistent is ProcArrayApplyRecoveryInfo() we will
 have to wait checkpoint_timeout time till we get consistent. Which
 sucks as there are good arguments to set that to 1h.
 That especially sucks as you loose consistency everytime you restart the
 standby...

Right, OK.

 And why is every 15 seconds good enough?

 Waiting 15s to become consistent instead of checkpoint_timeout seems to
 be ok to me and to be a good tradeoff between overhead and waiting. We
 can certainly discuss other values or making it configurable. The latter
 seemed to be unnecessary to me, but I have don't have a problem
 implementing it. I just don't want to document it :P

I don't think it particularly needs to be configurable, but I wonder
if we can't be a bit smarter about when we do it.  For example,
suppose we logged it every 15 s but only until we log a non-overflowed
snapshot.  I realize that the overhead of a WAL record every 15
seconds is fairly small, but the load on some systems is all but
nonexistent.  It would be nice not to wake up the HD unnecessarily.

 The WAL writer is supposed to call XLogBackgroundFlush() every time
 WalWriterDelay expires.  Yeah, it can hibernate, but if it's
 hibernating, then we should respect that decision for this WAL record
 type also.

 Why should we respect it?

Because I don't see any reason to believe that this WAL record is any
more important or urgent than any other WAL record that might get
logged.

  I understand why logical replication needs to connect to a database,
  but I don't understand why any other walsender would need to connect
  to a database.
 
  Well, logical replication actually streams out data using the walsender,
  so that's the reason why I want to add it there. But there have been
  cases in the past where we wanted to do stuff in the walsender that need
  database access, but we couldn't do so because you cannot connect to
  one.

 Could you be more specific?

 I only remember 3959.1349384...@sss.pgh.pa.us but I think it has come up
 before.

It seems we need some more design there.  Perhaps entering replication
mode could be triggered by writing either dbname=replication or
replication=yes.  But then, do the replication commands simply become
SQL commands?  I've certainly seen hackers use them that way.  And I
can imagine that being a sensible approach, but this patch seems like
it's only covering a fairly small fraction of what really ought to be
a single commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-09-03 Thread Andres Freund
On 2013-09-03 15:56:15 -0400, Robert Haas wrote:
 On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote:
  And why is every 15 seconds good enough?
 
  Waiting 15s to become consistent instead of checkpoint_timeout seems to
  be ok to me and to be a good tradeoff between overhead and waiting. We
  can certainly discuss other values or making it configurable. The latter
  seemed to be unnecessary to me, but I have don't have a problem
  implementing it. I just don't want to document it :P
 
 I don't think it particularly needs to be configurable, but I wonder
 if we can't be a bit smarter about when we do it.  For example,
 suppose we logged it every 15 s but only until we log a non-overflowed
 snapshot.

There's actually more benefits than just overflowed snapshots (pruning
of the known xids machinery, exclusive lock cleanup).

 I realize that the overhead of a WAL record every 15
 seconds is fairly small, but the load on some systems is all but
 nonexistent.  It would be nice not to wake up the HD unnecessarily.

The patch as-is only writes if there has been WAL written since the last
time it logged a running_xacts. I think it's not worth building more
smarts than that?

  The WAL writer is supposed to call XLogBackgroundFlush() every time
  WalWriterDelay expires.  Yeah, it can hibernate, but if it's
  hibernating, then we should respect that decision for this WAL record
  type also.
 
  Why should we respect it?
 
 Because I don't see any reason to believe that this WAL record is any
 more important or urgent than any other WAL record that might get
 logged.

I can't follow the logic behind that statement. Just about all WAL
records are either pretty immediately flushed afterwards or are done in
the context of a transaction where we flush (or do a
XLogSetAsyncXactLSN) at transaction commit.

XLogBackgroundFlush() won't necessarily flush the running_xacts
record. Unless you've set the async xact lsn it will only flush complete
blocks. So what can happen (I've seen that more than once in testing,
took me a while to debug) that a checkpoint is started in a busy period
but nothing happens after it finished. Since the checkpoint triggered
running_xact is triggered *before* we do the smgr flush it still is
overflowed. Then, after activity has died down, the bgwriter issues the
running xact record, but it's filling a block and thus it never get's
flushed.

To me the alternatives are to do a XLogSetAsyncXactLSN() or an
XLogFlush(). The latter is more aggressive and can block for a
measurable amount of time, which is why I don't want to do it in the
bgwriter.

   I understand why logical replication needs to connect to a database,
   but I don't understand why any other walsender would need to connect
   to a database.
  
   Well, logical replication actually streams out data using the walsender,
   so that's the reason why I want to add it there. But there have been
   cases in the past where we wanted to do stuff in the walsender that need
   database access, but we couldn't do so because you cannot connect to
   one.
 
  Could you be more specific?
 
  I only remember 3959.1349384...@sss.pgh.pa.us but I think it has come up
  before.
 
 It seems we need some more design there.  Perhaps entering replication
 mode could be triggered by writing either dbname=replication or
 replication=yes.  But then, do the replication commands simply become
 SQL commands?  I've certainly seen hackers use them that way.  And I
 can imagine that being a sensible approach, but this patch seems like
 it's only covering a fairly small fraction of what really ought to be
 a single commit.

Yes. I think you're right that we need more input/design here. I've
previously started threads about it, but nobody replied :(.

The problem with using dbname=replication as a trigger for anything is
that we actually allow databases to be created with that name. Perhaps
that was a design mistake.

I wondered about turning replication from a boolean into something like
off|0, on|1, database. dbname= gets only used in the latter
variant. That would be compatible with previous versions and would even
support using old tools (since all of them seem to do replication=1).

 But then, do the replication commands simply become
 SQL commands?  I've certainly seen hackers use them that way.

I don't think that it's a good way at this point to make them to plain
SQL. There is more infrastructure (signal handlers, permissions,
different timeouts)  memory required for walsenders, so using plain SQL
there seems beyond the scope of this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v5

2013-08-30 Thread Andres Freund
Hi,

I've attached a couple of the preliminary patches to $subject which I've
recently cleaned up in the hope that we can continue improving on those
in a piecemal fashion.
I am preparing submission of a newer version of the major patch but
unfortunately progress on that is slower than I'd like...

In the order of chance of applying them individuall they are:

0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints 
are done
* benefits hot standby startup
0003 wal_decoding: Allow walsender's to connect to a specific database
* biggest problem is how to specify the connection we connect
  to. Currently with the patch walsender connects to a database if it's
  not named replication (via dbname). Perhaps it's better to invent a
  replication_dbname parameter?
0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public
* Pretty trivial and boring.
0007 wal_decoding: Add information about a tables primary key to struct 
RelationData
* Could be used in the matview refresh code
0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new 
maximum for CommandCounterIncrement

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 3442c3a4e44c5a64efbe651b745a6f86f69cfdab Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 19 Aug 2013 13:24:30 +0200
Subject: [PATCH 02/13] wal_decoding: Introduce InvalidCommandId and declare
 that to be the new maximum for CommandCounterIncrement

This is useful to be able to represent a CommandId thats invalid. There was no
such value before.

This decreases the possible number of subtransactions by one which seems
unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are
never looked at outside the context of their own transaction (spare timetravel
access, but thats new anyway).
---
 src/backend/access/transam/xact.c | 4 ++--
 src/include/c.h   | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 31e868d..0591f3f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -766,12 +766,12 @@ CommandCounterIncrement(void)
 	if (currentCommandIdUsed)
 	{
 		currentCommandId += 1;
-		if (currentCommandId == FirstCommandId) /* check for overflow */
+		if (currentCommandId == InvalidCommandId)
 		{
 			currentCommandId -= 1;
 			ereport(ERROR,
 	(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-	 errmsg(cannot have more than 2^32-1 commands in a transaction)));
+	 errmsg(cannot have more than 2^32-2 commands in a transaction)));
 		}
 		currentCommandIdUsed = false;
 
diff --git a/src/include/c.h b/src/include/c.h
index 5961183..14bfdcd 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -368,6 +368,7 @@ typedef uint32 MultiXactOffset;
 typedef uint32 CommandId;
 
 #define FirstCommandId	((CommandId) 0)
+#define InvalidCommandId	(~(CommandId)0)
 
 /*
  * Array indexing support
-- 
1.8.3.251.g1462b67

From ac48fc2f5c5f0031494cfabb0bca46f0bbef47d2 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 19 Aug 2013 13:24:30 +0200
Subject: [PATCH 03/13] wal_decoding: Allow walsender's to connect to a
 specific database

Currently the decision whether to connect to a database or not is made by
checking whether the passed dbname parameter is replication. Unfortunately
this makes it impossible to connect a to a database named
replication... Possibly it would be better to use a separate connection
parameter like replication_dbname=xxx?

This is useful for future walsender commands which need database interaction.
---
 doc/src/sgml/protocol.sgml |  5 +++-
 src/backend/postmaster/postmaster.c| 13 +--
 .../libpqwalreceiver/libpqwalreceiver.c|  4 ++--
 src/backend/replication/walsender.c| 27 ++
 src/backend/utils/init/postinit.c  |  5 
 src/bin/pg_basebackup/pg_basebackup.c  |  4 ++--
 src/bin/pg_basebackup/pg_receivexlog.c |  4 ++--
 src/bin/pg_basebackup/receivelog.c |  4 ++--
 8 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 0b2e60e..51b4435 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1304,7 +1304,10 @@ To initiate streaming replication, the frontend sends the
 literalreplication/ parameter in the startup message. This tells the
 backend to go into walsender mode, wherein a small set of replication commands
 can be issued instead of SQL statements. Only the simple query protocol can be
-used in walsender mode.
+used in walsender mode. A literaldbname/ of literalreplication/ will
+start a walsender not connected to any database, specifying any other database
+will connect to 

[HACKERS] logical changeset generation v5

2013-06-14 Thread Andres Freund
Hi!

I am rather pleased to announce the next version of the changeset
extraction patchset. Thanks to help from a large number of people I
think we are slowly getting to the point where it is getting
committable.

Since the last submitted version
(20121115002746.ga7...@awork2.anarazel.de) a large number of fixes and
the result of good amount of review has been added to the tree. All
bugs known to me have been fixed.

Fixes include:
* synchronous replication support
* don't peg the xmin for user tables, do it only for catalog ones.
* arbitrarily large transaction support by spilling large transactions
  to disk
* spill snapshots to disk, so we can restart without waiting for a new
  snapshot to be built
* Don't read all WAL from the establishment of a logical slot
* tests via SQL interface to changeset extraction

The todo list includes:
* morph the logical slot interface into being replication slots that
  can also be used by streaming replication
* move some more code from snapbuild.c to decode.c to remove a largely
  duplicated switch
* do some more header/comment cleanup  clarification
* move pg_receivellog into its own directory in src/bin or contrib/.
* user/developer level documentation

The patch series currently has two interfaces to logical decoding. One -
which is primarily useful for pg_regress style tests and playing around
- is SQL based, the other one uses a walsender replication connection.

A quick demonstration of the SQL interface (server needs to be started
with wal_level = logical and max_logical_slots  0):
=# CREATE EXTENSION test_logical_decoding;
=# SELECT * FROM init_logical_replication('regression_slot', 'test_decoding');
slotname | xlog_position 
-+---
 regression_slot | 0/17D5908
(1 row)

=# CREATE TABLE foo(id serial primary key, data text);

=# INSERT INTO foo(data) VALUES(1);

=# UPDATE foo SET id = -id, data = ':'||data;

=# DELETE FROM foo;

=# DROP TABLE foo;

=# SELECT * FROM start_logical_replication('regression_slot', 'now', 
'hide-xids', '0');
 location  | xid |  data
---+-+
 0/17D59B8 | 695 | BEGIN
 0/17D59B8 | 695 | COMMIT
 0/17E8B58 | 696 | BEGIN
 0/17E8B58 | 696 | table foo: INSERT: id[int4]:1 data[text]:1
 0/17E8B58 | 696 | COMMIT
 0/17E8CA8 | 697 | BEGIN
 0/17E8CA8 | 697 | table foo: UPDATE: old-pkey: id[int4]:1 new-tuple: 
id[int4]:-1 data[text]::1
 0/17E8CA8 | 697 | COMMIT
 0/17E8E50 | 698 | BEGIN
 0/17E8E50 | 698 | table foo: DELETE: id[int4]:-1
 0/17E8E50 | 698 | COMMIT
 0/17E9058 | 699 | BEGIN
 0/17E9058 | 699 | COMMIT
(13 rows)

=# SELECT * FROM pg_stat_logical_decoding ;
slot_name|plugin | database | active | xmin | 
restart_decoding_lsn 
-+---+--++--+--
 regression_slot | test_decoding |12042 | f  |  695 | 0/17D58D0
(1 row)

=# SELECT * FROM stop_logical_replication('regression_slot');
 stop_logical_replication
--
0

The walsender interface has the same calls
INIT_LOGICAL_REPLICATION 'slot' 'plugin';
START_LOGICAL_REPLICATION 'slot' restart_lsn [(option value)*];
STOP_LOGICAL_REPLICATION 'slot';

The only difference is that START_LOGICAL_REPLICATION can stream changes
and it can support synchronous replication.

The output seen in the 'data' column is produced by a so called 'output
plugin' which users of the facility can write to suit their needs. They
can be written by implementing 5 functions in the shared object that's
passed to init_logical_replication() above:
* pg_decode_init (optional)
* pg_decode_begin_txn
* pg_decode_change
* pg_decode_commit_txn
* pg_decode_cleanup (optional)

The most interesting function pg_decode_change get's passed a structure
containing old/new versions of the row, the 'struct Relation' belonging
to it and metainformation about the transaction.

The output plugin can rely on syscache lookups et al. to decode the
changed tuple in whatever fashion it wants.

I'd like to invite reviewers to first look at:
* the output plugin interface
* the walsender/SRF interface
* patch 12 which contains most of the code

When reading the code, the information flow during decoding might be
interesting:
---
  +---+
  | XLogReader|
  +---+
  |
XLOG Records
  |
  v
  +---+
  | decode.c  |
  +---+
 |   |
 |   |
 v   |
+---+|
| snapbuild.c   |  HeapTupleData
+---+|
 |   |
  catalog snapshots  |
 |   |
 v   v
  +---+
  |reorderbuffer.c|
  +---+
 |
HeapTuple  Metadata