Re: [HACKERS] logical changeset generation v5
Hi, On 2013-09-03 11:40:57 -0400, Robert Haas wrote: On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund and...@2ndquadrant.com wrote: 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done * benefits hot standby startup I tried to update the patch to address the comments you made. 0003 wal_decoding: Allow walsender's to connect to a specific database * biggest problem is how to specify the connection we connect to. Currently with the patch walsender connects to a database if it's not named replication (via dbname). Perhaps it's better to invent a replication_dbname parameter? I've updated the patch so it extends the replication startup parameter to not only specify a boolean but also database. In the latter case it will connect to the database specified in dbname. As discussed downthread, this patch doesn't have an immediate advantage for users until the changeset extraction patch itself is applied. Whether or not it should be applied separately is unclear. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 2aa39548f5990e9663e95f011f25a89a0dc8d8a1 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Mon, 19 Aug 2013 13:24:30 +0200 Subject: [PATCH 2/9] wal_decoding: Allow walsender's to connect to a specific database Extend the existing 'replication' parameter to not only allow a boolean value but also database. If the latter is specified we connect to the database specified in 'dbname'. This is useful for future walsender commands which need database interaction, e.g. changeset extraction. --- doc/src/sgml/protocol.sgml | 24 +--- src/backend/postmaster/postmaster.c| 23 ++-- .../libpqwalreceiver/libpqwalreceiver.c| 4 +- src/backend/replication/walsender.c| 43 +++--- src/backend/utils/init/postinit.c | 5 +++ src/bin/pg_basebackup/pg_basebackup.c | 4 +- src/bin/pg_basebackup/pg_receivexlog.c | 4 +- src/bin/pg_basebackup/receivelog.c | 4 +- src/include/replication/walsender.h| 1 + 9 files changed, 89 insertions(+), 23 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 0b2e60e..2ea14e5 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1301,10 +1301,13 @@ para To initiate streaming replication, the frontend sends the -literalreplication/ parameter in the startup message. This tells the -backend to go into walsender mode, wherein a small set of replication commands -can be issued instead of SQL statements. Only the simple query protocol can be -used in walsender mode. +literalreplication/ parameter in the startup message. A boolean value +of literaltrue/ tells the backend to go into walsender mode, wherein a +small set of replication commands can be issued instead of SQL statements. Only +the simple query protocol can be used in walsender mode. +Passing a literaldatabase/ as the value instructs walsender to connect to +the database specified in the literaldbname/ paramter which will in future +allow some additional commands to the ones specified below to be run. The commands accepted in walsender mode are: @@ -1314,7 +1317,7 @@ The commands accepted in walsender mode are: listitem para Requests the server to identify itself. Server replies with a result - set of a single row, containing three fields: + set of a single row, containing four fields: /para para @@ -1356,6 +1359,17 @@ The commands accepted in walsender mode are: /listitem /varlistentry + varlistentry + term + dbname + /term + listitem + para + Database connected to or NULL. + /para + /listitem + /varlistentry + /variablelist /para /listitem diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 01d2618..a31b01d 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1894,10 +1894,21 @@ retry1: port-cmdline_options = pstrdup(valptr); else if (strcmp(nameptr, replication) == 0) { -if (!parse_bool(valptr, am_walsender)) +/* + * Due to backward compatibility concerns replication is a + * bybrid beast which allows the value to be either a boolean + * or the string 'database'. The latter connects to a specific + * database which is e.g. required for changeset extraction. + */ +if (strcmp(valptr, database) == 0) +{ + am_walsender = true; + am_db_walsender = true; +} +else if (!parse_bool(valptr, am_walsender)) ereport(FATAL, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg(invalid value for boolean option \replication\))); +
Re: [HACKERS] logical changeset generation v5
On 2013-09-04 10:02:05 -0400, Robert Haas wrote: On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote: I don't think it particularly needs to be configurable, but I wonder if we can't be a bit smarter about when we do it. For example, suppose we logged it every 15 s but only until we log a non-overflowed snapshot. There's actually more benefits than just overflowed snapshots (pruning of the known xids machinery, exclusive lock cleanup). I know that, but I thought the master and slave could only lose sync on those things after a master crash and that once per checkpoint cycle was enough for those other benefits. Am I wrong? The xid tracking can keep track without the additional records but it sometimes needs a good bit more memory to do so if the primary burns to xids quite fast. Everytime we see an running xacts record we can do cleanup (that's the ExpireOldKnownAssignedTransactionIds() in ProcArrayApplyRecoveryInfo()). The problem with using dbname=replication as a trigger for anything is that we actually allow databases to be created with that name. Perhaps that was a design mistake. It seemed like a good idea at the time, but maybe it wasn't. I'm not sure where to go with it at this point; a forcible backward compatibility break would probably screw things up for a lot of people. Yes, breaking things now doesn't seem like a good idea. I wondered about turning replication from a boolean into something like off|0, on|1, database. dbname= gets only used in the latter variant. That would be compatible with previous versions and would even support using old tools (since all of them seem to do replication=1). I don't love that, but I don't hate it, either. Ok. Will update the patch that way. Seems better than it's current state. But it still doesn't answer the following question, which I think is important: if I (or someone else) commits this patch, how will that make things better for users? At the moment it's just adding a knob that doesn't do anything for you when you twist it. I am not sure it's a good idea to commit it before we're sure were going to commit the changeset extraction. It's an independently reviewable and testable piece of code that's simple enough to understand quickly in contrast to the large changeset extraction patch. That's why I kept it separate. On the other hand, as you know, it's not without precedent to commit pieces of infrastructure that aren't really useful for the enduser (think FDW). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
On Tue, Sep 3, 2013 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote: I don't think it particularly needs to be configurable, but I wonder if we can't be a bit smarter about when we do it. For example, suppose we logged it every 15 s but only until we log a non-overflowed snapshot. There's actually more benefits than just overflowed snapshots (pruning of the known xids machinery, exclusive lock cleanup). I know that, but I thought the master and slave could only lose sync on those things after a master crash and that once per checkpoint cycle was enough for those other benefits. Am I wrong? The patch as-is only writes if there has been WAL written since the last time it logged a running_xacts. I think it's not worth building more smarts than that? Hmm, maybe. Because I don't see any reason to believe that this WAL record is any more important or urgent than any other WAL record that might get logged. I can't follow the logic behind that statement. Just about all WAL records are either pretty immediately flushed afterwards or are done in the context of a transaction where we flush (or do a XLogSetAsyncXactLSN) at transaction commit. XLogBackgroundFlush() won't necessarily flush the running_xacts record. OK, this was the key point I was missing. It seems we need some more design there. Perhaps entering replication mode could be triggered by writing either dbname=replication or replication=yes. But then, do the replication commands simply become SQL commands? I've certainly seen hackers use them that way. And I can imagine that being a sensible approach, but this patch seems like it's only covering a fairly small fraction of what really ought to be a single commit. Yes. I think you're right that we need more input/design here. I've previously started threads about it, but nobody replied :(. The problem with using dbname=replication as a trigger for anything is that we actually allow databases to be created with that name. Perhaps that was a design mistake. It seemed like a good idea at the time, but maybe it wasn't. I'm not sure where to go with it at this point; a forcible backward compatibility break would probably screw things up for a lot of people. I wondered about turning replication from a boolean into something like off|0, on|1, database. dbname= gets only used in the latter variant. That would be compatible with previous versions and would even support using old tools (since all of them seem to do replication=1). I don't love that, but I don't hate it, either. But it still doesn't answer the following question, which I think is important: if I (or someone else) commits this patch, how will that make things better for users? At the moment it's just adding a knob that doesn't do anything for you when you twist it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund and...@2ndquadrant.com wrote: 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done * benefits hot standby startup Review: 1. I think more comments are needed here to explain why we need this. I don't know if the comments should go into the functions modified by this patch or in some other location, but I don't find what's here now adequate for understanding. 2. I think the variable naming could be better. If nothing else, I'd spell out snapshot rather than abbreviating it to snap. I'd also add comments explaining what each of those variables does. And why isn't log_snap_interval_ms a #define rather than a variable? (Don't even talk to me about using gdb on a running instance. If you're even thinking about that, this needs to be a GUC.) 3. Why does LogCurrentRunningXacts() need to call XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get sync'd in a reasonably timely fashion; I can't see off-hand why this one should need special handling. 0003 wal_decoding: Allow walsender's to connect to a specific database * biggest problem is how to specify the connection we connect to. Currently with the patch walsender connects to a database if it's not named replication (via dbname). Perhaps it's better to invent a replication_dbname parameter? I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Absent a clear use case for such a thing, I don't think we should allow it. Ignorant suggestion: perhaps the database name could be stored in the logical replication slot. 0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public * Pretty trivial and boring. Seems fine. 0007 wal_decoding: Add information about a tables primary key to struct RelationData * Could be used in the matview refresh code I think you and Kevin should discuss whether this is actually the right way to do this. ISTM that if logical replication and materialized views end up selecting different approaches to this problem, everybody loses. 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement I'm still unconvinced we want this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
On 2013-09-03 11:40:57 -0400, Robert Haas wrote: On Fri, Aug 30, 2013 at 11:19 AM, Andres Freund and...@2ndquadrant.com wrote: 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done * benefits hot standby startup Review: 1. I think more comments are needed here to explain why we need this. I don't know if the comments should go into the functions modified by this patch or in some other location, but I don't find what's here now adequate for understanding. Hm. What information are you actually missing? I guess the XLogSetAsyncXactLSN() needs a bit more context based on your question, what else? Not sure if it makes sense to explain in detail why it helps us to get into a consistent state faster? 2. I think the variable naming could be better. If nothing else, I'd spell out snapshot rather than abbreviating it to snap. I'd also add comments explaining what each of those variables does. Ok. And why isn't log_snap_interval_ms a #define rather than a variable? (Don't even talk to me about using gdb on a running instance. If you're even thinking about that, this needs to be a GUC.) Ugh. It certainly doesn't have anything to do with wanting to change it on a running system using gdb. Brrr. I think I wanted it to be a constant variable but forgot the const. I personally prefer 'static const' to #define's if its legal C, but I guess the project's style differs, so I'll change that. 3. Why does LogCurrentRunningXacts() need to call XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get sync'd in a reasonably timely fashion; I can't see off-hand why this one should need special handling. No, we don't force writing out wal records in a timely fashion if there's no pressure in wal_buffers, basically only on commits and various XLogFlush()es. It doesn't make much of a difference if the entire system is busy, but if it's not the wal writer will sleep. The alternative would be to XLogFlush() the record, but that would actually block, which isn't really what we want/need. 0003 wal_decoding: Allow walsender's to connect to a specific database * biggest problem is how to specify the connection we connect to. Currently with the patch walsender connects to a database if it's not named replication (via dbname). Perhaps it's better to invent a replication_dbname parameter? I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Well, logical replication actually streams out data using the walsender, so that's the reason why I want to add it there. But there have been cases in the past where we wanted to do stuff in the walsender that need database access, but we couldn't do so because you cannot connect to one. Absent a clear use case for such a thing, I don't think we should allow it. Ignorant suggestion: perhaps the database name could be stored in the logical replication slot. The problem is that you need to InitPostgres() with a database. You cannot do that again, after connecting with an empty database which we do in a plain walsender. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund and...@2ndquadrant.com wrote: 1. I think more comments are needed here to explain why we need this. I don't know if the comments should go into the functions modified by this patch or in some other location, but I don't find what's here now adequate for understanding. Hm. What information are you actually missing? I guess the XLogSetAsyncXactLSN() needs a bit more context based on your question, what else? Not sure if it makes sense to explain in detail why it helps us to get into a consistent state faster? Well, we must have had some idea in mind when the original Hot Standby patch went in that doing this once per checkpoint was good enough. Now we think we need it every 15 seconds, but not more or less often. So, why the change of heart? To my way of thinking, it seems as though we ought to always begin replay at a checkpoint, so the standby ought always to see one of these records immediately. Obviously that's not good enough, but why not? And why is every 15 seconds good enough? 3. Why does LogCurrentRunningXacts() need to call XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get sync'd in a reasonably timely fashion; I can't see off-hand why this one should need special handling. No, we don't force writing out wal records in a timely fashion if there's no pressure in wal_buffers, basically only on commits and various XLogFlush()es. It doesn't make much of a difference if the entire system is busy, but if it's not the wal writer will sleep. The alternative would be to XLogFlush() the record, but that would actually block, which isn't really what we want/need. The WAL writer is supposed to call XLogBackgroundFlush() every time WalWriterDelay expires. Yeah, it can hibernate, but if it's hibernating, then we should respect that decision for this WAL record type also. 0003 wal_decoding: Allow walsender's to connect to a specific database * biggest problem is how to specify the connection we connect to. Currently with the patch walsender connects to a database if it's not named replication (via dbname). Perhaps it's better to invent a replication_dbname parameter? I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Well, logical replication actually streams out data using the walsender, so that's the reason why I want to add it there. But there have been cases in the past where we wanted to do stuff in the walsender that need database access, but we couldn't do so because you cannot connect to one. Could you be more specific? Absent a clear use case for such a thing, I don't think we should allow it. Ignorant suggestion: perhaps the database name could be stored in the logical replication slot. The problem is that you need to InitPostgres() with a database. You cannot do that again, after connecting with an empty database which we do in a plain walsender. Are you saying that the logical replication slot can't be read before calling InitPostgres()? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
On 2013-09-03 12:22:22 -0400, Robert Haas wrote: On Tue, Sep 3, 2013 at 12:05 PM, Andres Freund and...@2ndquadrant.com wrote: 1. I think more comments are needed here to explain why we need this. I don't know if the comments should go into the functions modified by this patch or in some other location, but I don't find what's here now adequate for understanding. Hm. What information are you actually missing? I guess the XLogSetAsyncXactLSN() needs a bit more context based on your question, what else? Not sure if it makes sense to explain in detail why it helps us to get into a consistent state faster? Well, we must have had some idea in mind when the original Hot Standby patch went in that doing this once per checkpoint was good enough. Now we think we need it every 15 seconds, but not more or less often. So, why the change of heart? I think the primary reason for that was that it's was a pretty complicated patchset and we needed to start somewhere. By now we do have reports of standbys taking their time to get consistent. To my way of thinking, it seems as though we ought to always begin replay at a checkpoint, so the standby ought always to see one of these records immediately. Obviously that's not good enough, but why not? We always see one after the checkpoint (well, actually before the checkpoint record, but ...), correct. The problem is just that reading a single xact_running record doesn't automatically make you consistent. If there's a single suboverflowed transaction running on the primary when the xl_runing_xacts is logged we won't be able to switch to consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun and some optimizations. Since the only place where we currently have the information to potentially become consistent is ProcArrayApplyRecoveryInfo() we will have to wait checkpoint_timeout time till we get consistent. Which sucks as there are good arguments to set that to 1h. That especially sucks as you loose consistency everytime you restart the standby... And why is every 15 seconds good enough? Waiting 15s to become consistent instead of checkpoint_timeout seems to be ok to me and to be a good tradeoff between overhead and waiting. We can certainly discuss other values or making it configurable. The latter seemed to be unnecessary to me, but I have don't have a problem implementing it. I just don't want to document it :P 3. Why does LogCurrentRunningXacts() need to call XLogSetAsyncXactLSN()? Hopefully any WAL record is going to get sync'd in a reasonably timely fashion; I can't see off-hand why this one should need special handling. No, we don't force writing out wal records in a timely fashion if there's no pressure in wal_buffers, basically only on commits and various XLogFlush()es. It doesn't make much of a difference if the entire system is busy, but if it's not the wal writer will sleep. The alternative would be to XLogFlush() the record, but that would actually block, which isn't really what we want/need. The WAL writer is supposed to call XLogBackgroundFlush() every time WalWriterDelay expires. Yeah, it can hibernate, but if it's hibernating, then we should respect that decision for this WAL record type also. Why should we respect it? There is work to be done and the wal writer has no way of knowing that without us telling it? Normally we rely on commit records and XLogFlush()es to trigger the wal writer. Alternatively we can start a transaction and set synchronous_commit = off, but that seems like a complication to me. I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Well, logical replication actually streams out data using the walsender, so that's the reason why I want to add it there. But there have been cases in the past where we wanted to do stuff in the walsender that need database access, but we couldn't do so because you cannot connect to one. Could you be more specific? I only remember 3959.1349384...@sss.pgh.pa.us but I think it has come up before. Absent a clear use case for such a thing, I don't think we should allow it. Ignorant suggestion: perhaps the database name could be stored in the logical replication slot. The problem is that you need to InitPostgres() with a database. You cannot do that again, after connecting with an empty database which we do in a plain walsender. Are you saying that the logical replication slot can't be read before calling InitPostgres()? The slot can be read just fine, but we won't know that we should do that. Walsender accepts commands via PostgresMain()'s mainloop which has done a InitPostgres(dbname) before. Which we need to do because we need the environment it sets up. The database is stored in the slots btw (as oid, not as a name though) ;) Greetings, Andres Freund -- Andres Freund
Re: [HACKERS] logical changeset generation v5
On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote: To my way of thinking, it seems as though we ought to always begin replay at a checkpoint, so the standby ought always to see one of these records immediately. Obviously that's not good enough, but why not? We always see one after the checkpoint (well, actually before the checkpoint record, but ...), correct. The problem is just that reading a single xact_running record doesn't automatically make you consistent. If there's a single suboverflowed transaction running on the primary when the xl_runing_xacts is logged we won't be able to switch to consistent. Check procarray.c:ProcArrayApplyRecoveryInfo() for some fun and some optimizations. Since the only place where we currently have the information to potentially become consistent is ProcArrayApplyRecoveryInfo() we will have to wait checkpoint_timeout time till we get consistent. Which sucks as there are good arguments to set that to 1h. That especially sucks as you loose consistency everytime you restart the standby... Right, OK. And why is every 15 seconds good enough? Waiting 15s to become consistent instead of checkpoint_timeout seems to be ok to me and to be a good tradeoff between overhead and waiting. We can certainly discuss other values or making it configurable. The latter seemed to be unnecessary to me, but I have don't have a problem implementing it. I just don't want to document it :P I don't think it particularly needs to be configurable, but I wonder if we can't be a bit smarter about when we do it. For example, suppose we logged it every 15 s but only until we log a non-overflowed snapshot. I realize that the overhead of a WAL record every 15 seconds is fairly small, but the load on some systems is all but nonexistent. It would be nice not to wake up the HD unnecessarily. The WAL writer is supposed to call XLogBackgroundFlush() every time WalWriterDelay expires. Yeah, it can hibernate, but if it's hibernating, then we should respect that decision for this WAL record type also. Why should we respect it? Because I don't see any reason to believe that this WAL record is any more important or urgent than any other WAL record that might get logged. I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Well, logical replication actually streams out data using the walsender, so that's the reason why I want to add it there. But there have been cases in the past where we wanted to do stuff in the walsender that need database access, but we couldn't do so because you cannot connect to one. Could you be more specific? I only remember 3959.1349384...@sss.pgh.pa.us but I think it has come up before. It seems we need some more design there. Perhaps entering replication mode could be triggered by writing either dbname=replication or replication=yes. But then, do the replication commands simply become SQL commands? I've certainly seen hackers use them that way. And I can imagine that being a sensible approach, but this patch seems like it's only covering a fairly small fraction of what really ought to be a single commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
On 2013-09-03 15:56:15 -0400, Robert Haas wrote: On Tue, Sep 3, 2013 at 12:57 PM, Andres Freund and...@2ndquadrant.com wrote: And why is every 15 seconds good enough? Waiting 15s to become consistent instead of checkpoint_timeout seems to be ok to me and to be a good tradeoff between overhead and waiting. We can certainly discuss other values or making it configurable. The latter seemed to be unnecessary to me, but I have don't have a problem implementing it. I just don't want to document it :P I don't think it particularly needs to be configurable, but I wonder if we can't be a bit smarter about when we do it. For example, suppose we logged it every 15 s but only until we log a non-overflowed snapshot. There's actually more benefits than just overflowed snapshots (pruning of the known xids machinery, exclusive lock cleanup). I realize that the overhead of a WAL record every 15 seconds is fairly small, but the load on some systems is all but nonexistent. It would be nice not to wake up the HD unnecessarily. The patch as-is only writes if there has been WAL written since the last time it logged a running_xacts. I think it's not worth building more smarts than that? The WAL writer is supposed to call XLogBackgroundFlush() every time WalWriterDelay expires. Yeah, it can hibernate, but if it's hibernating, then we should respect that decision for this WAL record type also. Why should we respect it? Because I don't see any reason to believe that this WAL record is any more important or urgent than any other WAL record that might get logged. I can't follow the logic behind that statement. Just about all WAL records are either pretty immediately flushed afterwards or are done in the context of a transaction where we flush (or do a XLogSetAsyncXactLSN) at transaction commit. XLogBackgroundFlush() won't necessarily flush the running_xacts record. Unless you've set the async xact lsn it will only flush complete blocks. So what can happen (I've seen that more than once in testing, took me a while to debug) that a checkpoint is started in a busy period but nothing happens after it finished. Since the checkpoint triggered running_xact is triggered *before* we do the smgr flush it still is overflowed. Then, after activity has died down, the bgwriter issues the running xact record, but it's filling a block and thus it never get's flushed. To me the alternatives are to do a XLogSetAsyncXactLSN() or an XLogFlush(). The latter is more aggressive and can block for a measurable amount of time, which is why I don't want to do it in the bgwriter. I understand why logical replication needs to connect to a database, but I don't understand why any other walsender would need to connect to a database. Well, logical replication actually streams out data using the walsender, so that's the reason why I want to add it there. But there have been cases in the past where we wanted to do stuff in the walsender that need database access, but we couldn't do so because you cannot connect to one. Could you be more specific? I only remember 3959.1349384...@sss.pgh.pa.us but I think it has come up before. It seems we need some more design there. Perhaps entering replication mode could be triggered by writing either dbname=replication or replication=yes. But then, do the replication commands simply become SQL commands? I've certainly seen hackers use them that way. And I can imagine that being a sensible approach, but this patch seems like it's only covering a fairly small fraction of what really ought to be a single commit. Yes. I think you're right that we need more input/design here. I've previously started threads about it, but nobody replied :(. The problem with using dbname=replication as a trigger for anything is that we actually allow databases to be created with that name. Perhaps that was a design mistake. I wondered about turning replication from a boolean into something like off|0, on|1, database. dbname= gets only used in the latter variant. That would be compatible with previous versions and would even support using old tools (since all of them seem to do replication=1). But then, do the replication commands simply become SQL commands? I've certainly seen hackers use them that way. I don't think that it's a good way at this point to make them to plain SQL. There is more infrastructure (signal handlers, permissions, different timeouts) memory required for walsenders, so using plain SQL there seems beyond the scope of this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical changeset generation v5
Hi, I've attached a couple of the preliminary patches to $subject which I've recently cleaned up in the hope that we can continue improving on those in a piecemal fashion. I am preparing submission of a newer version of the major patch but unfortunately progress on that is slower than I'd like... In the order of chance of applying them individuall they are: 0005 wal_decoding: Log xl_running_xact's at a higher frequency than checkpoints are done * benefits hot standby startup 0003 wal_decoding: Allow walsender's to connect to a specific database * biggest problem is how to specify the connection we connect to. Currently with the patch walsender connects to a database if it's not named replication (via dbname). Perhaps it's better to invent a replication_dbname parameter? 0006 wal_decoding: copydir: move fsync_fname to fd.[c.h] and make it public * Pretty trivial and boring. 0007 wal_decoding: Add information about a tables primary key to struct RelationData * Could be used in the matview refresh code 0002 wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 3442c3a4e44c5a64efbe651b745a6f86f69cfdab Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Mon, 19 Aug 2013 13:24:30 +0200 Subject: [PATCH 02/13] wal_decoding: Introduce InvalidCommandId and declare that to be the new maximum for CommandCounterIncrement This is useful to be able to represent a CommandId thats invalid. There was no such value before. This decreases the possible number of subtransactions by one which seems unproblematic. Its also not a problem for pg_upgrade because cmin/cmax are never looked at outside the context of their own transaction (spare timetravel access, but thats new anyway). --- src/backend/access/transam/xact.c | 4 ++-- src/include/c.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 31e868d..0591f3f 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -766,12 +766,12 @@ CommandCounterIncrement(void) if (currentCommandIdUsed) { currentCommandId += 1; - if (currentCommandId == FirstCommandId) /* check for overflow */ + if (currentCommandId == InvalidCommandId) { currentCommandId -= 1; ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg(cannot have more than 2^32-1 commands in a transaction))); + errmsg(cannot have more than 2^32-2 commands in a transaction))); } currentCommandIdUsed = false; diff --git a/src/include/c.h b/src/include/c.h index 5961183..14bfdcd 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -368,6 +368,7 @@ typedef uint32 MultiXactOffset; typedef uint32 CommandId; #define FirstCommandId ((CommandId) 0) +#define InvalidCommandId (~(CommandId)0) /* * Array indexing support -- 1.8.3.251.g1462b67 From ac48fc2f5c5f0031494cfabb0bca46f0bbef47d2 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Mon, 19 Aug 2013 13:24:30 +0200 Subject: [PATCH 03/13] wal_decoding: Allow walsender's to connect to a specific database Currently the decision whether to connect to a database or not is made by checking whether the passed dbname parameter is replication. Unfortunately this makes it impossible to connect a to a database named replication... Possibly it would be better to use a separate connection parameter like replication_dbname=xxx? This is useful for future walsender commands which need database interaction. --- doc/src/sgml/protocol.sgml | 5 +++- src/backend/postmaster/postmaster.c| 13 +-- .../libpqwalreceiver/libpqwalreceiver.c| 4 ++-- src/backend/replication/walsender.c| 27 ++ src/backend/utils/init/postinit.c | 5 src/bin/pg_basebackup/pg_basebackup.c | 4 ++-- src/bin/pg_basebackup/pg_receivexlog.c | 4 ++-- src/bin/pg_basebackup/receivelog.c | 4 ++-- 8 files changed, 51 insertions(+), 15 deletions(-) diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 0b2e60e..51b4435 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -1304,7 +1304,10 @@ To initiate streaming replication, the frontend sends the literalreplication/ parameter in the startup message. This tells the backend to go into walsender mode, wherein a small set of replication commands can be issued instead of SQL statements. Only the simple query protocol can be -used in walsender mode. +used in walsender mode. A literaldbname/ of literalreplication/ will +start a walsender not connected to any database, specifying any other database +will connect to
[HACKERS] logical changeset generation v5
Hi! I am rather pleased to announce the next version of the changeset extraction patchset. Thanks to help from a large number of people I think we are slowly getting to the point where it is getting committable. Since the last submitted version (20121115002746.ga7...@awork2.anarazel.de) a large number of fixes and the result of good amount of review has been added to the tree. All bugs known to me have been fixed. Fixes include: * synchronous replication support * don't peg the xmin for user tables, do it only for catalog ones. * arbitrarily large transaction support by spilling large transactions to disk * spill snapshots to disk, so we can restart without waiting for a new snapshot to be built * Don't read all WAL from the establishment of a logical slot * tests via SQL interface to changeset extraction The todo list includes: * morph the logical slot interface into being replication slots that can also be used by streaming replication * move some more code from snapbuild.c to decode.c to remove a largely duplicated switch * do some more header/comment cleanup clarification * move pg_receivellog into its own directory in src/bin or contrib/. * user/developer level documentation The patch series currently has two interfaces to logical decoding. One - which is primarily useful for pg_regress style tests and playing around - is SQL based, the other one uses a walsender replication connection. A quick demonstration of the SQL interface (server needs to be started with wal_level = logical and max_logical_slots 0): =# CREATE EXTENSION test_logical_decoding; =# SELECT * FROM init_logical_replication('regression_slot', 'test_decoding'); slotname | xlog_position -+--- regression_slot | 0/17D5908 (1 row) =# CREATE TABLE foo(id serial primary key, data text); =# INSERT INTO foo(data) VALUES(1); =# UPDATE foo SET id = -id, data = ':'||data; =# DELETE FROM foo; =# DROP TABLE foo; =# SELECT * FROM start_logical_replication('regression_slot', 'now', 'hide-xids', '0'); location | xid | data ---+-+ 0/17D59B8 | 695 | BEGIN 0/17D59B8 | 695 | COMMIT 0/17E8B58 | 696 | BEGIN 0/17E8B58 | 696 | table foo: INSERT: id[int4]:1 data[text]:1 0/17E8B58 | 696 | COMMIT 0/17E8CA8 | 697 | BEGIN 0/17E8CA8 | 697 | table foo: UPDATE: old-pkey: id[int4]:1 new-tuple: id[int4]:-1 data[text]::1 0/17E8CA8 | 697 | COMMIT 0/17E8E50 | 698 | BEGIN 0/17E8E50 | 698 | table foo: DELETE: id[int4]:-1 0/17E8E50 | 698 | COMMIT 0/17E9058 | 699 | BEGIN 0/17E9058 | 699 | COMMIT (13 rows) =# SELECT * FROM pg_stat_logical_decoding ; slot_name|plugin | database | active | xmin | restart_decoding_lsn -+---+--++--+-- regression_slot | test_decoding |12042 | f | 695 | 0/17D58D0 (1 row) =# SELECT * FROM stop_logical_replication('regression_slot'); stop_logical_replication -- 0 The walsender interface has the same calls INIT_LOGICAL_REPLICATION 'slot' 'plugin'; START_LOGICAL_REPLICATION 'slot' restart_lsn [(option value)*]; STOP_LOGICAL_REPLICATION 'slot'; The only difference is that START_LOGICAL_REPLICATION can stream changes and it can support synchronous replication. The output seen in the 'data' column is produced by a so called 'output plugin' which users of the facility can write to suit their needs. They can be written by implementing 5 functions in the shared object that's passed to init_logical_replication() above: * pg_decode_init (optional) * pg_decode_begin_txn * pg_decode_change * pg_decode_commit_txn * pg_decode_cleanup (optional) The most interesting function pg_decode_change get's passed a structure containing old/new versions of the row, the 'struct Relation' belonging to it and metainformation about the transaction. The output plugin can rely on syscache lookups et al. to decode the changed tuple in whatever fashion it wants. I'd like to invite reviewers to first look at: * the output plugin interface * the walsender/SRF interface * patch 12 which contains most of the code When reading the code, the information flow during decoding might be interesting: --- +---+ | XLogReader| +---+ | XLOG Records | v +---+ | decode.c | +---+ | | | | v | +---+| | snapbuild.c | HeapTupleData +---+| | | catalog snapshots | | | v v +---+ |reorderbuffer.c| +---+ | HeapTuple Metadata