Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-10-06 Thread Fujii Masao
Hi,

On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I've pushed that to 'replication-orig' branch in my git
 repository, attached is the same as a diff against your SR_0914.patch.

The following changes about crossing a xlogid boundary seem wrong,
which would break the management of some XLOG positions.

 ! /* Update state for read */
 ! tmp = recptr.xrecoff + byteswritten;
 ! if (tmp  recptr.xrecoff)
 ! recptr.xlogid++; /* overflow */
 ! recptr.xrecoff = tmp;

 ! endptr.xrecoff += MAX_SEND_SIZE;
 ! if(endptr.xrecoff  startptr.xrecoff)
 ! endptr.xlogid++; /* xrecoff overflowed */

 ! if (endptr.xlogid != startptr.xlogid)
   {
 ! Assert(endptr.xlogid == startptr.xlogid + 1);
 ! nbytes = (0x - endptr.xrecoff) + 
 startptr.xrecoff;
 ! }

The size of a logical XLOG file is 0xff00. So, even if xrecoff has
not been overflowed yet, we might need to cross a xlogid boundary.
The xrecoff should be compared with XLogFileSize, I think. Can I fix those?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-10-06 Thread Alvaro Herrera
Fujii Masao escribió:
 On Thu, Sep 17, 2009 at 5:08 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
  Walreceiver is really a slave to the startup process. The startup
  process decides when it's launched, and it's the startup process that
  then waits for it to advance. But the way it's set up at the moment, the
  startup process needs to ask the postmaster to start it up, and it
  doesn't look very robust to me. For example, if launching walreceiver
  fails for some reason, startup process will just hang waiting for it.
 
 I changed the postmaster to report the failure of  fork of the walreceiver
 to the startup process by resetting WalRcv-in_progress, which prevents
 the startup process from getting stuck when launching walreceiver fails.
 http://archives.postgresql.org/pgsql-hackers/2009-09/msg01996.php
 
 Do you have another concern about the robustness? If yes, I'll address that.

Hmm.  Without looking at the patch at all, this seems similar to how
autovacuum does things: autovac launcher signals postmaster that a
worker needs to be started.  Postmaster proceeds to fork a worker.  This
could obviously fail for a lot of reasons.

Now, there is code in place to notify the user when forking fails, and
this is seen on the wild quite a bit more than one would like :-(  I
think it would be a good idea to have a retry mechanism in the
walreceiver startup mechanism so that recovery does not get stuck due to
transient problems.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-10-06 Thread Fujii Masao
Hi,

On Tue, Oct 6, 2009 at 10:42 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Hmm.  Without looking at the patch at all, this seems similar to how
 autovacuum does things: autovac launcher signals postmaster that a
 worker needs to be started.  Postmaster proceeds to fork a worker.  This
 could obviously fail for a lot of reasons.

Yeah, I drew upon the autovac code.

 Now, there is code in place to notify the user when forking fails, and
 this is seen on the wild quite a bit more than one would like :-(  I
 think it would be a good idea to have a retry mechanism in the
 walreceiver startup mechanism so that recovery does not get stuck due to
 transient problems.

Agreed. The latest patch provides the retry mechanism.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-30 Thread Fujii Masao
On Thu, Sep 17, 2009 at 5:08 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Walreceiver is really a slave to the startup process. The startup
 process decides when it's launched, and it's the startup process that
 then waits for it to advance. But the way it's set up at the moment, the
 startup process needs to ask the postmaster to start it up, and it
 doesn't look very robust to me. For example, if launching walreceiver
 fails for some reason, startup process will just hang waiting for it.

I changed the postmaster to report the failure of  fork of the walreceiver
to the startup process by resetting WalRcv-in_progress, which prevents
the startup process from getting stuck when launching walreceiver fails.
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01996.php

Do you have another concern about the robustness? If yes, I'll address that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-25 Thread Fujii Masao
On Thu, Sep 24, 2009 at 7:57 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Anyway, I'll change walreceiver to retry connecting to the primary
 after an error occurs in PQstartXLogStreaming()/PQgetXLogData()/
 PQputXLogRecPtr(). Should we set an upper limit of the number of
 the retries?

 I don't think we need an upper limit.

Without an upper limit, for example, mis-setting of the primary_conninfo
would make walreceiver repeat PQstartXLogStreaming() forever. Is this OK?

 - I know I said we should have just asynchronous replication at first,
 but looking ahead, how would you do synchronous?

 As the previous patch did, I'm going to make walsender read the latest
 XLOG from wal_buffers, introduce the signaling between a backend
 and walsender, and keep a backend waiting until the specified XLOG
 has been written or fsynced in the standby.

 Ok. I don't think walsender needs to access wal_buffers even then,
 though. Once the backend has written the WAL, walsender can well read it
 from disk (it will surely be in OS cache still).

I think that walsender should not delay sending the XLOG until it has
been written by the backend, for performance improvement. Otherwise,
XLOG write and send are performed in serial, which would increase a
response time. Should those be performed in parallel?

 What kind of signaling
 is needed between walreceiver and startup process for that?

 I was thinking that the synchronization mode which a client waits
 until XLOG has been applied is not necessary right now, so no
 signaling is also not required between those processes yet. But,
 HS requires this capability?

 Yeah, I think it will be important with hot standby. It's a much more
 useful guarantee that once COMMIT returns, the transactions is visible
 in the standby, than that it's merely fsync'd to disk in the standby.

 (don't need to solve it now, let's do just asynchronous mode now, but
 it's something to keep in mind)

Okey.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-25 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Thu, Sep 24, 2009 at 7:57 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 - I know I said we should have just asynchronous replication at first,
 but looking ahead, how would you do synchronous?
 As the previous patch did, I'm going to make walsender read the latest
 XLOG from wal_buffers, introduce the signaling between a backend
 and walsender, and keep a backend waiting until the specified XLOG
 has been written or fsynced in the standby.
 Ok. I don't think walsender needs to access wal_buffers even then,
 though. Once the backend has written the WAL, walsender can well read it
 from disk (it will surely be in OS cache still).
 
 I think that walsender should not delay sending the XLOG until it has
 been written by the backend, for performance improvement. Otherwise,
 XLOG write and send are performed in serial, which would increase a
 response time. Should those be performed in parallel?

Well, sure, performance is good, but let's keep it simple for now. The
write() to disk should normally be absorbed by the OS cache and return
quickly, so it's not a big delay.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-25 Thread Fujii Masao
Hi,

On Fri, Sep 25, 2009 at 7:10 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Fujii Masao wrote:
 On Thu, Sep 24, 2009 at 7:57 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 - I know I said we should have just asynchronous replication at first,
 but looking ahead, how would you do synchronous?
 As the previous patch did, I'm going to make walsender read the latest
 XLOG from wal_buffers, introduce the signaling between a backend
 and walsender, and keep a backend waiting until the specified XLOG
 has been written or fsynced in the standby.
 Ok. I don't think walsender needs to access wal_buffers even then,
 though. Once the backend has written the WAL, walsender can well read it
 from disk (it will surely be in OS cache still).

 I think that walsender should not delay sending the XLOG until it has
 been written by the backend, for performance improvement. Otherwise,
 XLOG write and send are performed in serial, which would increase a
 response time. Should those be performed in parallel?

 Well, sure, performance is good, but let's keep it simple for now. The
 write() to disk should normally be absorbed by the OS cache and return
 quickly, so it's not a big delay.

Umm... a backend at least should tell walsender the location which it
has written the XLOG before issuing fsync. In the current XLogWrite(),
XLogCtl-LogwrtResult.Write is updated after fsync has been performed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Fujii Masao
Hi,

Sorry for the delay.

On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Having gone through the patch now in more detail, I think it's in pretty
 good shape. I'm happy with the overall design, except that I haven't
 been able to make up my mind if walreceiver should indeed be a
 stand-alone program as discussed, or a postmaster child process as in
 the patch you submitted. Putting that question aside for a moment,
 here's some minor things, in no particular order:

Thanks for the comments.

 - The async API in PQgetXLogData is quite different from the other
 commands. It's close to the API from PQgetCopyData(), but doesn't return
 a malloc'd buffer like PQgetCopyData does. I presume that's to optimize
 away the extra memcpy step?

Yes. This is for preventing extra memcpy.

 I don't think that's really necessary, I
 don't recall any complaints about that in PQgetCopyData(), and if it
 does become an issue, it could be optimized away by mallocing the buffer
 first and reading directly to that.

OK. I'll change PQgetXLogData() to return a malloc'd buffer, and will
remove PQmarkConsumed().

 - Can we avoid sprinkling XLogStreamingAllowed() calls to places where
 we check if WAL-logging is required (nbtsort.c, copy.c etc.). I think we
 need a new macro to encapsulate (XLogArchivingActive() ||
 XLogStreamingAllowed()).

Yes. I'll introduce a new macro XLogIsNeeded() which encapsulates
(XLogArchivingActive() || XLogStreamingAllowed()).

 - Is O_DIRECT ever a good idea in walreceiver? If it's really direct and
 doesn't get cached, the startup process will need to read from disk.

Good point. I agree that O_DIRECT is useless if walreceiver works
with the startup process. It might be useful if only stand-alone walreceiver
program is executed in the standby.

 - Can we replace read/write_conninfo with just a long-enough field in
 shared mem? Would be simpler. (this is moot if we go with the
 stand-alone walreceiver program and pass it as a command-line argument)

Yes, if we can decide the length of conninfo. Since I could not decide
that, I used read/write_conninfo to tell walreceiver the conninfo. Is the
fixed size 1024B enough for conninfo?

 - walreceiver shouldn't die on connection error, just to be restarted by
 startup process. Can we add error handling a la bgwriter and have a
 retry loop within walreceiver? (again, if we go with a stand-alone
 walreceiver program, it's probably better to have startup process
 responsible to restart walreceiver, as it is now)

Error handling a la bgwriter? You mean that PG_exception_stack
should be set up to handle an ERROR exception?

Anyway, I'll change walreceiver to retry connecting to the primary
after an error occurs in PQstartXLogStreaming()/PQgetXLogData()/
PQputXLogRecPtr(). Should we set an upper limit of the number of
the retries?

 - pq_wait in backend waits until you can read or write at least 1 byte.
 There is no guarantee that you can send or read the whole message
 without blocking. We'd have to put the socket in non-blocking mode for
 that. I'm not sure what the implications of this are.

Umm... AFAIK, poll and select guarantee that at least the subsequent
recv will not be blocked. If there is only 1 byte available in the buffer,
recv would read that 1 byte and return immediately. I'm not sure if send
will get stuck even after poll is passed. In my environment (RHEL5),
send seems not to be blocked.

 - we should include system_identifier somewhere in the replication
 startup handshake. Otherwise you can connect to server from a different
 system and have logs shipped, if they happen to be roughly at the same
 point in WAL. Replay will almost certainly fail, but we should error
 earlier.

Agreed. I'll do that.

 - I know I said we should have just asynchronous replication at first,
 but looking ahead, how would you do synchronous?

As the previous patch did, I'm going to make walsender read the latest
XLOG from wal_buffers, introduce the signaling between a backend
and walsender, and keep a backend waiting until the specified XLOG
has been written or fsynced in the standby.

 What kind of signaling
 is needed between walreceiver and startup process for that?

I was thinking that the synchronization mode which a client waits
until XLOG has been applied is not necessary right now, so no
signaling is also not required between those processes yet. But,
HS requires this capability?

 - 'replication' shouldn't be a real database.

Agreed. I'll remove that.

 I found the paging logic in walsender confusing, and didn't like the
 idea that walsender needs to set the XLOGSTREAM_END_SEG flag. Surely
 walreceiver knows how to split the WAL into files without such a flag. I
 reworked that logic, I think it's easier to understand now. I kept the
 support for the flag in libpq and the protocol for now, but it should be
 removed too, or repurposed to indicate that pg_switch_xlog() was done in
 the master. I've pushed that to 

Re: walreceiver settings Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Fujii Masao
Hi,

On Mon, Sep 21, 2009 at 1:55 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 The startup process could capture stderr from walreceiver and forward it
 with elog(LOG).

The startup process should obtain also the message level in some way
(pipe?), and control the messages according to it. It's confusing that
every messages are output with LOG level.

 The startup process could kill and restart walreceiver to reload. If
 reloading is really required, that is.

I think that it's confusing that SIGHUP kills a process. If walreceiver is
being monitored, the monitoring tool might raise a false alert.

 Which GUC parameters are we
 concerned about? The ones related to logging you mentioned, but if we
 handle logging via a pipe to the startup process, that won't be an issue.

wal_sync_method and fsync. At least I'd like to use fdatasync instead of
fsync for performance improvement.

 Sounds complicated..

 One option that you might well want to change on the fly is the
 connection info string in recovery.conf. Neither of the above really
 cater for that, unless we make walreceiver read recovery.conf as well. I
 think we should keep walreceiver quite dumb.

Agreed.

 4) Change walreceiver back to a child process of postmaster.

 Yeah, that's not out of the question either.

I like this simplest approach. But, as you pointed out, in the original
patch, the way to launch walreceiver is not robust. We need to add
some codes using examples from autovacuum launcher and worker.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Fujii Masao
Hi,

On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I found the paging logic in walsender confusing, and didn't like the
 idea that walsender needs to set the XLOGSTREAM_END_SEG flag. Surely
 walreceiver knows how to split the WAL into files without such a flag. I
 reworked that logic, I think it's easier to understand now. I kept the
 support for the flag in libpq and the protocol for now, but it should be
 removed too, or repurposed to indicate that pg_switch_xlog() was done in
 the master. I've pushed that to 'replication-orig' branch in my git
 repository, attached is the same as a diff against your SR_0914.patch.

In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
file after receiving new XLOG records before writing them. This would
increase the backend's waiting time for replication in synchronous case.
The walreceiver should fsync the XLOG file after sending ACK (if needed)
before receiving the next XLOG records?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Heikki Linnakangas
Fujii Masao wrote:
 In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
 file after receiving new XLOG records before writing them. This would
 increase the backend's waiting time for replication in synchronous case.
 The walreceiver should fsync the XLOG file after sending ACK (if needed)
 before receiving the next XLOG records?

I don't follow. Wareceiver does fsync the file just after writing it if
 the fsync_requested flag was set in the message. Surely that would be
set in synchronous mode, that's what the flag is for, right?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Fujii Masao
Hi,

On Thu, Sep 24, 2009 at 7:41 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Fujii Masao wrote:
 In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
 file after receiving new XLOG records before writing them. This would
 increase the backend's waiting time for replication in synchronous case.
 The walreceiver should fsync the XLOG file after sending ACK (if needed)
 before receiving the next XLOG records?

 I don't follow. Wareceiver does fsync the file just after writing it if
  the fsync_requested flag was set in the message. Surely that would be
 set in synchronous mode, that's what the flag is for, right?

That's the case where fsync is issued at the end of segment.
In this case, since the fsync_requested flag is not set,
walreceiver doesn't perform fsync in that loop. After the
next XLOG arrives, walreceiver does fsync to the previous file,
in XLogWalRcvWrite().

Am I missing something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Mon, Sep 21, 2009 at 4:51 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 - Can we replace read/write_conninfo with just a long-enough field in
 shared mem? Would be simpler. (this is moot if we go with the
 stand-alone walreceiver program and pass it as a command-line argument)
 
 Yes, if we can decide the length of conninfo. Since I could not decide
 that, I used read/write_conninfo to tell walreceiver the conninfo. Is the
 fixed size 1024B enough for conninfo?

Yeah, that should be plenty.

 - walreceiver shouldn't die on connection error, just to be restarted by
 startup process. Can we add error handling a la bgwriter and have a
 retry loop within walreceiver? (again, if we go with a stand-alone
 walreceiver program, it's probably better to have startup process
 responsible to restart walreceiver, as it is now)
 
 Error handling a la bgwriter? You mean that PG_exception_stack
 should be set up to handle an ERROR exception?

Yep.

 Anyway, I'll change walreceiver to retry connecting to the primary
 after an error occurs in PQstartXLogStreaming()/PQgetXLogData()/
 PQputXLogRecPtr(). Should we set an upper limit of the number of
 the retries?

I don't think we need an upper limit.

 - pq_wait in backend waits until you can read or write at least 1 byte.
 There is no guarantee that you can send or read the whole message
 without blocking. We'd have to put the socket in non-blocking mode for
 that. I'm not sure what the implications of this are.
 
 Umm... AFAIK, poll and select guarantee that at least the subsequent
 recv will not be blocked. If there is only 1 byte available in the buffer,
 recv would read that 1 byte and return immediately. I'm not sure if send
 will get stuck even after poll is passed. In my environment (RHEL5),
 send seems not to be blocked.

Hmm, I guess you're right.

 - I know I said we should have just asynchronous replication at first,
 but looking ahead, how would you do synchronous?
 
 As the previous patch did, I'm going to make walsender read the latest
 XLOG from wal_buffers, introduce the signaling between a backend
 and walsender, and keep a backend waiting until the specified XLOG
 has been written or fsynced in the standby.

Ok. I don't think walsender needs to access wal_buffers even then,
though. Once the backend has written the WAL, walsender can well read it
from disk (it will surely be in OS cache still).

 What kind of signaling
 is needed between walreceiver and startup process for that?
 
 I was thinking that the synchronization mode which a client waits
 until XLOG has been applied is not necessary right now, so no
 signaling is also not required between those processes yet. But,
 HS requires this capability?

Yeah, I think it will be important with hot standby. It's a much more
useful guarantee that once COMMIT returns, the transactions is visible
in the standby, than that it's merely fsync'd to disk in the standby.

(don't need to solve it now, let's do just asynchronous mode now, but
it's something to keep in mind)

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-24 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Thu, Sep 24, 2009 at 7:41 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Fujii Masao wrote:
 In the 'replication-orig' branch, walreceiver fsyncs the previous XLOG
 file after receiving new XLOG records before writing them. This would
 increase the backend's waiting time for replication in synchronous case.
 The walreceiver should fsync the XLOG file after sending ACK (if needed)
 before receiving the next XLOG records?
 I don't follow. Wareceiver does fsync the file just after writing it if
  the fsync_requested flag was set in the message. Surely that would be
 set in synchronous mode, that's what the flag is for, right?
 
 That's the case where fsync is issued at the end of segment.
 In this case, since the fsync_requested flag is not set,
 walreceiver doesn't perform fsync in that loop. After the
 next XLOG arrives, walreceiver does fsync to the previous file,
 in XLogWalRcvWrite().

Ok. I don't see anything wrong with that. If the primary didn't set
fsync_requested, it's not in a hurry to get an acknowledgment.

I guess we could check *after* writing, if we just finished filling the
segment. If we did, we could fsync since we're going to fsync anyway as
soon as we receive the next message. Not sure if it's worth the trouble.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-21 Thread Heikki Linnakangas
Having gone through the patch now in more detail, I think it's in pretty
good shape. I'm happy with the overall design, except that I haven't
been able to make up my mind if walreceiver should indeed be a
stand-alone program as discussed, or a postmaster child process as in
the patch you submitted. Putting that question aside for a moment,
here's some minor things, in no particular order:

- The async API in PQgetXLogData is quite different from the other
commands. It's close to the API from PQgetCopyData(), but doesn't return
a malloc'd buffer like PQgetCopyData does. I presume that's to optimize
away the extra memcpy step? I don't think that's really necessary, I
don't recall any complaints about that in PQgetCopyData(), and if it
does become an issue, it could be optimized away by mallocing the buffer
first and reading directly to that.

- Can we avoid sprinkling XLogStreamingAllowed() calls to places where
we check if WAL-logging is required (nbtsort.c, copy.c etc.). I think we
need a new macro to encapsulate (XLogArchivingActive() ||
XLogStreamingAllowed()).

- Is O_DIRECT ever a good idea in walreceiver? If it's really direct and
doesn't get cached, the startup process will need to read from disk.

- Can we replace read/write_conninfo with just a long-enough field in
shared mem? Would be simpler. (this is moot if we go with the
stand-alone walreceiver program and pass it as a command-line argument)

- walreceiver shouldn't die on connection error, just to be restarted by
startup process. Can we add error handling a la bgwriter and have a
retry loop within walreceiver? (again, if we go with a stand-alone
walreceiver program, it's probably better to have startup process
responsible to restart walreceiver, as it is now)

- pq_wait in backend waits until you can read or write at least 1 byte.
There is no guarantee that you can send or read the whole message
without blocking. We'd have to put the socket in non-blocking mode for
that. I'm not sure what the implications of this are.

- we should include system_identifier somewhere in the replication
startup handshake. Otherwise you can connect to server from a different
system and have logs shipped, if they happen to be roughly at the same
point in WAL. Replay will almost certainly fail, but we should error
earlier.

- I know I said we should have just asynchronous replication at first,
but looking ahead, how would you do synchronous? What kind of signaling
is needed between walreceiver and startup process for that?

- 'replication' shouldn't be a real database.


I found the paging logic in walsender confusing, and didn't like the
idea that walsender needs to set the XLOGSTREAM_END_SEG flag. Surely
walreceiver knows how to split the WAL into files without such a flag. I
reworked that logic, I think it's easier to understand now. I kept the
support for the flag in libpq and the protocol for now, but it should be
removed too, or repurposed to indicate that pg_switch_xlog() was done in
the master. I've pushed that to 'replication-orig' branch in my git
repository, attached is the same as a diff against your SR_0914.patch.

I need a break from this patch, so I'll take a closer look at Simon's
hot standby now. Meanwhile, can you work on the above items and submit a
new version, please?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/transam/recovery.conf.sample
--- b/src/backend/access/transam/recovery.conf.sample
***
*** 2,10 
  # PostgreSQL recovery config file
  # ---
  #
! # Edit this file to provide the parameters that PostgreSQL
! # needs to perform an archive recovery of a database, or
! # a log-streaming replication.
  #
  # If recovery.conf is present in the PostgreSQL data directory, it is
  # read on postmaster startup.  After successful recovery, it is renamed
--- 2,10 
  # PostgreSQL recovery config file
  # ---
  #
! # Edit this file to provide the parameters that PostgreSQL needs to
! # perform an archive recovery of a database, or to act as a log-streaming
! # replication standby.
  #
  # If recovery.conf is present in the PostgreSQL data directory, it is
  # read on postmaster startup.  After successful recovery, it is renamed
***
*** 83,89 
  #---
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # the standby. It tries to connect to the primary according to the
  # connection settings primary_conninfo, and receives XLOG records
  # continuously.
  #
--- 83,89 
  #---
  #
  # When standby_mode is enabled, the PostgreSQL server will work as
! # a standby. It tries to connect to the primary according to the
  # connection settings primary_conninfo, and receives XLOG records
  # continuously.
  #
*** a/src/backend/access/transam/xlog.c

Re: walreceiver settings Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-20 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Fri, Sep 18, 2009 at 7:34 PM, Fujii Masao masao.fu...@gmail.com wrote:
 This approach is OK if the stand-alone walreceiver is treated steadily
 by the startup process like a child process under postmaster:

 * Handling of some interrupts: SIGHUP, SIGTERM?, SIGINT, SIGQUIT...
   For example, the startup process would need to rethrow walreceiver
   the interrupt from postmaster.

 * Communication with other child processes: stats collector? syslogger?...
   For example, the log message generated by walreceiver should also
   be collected by syslogger if requested.
 
 Also we should consider how to give a GUC parameter to the stand-alone
 walreceiver. In the initial patch, since walreceiver was a child process of
 postmaster, it could easily get any GUC parameter. But it's not so easy
 to give a GUC parameter to a stand-alone program.

Yes, good point.

 * some parameters for logging
   I think that the log messages generated by walreceiver should also be
   treated as well as the other postgres messages. For example, I'd like
   to specify log_line_prefix also for walreceiver.

Hmm, agreed, although we already have the same problem with
archive_command, and pg_standby in particular. I could live with that
for now.

The startup process could capture stderr from walreceiver and forward it
with elog(LOG).

 There are some approaches to give a GUC parameter to walreceiver.
 Which is the best?
 
 1) Give a parameter as a command-line argument of the stand-alone
 walreceiver. This is a straightforward approach, but wouldn't cover
 a reload of parameter.

The startup process could kill and restart walreceiver to reload. If
reloading is really required, that is. Which GUC parameters are we
concerned about? The ones related to logging you mentioned, but if we
handle logging via a pipe to the startup process, that won't be an issue.

 2) Give a parameter via pipe between the startup process and walreceiver.
 
 3) Change walreceiver to read a configuration file. The problem is that
 the command-line argument of postmaster doesn't affect walreceiver.
 The combination of 1) and 3) might be required.

Sounds complicated..

One option that you might well want to change on the fly is the
connection info string in recovery.conf. Neither of the above really
cater for that, unless we make walreceiver read recovery.conf as well. I
think we should keep walreceiver quite dumb.

 4) Change walreceiver back to a child process of postmaster.

Yeah, that's not out of the question either.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


walreceiver settings Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-19 Thread Fujii Masao
Hi,

On Fri, Sep 18, 2009 at 7:34 PM, Fujii Masao masao.fu...@gmail.com wrote:
 This approach is OK if the stand-alone walreceiver is treated steadily
 by the startup process like a child process under postmaster:

 * Handling of some interrupts: SIGHUP, SIGTERM?, SIGINT, SIGQUIT...
   For example, the startup process would need to rethrow walreceiver
   the interrupt from postmaster.

 * Communication with other child processes: stats collector? syslogger?...
   For example, the log message generated by walreceiver should also
   be collected by syslogger if requested.

Also we should consider how to give a GUC parameter to the stand-alone
walreceiver. In the initial patch, since walreceiver was a child process of
postmaster, it could easily get any GUC parameter. But it's not so easy
to give a GUC parameter to a stand-alone program.

I think that at least the following parameters should affect walreceiver:

* wal_sync_method
  I want walreceiver to use fdatasync instead of fsync for performance
  improvement. And other DBA might want to choose another method.

* fsync
  I'm not surprised if someone wants to disable fsync in the standby.

* some parameters for logging
  I think that the log messages generated by walreceiver should also be
  treated as well as the other postgres messages. For example, I'd like
  to specify log_line_prefix also for walreceiver.

There are some approaches to give a GUC parameter to walreceiver.
Which is the best?

1) Give a parameter as a command-line argument of the stand-alone
walreceiver. This is a straightforward approach, but wouldn't cover
a reload of parameter.

2) Give a parameter via pipe between the startup process and walreceiver.

3) Change walreceiver to read a configuration file. The problem is that
the command-line argument of postmaster doesn't affect walreceiver.
The combination of 1) and 3) might be required.

4) Change walreceiver back to a child process of postmaster.

Do you have any other better approach?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-18 Thread Heikki Linnakangas
Heikki Linnakangas wrote:
 Heikki Linnakangas wrote:
 I'm thinking that walreceiver should be a stand-alone program that the
 startup process launches, similar to how it invokes restore_command in
 PITR recovery. Instead of using system(), though, it would use
 fork+exec, and a pipe to communicate.
 
 Here's a WIP patch to do that, over your latest posted patch. I've also
 pushed this to my git repository at
 git://git.postgresql.org/git/users/heikki/postgres.git, replication
 branch.
 
 I'll continue reviewing...

BTW, my modified patch doesn't correctly zero-fill new WAL segments.
Needs to be fixed...

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-18 Thread Fujii Masao
Hi,

On Fri, Sep 18, 2009 at 2:47 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Heikki Linnakangas wrote:
 I'm thinking that walreceiver should be a stand-alone program that the
 startup process launches, similar to how it invokes restore_command in
 PITR recovery. Instead of using system(), though, it would use
 fork+exec, and a pipe to communicate.

 Here's a WIP patch to do that, over your latest posted patch. I've also
 pushed this to my git repository at
 git://git.postgresql.org/git/users/heikki/postgres.git, replication
 branch.

In my environment, I cannot use git protocol for some reason.
Could you export your repository so that it can be accessed also via http?
BTW, I seem to be able to access http://git.postgresql.org/git/bucardo.git.
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#exporting-via-http

How should we advance development of SR?
Should I be concentrated on the primary side, and leave the standby side to you?
When I change something, should I make a patch for the latest SR source in your
git repo, and submit it?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-18 Thread Heikki Linnakangas
Fujii Masao wrote:
 Hi,
 
 On Fri, Sep 18, 2009 at 2:47 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 Heikki Linnakangas wrote:
 I'm thinking that walreceiver should be a stand-alone program that the
 startup process launches, similar to how it invokes restore_command in
 PITR recovery. Instead of using system(), though, it would use
 fork+exec, and a pipe to communicate.
 Here's a WIP patch to do that, over your latest posted patch. I've also
 pushed this to my git repository at
 git://git.postgresql.org/git/users/heikki/postgres.git, replication
 branch.
 
 In my environment, I cannot use git protocol for some reason.
 Could you export your repository so that it can be accessed also via http?

Sure, it should be accessible via HTTP as well:
http://git.postgresql.org/git/users/heikki/postgres.git

 How should we advance development of SR?
 Should I be concentrated on the primary side, and leave the standby side to 
 you?
 When I change something, should I make a patch for the latest SR source in 
 your
 git repo, and submit it?

Hmm, yeah, let's do that.

Right now, I'm trying to understand the page boundary stuff and partial
page handling in ReadRecord and walsender.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-18 Thread Fujii Masao
Hi,

On Thu, Sep 17, 2009 at 5:08 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I'm thinking that walreceiver should be a stand-alone program that the
 startup process launches, similar to how it invokes restore_command in
 PITR recovery. Instead of using system(), though, it would use
 fork+exec, and a pipe to communicate.

This approach is OK if the stand-alone walreceiver is treated steadily
by the startup process like a child process under postmaster:

* Handling of some interrupts: SIGHUP, SIGTERM?, SIGINT, SIGQUIT...
   For example, the startup process would need to rethrow walreceiver
   the interrupt from postmaster.

* Communication with other child processes: stats collector? syslogger?...
   For example, the log message generated by walreceiver should also
   be collected by syslogger if requested.

For now, I think that pipe is enough for communication between the
startup process and walreceiver. Though there was the idea to pass
XLOG to the startup process via wal_buffers, in which pipe is not
suitable, I think that is overkill.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-17 Thread Heikki Linnakangas
Fujii Masao wrote:
 On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 After playing with this a little bit, I think we need logic in the slave
 to reconnect to the master if the connection is broken for some reason,
 or can't be established in the first place. At the moment, that is
 considered as the end of recovery, and the slave starts up. You have the
 trigger file mechanism to stop that, but it only gives you a chance to
 manually kill and restart the slave before it chooses a new timeline and
 starts up, it doesn't reconnect automatically.
 
 I was thinking that the automatic reconnection capability is the TODO item
 for the later CF. The infrastructure for it has already been introduced in the
 current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
 postmaster/walreceiver.c). This is the maximum number of times to retry
 walreceiver. In the current version, this is the fixed value, but we can make
 this user-configurable (parameter of recovery.conf is suitable, I think).

Ah, I see.

Robert Haas suggested a while ago that walreceiver could be a
stand-alone utility, not requiring postmaster at all. That would allow
you to set up streaming replication as another way to implement WAL
archiving. Looking at how the processes interact, there really isn't
much communication between walreceiver and the rest of the system, so
that sounds pretty attractive.

Walreceiver only needs access to shared memory so that it can tell the
startup process how far it has replicated already. Even when we add the
synchronous capability, I don't think we need any more inter-process
communication. Only if we wanted to acknowledge to the master when a
piece of WAL log has been successfully replayed, the startup process
would need to tell walreceiver about it, but I think we're going to
settle for acknowledging when a piece of log has been fsync'd to disk.

Walreceiver is really a slave to the startup process. The startup
process decides when it's launched, and it's the startup process that
then waits for it to advance. But the way it's set up at the moment, the
startup process needs to ask the postmaster to start it up, and it
doesn't look very robust to me. For example, if launching walreceiver
fails for some reason, startup process will just hang waiting for it.

I'm thinking that walreceiver should be a stand-alone program that the
startup process launches, similar to how it invokes restore_command in
PITR recovery. Instead of using system(), though, it would use
fork+exec, and a pipe to communicate.

Also, when we get around to implement the fetch base backup
automatically via the TCP connection feature, we can't use walreceiver
as it is now for that, because there's no hope of starting up the system
that far without a base backup. I'm not sure if it can or should be
merged with the walreceiver program, but it can't be a postmaster child
process, that's for sure.

Thoughts?

 Also a parameter like retries_interval might be necessary. This parameter
 indicates the interval between each reconnection attempt.

Yeah, maybe, although a hard-coded interval of a few seconds should be
enough to get us started.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-17 Thread Magnus Hagander
On Thu, Sep 17, 2009 at 10:08, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Fujii Masao wrote:
 On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 After playing with this a little bit, I think we need logic in the slave
 to reconnect to the master if the connection is broken for some reason,
 or can't be established in the first place. At the moment, that is
 considered as the end of recovery, and the slave starts up. You have the
 trigger file mechanism to stop that, but it only gives you a chance to
 manually kill and restart the slave before it chooses a new timeline and
 starts up, it doesn't reconnect automatically.

 I was thinking that the automatic reconnection capability is the TODO item
 for the later CF. The infrastructure for it has already been introduced in 
 the
 current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
 postmaster/walreceiver.c). This is the maximum number of times to retry
 walreceiver. In the current version, this is the fixed value, but we can make
 this user-configurable (parameter of recovery.conf is suitable, I think).

 Ah, I see.

 Robert Haas suggested a while ago that walreceiver could be a
 stand-alone utility, not requiring postmaster at all. That would allow
 you to set up streaming replication as another way to implement WAL
 archiving. Looking at how the processes interact, there really isn't
 much communication between walreceiver and the rest of the system, so
 that sounds pretty attractive.

Yes, that would be very very useful.


 Walreceiver is really a slave to the startup process. The startup
 process decides when it's launched, and it's the startup process that
 then waits for it to advance. But the way it's set up at the moment, the
 startup process needs to ask the postmaster to start it up, and it
 doesn't look very robust to me. For example, if launching walreceiver
 fails for some reason, startup process will just hang waiting for it.

 I'm thinking that walreceiver should be a stand-alone program that the
 startup process launches, similar to how it invokes restore_command in
 PITR recovery. Instead of using system(), though, it would use
 fork+exec, and a pipe to communicate.

Not having looked at all into the details, that sounds like a nice
improvement :-)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-17 Thread Heikki Linnakangas
Some random comments:

I don't think we need the new PM_SHUTDOWN_3 postmaster state. We can
treat walsenders the same as the archive process, and kill and wait for
both of them to die in PM_SHUTDOWN_2 state.

I think there's something wrong with the napping in walsender. When I
perform px_xlog_switch(), it takes surprisingly long for it to trickle
to the standby. When I put a little proxy program in between the master
and slave that delays all messages from the slave to the master by one
second, it got worse, even though I would expect the master to still
keep sending WAL at full speed. I get logs like this:

2009-09-17 14:13:16.876 EEST LOG:  xlog send request 0/3800; send
0/376C; write 0/376C
2009-09-17 14:13:16.877 EEST LOG:  xlog read request 0/3701; send
0/3701; write 0/376C
2009-09-17 14:13:17.077 EEST LOG:  xlog send request 0/3800; send
0/3701; write 0/376C
2009-09-17 14:13:17.077 EEST LOG:  xlog read request 0/3702; send
0/3702; write 0/376C
2009-09-17 14:13:17.078 EEST LOG:  xlog read request 0/3703; send
0/3703; write 0/376C
2009-09-17 14:13:17.278 EEST LOG:  xlog send request 0/3800; send
0/3703; write 0/376C
2009-09-17 14:13:17.279 EEST LOG:  xlog read request 0/3704; send
0/3704; write 0/376C
...
2009-09-17 14:13:22.796 EEST LOG:  xlog read request 0/37FD; send
0/37FD; write 0/376D
2009-09-17 14:13:22.896 EEST LOG:  xlog send request 0/3800; send
0/37FD; write 0/376D
2009-09-17 14:13:22.896 EEST LOG:  xlog read request 0/37FE; send
0/37FE; write 0/376D
2009-09-17 14:13:22.896 EEST LOG:  xlog read request 0/37FF; send
0/37FF; write 0/376D
2009-09-17 14:13:22.897 EEST LOG:  xlog read request 0/3800; send
0/3800; write 0/376D
2009-09-17 14:14:09.932 EEST LOG:  xlog send request 0/38000428; send
0/3800; write 0/3800
2009-09-17 14:14:09.932 EEST LOG:  xlog read request 0/38000428; send
0/38000428; write 0/3800

It looks like it's having 100 or 200 ms naps in between. Also, I
wouldn't expect to see so many read request acknowledgments from the
slave. The master doesn't really need to know how far the slave is,
except in synchronous replication when it has requested a flush to
slave. Another reason why master needs to know is so that the master can
recycle old log files, but for that we'd really only need an
acknowledgment once per WAL file or even less.

Why does XLogSend() care about page boundaries? Perhaps it's a leftover
from the old approach that read from wal_buffers?

Do we really need the support for asynchronous backend libpq commands?
Could walsender just keep blasting WAL to the slave, and only try to
read an acknowledgment after it has requested one, by setting
XLOGSTREAM_FLUSH flag. Or maybe we should be putting the socket into
non-blocking mode.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-17 Thread Csaba Nagy
On Thu, 2009-09-17 at 10:08 +0200, Heikki Linnakangas wrote:
 Robert Haas suggested a while ago that walreceiver could be a
 stand-alone utility, not requiring postmaster at all. That would allow
 you to set up streaming replication as another way to implement WAL
 archiving. Looking at how the processes interact, there really isn't
 much communication between walreceiver and the rest of the system, so
 that sounds pretty attractive.

Just a small comment in this direction: what if the archive would be
itself a postgres DB, and it would collect the WALs in some special
place (together with some meta data, snapshots, etc), and then a slave
could connect to it just like to any other master ? (except maybe it
could specify which snapshot to to start with and possibly choosing
between different archived WAL streams).

Maybe it is completely stupid what I'm saying, but I see the archive as
just another form of a postgres server, with the same protocol from the
POV of a slave. While I don't have the clue to implement such a thing, I
thought it might be interesting as an idea while discussing the
walsender/receiver interface...

Cheers,
Csaba.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-17 Thread Fujii Masao
Hi,

On Thu, Sep 17, 2009 at 8:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Some random comments:

Thanks for the comments.

 I don't think we need the new PM_SHUTDOWN_3 postmaster state. We can
 treat walsenders the same as the archive process, and kill and wait for
 both of them to die in PM_SHUTDOWN_2 state.

OK, I'll use PM_SHUTDOWN_2 for walsender instead of PM_SHUTDOWN_3.

 I think there's something wrong with the napping in walsender. When I
 perform px_xlog_switch(), it takes surprisingly long for it to trickle
 to the standby. When I put a little proxy program in between the master
 and slave that delays all messages from the slave to the master by one
 second, it got worse, even though I would expect the master to still
 keep sending WAL at full speed. I get logs like this:

Probably this is because XLOG records following XLOG_SWITCH are
sent to the standby, too. Though those records are obviously not used
for recovery, they are sent because walsender doesn't know where
XLOG_SWITCH is.

The difficulty is that there might be many XLOG_SWITCHs in the XLOG
files which are going to be sent by walsender. How should walsender
get to know those location? One possible solution is to make walsender
parse the XLOG files and search XLOG_SWITCH. But this is overkill,
I think.

I don't think that XLOG switch is often requested and is sensitive to
response time in many cases. So it's not worth changing walsender
to skip the XLOG following XLOG_SWITCH, I think. Thought?

 2009-09-17 14:14:09.932 EEST LOG:  xlog send request 0/38000428; send
 0/3800; write 0/3800
 2009-09-17 14:14:09.932 EEST LOG:  xlog read request 0/38000428; send
 0/38000428; write 0/3800

 It looks like it's having 100 or 200 ms naps in between. Also, I
 wouldn't expect to see so many read request acknowledgments from the
 slave. The master doesn't really need to know how far the slave is,
 except in synchronous replication when it has requested a flush to
 slave. Another reason why master needs to know is so that the master can
 recycle old log files, but for that we'd really only need an
 acknowledgment once per WAL file or even less.

You mean that the new protocol for asking the standby about the completion
location of replication is required? In synchronous case, the backend should
not wait for one acknowledgement per XLOG file, for its performance.

 Why does XLogSend() care about page boundaries? Perhaps it's a leftover
 from the old approach that read from wal_buffers?

That is for not sending a partially-filled XLOG *record*, which simplifies the
logic that startup process waits for the next XLOG record available, i.e.,
startup process doesn't need to take care of a partially-sent record.

 Do we really need the support for asynchronous backend libpq commands?
 Could walsender just keep blasting WAL to the slave, and only try to
 read an acknowledgment after it has requested one, by setting
 XLOGSTREAM_FLUSH flag. Or maybe we should be putting the socket into
 non-blocking mode.

Yes, that is required, especially for synchronous replication. The receiving of
the acknowledgement should not keep the subsequent XLOG-sending waiting.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-16 Thread Fujii Masao
Hi,

On Wed, Sep 16, 2009 at 11:37 AM, Fujii Masao masao.fu...@gmail.com wrote:
 I was thinking that the automatic reconnection capability is the TODO item
 for the later CF. The infrastructure for it has already been introduced in the
 current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
 postmaster/walreceiver.c). This is the maximum number of times to retry
 walreceiver. In the current version, this is the fixed value, but we can make
 this user-configurable (parameter of recovery.conf is suitable, I think).

 Also a parameter like retries_interval might be necessary. This parameter
 indicates the interval between each reconnection attempt.

 Do you think that these parameters should be introduced right now? or
 the later CF?

I updated the TODO list on the wiki, and marked the items that I'm going to
develop for the later CommitFest.
http://wiki.postgresql.org/wiki/Streaming_Replication#Todo_and_Claim

Do you have any other TODO item? How much is that priority?
And, is there already-listed TODO item which should be developed right
now (CommitFest 2009-09)?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-15 Thread Fujii Masao
Hi,

On Tue, Sep 15, 2009 at 2:54 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 The first thing that caught my eye is that I don't think replication
 should be a real database. Rather, it should by a keyword in
 pg_hba.conf, like the existing all, sameuser, samerole keywords
 that you can put into the database-column.

I'll try that! It might be only necessary to prevent walsender from accessing
pg_database and checking if the target database is present, in InitPostres().

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-15 Thread Heikki Linnakangas
Kevin Grittner wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 Kevin Grittner wrote:
  
 IMO, it would be best if the status could be sent via NOTIFY.
 To where?
  
 To registered listeners?
  
 I guess I should have worded that as it would be best if a change is
 replication status could be signaled via NOTIFY -- does that satisfy,
 or am I missing your point entirely?

Ok, makes more sense now.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-15 Thread Heikki Linnakangas
After playing with this a little bit, I think we need logic in the slave
to reconnect to the master if the connection is broken for some reason,
or can't be established in the first place. At the moment, that is
considered as the end of recovery, and the slave starts up. You have the
trigger file mechanism to stop that, but it only gives you a chance to
manually kill and restart the slave before it chooses a new timeline and
starts up, it doesn't reconnect automatically.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-15 Thread Fujii Masao
Hi,

On Tue, Sep 15, 2009 at 7:53 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 After playing with this a little bit, I think we need logic in the slave
 to reconnect to the master if the connection is broken for some reason,
 or can't be established in the first place. At the moment, that is
 considered as the end of recovery, and the slave starts up. You have the
 trigger file mechanism to stop that, but it only gives you a chance to
 manually kill and restart the slave before it chooses a new timeline and
 starts up, it doesn't reconnect automatically.

I was thinking that the automatic reconnection capability is the TODO item
for the later CF. The infrastructure for it has already been introduced in the
current patch. Please see the macro MAX_WALRCV_RETRIES (backend/
postmaster/walreceiver.c). This is the maximum number of times to retry
walreceiver. In the current version, this is the fixed value, but we can make
this user-configurable (parameter of recovery.conf is suitable, I think).

Also a parameter like retries_interval might be necessary. This parameter
indicates the interval between each reconnection attempt.

Do you think that these parameters should be introduced right now? or
the later CF?

BTW, these parameters are provided in MySQL replication.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Heikki Linnakangas
Greg Smith wrote:
 Putting on my DBA hat for a minute, the first question I see people
 asking is how do I measure how far behind the slaves are?.  Presumably
 you can get that out of pg_controldata; my first question is whether
 that's complete enough information?  If not, what else should be monitored?
 
 I don't think running that program going to fly for a production quality
 integrated replication setup though.  The UI admins are going to want
 would allow querying this easily via a standard database query.  Most
 monitoring systems can issue psql queries but not necessarily run a
 remote binary.  I think that parts of pg_controldata needs to get
 exposed via some number of built-in UDFs instead, and whatever new
 internal state makes sense too.  I could help out writing those, if
 someone more familiar with the replication internals can help me nail
 down a spec on what to watch.

Yep, assuming for a moment that hot standby goes into 8.5, status
functions that return such information is the natural interface. It
should be trivial to write them as soon as hot standby and streaming
replication are in place.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Andrew Dunstan



Greg Smith wrote:
This is looking really neat now, making async replication really solid 
first before even trying to move on to sync is the right way to go 
here IMHO.


I agree with both of those sentiments.

One question I have is what is the level of traffic involved between the 
master and the slave. I know numbers of people have found the traffic 
involved in shipping of log files to be a pain, and thus we get things 
like pglesslog.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Kevin Grittner
Greg Smith gsm...@gregsmith.com wrote:
 
 Putting on my DBA hat for a minute, the first question I see people
 asking is how do I measure how far behind the slaves are?. 
 Presumably you can get that out of pg_controldata; my first question
 is whether that's complete enough information?  If not, what else
 should be monitored?
 
 I don't think running that program going to fly for a production
 quality integrated replication setup though.  The UI admins are
 going to want would allow querying this easily via a standard
 database query.  Most monitoring systems can issue psql queries but
 not necessarily run a remote binary.  I think that parts of
 pg_controldata needs to get exposed via some number of built-in UDFs
 instead, and whatever new internal state makes sense too.  I could
 help out writing those, if someone more familiar with the
 replication internals can help me nail down a spec on what to watch.
 
IMO, it would be best if the status could be sent via NOTIFY.  In my
experience, this results in monitoring which both has less overhead
and is more current.  We tend to be almost as interested in metrics on
throughput as lag.  Backlogged volume can be interesting, too, if it's
available.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Greg Smith
This is looking really neat now, making async replication really solid 
first before even trying to move on to sync is the right way to go here 
IMHO.  I just cleaned up the docs on the Wiki page, when this patch is 
closer to being committed I officially volunteer to do the same on the 
internal SGML docs; someone should nudge me when the patch is at that 
point if I don't take care of it before then.


Putting on my DBA hat for a minute, the first question I see people asking 
is how do I measure how far behind the slaves are?.  Presumably you can 
get that out of pg_controldata; my first question is whether that's 
complete enough information?  If not, what else should be monitored?


I don't think running that program going to fly for a production quality 
integrated replication setup though.  The UI admins are going to want 
would allow querying this easily via a standard database query.  Most 
monitoring systems can issue psql queries but not necessarily run a remote 
binary.  I think that parts of pg_controldata needs to get exposed via 
some number of built-in UDFs instead, and whatever new internal state 
makes sense too.  I could help out writing those, if someone more familiar 
with the replication internals can help me nail down a spec on what to 
watch.


--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Heikki Linnakangas
Kevin Grittner wrote:
 Greg Smith gsm...@gregsmith.com wrote:
 I don't think running that program going to fly for a production
 quality integrated replication setup though.  The UI admins are
 going to want would allow querying this easily via a standard
 database query.  Most monitoring systems can issue psql queries but
 not necessarily run a remote binary.  I think that parts of
 pg_controldata needs to get exposed via some number of built-in UDFs
 instead, and whatever new internal state makes sense too.  I could
 help out writing those, if someone more familiar with the
 replication internals can help me nail down a spec on what to watch.
  
 IMO, it would be best if the status could be sent via NOTIFY.

To where?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Kevin Grittner
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 Kevin Grittner wrote:
 
 IMO, it would be best if the status could be sent via NOTIFY.
 
 To where?
 
To registered listeners?
 
I guess I should have worded that as it would be best if a change is
replication status could be signaled via NOTIFY -- does that satisfy,
or am I missing your point entirely?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Heikki Linnakangas
Fujii Masao wrote:
 Here is the latest version of Streaming Replication (SR) patch.

The first thing that caught my eye is that I don't think replication
should be a real database. Rather, it should by a keyword in
pg_hba.conf, like the existing all, sameuser, samerole keywords
that you can put into the database-column.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Simon Riggs

On Mon, 2009-09-14 at 20:24 +0900, Fujii Masao wrote:

 The latest patch has overcome those problems:

Well done. I hope to look at it myself in a few days time.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Fujii Masao
Hi,

On Tue, Sep 15, 2009 at 12:47 AM, Greg Smith gsm...@gregsmith.com wrote:
 Putting on my DBA hat for a minute, the first question I see people asking
 is how do I measure how far behind the slaves are?.  Presumably you can
 get that out of pg_controldata; my first question is whether that's complete
 enough information?  If not, what else should be monitored?

Currently the progress of replication is shown only in PS display. So, the
following three steps are necessary to measure the gap of the servers.

1. execute pg_current_xlog_location() to check how far the primary has
written WAL.
2. execute 'ps' to check how far the standby has written WAL.
3. compare the above results.

This is very messy. More user-friendly monitoring feature is necessary,
and development of it is one of TODO item for the later CommitFest.

I'm thinking something like pg_standbys_xlog_location() which returns
one row per standby servers, showing pid of walsender, host name/
port number/user OID of the standby, the location where the standby
has written/flushed WAL. DBA can measure the gap from the
combination of pg_current_xlog_location() and pg_standbys_xlog_location()
via one query on the primary. Thought?

But the problem might be what happens after the primary has fallen
down. The current write location of the primary cannot be checked via
pg_current_xlog_locaton, and might need to be calculated from WAL
files on the primary. Is the tool which performs such calculation
necessary?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication patch for CommitFest 2009-09

2009-09-14 Thread Fujii Masao
Hi,

On Tue, Sep 15, 2009 at 1:06 AM, Andrew Dunstan and...@dunslane.net wrote:
 One question I have is what is the level of traffic involved between the
 master and the slave. I know numbers of people have found the traffic
 involved in shipping of log files to be a pain, and thus we get things like
 pglesslog.

That is almost the same as the WAL write traffic on the primary. In fact,
the content of WAL files written to the standby are exactly the same as
those on the primary. Currently SR has provided no compression
capability of the traffic. Should we introduce something like
walsender_hook/walreceiver_hook to cooperate with the add-on program
for compression like pglesslog?

If you always use PITR instead of normal recovery, full_page_writes = off
might be another solution.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers