Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-03-01 Thread Heikki Linnakangas

On 26.02.2011 16:58, Robert Haas wrote:

It sounds like the only thing we have definite agreement about from
all this is that apply_location should be renamed to replay_location
in pg_stat_replication, a point fairly incidental to the what this
patch is about.  It seems fairly unsatisfying to just change that and
punt the rest of this, but I'm not sure what the alternative is.


After reading the discussions, I don't see any actual opposition to 
Fujii-san's patch. And I think it makes sense, the new definition makes 
sense for the purpose Fujii mentioned in the mail that started this 
thread: determining which standby is most up-to-date.


There has been a lot of good suggestions, like verifying the received 
WAL before overwriting existing WAL. But we're not going to start doing 
bigger code changes this late in the release cycle. And even if we did, 
this patch would still make sense - we still wouldn't want 
pg_last_xlog_receive_location() to move backwards.


So, committed.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-26 Thread Robert Haas
On Wed, Feb 16, 2011 at 7:06 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Feb 16, 2011 at 12:59 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao masao.fu...@gmail.com wrote:
 You suggest that the shared variable Stream tracks the WAL write location,
 after it's set to the replication starting position? I don't think
 that the write
 location needs to be tracked in the shmem because other processes than
 walreceiver don't use it.

 Well, my proposal was to expose it, on the theory that it's useful.
 As we stream the WAL, we write it, so I think for all intents and
 purposes write == stream.  But using it to convey the starting
 position makes more sense if you call it stream than it does if you
 call it write.

 Umm.. I could not find any use case to expose the WAL write location
 besides flush one. So I'm not sure if it's really useful to track the
 write location in the shmem besides the walreceiver-local memory.
 What use case do you think of?

 Well, we're currently exposing that on the master via
 pg_stat_replication.  I guess we could rip that out, but I think that
 if nothing else we're imagining eventually supporting a sync rep mode
 where the standby acknowledges WAL upon receipt rather than upon
 write.  And the lag between the write and flush positions can be up to
 16MB, so it doesn't seem entirely academic.  Basically, the write
 position is the most WAL that could be on disk on standby and the
 flush is the most WAL that we're SURE is on disk on the standby.

 Personally the term stream sounds more ambiguous than write.
 I cannot imagine what location the pg_last_xlog_stream_location or
 stream_location actually returns, from the function name;  WAL
 location that has been received? written? flushed? replayed?
 Since the write sounds cleaner, I like it.

 Well, the problem with receivedUpto is that it's really being used for
 two different things, neither of which is how much WAL has been
 received.  One is where streaming is to start (hence, stream) and the
 other is how much we've flushed to disk (hence, flush).  So you might
 think there were four positions: streaming start, write, flush, apply.
  But I think the first two are really the same: once you've received
 at least one byte, the position that you're streaming from and the
 write position are the same, so I think the name stream can span both
 concepts.  OTOH, stream-start and flush are clearly NOT the same -
 there is a small but potentially significant delay between
 stream/write and flush.

It sounds like the only thing we have definite agreement about from
all this is that apply_location should be renamed to replay_location
in pg_stat_replication, a point fairly incidental to the what this
patch is about.  It seems fairly unsatisfying to just change that and
punt the rest of this, but I'm not sure what the alternative is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-16 Thread Robert Haas
On Wed, Feb 16, 2011 at 12:59 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao masao.fu...@gmail.com wrote:
 You suggest that the shared variable Stream tracks the WAL write location,
 after it's set to the replication starting position? I don't think
 that the write
 location needs to be tracked in the shmem because other processes than
 walreceiver don't use it.

 Well, my proposal was to expose it, on the theory that it's useful.
 As we stream the WAL, we write it, so I think for all intents and
 purposes write == stream.  But using it to convey the starting
 position makes more sense if you call it stream than it does if you
 call it write.

 Umm.. I could not find any use case to expose the WAL write location
 besides flush one. So I'm not sure if it's really useful to track the
 write location in the shmem besides the walreceiver-local memory.
 What use case do you think of?

Well, we're currently exposing that on the master via
pg_stat_replication.  I guess we could rip that out, but I think that
if nothing else we're imagining eventually supporting a sync rep mode
where the standby acknowledges WAL upon receipt rather than upon
write.  And the lag between the write and flush positions can be up to
16MB, so it doesn't seem entirely academic.  Basically, the write
position is the most WAL that could be on disk on standby and the
flush is the most WAL that we're SURE is on disk on the standby.

 Personally the term stream sounds more ambiguous than write.
 I cannot imagine what location the pg_last_xlog_stream_location or
 stream_location actually returns, from the function name;  WAL
 location that has been received? written? flushed? replayed?
 Since the write sounds cleaner, I like it.

Well, the problem with receivedUpto is that it's really being used for
two different things, neither of which is how much WAL has been
received.  One is where streaming is to start (hence, stream) and the
other is how much we've flushed to disk (hence, flush).  So you might
think there were four positions: streaming start, write, flush, apply.
 But I think the first two are really the same: once you've received
at least one byte, the position that you're streaming from and the
write position are the same, so I think the name stream can span both
concepts.  OTOH, stream-start and flush are clearly NOT the same -
there is a small but potentially significant delay between
stream/write and flush.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-15 Thread Robert Haas
On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao masao.fu...@gmail.com wrote:
 You suggest that the shared variable Stream tracks the WAL write location,
 after it's set to the replication starting position? I don't think
 that the write
 location needs to be tracked in the shmem because other processes than
 walreceiver don't use it.

Well, my proposal was to expose it, on the theory that it's useful.
As we stream the WAL, we write it, so I think for all intents and
purposes write == stream.  But using it to convey the starting
position makes more sense if you call it stream than it does if you
call it write.

 You propose to rename LogstreamResult.Write to .Stream, and
 merge it and receiveStart?

Yeah, or probably change recieveStart to be called Stream.  It's
representing the same thing, just in shmem instead of backend-local,
so why name it differently?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-15 Thread Fujii Masao
On Tue, Feb 15, 2011 at 9:41 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Feb 15, 2011 at 12:34 AM, Fujii Masao masao.fu...@gmail.com wrote:
 You suggest that the shared variable Stream tracks the WAL write location,
 after it's set to the replication starting position? I don't think
 that the write
 location needs to be tracked in the shmem because other processes than
 walreceiver don't use it.

 Well, my proposal was to expose it, on the theory that it's useful.
 As we stream the WAL, we write it, so I think for all intents and
 purposes write == stream.  But using it to convey the starting
 position makes more sense if you call it stream than it does if you
 call it write.

Umm.. I could not find any use case to expose the WAL write location
besides flush one. So I'm not sure if it's really useful to track the
write location in the shmem besides the walreceiver-local memory.
What use case do you think of?

Personally the term stream sounds more ambiguous than write.
I cannot imagine what location the pg_last_xlog_stream_location or
stream_location actually returns, from the function name;  WAL
location that has been received? written? flushed? replayed?
Since the write sounds cleaner, I like it.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-14 Thread Fujii Masao
On Sat, Feb 12, 2011 at 11:32 PM, Robert Haas robertmh...@gmail.com wrote:
 So, what if we did some renaming?  I'd be inclined to start by
 renaming receivedUpTo to Flush, and add a new position called
 Stream.  When walreciever is started, we set Stream to the position at
 which streaming is going to begin (which can rewind) and leave Flush
 alone (so it never rewinds). We then change the walreceiver feedback
 mechanism to use the term stream_location rather than write_location;
 and we could consider replacing pg_last_xlog_receive_location() with
 pg_last_xlog_stream_location() and pg_last_xlog_flush_location().

You suggest that the shared variable Stream tracks the WAL write location,
after it's set to the replication starting position? I don't think
that the write
location needs to be tracked in the shmem because other processes than
walreceiver don't use it.

What I proposed is:

Walreceiver-local variables
==
1. LogstreamResult.Write
 - Indicates the location of recently written WAL record
 - Can rewind
 - pg_stat_replication.write_location returns this

2. LogstreamResult.Flush
 - Indicates the location of recently flushed WAL record
 - Can rewind
 - pg_stat_replication.flush_location returns this

Shmem variables
===
3. WalRcv-receiveStart
 - Indicates the replication starting location
 - Updated only when walreceiver is started
 - Doesn't exist at the moment, so I propose to add this

4. WalRcv-receivedUpto
 - Indicates the latest location of all the flushed WAL records
 - Never rewinds
(Can rewind at the moment, so I propose to prevent the rewind)
 - pg_last_xlog_receive_location returns this

You propose to rename LogstreamResult.Write to .Stream, and
merge it and receiveStart?

 I'd also be inclined to go to the walreceiver code and and rename the
 apply_location to replay_location, so that it matches
 pg_last_xlog_replay_location().  The latter is in 9.0, but the former
 is new to 9.1, so we can still fix it to be consistent without a
 backward compatibility break.

+1

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-13 Thread Jeff Janes
On Sun, Jan 30, 2011 at 11:12 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 I do not understand what doing so gets us.

 Say we previously received 2/3 of a WAL file, and replayed most of it.
 So now the shared buffers have data that has been synced to that WAL
 file already, and some of those dirty shared buffers have been written
 to disk and some have not.  At this point, we need the data in the first
 2/3 of the WAL file in order to reach a consistent state.  But now we
 lose the connection to the master, and then we restore it.  Now we
 request the entire file from the start rather than from where it
 left off.

 Either of two things happens.  Most likely, the newly received WAL file
 matches the file it is overwriting, in which case there was no
 point in asking for it.

 Less likely, the master is feeding us gibberish.  By requesting the
 full WAL file, we check the header and detect that the master is feeding
 us gibberish.  Unfortunately, we only detect that fact *after* we have
 replaced a critical part of our own (previously good) copy of the WAL
 file with said gibberish.  The standby is now in an unrecoverable state.

 Right. To avoid this problem completely, IMO, walreceiver should validate
 the received WAL data before writing it. Or, walreceiver should write the
 WAL to the transient file, and the startup process should rename it to the
 correct name after replaying it.

 We should do something like the above?

I don't think we should overwrite local WAL that has already been
applied, unless we already know for sure that it is bad (and suspect
re-streaming might make it right.)

I think the better approach would be to check the existence, and the
segment header, of the WAL segment we already (think we) have in
pg_xlog.  And then request re-streaming of that file from the
beginning only if it is either missing or has the wrong header in the
local copy.  That might make the code a lot more complex, though.  At
least I don't see a simple way to implement it.


 With a bit of malicious engineering, I have created this situation.
 I don't know how likely it is that something like that could happen
 accidentally, say with a corrupted file system.  I have been unable
 to engineer a situation where checking the header actually does
 any good.  It has either done nothing, or done harm.

 OK, I seem to have to consider again why the code which retreats the
 replication starting location exists.

 At first, I added the code to prevent a half-baked WAL file. For example,
 please imagine the case where you start the standby server with no WAL
 files in pg_xlog. In this case, if replication starts from the middle of WAL
 file, the received WAL file is obviously broken (i.e., with no WAL data in
 the first half of file). This broken WAL file might cause trouble when we
 restart the standby and even when we just promote it (because the last
 applied WAL file is re-fetched at the end of recovery).

 OTOH, we can start replication from the middle of WAL file if we can
 *ensure* that the first half of WAL file already exists. At least, when the
 standby reconnects to the master, we might be able to ensure that and
 start from the middle.

 OK, thanks for the explanation.  Is there a race condition here?  That is,
 it seems that with your patch this part of the code could get executed
 after streaming is restarted, but before the streaming ever actually received
 and wrote anything.  So it would be checking the old header, not the newly
 about-to-be received header.

 As far as I read the code again, there is no such a race condition. When
 the received WAL is read just after streaming is restarted, that part of the
 code seems to always be executed.

 Maybe I'm missing something. Could you explain about the problematic
 scenario?

OK, I think you are right.  I was thinking that if apply is well
behind receive when the connection is lost, that a new connection
could immediately pass the havedata test an so proceed to the header
check perhaps before anything was received.  But now I see that we
never try to re-connect until after apply has caught up with receive,
so that race condition cannot happen.  Sorry for the false alarm.

By the way, the patch no longer applies cleanly to HEAD, as of commit
d309acf201ab2c5 (Feb 11th).  But it looks like a trivial conflict (Two
different patches both correct the same doc typo).

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-12 Thread Robert Haas
On Fri, Feb 11, 2011 at 12:52 PM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 Actually... wait a minute.  Now that I'm thinking about this a little
 more, I'm not so convinced.  If we leave this the way is, and just
 paper over the problem using the method Stephen and I both had in
 mind, then we could be streaming from a position that precedes the
 so-called current position.  And worse, the newly introduced replies
 to the master will still show the position going backward, so the
 master will reflect a position that is earlier that the position the
 slave reports for itself, not because of any real difference but
 because of a reporting difference.  That sure doesn't seem good.

 I'm really not sure it's as bad as all that...  The slave and the master
 are only going to be out of sync wrt position until the slave sends
 the request for update to the master, gets back the result, and checks
 the XLOG header, right?

It'll restream the whole segment up to wherever it was, but basically, yes.

I think a big part of the problem here is that we have wildly
inconsistent terminology for no especially good reason.  The standby
reports three XLOG positions to the master, currently called write,
flush, and apply.  The walreceiver reports positions called
recievedUpTo and lastChunkStart.  receivedUpTo is actually the FLUSH
position, and lastChunkStart is the previous flush position, except
when the walreceiver first starts, when it's the position at which
streaming is to begin.

So, what if we did some renaming?  I'd be inclined to start by
renaming receivedUpTo to Flush, and add a new position called
Stream.  When walreciever is started, we set Stream to the position at
which streaming is going to begin (which can rewind) and leave Flush
alone (so it never rewinds).  We then change the walreceiver feedback
mechanism to use the term stream_location rather than write_location;
and we could consider replacing pg_last_xlog_receive_location() with
pg_last_xlog_stream_location() and pg_last_xlog_flush_location().
That's a backward compatibility break, but maybe it's worth it for the
sake of terminological consistency, not to mention accuracy.

I'd also be inclined to go to the walreceiver code and and rename the
apply_location to replay_location, so that it matches
pg_last_xlog_replay_location().  The latter is in 9.0, but the former
is new to 9.1, so we can still fix it to be consistent without a
backward compatibility break.

Thoughts, comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-12 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 I think a big part of the problem here is that we have wildly
 inconsistent terminology for no especially good reason.  

Agreed.

 Thoughts, comments?

I thought about this for a bit and agree w/ your suggestions.

So, +1 from me.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-12 Thread Simon Riggs
On Sat, 2011-02-12 at 09:32 -0500, Robert Haas wrote:
 On Fri, Feb 11, 2011 at 12:52 PM, Stephen Frost sfr...@snowman.net wrote:
  * Robert Haas (robertmh...@gmail.com) wrote:
  Actually... wait a minute.  Now that I'm thinking about this a little
  more, I'm not so convinced.  If we leave this the way is, and just
  paper over the problem using the method Stephen and I both had in
  mind, then we could be streaming from a position that precedes the
  so-called current position.  And worse, the newly introduced replies
  to the master will still show the position going backward, so the
  master will reflect a position that is earlier that the position the
  slave reports for itself, not because of any real difference but
  because of a reporting difference.  That sure doesn't seem good.
 
  I'm really not sure it's as bad as all that...  The slave and the master
  are only going to be out of sync wrt position until the slave sends
  the request for update to the master, gets back the result, and checks
  the XLOG header, right?
 
 It'll restream the whole segment up to wherever it was, but basically, yes.
 
 I think a big part of the problem here is that we have wildly
 inconsistent terminology for no especially good reason.  The standby
 reports three XLOG positions to the master, currently called write,
 flush, and apply.  The walreceiver reports positions called
 recievedUpTo and lastChunkStart.  receivedUpTo is actually the FLUSH
 position, and lastChunkStart is the previous flush position, except
 when the walreceiver first starts, when it's the position at which
 streaming is to begin.
 
 So, what if we did some renaming?  I'd be inclined to start by
 renaming receivedUpTo to Flush, and add a new position called
 Stream.  When walreciever is started, we set Stream to the position at
 which streaming is going to begin (which can rewind) and leave Flush
 alone (so it never rewinds).

OK

   We then change the walreceiver feedback
 mechanism to use the term stream_location rather than write_location;

OK

 and we could consider replacing pg_last_xlog_receive_location() with
 pg_last_xlog_stream_location() and pg_last_xlog_flush_location().
 That's a backward compatibility break, but maybe it's worth it for the
 sake of terminological consistency, not to mention accuracy.

Don't see a need to break anything. OK with two new function names.

 I'd also be inclined to go to the walreceiver code and and rename the
 apply_location to replay_location, so that it matches
 pg_last_xlog_replay_location().  The latter is in 9.0, but the former
 is new to 9.1, so we can still fix it to be consistent without a
 backward compatibility break.

Any renaming of things should be done as cosmetic fixes after important
patches are in.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-12 Thread Simon Riggs
On Mon, 2011-01-31 at 16:12 +0900, Fujii Masao wrote:
 On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:
  I do not understand what doing so gets us.
 
  Say we previously received 2/3 of a WAL file, and replayed most of it.
  So now the shared buffers have data that has been synced to that WAL
  file already, and some of those dirty shared buffers have been written
  to disk and some have not.  At this point, we need the data in the first
  2/3 of the WAL file in order to reach a consistent state.  But now we
  lose the connection to the master, and then we restore it.  Now we
  request the entire file from the start rather than from where it
  left off.
 
  Either of two things happens.  Most likely, the newly received WAL file
  matches the file it is overwriting, in which case there was no
  point in asking for it.
 
  Less likely, the master is feeding us gibberish.  By requesting the
  full WAL file, we check the header and detect that the master is feeding
  us gibberish.  Unfortunately, we only detect that fact *after* we have
  replaced a critical part of our own (previously good) copy of the WAL
  file with said gibberish.  The standby is now in an unrecoverable state.
 
 Right. To avoid this problem completely, IMO, walreceiver should validate
 the received WAL data before writing it. Or, walreceiver should write the
 WAL to the transient file, and the startup process should rename it to the
 correct name after replaying it.
 
 We should do something like the above?
 
  With a bit of malicious engineering, I have created this situation.
  I don't know how likely it is that something like that could happen
  accidentally, say with a corrupted file system.  I have been unable
  to engineer a situation where checking the header actually does
  any good.  It has either done nothing, or done harm.
 
 OK, I seem to have to consider again why the code which retreats the
 replication starting location exists.
 
 At first, I added the code to prevent a half-baked WAL file. For example,
 please imagine the case where you start the standby server with no WAL
 files in pg_xlog. In this case, if replication starts from the middle of WAL
 file, the received WAL file is obviously broken (i.e., with no WAL data in
 the first half of file). This broken WAL file might cause trouble when we
 restart the standby and even when we just promote it (because the last
 applied WAL file is re-fetched at the end of recovery).
 
 OTOH, we can start replication from the middle of WAL file if we can
 *ensure* that the first half of WAL file already exists. At least, when the
 standby reconnects to the master, we might be able to ensure that and
 start from the middle.

Some important points here, but it seems complex.

AFAICS we need to do two things
* check header of WAL file matches when we start streaming
* start streaming from last received location

So why not do them separately, rather than rewinding the received
location and kludging things?

Seems easier to send all required info in a separate data packet, then
validate the existing WAL file against that. Then start streaming from
last received location. That way we don't have any going backwards
logic at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-11 Thread Stephen Frost
Fujii, all,

* Fujii Masao (masao.fu...@gmail.com) wrote:
 That logic exists because we'd like to check that newly-received WAL
 data is consistent with previous one by validating the header of new
 WAL file. So since we need the header of new WAL file, we retreat the
 replication starting location to the beginning of the WAL file when
 reconnecting to the primary.

Thanks for that explanation, but I can't help but wonder why it doesn't
make more sense to introduce a new variable to track the value you want
rather than reusing an existing one and then adding a variable to
represent what the old variable was already doing.  In other words, why
not invent

XLogRecPtr  newestReceviedUpto; /* or something */

and update that as necessary and have the function you want changed
return that, and leave receivedUpto alone..?  It seems like it'd be a
lot easier to prove to ourselves that your patch didn't break anything,
presuming the function we're talking about changing the return value of
isn't called anywhere (seems rather unlikely to me..).

Also, you really should reformat the docs properly when you change them,
rather than leaving the lines ragged..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-11 Thread Robert Haas
On Fri, Feb 11, 2011 at 11:52 AM, Stephen Frost sfr...@snowman.net wrote:
 Fujii, all,

 * Fujii Masao (masao.fu...@gmail.com) wrote:
 That logic exists because we'd like to check that newly-received WAL
 data is consistent with previous one by validating the header of new
 WAL file. So since we need the header of new WAL file, we retreat the
 replication starting location to the beginning of the WAL file when
 reconnecting to the primary.

 Thanks for that explanation, but I can't help but wonder why it doesn't
 make more sense to introduce a new variable to track the value you want
 rather than reusing an existing one and then adding a variable to
 represent what the old variable was already doing.

+1.

It seems like there may be some more significant changes in this area
needed; however, for now I think the best fix is the one with the
least chance of breaking anything.

 Also, you really should reformat the docs properly when you change them,
 rather than leaving the lines ragged..

It's OK to leave them a little ragged, I think.  It eases back-patching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-11 Thread Robert Haas
On Fri, Feb 11, 2011 at 12:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Feb 11, 2011 at 11:52 AM, Stephen Frost sfr...@snowman.net wrote:
 Fujii, all,

 * Fujii Masao (masao.fu...@gmail.com) wrote:
 That logic exists because we'd like to check that newly-received WAL
 data is consistent with previous one by validating the header of new
 WAL file. So since we need the header of new WAL file, we retreat the
 replication starting location to the beginning of the WAL file when
 reconnecting to the primary.

 Thanks for that explanation, but I can't help but wonder why it doesn't
 make more sense to introduce a new variable to track the value you want
 rather than reusing an existing one and then adding a variable to
 represent what the old variable was already doing.

 +1.

 It seems like there may be some more significant changes in this area
 needed; however, for now I think the best fix is the one with the
 least chance of breaking anything.

Actually... wait a minute.  Now that I'm thinking about this a little
more, I'm not so convinced.  If we leave this the way is, and just
paper over the problem using the method Stephen and I both had in
mind, then we could be streaming from a position that precedes the
so-called current position.  And worse, the newly introduced replies
to the master will still show the position going backward, so the
master will reflect a position that is earlier that the position the
slave reports for itself, not because of any real difference but
because of a reporting difference.  That sure doesn't seem good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-02-11 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 Actually... wait a minute.  Now that I'm thinking about this a little
 more, I'm not so convinced.  If we leave this the way is, and just
 paper over the problem using the method Stephen and I both had in
 mind, then we could be streaming from a position that precedes the
 so-called current position.  And worse, the newly introduced replies
 to the master will still show the position going backward, so the
 master will reflect a position that is earlier that the position the
 slave reports for itself, not because of any real difference but
 because of a reporting difference.  That sure doesn't seem good.

I'm really not sure it's as bad as all that...  The slave and the master
are only going to be out of sync wrt position until the slave sends
the request for update to the master, gets back the result, and checks
the XLOG header, right?

Here's my question- we're talking about if the master dies here, as I
understand it.  If the slave comes up and can't get to the master, is he
going to be reporting the older position, or his current one?  The
answer to that should be, I believe, the *current* one.  He'd only go
backwards *if* the master is up and he gets his request over to the
master to get the old log.  In fact, if he comes up and gets his request
off to the master and the master never replies, in my view, he should be
able to be restarted into 'master' mode and come all the way up to
'current' (which would be later than what he was trying to ask the
master for).  *That* is the value, I think, that Fujii is looking for-
if this slave started up as a master, where would he be?  That should
always be increasing.  The fact that we back up a little, as a
double-check to make sure we didn't get wildly out of sync, and because
we want the master to only be producing full XLOGs, is an implementation
detail, really, that probably doesn't really need to be exposed..

That may mean that we want to change what the slave is reporting for
itself though, if it's currently allowed to be seen going backwards,
but I don't think that would be terribly difficult to change...

I havn't really dug into the SR/HS stuff, so I could be way off too..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-30 Thread Fujii Masao
On Sun, Jan 30, 2011 at 10:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 I do not understand what doing so gets us.

 Say we previously received 2/3 of a WAL file, and replayed most of it.
 So now the shared buffers have data that has been synced to that WAL
 file already, and some of those dirty shared buffers have been written
 to disk and some have not.  At this point, we need the data in the first
 2/3 of the WAL file in order to reach a consistent state.  But now we
 lose the connection to the master, and then we restore it.  Now we
 request the entire file from the start rather than from where it
 left off.

 Either of two things happens.  Most likely, the newly received WAL file
 matches the file it is overwriting, in which case there was no
 point in asking for it.

 Less likely, the master is feeding us gibberish.  By requesting the
 full WAL file, we check the header and detect that the master is feeding
 us gibberish.  Unfortunately, we only detect that fact *after* we have
 replaced a critical part of our own (previously good) copy of the WAL
 file with said gibberish.  The standby is now in an unrecoverable state.

Right. To avoid this problem completely, IMO, walreceiver should validate
the received WAL data before writing it. Or, walreceiver should write the
WAL to the transient file, and the startup process should rename it to the
correct name after replaying it.

We should do something like the above?

 With a bit of malicious engineering, I have created this situation.
 I don't know how likely it is that something like that could happen
 accidentally, say with a corrupted file system.  I have been unable
 to engineer a situation where checking the header actually does
 any good.  It has either done nothing, or done harm.

OK, I seem to have to consider again why the code which retreats the
replication starting location exists.

At first, I added the code to prevent a half-baked WAL file. For example,
please imagine the case where you start the standby server with no WAL
files in pg_xlog. In this case, if replication starts from the middle of WAL
file, the received WAL file is obviously broken (i.e., with no WAL data in
the first half of file). This broken WAL file might cause trouble when we
restart the standby and even when we just promote it (because the last
applied WAL file is re-fetched at the end of recovery).

OTOH, we can start replication from the middle of WAL file if we can
*ensure* that the first half of WAL file already exists. At least, when the
standby reconnects to the master, we might be able to ensure that and
start from the middle.

 OK, thanks for the explanation.  Is there a race condition here?  That is,
 it seems that with your patch this part of the code could get executed
 after streaming is restarted, but before the streaming ever actually received
 and wrote anything.  So it would be checking the old header, not the newly
 about-to-be received header.

As far as I read the code again, there is no such a race condition. When
the received WAL is read just after streaming is restarted, that part of the
code seems to always be executed.

Maybe I'm missing something. Could you explain about the problematic
scenario?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-29 Thread Jeff Janes
On Tue, Jan 25, 2011 at 6:38 PM, Fujii Masao masao.fu...@gmail.com wrote:

 The third seems more problematic.  In the XLogPageRead,
 it checks to see if more records have been received beyond what
 has been applied.  By using the non-retreating value here, it seems
 like the xlog replay could start replaying records that the wal
 receiver is in the process of overwriting.  Now, I've argued to myself
 that this is not a problem, because the receiver is overwriting them
 with identical data to that which is already there.

 Yes. I don't think that it's a problem, too.

 But by that logic, why does any part of it (walrcv-receiveStart in
 the patch, walrcv-receivedUpto unpatched) need to retreat?  From
 the previous discussion, I understand that the concern is that we don't
 want to retrieve and write out partial xlog files.  What I don't understand
 is how we could get our selves into the position in which we are doing
 that, other than by someone going in and doing impermissible things to
 the PGDATA directory behind our backs.

 That logic exists because we'd like to check that newly-received WAL
 data is consistent with previous one by validating the header of new
 WAL file.

I do not understand what doing so gets us.

Say we previously received 2/3 of a WAL file, and replayed most of it.
So now the shared buffers have data that has been synced to that WAL
file already, and some of those dirty shared buffers have been written
to disk and some have not.  At this point, we need the data in the first
2/3 of the WAL file in order to reach a consistent state.  But now we
lose the connection to the master, and then we restore it.  Now we
request the entire file from the start rather than from where it
left off.

Either of two things happens.  Most likely, the newly received WAL file
matches the file it is overwriting, in which case there was no
point in asking for it.

Less likely, the master is feeding us gibberish.  By requesting the
full WAL file, we check the header and detect that the master is feeding
us gibberish.  Unfortunately, we only detect that fact *after* we have
replaced a critical part of our own (previously good) copy of the WAL
file with said gibberish.  The standby is now in an unrecoverable state.

With a bit of malicious engineering, I have created this situation.
I don't know how likely it is that something like that could happen
accidentally, say with a corrupted file system.  I have been unable
to engineer a situation where checking the header actually does
any good.  It has either done nothing, or done harm.



 So since we need the header of new WAL file, we retreat the
 replication starting location to the beginning of the WAL file when
 reconnecting to the primary.

 The following code (in XLogPageRead) validates the header of new
 WAL file.

 --
        if (switched_segment  targetPageOff != 0)
        {
                /*
                 * Whenever switching to a new WAL segment, we read the first 
 page of
                 * the file and validate its header, even if that's not where 
 the
                 * target record is.  This is so that we can check the 
 additional
                 * identification info that is present in the first page's 
 long
                 * header.
                 */
                readOff = 0;
                if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
                {
                        ereport(emode_for_corrupt_record(emode, *RecPtr),
                                        (errcode_for_file_access(),
                                         errmsg(could not read from log file 
 %u, segment %u, offset %u: %m,
                                                        readId, readSeg, 
 readOff)));
                        goto next_record_is_invalid;
                }
                if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
                        goto next_record_is_invalid;
        }


OK, thanks for the explanation.  Is there a race condition here?  That is,
it seems that with your patch this part of the code could get executed
after streaming is restarted, but before the streaming ever actually received
and wrote anything.  So it would be checking the old header, not the newly
about-to-be received header.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-25 Thread Jeff Janes
On Mon, Jan 24, 2011 at 11:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao masao.fu...@gmail.com wrote:
 So I'm thinking to change pg_last_xlog_receive_location not to
 move backwards.

 The attached patch does that.

 It looks to me like this is changing more than just the return value
 of pg_last_xlog_receive_location.  receivedUpto isn't only used for
 that one function, and there's no explanation in your email or in the
 patch of why the new behavior is correct and/or better for the other
 places where it's used.

I believe the new walrcv-receiveStart was introduced to divide up those
behaviors that should go backwards from those that should not.

The non-retreating value is used in 3 places (via GetWalRcvWriteRecPtr)
in xlog.c.  (And some other places I haven't examined yet.)

One place is to decide whether to remove/recycle XLog files, and I think the
non-retreating value is at least as suitable for this usage.

One is to feed the pg_last_xlog_receive_location() function.

The third seems more problematic.  In the XLogPageRead,
it checks to see if more records have been received beyond what
has been applied.  By using the non-retreating value here, it seems
like the xlog replay could start replaying records that the wal
receiver is in the process of overwriting.  Now, I've argued to myself
that this is not a problem, because the receiver is overwriting them
with identical data to that which is already there.

But by that logic, why does any part of it (walrcv-receiveStart in
the patch, walrcv-receivedUpto unpatched) need to retreat?  From
the previous discussion, I understand that the concern is that we don't
want to retrieve and write out partial xlog files.  What I don't understand
is how we could get our selves into the position in which we are doing
that, other than by someone going in and doing impermissible things to
the PGDATA directory behind our backs.

I've been spinning my wheels on that part for a while.  Since I don't understand
what we are defending against in the original code, I can't evaluate
if the patch
maintains that defense.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-25 Thread Fujii Masao
Thanks for the review!

On Wed, Jan 26, 2011 at 3:40 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 I believe the new walrcv-receiveStart was introduced to divide up those
 behaviors that should go backwards from those that should not.

Yes.

 The non-retreating value is used in 3 places (via GetWalRcvWriteRecPtr)
 in xlog.c.  (And some other places I haven't examined yet.)

Yes.

 The third seems more problematic.  In the XLogPageRead,
 it checks to see if more records have been received beyond what
 has been applied.  By using the non-retreating value here, it seems
 like the xlog replay could start replaying records that the wal
 receiver is in the process of overwriting.  Now, I've argued to myself
 that this is not a problem, because the receiver is overwriting them
 with identical data to that which is already there.

Yes. I don't think that it's a problem, too.

 But by that logic, why does any part of it (walrcv-receiveStart in
 the patch, walrcv-receivedUpto unpatched) need to retreat?  From
 the previous discussion, I understand that the concern is that we don't
 want to retrieve and write out partial xlog files.  What I don't understand
 is how we could get our selves into the position in which we are doing
 that, other than by someone going in and doing impermissible things to
 the PGDATA directory behind our backs.

That logic exists because we'd like to check that newly-received WAL
data is consistent with previous one by validating the header of new
WAL file. So since we need the header of new WAL file, we retreat the
replication starting location to the beginning of the WAL file when
reconnecting to the primary.

The following code (in XLogPageRead) validates the header of new
WAL file.

--
if (switched_segment  targetPageOff != 0)
{
/*
 * Whenever switching to a new WAL segment, we read the first 
page of
 * the file and validate its header, even if that's not where 
the
 * target record is.  This is so that we can check the 
additional
 * identification info that is present in the first page's 
long
 * header.
 */
readOff = 0;
if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
{
ereport(emode_for_corrupt_record(emode, *RecPtr),
(errcode_for_file_access(),
 errmsg(could not read from log file 
%u, segment %u, offset %u: %m,
readId, readSeg, 
readOff)));
goto next_record_is_invalid;
}
if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
goto next_record_is_invalid;
}
--

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-24 Thread Robert Haas
On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao masao.fu...@gmail.com wrote:
 So I'm thinking to change pg_last_xlog_receive_location not to
 move backwards.

 The attached patch does that.

It looks to me like this is changing more than just the return value
of pg_last_xlog_receive_location.  receivedUpto isn't only used for
that one function, and there's no explanation in your email or in the
patch of why the new behavior is correct and/or better for the other
places where it's used.

This email from Heikki seems to indicate that there's a reason for the
current behavior:

http://archives.postgresql.org/pgsql-hackers/2010-06/msg00586.php

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-24 Thread Fujii Masao
On Tue, Jan 25, 2011 at 4:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 2:40 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao masao.fu...@gmail.com wrote:
 So I'm thinking to change pg_last_xlog_receive_location not to
 move backwards.

 The attached patch does that.

 It looks to me like this is changing more than just the return value
 of pg_last_xlog_receive_location.  receivedUpto isn't only used for
 that one function, and there's no explanation in your email or in the
 patch of why the new behavior is correct and/or better for the other
 places where it's used.

 This email from Heikki seems to indicate that there's a reason for the
 current behavior:

 http://archives.postgresql.org/pgsql-hackers/2010-06/msg00586.php

Yes, so I didn't change that behavior, i.e., even with the patch, SR still
always starts from the head of the WAL segment, not the middle of that.
What I changed is only the return value of pg_last_xlog_receive_location.
Am I missing something?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-12 Thread Fujii Masao
Hi,

In the case where there are multiple standbys, when a failover
happens, we usually calculate most advanced standby by using
pg_last_xlog_receive_location or pg_last_xlog_replay_location,
and promote it to new master. The problem is that neither
function might return the right last location of WAL available in
the standby. So we cannot use them to check which standby
is most ahead.

Since pg_last_xlog_receive_location moves backwards when
the standby attempts to reconnect to the primary, it might fall
behind the last location of WAL available. OTOH,
pg_last_xlog_replay_location might also fall behind because
of Hot Standby query conflict.

So I'm thinking to change pg_last_xlog_receive_location not to
move backwards. There is no need to move it backwards when
the standby reconnects to the primary. So we can do that.

BTW, the related discussion was done before:
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00576.php

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change pg_last_xlog_receive_location not to move backwards

2011-01-12 Thread Fujii Masao
On Thu, Jan 13, 2011 at 4:24 PM, Fujii Masao masao.fu...@gmail.com wrote:
 So I'm thinking to change pg_last_xlog_receive_location not to
 move backwards.

The attached patch does that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


receive_location_not_back_off_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers