Fujii Masao wrote:
* Small code changes to handling of failedSources, inspired by your
comment. No change in functionality.
This is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch xlogchanges
I looked the patch and was not able to find
On Wed, Mar 31, 2010 at 1:28 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Fujii Masao wrote:
* Small code changes to handling of failedSources, inspired by your
comment. No change in functionality.
This is also available in my git repository at
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
PANICing won't change the situation, so it just destroys server
availability. If we had 1 master and 42 slaves then this behaviour would
take down almost the whole
Tom Lane wrote:
Fujii Masao masao.fu...@gmail.com writes:
OK. How about making the startup process emit WARNING, stop WAL replay and
wait for the presence of trigger file, when an invalid record is found?
Which keeps the server up for readonly queries. And if the trigger file is
found, I
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
PANICing won't change the situation, so it just destroys server
availability. If we had 1 master and 42 slaves then this behaviour would
take down almost the whole
On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
PANIC seems like the appropriate solution for now.
It definitely is not. Think some more.
--
Simon Riggs www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your
Simon Riggs wrote:
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote:
And if the trigger file is
found, I think that the startup process should emit a FATAL, i.e., the
server should exit immediately, to prevent the server from becoming the
primary in a half-finished state.
Please
(cc'ing docs list)
Simon Riggs wrote:
The lack of docs begins to show a lack of coherent high-level design
here.
Yeah, I think you're right. It's becoming hard to keep track of how it's
supposed to behave.
By now, I've forgotten what this thread was even about. The major
design decision in
Simon Riggs wrote:
On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
PANIC seems like the appropriate solution for now.
It definitely is not. Think some more.
Well, what happens now in previous versions with pg_standby et al is
that the standby starts up. That doesn't seem
Heikki Linnakangas wrote:
Simon Riggs wrote:
On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
PANIC seems like the appropriate solution for now.
It definitely is not. Think some more.
Well, what happens now in previous versions with pg_standby et al is
that the standby starts
Fujii Masao wrote:
sources = ~failedSources;
failedSources |= readSource;
The above lines in XLogPageRead() seem not to be required in normal
recovery case (i.e., standby_mode = off). So how about the attached
patch?
*** 9050,9056 next_record_is_invalid:
--- 9047,9056
Fujii Masao wrote:
On second thought, the following lines seem to be necessary just after
calling XLogPageRead() since it reads new WAL file from another source.
if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
emode = PANIC;
else
On Thu, Mar 25, 2010 at 8:55 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
* If a corrupt WAL record is found in archive or streamed from master in
standby mode, throw WARNING instead of PANIC, and keep trying. In
archive recovery (ie. standby_mode=off) it's still a PANIC.
On Thu, 2010-03-25 at 12:15 +0200, Heikki Linnakangas wrote:
(cc'ing docs list)
Simon Riggs wrote:
The lack of docs begins to show a lack of coherent high-level design
here.
Yeah, I think you're right. It's becoming hard to keep track of how it's
supposed to behave.
Thank you for
On Thu, 2010-03-25 at 12:26 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Thu, 2010-03-25 at 10:11 +0200, Heikki Linnakangas wrote:
PANIC seems like the appropriate solution for now.
It definitely is not. Think some more.
Well, what happens now in previous versions with
On Thu, Mar 25, 2010 at 9:55 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
* Fix the bug of a spurious PANIC in archive recovery, if the WAL ends
in the middle of a WAL record that continues over a WAL segment boundary.
* If a corrupt WAL record is found in archive or
Fujii Masao wrote:
But in the current (v8.4 or before) behavior, recovery ends normally
when an invalid record is found in an archived WAL file. Otherwise,
the server would never be able to start normal processing when there
is a corrupted archived file for some reasons. So, that invalid
On Wed, Mar 24, 2010 at 9:31 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Hmm, true, this changes behavior over previous releases. I tend to think
that it's always an error if there's a corrupt file in the archive,
though, and PANIC is appropriate. If the administrator
On Wed, Mar 24, 2010 at 10:20 PM, Fujii Masao masao.fu...@gmail.com wrote:
Thanks. That's easily fixable (applies over the previous patch):
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3773,7 +3773,7 @@ retry:
pagelsn.xrecoff = 0;
On Wed, 2010-03-24 at 14:31 +0200, Heikki Linnakangas wrote:
Fujii Masao wrote:
But in the current (v8.4 or before) behavior, recovery ends normally
when an invalid record is found in an archived WAL file. Otherwise,
the server would never be able to start normal processing when there
is
On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
PANICing won't change the situation, so it just destroys server
availability. If we had 1 master and 42 slaves then this behaviour would
take down almost the whole server farm at once. Very uncool.
You might have reason
Fujii Masao masao.fu...@gmail.com writes:
OK. How about making the startup process emit WARNING, stop WAL replay and
wait for the presence of trigger file, when an invalid record is found?
Which keeps the server up for readonly queries. And if the trigger file is
found, I think that the
Sorry for the delay.
On Fri, Mar 19, 2010 at 8:37 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Here's a patch I've been playing with.
Thanks! I'm reading the patch.
The idea is that in standby mode,
the server keeps trying to make progress in the recovery by:
a)
On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote:
I agree that this is a bigger problem. Since the standby always starts
walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
to receive the WAL files following the REDO starting point even if they
have already been in
Simon Riggs wrote:
On Thu, 2010-03-18 at 23:27 +0900, Fujii Masao wrote:
I agree that this is a bigger problem. Since the standby always starts
walreceiver before replaying any WAL files in pg_xlog, walreceiver tries
to receive the WAL files following the REDO starting point even if they
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
Simon Riggs wrote:
We might also have written half a file many times. The files in pg_xlog
are suspect whereas the files in the archive are not. If we have both we
should prefer the archive.
Yep.
Really? That will result in a
Tom Lane wrote:
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
Simon Riggs wrote:
We might also have written half a file many times. The files in pg_xlog
are suspect whereas the files in the archive are not. If we have both we
should prefer the archive.
Yep.
Really?
Heikki Linnakangas escribió:
When recovery reaches an invalid WAL record, typically caused by a
half-written WAL file, it closes the file and moves to the next source.
If an error is found in a file restored from archive or in a portion
just streamed from master, however, a PANIC is thrown,
Alvaro Herrera wrote:
Heikki Linnakangas escribió:
When recovery reaches an invalid WAL record, typically caused by a
half-written WAL file, it closes the file and moves to the next source.
If an error is found in a file restored from archive or in a portion
just streamed from master,
On Wed, Mar 17, 2010 at 7:35 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Fujii Masao wrote:
I found another missing feature in new file-based log shipping (i.e.,
standby_mode is enabled and 'cp' is used as restore_command).
After the trigger file is found, the startup
Fujii Masao wrote:
I found another missing feature in new file-based log shipping (i.e.,
standby_mode is enabled and 'cp' is used as restore_command).
After the trigger file is found, the startup process with pg_standby
tries to replay all of the WAL files in both pg_xlog and the archive.
On Wed, 2010-03-17 at 12:35 +0200, Heikki Linnakangas wrote:
Looking into this, I realized that we have a bigger problem...
A lot of this would be easier if you do the docs first, then work
through the problems. The new system is more complex, since it has two
modes rather than one and also
On Fri, Feb 12, 2010 at 2:29 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
So the only major feature we're missing is the ability to clean up old
files.
I found another missing feature in new file-based log shipping (i.e.,
standby_mode is enabled and 'cp' is used as
On Sat, Feb 13, 2010 at 1:10 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Are you thinking of a scenario where remove_command gets stuck, and
prevents bgwriter from performing restartpoints while it's stuck?
Yes. If there is the archive in the remote server and the network
On Fri, 2010-02-12 at 14:38 +0900, Fujii Masao wrote:
On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server
Simon Riggs wrote:
In 8.4 it is pg_standby that was responsible for clearing down the
archive, which is why I suggested using pg_standby for that again. I
agree that will not work. The important thing is not pg_standby but that
we have a valid mechanism for clearing down the archive.
Good
On Fri, 2010-02-12 at 12:54 +, Simon Riggs wrote:
So I suggest that you have a new action that gets called after every
checkpoint to clear down the archive. It will remove all files from the
archive prior to %r. We can implement that as a sequence of unlink()s
from within the server, or
On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
So I suggest that you have a new action that gets called after every
checkpoint to clear down the archive. It will remove all files from the
archive prior to %r. We can implement that as a sequence
Fujii Masao wrote:
On Fri, Feb 12, 2010 at 10:10 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
So I suggest that you have a new action that gets called after every
checkpoint to clear down the archive. It will remove all files from the
archive prior to %r. We can implement
Simon Riggs si...@2ndquadrant.com writes:
Attached patch implements pg_standby for use as an
archive_cleanup_command, reusing existing code with new -a option.
Happy to add the archive_cleanup_command into main server as well, if
you like. Won't take long.
Would it be possible to have the
so I from by like having the server doing the cleanup because it down by
necessarily have the while picture. it down nt know of it is the only
replica reading these log files our if the site policy is to keep them for
disaster recovery purposes.
I like having this as an return val command though.
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
Fujii Masao wrote:
As I pointed out previously, the standby might restore a partially-filled
WAL file that is being archived by the primary, and cause a FATAL error.
And this happened in my box when I was testing the SR.
Simon Riggs wrote:
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it
Simon Riggs wrote:
On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned
On Thu, 2010-02-11 at 14:44 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:
Hmm, so after running restore_command, check the file size and if
Simon Riggs wrote:
If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.
pg_standby cannot be used with
Simon Riggs si...@2ndquadrant.com writes:
If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.
Let me
On Thu, 2010-02-11 at 15:28 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
On Thu, 2010-02-11 at 14:41 +0100, Dimitri Fontaine wrote:
Simon Riggs si...@2ndquadrant.com writes:
If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by
Simon Riggs wrote:
One question then: how do we ensure that the archive does not grow too
big? pg_standby cleans down the archive using %R. That function appears
to not exist anymore.
You can still use %R. Of course, plain 'cp' won't know what to do with
it, so a script will then be required.
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 08:29]:
To suppport a restore_command that does the sleeping itself, like
pg_standby, would require a major rearchitecting of the retry logic. And
I don't see why that'd desirable anyway. It's easier for the admin to
set up
On Thu, 2010-02-11 at 15:55 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
One question then: how do we ensure that the archive does not grow too
big? pg_standby cleans down the archive using %R. That function appears
to not exist anymore.
You can still use %R. Of course, plain
Aidan Van Dyk wrote:
But colour me confused, I'm still not understanding why this is any
different that with normal PITR recovery.
So even with a plain cp in your recovery command instead of a
sleep+copy (a la pg_standby, or PITR tools, or all the home-grown
solutions out thery), I'm not
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.
That would work too, but it doesn't seem any simpler to me. On the contrary.
--
Heikki
On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.
That would work too, but
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 09:17]:
If the file is just being copied to the archive when restore_command
('cp', say) is launched, it will copy a half file. That's not a problem
for PITR, because PITR will end at the end of valid WAL anyway, but
returning a
Heikki Linnakangas wrote:
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.
That would work too, but it doesn't seem any simpler to
Simon Riggs wrote:
On Thu, 2010-02-11 at 16:22 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.
That would
Simon Riggs escreveu:
It would mean that pg_standby would act appropriately according to the
setting of standby_mode. So you wouldn't need multiple examples of use,
it would all just work whatever the setting of standby_mode. Nice simple
entry in the docs.
+1. I like the %s idea. IMHO fixing
Aidan Van Dyk wrote:
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 09:17]:
If the file is just being copied to the archive when restore_command
('cp', say) is launched, it will copy a half file. That's not a problem
for PITR, because PITR will end at the end of valid WAL
Aidan Van Dyk wrote:
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 09:17]:
Yeah, if you're careful about that, then this change isn't required. But
pg_standby protects against that, so I think it'd be reasonable to have
the same level of protection built-in. It's not a lot
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 12:04]:
But it can be a problem - without the last WAL (or at least enough of
it) the master switched and archived, you have no guarantee of having
being consistent again (I'm thinking specifically of recovering from a
fresh
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
-1. it isn't necessary for PITR. It's a new requirement for
standby_mode='on', unless we add the file size check into the backend. I
think we should add the file size check to the backend instead and save
admins the headache.
I
Aidan Van Dyk wrote:
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 12:04]:
But it can be a problem - without the last WAL (or at least enough of
it) the master switched and archived, you have no guarantee of having
being consistent again (I'm thinking specifically of
On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote:
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
-1. it isn't necessary for PITR. It's a new requirement for
standby_mode='on', unless we add the file size check into the backend. I
think we should add the file size check to
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
I think 'rsync' has the same problem.
There is a switch you can use to create the problem under rsync, but
by default rsync copies to a temporary file name and moves the
completed file to the target name.
-Kevin
--
Sent via
On Thu, Feb 11, 2010 at 01:22:44PM -0500, Kevin Grittner wrote:
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
I think 'rsync' has the same problem.
There is a switch you can use to create the problem under rsync, but
by default rsync copies to a temporary file name and
On Thu, 2010-02-11 at 19:29 +0200, Heikki Linnakangas wrote:
Aidan Van Dyk wrote:
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100211 09:17]:
Yeah, if you're careful about that, then this change isn't required. But
pg_standby protects against that, so I think it'd be
On Thu, Feb 11, 2010 at 11:22 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Simon Riggs wrote:
Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each
Simon Riggs wrote:
On Thu, 2010-02-11 at 13:08 -0500, Tom Lane wrote:
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
-1. it isn't necessary for PITR. It's a new requirement for
standby_mode='on', unless we add the file size check into the backend. I
think we should add the
On Wed, Feb 10, 2010 at 4:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
Yes, only in standby mode case. OTOH I think that normal
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100210 02:33]:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it
Aidan Van Dyk wrote:
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100210 02:33]:
Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for
On Thu, Jan 28, 2010 at 12:27 AM, Heikki Linnakangas
hei...@postgresql.org wrote:
Log Message:
---
Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover
Fujii Masao wrote:
As I pointed out previously, the standby might restore a partially-filled
WAL file that is being archived by the primary, and cause a FATAL error.
And this happened in my box when I was testing the SR.
sby [20088] FATAL: archive file 00010087 has
wrong
76 matches
Mail list logo