Re: [HACKERS] Probable problem with pg_standby

2008-11-04 Thread Detlef Ulherr

Fujii Masao wrote:

On Tue, Nov 4, 2008 at 8:09 PM, Detlef Ulherr [EMAIL PROTECTED] wrote:
  

All I did was forcing the primary in a recovery to generate a new timeline.
The installed version was 8.3.4, but the problem is the same with earlier
versions as well. It occurred in 8.2 also. this problem is reproducible all
the times. For my agent code I implemented a workaround which guarantees
that during a resilvering process the primary and the standby start at t the
same timeline. But my feeling is that the standby should go to the same
timeline as the primary when he receives the history file without
disruption, and by all means it should never stop the recovery unmotivated.
This will make a full synchronization necessary and in times of larger
databases, this may cause major downtimes.



I agree with you only if normal archive recovery case (not specified
recovery_target_xid/time). But, in point-in-time recovery case, the standby
cannot continue to redo without stopping. DBA has to reconstruct the
standby (get new online-backup with new timeline ID, locate it on the
standby and restart recovery).

Or, we should deal with normal archive recovery and point-in-time one
separately?

Regards,

  
Agreed, a point in time recovery can send the primary behind the 
standby, but this should not happen with a normal archive recovery, so 
separating the two cases will be a big improvement. A meaningful error 
message in the log will help the poor dba, currently there is nothing in 
the standby's log. It just stops the recovery.


In my case it was a normal archive recovery, and definitely no point in 
time recovery.


Regards,

--

*
Detlef Ulherr
Staff Engineer  Tel: (++49 6103) 752-248
Availability EngineeringFax: (++49 6103) 752-167
Sun Microsystems GmbH 
Amperestr. 6		mailto:[EMAIL PROTECTED]

63225 Langenhttp://www.sun.de/
*

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551
Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

*



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Probable problem with pg_standby

2008-11-04 Thread Fujii Masao
On Tue, Nov 4, 2008 at 8:09 PM, Detlef Ulherr [EMAIL PROTECTED] wrote:
 All I did was forcing the primary in a recovery to generate a new timeline.
 The installed version was 8.3.4, but the problem is the same with earlier
 versions as well. It occurred in 8.2 also. this problem is reproducible all
 the times. For my agent code I implemented a workaround which guarantees
 that during a resilvering process the primary and the standby start at t the
 same timeline. But my feeling is that the standby should go to the same
 timeline as the primary when he receives the history file without
 disruption, and by all means it should never stop the recovery unmotivated.
 This will make a full synchronization necessary and in times of larger
 databases, this may cause major downtimes.

I agree with you only if normal archive recovery case (not specified
recovery_target_xid/time). But, in point-in-time recovery case, the standby
cannot continue to redo without stopping. DBA has to reconstruct the
standby (get new online-backup with new timeline ID, locate it on the
standby and restart recovery).

Or, we should deal with normal archive recovery and point-in-time one
separately?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Probable problem with pg_standby

2008-11-04 Thread Detlef Ulherr

Hi,

First to introduce myself, I am working in Sun Cluster engineering and I 
am responsible for the integration (the agent) between PostgreSQL and 
Sun Cluster. The PostgreSQL agent provides a feature which uses WAL file 
shipping and pg_standby as a replacement for shared storage.


Let's talk about the problem now. Whenever the primary database server 
selects a new timeline, the standby server which is running pg_standby 
stops applying logs to  its database.  It comes even worse, after a 
while pg_standby terminates the recovery mode and now we have primary 
and standby accepting requests. there was no trigger file created, nor a 
signal sent manually to pg_standby.


Here is some debugging output of pg_standby.

running restore : OK
removing /pgs/83_walarchives/001A00B5
LOG:  restored log file 001F00D4 from archive
LOG:  record with zero length at 0/D460
LOG:  redo done at 0/D420

Trigger file: /pgs/data/failover
Waiting for WAL file: 001F00D4
WAL file path   : /pgs/83_walarchives/001F00D4
Restoring to... : pg_xlog/RECOVERYXLOG
Sleep interval  : 5 seconds
Max wait interval   : 0 forever
Command for restore : cp 
/pgs/83_walarchives/001F00D4 pg_xlog/RECOVERYXLOG

Keep archive history: 001F00B6 and later
running restore : OK
LOG:  restored log file 001F00D4 from archive

Trigger file: /pgs/data/failover
Waiting for WAL file: 0020.history
WAL file path   : /pgs/83_walarchives/0020.history
Restoring to... : pg_xlog/RECOVERYHISTORY
Sleep interval  : 5 seconds
Max wait interval   : 0 forever
Command for restore : cp /pgs/83_walarchives/0020.history 
pg_xlog/RECOVERYHISTORY

Keep archive history: No cleanup required
running restore : OKLOG:  restored log file 0020.history 
from archive


Trigger file: /pgs/data/failover
Waiting for WAL file: 0021.history
WAL file path   : /pgs/83_walarchives/0021.history
Restoring to... : pg_xlog/RECOVERYHISTORY
Sleep interval  : 5 seconds
Max wait interval   : 0 forever
Command for restore : cp /pgs/83_walarchives/0021.history 
pg_xlog/RECOVERYHISTORY

Keep archive history: No cleanup required
running restore :cp: cannot access 
/pgs/83_walarchives/0021.history

cp: cannot access /pgs/83_walarchives/0021.history
cp: cannot access /pgs/83_walarchives/0021.history
not restored: history file not found
LOG:  selected new timeline ID: 33

Trigger file: /pgs/data/failover
Waiting for WAL file: 001F.history
WAL file path   : /pgs/83_walarchives/001F.history
Restoring to... : pg_xlog/RECOVERYHISTORY
Sleep interval  : 5 seconds
Max wait interval   : 0 forever
Command for restore : cp /pgs/83_walarchives/001F.history 
pg_xlog/RECOVERYHISTORY

Keep archive history: No cleanup required
running restore : OKLOG:  restored log file 001F.history 
from archive

LOG:  archive recovery complete
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

And here are the corresponding logs from the primary database server.

LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
building file list ... done
001D00D1

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
001F.history

sent 2248 bytes  received 42 bytes  4580.00 bytes/sec
total size is 2119  speedup is 0.93
building file list ... done
001F00D2

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
LOG:  database system was shut down at 2008-10-29 14:07:40 CET
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
building file list ... done
001F00D3

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
building file list ... done
001D00D2

sent 16779397 bytes  received 42 bytes  11186292.67 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
001E00D2

sent 16779397 bytes  received 42 bytes  6711775.60 bytes/sec
total size is 16777216  speedup is 1.00
LOG:  database system was shut down at 2008-10-29 14:10:59 CET
LOG:  starting archive recovery
LOG:  restore_command = 'cp /pgs/83_walarchives/%f %p'