Re: [GENERAL] postgresql 9.3 failover time

2015-12-14 Thread David Steele
On 12/12/15 2:08 PM, Shay Cohavi wrote:
> *I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
> replication & continuos archiving).*
> *I have created 2 failover & failback script in order to perform a
> switchover between the DB servers:*
> *1. failover - create a trigger file in order to promote the new primary.*
> *2. failback - perform a base backup as mentions in :*
> *   a. start backup on the primary.*
> *   b. stop the failed node .*
> *   *didn't delete the DB directory on the failed node
> *   c. performing rsync between the nodes.*

If you use rsync here be sure to use checksums.  The clusters are very
similar to each other and rsync timestamp resolution could become a problem.

> *   d.stopping the backup on the primary.*
> *   e.performing rsync on the pg_xlog.*
> *   f. creating a recovery.conf*
> /
> /standby_mode = 'on'/
> /primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'/
> /restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'/
> /trigger_file = '/home/postgres/databases/fabrix/trigger'/
> /archive_cleanup_command = 'ssh 10.50.1.153
> /home/postgres/pg_utils/archive_cleanup.sh %r'/
> *
> *   g. starting the failed node as secondary.*
> *
> *the switchover method:*
> *1. stop the primary node.*
> *2. promote the secondary node (failover.sh).*
> *3. perform failback on the failed node.*
> *4. start the failed node.*
> *
> *this method works great! *
> *
> *but if I perform multiple switchovers (>20), each time the new primary
> gets promoted (trigger file) - it takes longer because it searches the
> timelines on the archive. *

This is an indication that your backup/restore process is not working
correctly.  Postgres should only look for timelines that are greater
than the current timeline.

> *for example:*
> 
> /[2015-12-12 20:35:10.769 IST] LOG:  trigger file found:
> /home/postgres/databases/fabrix/trigger/
> /[2015-12-12 20:35:10.769 IST] FATAL:  terminating walreceiver process
> due to administrator command/
> /scp: /home/postgres/archive/0094000200DC: No such file or
> directory/
> /[2015-12-12 20:35:10.893 IST] LOG:  record with zero length at 2/DC000168/
> /[2015-12-12 20:35:10.893 IST] LOG:  redo done at 2/DC000100/
> /scp: /home/postgres/archive/0094000200DC: No such file or
> directory/
> /scp: /home/postgres/archive/0093000200DC: No such file or
> directory/
> /scp: /home/postgres/archive/0092000200DC: No such file or
> directory/
> /./
> /./
> /./
> /
> /
> /scp: /home/postgres/archive/0091000200DC: No such file or
> directory/
> /scp: /home/postgres/archive/009200DC: No such file or
> directory/
> /scp: /home/postgres/archive/0095.history: No such file or directory/
> /[2015-12-12 20:35:11.801 IST] LOG:  selected new timeline ID: 149/
> /[2015-12-12 20:35:11.931 IST] LOG:  restored log file
> "0094.history" from archive/
> /[2015-12-12 20:35:12.173 IST] LOG:  archive recovery complete/
> /[2015-12-12 20:35:12.181 IST] LOG:  database system is ready to accept
> connections/
> /[2015-12-12 20:35:12.181 IST] LOG:  autovacuum launcher started/

It's not clear to me how you got to timeline 149.  Some lines have been
removed - did the history log requests go all the way to 148?

Rsync is possibly your issue here - maybe pg_control is not being copied
because the timestamp is the same on both systems (rsync only has a 1
second time resolution so this is very possible between a master and a
streaming replica).  Try rsync with checksums (--checksum) and see if
that makes a difference.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


[GENERAL] postgresql 9.3 failover time

2015-12-12 Thread Shay Cohavi
*Hi,*
*I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
replication & continuos archiving).*
*I have created 2 failover & failback script in order to perform a
switchover between the DB servers:*
*1. failover - create a trigger file in order to promote the new primary.*
*2. failback - perform a base backup as mentions in :*
*   a. start backup on the primary.*
*   b. stop the failed node .*
   didn't delete the DB directory on the failed node
*   c. performing rsync between the nodes.*
*   d.stopping the backup on the primary.*
*   e.performing rsync on the pg_xlog.*
*   f. creating a recovery.conf*

*standby_mode = 'on'*
*primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'*
*restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'*
*trigger_file = '/home/postgres/databases/fabrix/trigger'*
*archive_cleanup_command = 'ssh 10.50.1.153
/home/postgres/pg_utils/archive_cleanup.sh %r'*

*   g. starting the failed node as secondary.*

*the switchover method:*
*1. stop the primary node.*
*2. promote the secondary node (failover.sh).*
*3. perform failback on the failed node.*
*4. start the failed node.*

*this method works great! *


*but if I perform multiple switchovers (>20), each time the new primary
gets promoted (trigger file) - it takes longer because it searches the
timelines on the archive. *

*for example:*

*[2015-12-12 20:35:10.769 IST] LOG:  trigger file found:
/home/postgres/databases/fabrix/trigger*
*[2015-12-12 20:35:10.769 IST] FATAL:  terminating walreceiver process due
to administrator command*
*scp: /home/postgres/archive/0094000200DC: No such file or
directory*
*[2015-12-12 20:35:10.893 IST] LOG:  record with zero length at 2/DC000168*
*[2015-12-12 20:35:10.893 IST] LOG:  redo done at 2/DC000100*
*scp: /home/postgres/archive/0094000200DC: No such file or
directory*
*scp: /home/postgres/archive/0093000200DC: No such file or
directory*
*scp: /home/postgres/archive/0092000200DC: No such file or
directory*
*.*
*.*
*.*

*scp: /home/postgres/archive/0091000200DC: No such file or
directory*
*scp: /home/postgres/archive/009200DC: No such file or
directory*
*scp: /home/postgres/archive/0095.history: No such file or directory*
*[2015-12-12 20:35:11.801 IST] LOG:  selected new timeline ID: 149*
*[2015-12-12 20:35:11.931 IST] LOG:  restored log file "0094.history"
from archive*
*[2015-12-12 20:35:12.173 IST] LOG:  archive recovery complete*
*[2015-12-12 20:35:12.181 IST] LOG:  database system is ready to accept
connections*
*[2015-12-12 20:35:12.181 IST] LOG:  autovacuum launcher started*

*this could take for a least 1 min.or more.*

*is there any way to skip the timeline searching in order to decrease the
promotion?*


*Thanks,*
*ShayC*