Re: [GENERAL] postgresql 9.3 failover time
On 12/12/15 2:08 PM, Shay Cohavi wrote: > *I have postgresql 9.3 setup with 2 nodes (active/standby with streaming > replication & continuos archiving).* > *I have created 2 failover & failback script in order to perform a > switchover between the DB servers:* > *1. failover - create a trigger file in order to promote the new primary.* > *2. failback - perform a base backup as mentions in :* > * a. start backup on the primary.* > * b. stop the failed node .* > * *didn't delete the DB directory on the failed node > * c. performing rsync between the nodes.* If you use rsync here be sure to use checksums. The clusters are very similar to each other and rsync timestamp resolution could become a problem. > * d.stopping the backup on the primary.* > * e.performing rsync on the pg_xlog.* > * f. creating a recovery.conf* > / > /standby_mode = 'on'/ > /primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'/ > /restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'/ > /trigger_file = '/home/postgres/databases/fabrix/trigger'/ > /archive_cleanup_command = 'ssh 10.50.1.153 > /home/postgres/pg_utils/archive_cleanup.sh %r'/ > * > * g. starting the failed node as secondary.* > * > *the switchover method:* > *1. stop the primary node.* > *2. promote the secondary node (failover.sh).* > *3. perform failback on the failed node.* > *4. start the failed node.* > * > *this method works great! * > * > *but if I perform multiple switchovers (>20), each time the new primary > gets promoted (trigger file) - it takes longer because it searches the > timelines on the archive. * This is an indication that your backup/restore process is not working correctly. Postgres should only look for timelines that are greater than the current timeline. > *for example:* > > /[2015-12-12 20:35:10.769 IST] LOG: trigger file found: > /home/postgres/databases/fabrix/trigger/ > /[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process > due to administrator command/ > /scp: /home/postgres/archive/0094000200DC: No such file or > directory/ > /[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168/ > /[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100/ > /scp: /home/postgres/archive/0094000200DC: No such file or > directory/ > /scp: /home/postgres/archive/0093000200DC: No such file or > directory/ > /scp: /home/postgres/archive/0092000200DC: No such file or > directory/ > /./ > /./ > /./ > / > / > /scp: /home/postgres/archive/0091000200DC: No such file or > directory/ > /scp: /home/postgres/archive/009200DC: No such file or > directory/ > /scp: /home/postgres/archive/0095.history: No such file or directory/ > /[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149/ > /[2015-12-12 20:35:11.931 IST] LOG: restored log file > "0094.history" from archive/ > /[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete/ > /[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept > connections/ > /[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started/ It's not clear to me how you got to timeline 149. Some lines have been removed - did the history log requests go all the way to 148? Rsync is possibly your issue here - maybe pg_control is not being copied because the timestamp is the same on both systems (rsync only has a 1 second time resolution so this is very possible between a master and a streaming replica). Try rsync with checksums (--checksum) and see if that makes a difference. -- -David da...@pgmasters.net signature.asc Description: OpenPGP digital signature
[GENERAL] postgresql 9.3 failover time
*Hi,* *I have postgresql 9.3 setup with 2 nodes (active/standby with streaming replication & continuos archiving).* *I have created 2 failover & failback script in order to perform a switchover between the DB servers:* *1. failover - create a trigger file in order to promote the new primary.* *2. failback - perform a base backup as mentions in :* * a. start backup on the primary.* * b. stop the failed node .* didn't delete the DB directory on the failed node * c. performing rsync between the nodes.* * d.stopping the backup on the primary.* * e.performing rsync on the pg_xlog.* * f. creating a recovery.conf* *standby_mode = 'on'* *primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'* *restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'* *trigger_file = '/home/postgres/databases/fabrix/trigger'* *archive_cleanup_command = 'ssh 10.50.1.153 /home/postgres/pg_utils/archive_cleanup.sh %r'* * g. starting the failed node as secondary.* *the switchover method:* *1. stop the primary node.* *2. promote the secondary node (failover.sh).* *3. perform failback on the failed node.* *4. start the failed node.* *this method works great! * *but if I perform multiple switchovers (>20), each time the new primary gets promoted (trigger file) - it takes longer because it searches the timelines on the archive. * *for example:* *[2015-12-12 20:35:10.769 IST] LOG: trigger file found: /home/postgres/databases/fabrix/trigger* *[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process due to administrator command* *scp: /home/postgres/archive/0094000200DC: No such file or directory* *[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168* *[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100* *scp: /home/postgres/archive/0094000200DC: No such file or directory* *scp: /home/postgres/archive/0093000200DC: No such file or directory* *scp: /home/postgres/archive/0092000200DC: No such file or directory* *.* *.* *.* *scp: /home/postgres/archive/0091000200DC: No such file or directory* *scp: /home/postgres/archive/009200DC: No such file or directory* *scp: /home/postgres/archive/0095.history: No such file or directory* *[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149* *[2015-12-12 20:35:11.931 IST] LOG: restored log file "0094.history" from archive* *[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete* *[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept connections* *[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started* *this could take for a least 1 min.or more.* *is there any way to skip the timeline searching in order to decrease the promotion?* *Thanks,* *ShayC*