Re: [HACKERS] testing HS/SR - invalid magic number
On Wed, April 14, 2010 08:23, Heikki Linnakangas wrote: Erik Rijkers wrote: This replication test that was working well earlier (it ran daily), stopped working This is probably because of this change: date: 2010/04/12 09:52:29; author: heikki; state: Exp; lines: +71 -23 Change the logic to decide when to delete old WAL segments, so that it doesn't take into account how far the WAL senders are. This way a hung WAL sender doesn't prevent old WAL segments from being recycled/removed in the primary, ultimately causing the disk to fill up. Instead add standby_keep_segments setting to control how many old WAL segments are kept in the primary. This also makes it more reliable to use streaming replication without WAL archiving, assuming that you set standby_keep_segments high enough. If you generate enough WAL records in the master that the standby can't keep up, the primary will eventually delete a WAL segment that hasn't been streamed to the standby yet, hence the requested WAL segment 00010032 has already been removed error. ah, I hadn't seen that change, and setting standby_keep_segments to some higher value (I chose 1 for now) prevents these errors. It shouldn't remove the segment before it's successfully archived, though, and your logs show that the standby can't find that file in the archive either. Is archiving set up properly? It wasn't really - I was archiving on the primary server, instead of copying to the standby server. I'll fix that, thanks. Btw, typo alert: documentation 18.5.4, subhead 'standby_keep_segments', says: 'replciation' for 'replication'. http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] testing HS/SR - invalid magic number
Erik Rijkers wrote: Btw, typo alert: documentation 18.5.4, subhead 'standby_keep_segments', says: 'replciation' for 'replication'. http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html#RUNTIME-CONFIG-REPLICATION Thanks, fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] testing HS/SR - invalid magic number
Erik Rijkers wrote: On Wed, April 14, 2010 02:34, Erik Rijkers wrote: This replication test that was working well earlier (it ran daily), stopped working after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least recent). This test copies a schema from a 8.4.3 instance to the 9.0 primary: pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas The above does indeed not work anymore, but the logfiles I sent a little earlier were from a different pg_dump statements: pg_dump -c -h /tmp -p 5432 -t public.mytable --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas ... copying table by table to see if that improved matters, which it didn't. Streaming replication is completely oblivious of what's in the WAL it streams, you would probably get the same result with bulk insert or update. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] testing HS/SR - invalid magic number
Erik Rijkers wrote: This replication test that was working well earlier (it ran daily), stopped working after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least recent). ... -- logfile standby: ... 2010-04-14 02:21:11 CEST 5601 start=2010-04-14 02:18:22 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00010032 has already been removed cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010032': No such file or directory 2010-04-14 02:21:11 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number in log file 0, segment 50, offset 13795328 cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010032': No such file or directory 2010-04-14 02:21:11 CEST 5784 start=2010-04-14 02:21:11 CEST LOG: streaming replication successfully connected to primary This is probably because of this change: date: 2010/04/12 09:52:29; author: heikki; state: Exp; lines: +71 -23 Change the logic to decide when to delete old WAL segments, so that it doesn't take into account how far the WAL senders are. This way a hung WAL sender doesn't prevent old WAL segments from being recycled/removed in the primary, ultimately causing the disk to fill up. Instead add standby_keep_segments setting to control how many old WAL segments are kept in the primary. This also makes it more reliable to use streaming replication without WAL archiving, assuming that you set standby_keep_segments high enough. If you generate enough WAL records in the master that the standby can't keep up, the primary will eventually delete a WAL segment that hasn't been streamed to the standby yet, hence the requested WAL segment 00010032 has already been removed error. It shouldn't remove the segment before it's successfully archived, though, and your logs show that the standby can't find that file in the archive either. Is archiving set up properly? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] testing HS/SR - invalid magic number
This replication test that was working well earlier (it ran daily), stopped working after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least recent). This test copies a schema from a 8.4.3 instance to the 9.0 primary: pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas I tried it twice, with a complete reinstall from the cvs repository; last one just now (created 20100414 0128), and once a few hours earlier. -- logfile primary: LOG: database system was shut down at 2010-04-14 02:15:01 CEST LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: received smart shutdown request LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down 2010-04-14 02:15:14 CEST 14756 start=2010-04-14 02:15:14 CEST LOG: database system was shut down at 2010-04-14 02:15:12 CEST 2010-04-14 02:15:14 CEST 14752 start=2010-04-14 02:15:13 CEST LOG: database system is ready to accept connections 2010-04-14 02:18:23 CEST 15100 [unknown] rijkers start=2010-04-14 02:18:22 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=55248 [...] 2010-04-14 02:21:10 CEST 15100 [unknown] rijkers start=2010-04-14 02:18:22 CEST FATAL: requested WAL segment 00010032 has already been removed 2010-04-14 02:21:11 CEST 15377 [unknown] rijkers start=2010-04-14 02:21:11 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50869 2010-04-14 02:21:11 CEST 15377 [unknown] rijkers start=2010-04-14 02:21:11 CEST FATAL: requested WAL segment 00010032 has already been removed 2010-04-14 02:21:16 CEST 15384 [unknown] rijkers start=2010-04-14 02:21:16 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50870 2010-04-14 02:21:16 CEST 15384 [unknown] rijkers start=2010-04-14 02:21:16 CEST FATAL: requested WAL segment 00010032 has already been removed 2010-04-14 02:21:21 CEST 15400 [unknown] rijkers start=2010-04-14 02:21:21 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50871 2010-04-14 02:21:22 CEST 15400 [unknown] rijkers start=2010-04-14 02:21:21 CEST FATAL: requested WAL segment 00010032 has already been removed 2010-04-14 02:21:26 CEST 15406 [unknown] rijkers start=2010-04-14 02:21:26 CEST LOG: replication connection authorized: user=rijkers host=127.0.0.1 port=50872 etc. -- logfile standby: 2010-04-14 02:18:22 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: database system was interrupted; last known up at 2010-04-14 02:17:54 CEST 2010-04-14 02:18:22 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: entering standby mode cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010001': No such file or directory 2010-04-14 02:18:23 CEST 5601 start=2010-04-14 02:18:22 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:18:24 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: redo starts at 0/120 2010-04-14 02:18:24 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: consistent recovery state reached at 0/200 2010-04-14 02:18:24 CEST 5597 start=2010-04-14 02:18:22 CEST LOG: database system is ready to accept read only connections 2010-04-14 02:21:11 CEST 5601 start=2010-04-14 02:18:22 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00010032 has already been removed cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010032': No such file or directory 2010-04-14 02:21:11 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number in log file 0, segment 50, offset 13795328 cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010032': No such file or directory 2010-04-14 02:21:11 CEST 5784 start=2010-04-14 02:21:11 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:21:11 CEST 5784 start=2010-04-14 02:21:11 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00010032 has already been removed cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010032': No such file or directory 2010-04-14 02:21:11 CEST 5598 start=2010-04-14 02:18:22 CEST LOG: invalid magic number in log file 0, segment 50, offset 13795328 cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/00010032': No such file or directory 2010-04-14 02:21:16 CEST 5790 start=2010-04-14 02:21:16 CEST LOG: streaming replication successfully connected to primary 2010-04-14 02:21:16 CEST 5790 start=2010-04-14 02:21:16 CEST FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 00010032 has already been removed
Re: [HACKERS] testing HS/SR - invalid magic number
On Wed, April 14, 2010 02:34, Erik Rijkers wrote: This replication test that was working well earlier (it ran daily), stopped working after reinstall of new instances of cvs HEAD. I think the change must have been today (or at least recent). This test copies a schema from a 8.4.3 instance to the 9.0 primary: pg_dump -c -h /tmp -p 5432 -n public --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas The above does indeed not work anymore, but the logfiles I sent a little earlier were from a different pg_dump statements: pg_dump -c -h /tmp -p 5432 -t public.mytable --no-owner --no-privileges ms \ | psql -1qtA -h /tmp -p 7575 -d replicas ... copying table by table to see if that improved matters, which it didn't. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers