Re: Refactor pg_rewind code and make it work against a standby
On 20/11/2020 19:14, Andres Freund wrote: Hi, On 2020-11-20 16:19:03 +0200, Heikki Linnakangas wrote: Pushed a fix similar to your patch, but I put the wait_for_catchup() before running pg_rewind. The point of inserting the 'in A, after C was promoted' row is that it's present in B when pg_rewind runs. Hm - don't we possibly need *both*? Since post pg_rewind recovery starts at the previous checkpoint, it's quite possible for C to get ready to answer queries before that record has been replayed? No, C will not reach consistent state until all the WAL in the source system has been replayed. pg_rewind will set minRecoveryPoint to the minRecoveryPoint of the source system, after copying all the files. (Or its insert point, if it's not a standby server, but in this case it is). Same as when taking an online backup. - Heikki
Re: Refactor pg_rewind code and make it work against a standby
Hi, On 2020-11-20 16:19:03 +0200, Heikki Linnakangas wrote: > Pushed a fix similar to your patch, but I put the wait_for_catchup() before > running pg_rewind. The point of inserting the 'in A, after C was promoted' > row is that it's present in B when pg_rewind runs. Hm - don't we possibly need *both*? Since post pg_rewind recovery starts at the previous checkpoint, it's quite possible for C to get ready to answer queries before that record has been replayed? Thanks, Andres Freund
Re: Refactor pg_rewind code and make it work against a standby
On 20/11/2020 02:38, Andres Freund wrote: I locally, on a heavily modified branch (AIO support), started to get consistent failures in this test. I *suspect*, but am not sure, that it's the test's fault, not the fault of modifications. As far as I can tell, after the pg_rewind call, there's no guarantee that node_c has fully caught up to the 'in A, after C was promoted' insertion on node a. Thus at the check_query() I sometimes get just 'in A, before promotion' back. After adding a wait that problem seems to be fixed. Here's what I did diff --git i/src/bin/pg_rewind/t/007_standby_source.pl w/src/bin/pg_rewind/t/007_standby_source.pl index f6abcc2d987..48898bef2f5 100644 --- i/src/bin/pg_rewind/t/007_standby_source.pl +++ w/src/bin/pg_rewind/t/007_standby_source.pl @@ -88,6 +88,7 @@ $node_c->safe_psql('postgres', "checkpoint"); # - you need to rewind. $node_a->safe_psql('postgres', "INSERT INTO tbl1 VALUES ('in A, after C was promoted')"); +$lsn = $node_a->lsn('insert'); # Also insert a new row in the standby, which won't be present in the # old primary. @@ -142,6 +143,8 @@ $node_primary = $node_c; # Run some checks to verify that C has been successfully rewound, # and connected back to follow B. +$node_b->wait_for_catchup('node_c', 'replay', $lsn); + check_query( 'SELECT * FROM tbl1', qq(in A Yes, I was able to reproduced that by inserting a strategic sleep in the test and pausing replication by attaching gdb to the walsender process. Pushed a fix similar to your patch, but I put the wait_for_catchup() before running pg_rewind. The point of inserting the 'in A, after C was promoted' row is that it's present in B when pg_rewind runs. Thanks! - Heikki
Re: Refactor pg_rewind code and make it work against a standby
Hi, On 2020-11-15 17:10:53 +0200, Heikki Linnakangas wrote: > Yep, quite right. Fixed that way, thanks for the debugging! I locally, on a heavily modified branch (AIO support), started to get consistent failures in this test. I *suspect*, but am not sure, that it's the test's fault, not the fault of modifications. As far as I can tell, after the pg_rewind call, there's no guarantee that node_c has fully caught up to the 'in A, after C was promoted' insertion on node a. Thus at the check_query() I sometimes get just 'in A, before promotion' back. After adding a wait that problem seems to be fixed. Here's what I did diff --git i/src/bin/pg_rewind/t/007_standby_source.pl w/src/bin/pg_rewind/t/007_standby_source.pl index f6abcc2d987..48898bef2f5 100644 --- i/src/bin/pg_rewind/t/007_standby_source.pl +++ w/src/bin/pg_rewind/t/007_standby_source.pl @@ -88,6 +88,7 @@ $node_c->safe_psql('postgres', "checkpoint"); # - you need to rewind. $node_a->safe_psql('postgres', "INSERT INTO tbl1 VALUES ('in A, after C was promoted')"); +$lsn = $node_a->lsn('insert'); # Also insert a new row in the standby, which won't be present in the # old primary. @@ -142,6 +143,8 @@ $node_primary = $node_c; # Run some checks to verify that C has been successfully rewound, # and connected back to follow B. +$node_b->wait_for_catchup('node_c', 'replay', $lsn); + check_query( 'SELECT * FROM tbl1', qq(in A - Andres
Re: Refactor pg_rewind code and make it work against a standby
On 15/11/2020 09:07, Tom Lane wrote: I wrote: Not sure if you noticed, but piculet has twice failed the 007_standby_source.pl test that was added by 9c4f5192f: ... Now, I'm not sure what to make of that, but I can't help noticing that piculet uses --disable-atomics while francolin uses --disable-spinlocks. That leads the mind towards some kind of low-level synchronization bug ... Or, maybe it's less mysterious than that. The failure looks like we have not waited long enough for the just-inserted row to get replicated to node C. That wait is implemented as $lsn = $node_a->lsn('insert'); $node_b->wait_for_catchup('node_c', 'write', $lsn); which looks fishy ... shouldn't wait_for_catchup be told to wait for replay of that LSN, not just write-the-WAL? Yep, quite right. Fixed that way, thanks for the debugging! - Heikki
Re: Refactor pg_rewind code and make it work against a standby
I wrote: > Not sure if you noticed, but piculet has twice failed the > 007_standby_source.pl test that was added by 9c4f5192f: > ... > Now, I'm not sure what to make of that, but I can't help noticing that > piculet uses --disable-atomics while francolin uses --disable-spinlocks. > That leads the mind towards some kind of low-level synchronization > bug ... Or, maybe it's less mysterious than that. The failure looks like we have not waited long enough for the just-inserted row to get replicated to node C. That wait is implemented as $lsn = $node_a->lsn('insert'); $node_b->wait_for_catchup('node_c', 'write', $lsn); which looks fishy ... shouldn't wait_for_catchup be told to wait for replay of that LSN, not just write-the-WAL? regards, tom lane
Re: Refactor pg_rewind code and make it work against a standby
Not sure if you noticed, but piculet has twice failed the 007_standby_source.pl test that was added by 9c4f5192f: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2020-11-15%2006%3A00%3A11 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2020-11-13%2011%3A20%3A10 and francolin failed once: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin=2020-11-12%2018%3A57%3A33 These failures look the same: # Failed test 'table content after rewind and insert: query result matches' # at t/007_standby_source.pl line 160. # got: 'in A # in A, before promotion # in A, after C was promoted # ' # expected: 'in A # in A, before promotion # in A, after C was promoted # in A, after rewind # ' # Looks like you failed 1 test of 3. [11:27:01] t/007_standby_source.pl ... Dubious, test returned 1 (wstat 256, 0x100) Failed 1/3 subtests Now, I'm not sure what to make of that, but I can't help noticing that piculet uses --disable-atomics while francolin uses --disable-spinlocks. That leads the mind towards some kind of low-level synchronization bug ... regards, tom lane
Re: Refactor pg_rewind code and make it work against a standby
On 04/11/2020 11:23, Heikki Linnakangas wrote: I read through the patches one more time, fixed a bunch of typos and such, and pushed patches 1-4. I'm going to spend some more time on testing the last patch. It allows using a standby server as the source, and we don't have any tests for that yet. Thanks for the review! Did some more testing, fixed one bug, and pushed. To test this, I set up a cluster with one primary, a standby, and a cascaded standby. I launched a test workload against the primary that creates tables, inserts rows, and drops tables continuously. In another shell, I promoted the cascaded standby, run some updates on the promoted server, and finally, run pg_rewind pointed at the standby, and start it again as a cascaded standby. Repeat. Attached are the scripts I used. I edited them between test runs to test slightly different scenarios. I don't expect them to be very useful to anyone else, but the Internet is my backup. I did find one bug in the patch with that, so the time was well spent: the code in process_queued_fetch_requests() got confused and errored out, if a file was removed in the source system while pg_rewind was running. There was code to deal with that, but it was broken. Fixed that. - Heikki rewind-cascading-test.tar.gz Description: application/gzip
Re: Refactor pg_rewind code and make it work against a standby
On 25/09/2020 02:56, Soumyadeep Chakraborty wrote: On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas wrote: 7. Please address the FIXME for the symlink case: /* FIXME: Check if it points to the same target? */ It's not a new issue. Would be nice to fix, of course. I'm not sure what the right thing to do would be. If you have e.g. replaced postgresql.conf with a symlink that points outside the data directory, would it be appropriate to overwrite it? Or perhaps we should throw an error? We also throw an error if a file is a symlink in the source but a regular file in the target, or vice versa. Hmm, I can imagine a use case for 2 different symlink targets on the source and target clusters. For example the primary's pg_wal directory can have a different symlink target as compared to a standby's (different mount points on the same network maybe?). An end user might not desire pg_rewind meddling with that setup or may desire pg_rewind to treat the source as a source-of-truth with respect to this as well and would want pg_rewind to overwrite the target's symlink. Maybe doing a check and emitting a warning with hint/detail is prudent here while taking no action. We have special handling for 'pg_wal' to pretend that it's a regular directory (see process_source_file()), so that's taken care of. But if you did a something similar with some other subdirectory, that would be a problem. 14. queue_overwrite_range(), finish_overwrite() instead of queue_fetch_range(), finish_fetch()? Similarly update\ *_fetch_file_range() and *_finish_fetch() 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c Good idea! And fetch.h -> rewind_source.h. +1. You might have missed the changes to rename "fetch" -> "overwrite" as was mentioned in 14. I preferred the "fetch" nomenclature in those function names. They fetch and overwrite the file ranges, so 'fetch' still seems appropriate. "fetch" -> "overwrite" would make sense if you wanted to emphasize the "overwrite" part more. Or we could rename it to "fetch_and_overwrite". But overall I think "fetch" is fine. 16. conn = PQconnectdb(connstr_source); if (PQstatus(conn) == CONNECTION_BAD) pg_fatal("could not connect to server: %s", PQerrorMessage(conn)); if (showprogress) pg_log_info("connected to server"); The above hunk should be part of init_libpq_source(). Consequently, init_libpq_source() should take a connection string instead of a conn. The libpq connection is also needed by WriteRecoveryConfig(), that's why it's not fully encapsulated in libpq_source. Ah. I find it pretty weird why we need to specify --source-server to have write-recovery-conf work. From the code, we only need the conn for calling PQserverVersion(), something we can easily get by slurping pg_controldata on the source side? Maybe we can remove this limitation? Yeah, perhaps. In another patch :-). I read through the patches one more time, fixed a bunch of typos and such, and pushed patches 1-4. I'm going to spend some more time on testing the last patch. It allows using a standby server as the source, and we don't have any tests for that yet. Thanks for the review! - Heikki
Re: Refactor pg_rewind code and make it work against a standby
On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas wrote: > >> /* > >> * If this is a relation file, copy the modified blocks. > >> * > >> * This is in addition to any other changes. > >> */ > >> iter = datapagemap_iterate(>target_modified_pages); > >> while (datapagemap_next(iter, )) > >> { > >> offset = blkno * BLCKSZ; > >> > >> source->queue_fetch_range(source, entry->path, offset, BLCKSZ); > >> } > >> pg_free(iter); > > > > Can we put this hunk into a static function overwrite_pages()? > > Meh, it's only about 10 lines of code, and one caller. Fair. > > > 7. Please address the FIXME for the symlink case: > > /* FIXME: Check if it points to the same target? */ > > It's not a new issue. Would be nice to fix, of course. I'm not sure what > the right thing to do would be. If you have e.g. replaced > postgresql.conf with a symlink that points outside the data directory, > would it be appropriate to overwrite it? Or perhaps we should throw an > error? We also throw an error if a file is a symlink in the source but a > regular file in the target, or vice versa. > Hmm, I can imagine a use case for 2 different symlink targets on the source and target clusters. For example the primary's pg_wal directory can have a different symlink target as compared to a standby's (different mount points on the same network maybe?). An end user might not desire pg_rewind meddling with that setup or may desire pg_rewind to treat the source as a source-of-truth with respect to this as well and would want pg_rewind to overwrite the target's symlink. Maybe doing a check and emitting a warning with hint/detail is prudent here while taking no action. > > 8. > > > > * it anyway. But there's no harm in copying it now.) > > > > and > > > > * copy them here. But we don't know which scenario we're > > * dealing with, and there's no harm in copying the missing > > * blocks now, so do it now. > > > > Could you add a line or two explaining why there is "no harm" in these > > two hunks above? > > The previous sentences explain that there's a WAL record covering them. > So they will be overwritten by WAL replay anyway. Does it need more > explanation? Yeah you are right, that is reason enough. > > 14. queue_overwrite_range(), finish_overwrite() instead of > > queue_fetch_range(), finish_fetch()? Similarly update\ > > *_fetch_file_range() and *_finish_fetch() > > > > > > 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c > > Good idea! And fetch.h -> rewind_source.h. +1. You might have missed the changes to rename "fetch" -> "overwrite" as was mentioned in 14. > > I also moved the code in execute_file_actions() function to pg_rewind.c, > into a new function: perform_rewind(). It felt a bit silly to have just > execute_file_actions() in a file of its own. perform_rewind() now does > all the modifications to the data directory, writing the backup file. > Except for writing the recovery config: that also needs to be when > there's no rewind to do, so it makes sense to keep it separate. What do > you think? I don't mind inlining that functionality into perform_rewind(). +1. Heads up: The function declaration for execute_file_actions() is still there in rewind_source.h. > > 16. > > > >> conn = PQconnectdb(connstr_source); > >> > >> if (PQstatus(conn) == CONNECTION_BAD) > >> pg_fatal("could not connect to server: %s", > >> PQerrorMessage(conn)); > >> > >> if (showprogress) > >> pg_log_info("connected to server"); > > > > > > The above hunk should be part of init_libpq_source(). Consequently, > > init_libpq_source() should take a connection string instead of a conn. > > The libpq connection is also needed by WriteRecoveryConfig(), that's why > it's not fully encapsulated in libpq_source. Ah. I find it pretty weird why we need to specify --source-server to have write-recovery-conf work. From the code, we only need the conn for calling PQserverVersion(), something we can easily get by slurping pg_controldata on the source side? Maybe we can remove this limitation? Regards, Soumyadeep (VMware)
Re: Refactor pg_rewind code and make it work against a standby
On 20/09/2020 23:44, Soumyadeep Chakraborty wrote: Before getting into the code review for the patch, I wanted to know why we don't use a Bitmapset for target_modified_pages? Bitmapset is not available in client code. Perhaps it could be moved to src/common with some changes, but doesn't seem worth it until there's more client code that would need it. I'm not sure that a bitmap is the best data structure for tracking the changed blocks in the first place. A hash table might be better if there are only a few changed blocks, or something like src/backend/lib/integerset.c if there are many. But as long as the simple bitmap gets the job done, let's keep it simple. 2. Rename target_modified_pages to target_pages_to_overwrite? target_modified_pages can lead to confusion as to whether it includes pages that were modified on the target but not even present in the source and the other exclusionary cases. target_pages_to_overwrite is much clearer. Agreed, I'll rename it. Conceptually, while we're scanning source WAL, we're just making note of the modified blocks. The decision on what to do with them happens only later, in decide_file_action(). The difference is purely theoretical, though. There is no real decision to be made, all the modified blocks will be overwritten. So on the whole, I agree 'target_page_to_overwrite' is better. /* * If this is a relation file, copy the modified blocks. * * This is in addition to any other changes. */ iter = datapagemap_iterate(>target_modified_pages); while (datapagemap_next(iter, )) { offset = blkno * BLCKSZ; source->queue_fetch_range(source, entry->path, offset, BLCKSZ); } pg_free(iter); Can we put this hunk into a static function overwrite_pages()? Meh, it's only about 10 lines of code, and one caller. 4. * block that have changed in the target system. It makes note of all the * changed blocks in the pagemap of the file. Can we replace the above with: * block that has changed in the target system. It decides if the given blkno in the target relfile needs to be overwritten from the source. Ok. Again conceptually though, process_target_wal_block_change() just collects information, and the decisions are made later. But you're right that we do leave out truncated-away blocks here, so we are doing more than just noting all the changed blocks. /* * Doesn't exist in either server. Why does it have an entry in the * first place?? */ return FILE_ACTION_NONE; Can we delete the above hunk and add the following assert to the very top of decide_file_action(): Assert(entry->target_exists || entry->source_exists); I would like to keep the check even when assertions are not enabled. I'll add an Assert(false) there. 7. Please address the FIXME for the symlink case: /* FIXME: Check if it points to the same target? */ It's not a new issue. Would be nice to fix, of course. I'm not sure what the right thing to do would be. If you have e.g. replaced postgresql.conf with a symlink that points outside the data directory, would it be appropriate to overwrite it? Or perhaps we should throw an error? We also throw an error if a file is a symlink in the source but a regular file in the target, or vice versa. 8. * it anyway. But there's no harm in copying it now.) and * copy them here. But we don't know which scenario we're * dealing with, and there's no harm in copying the missing * blocks now, so do it now. Could you add a line or two explaining why there is "no harm" in these two hunks above? The previous sentences explain that there's a WAL record covering them. So they will be overwritten by WAL replay anyway. Does it need more explanation? 14. queue_overwrite_range(), finish_overwrite() instead of queue_fetch_range(), finish_fetch()? Similarly update\ *_fetch_file_range() and *_finish_fetch() 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c Good idea! And fetch.h -> rewind_source.h. I also moved the code in execute_file_actions() function to pg_rewind.c, into a new function: perform_rewind(). It felt a bit silly to have just execute_file_actions() in a file of its own. perform_rewind() now does all the modifications to the data directory, writing the backup file. Except for writing the recovery config: that also needs to be when there's no rewind to do, so it makes sense to keep it separate. What do you think? 16. conn = PQconnectdb(connstr_source); if (PQstatus(conn) == CONNECTION_BAD) pg_fatal("could not connect to server: %s", PQerrorMessage(conn)); if (showprogress) pg_log_info("connected to server"); The above hunk should be part of init_libpq_source(). Consequently, init_libpq_source() should take a connection string instead of a conn. The libpq connection is also needed by WriteRecoveryConfig(), that's why it's not fully encapsulated in libpq_source. 19. typedef struct { const char *path; /* path relative to data directory root */ uint64 offset;
Re: Refactor pg_rewind code and make it work against a standby
Thanks for the review! I'll post a new version shortly, with your comments incorporated. More detailed response to a few of them below: On 18/09/2020 10:41, Kyotaro Horiguchi wrote: I don't think filemap_finalize needs to iterate over filemap twice. True, but I thought it's more clear that way, doing one thing at a time. hash_string_pointer is a copy of that of pg_verifybackup.c. Is it worth having in hashfn.h or .c? I think it's fine for now. Maybe in the future if more copies crop up. --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c ... + filemap_t *filemap; .. + filemap_init(); ... + filemap = filemap_finalize(); I'm a bit confused by this, and realized that what filemap_init initializes is *not* the filemap, but the filehash. So for example, the name of the functions might should be something like this? filehash_init() filemap = filehash_finalyze()/create_filemap() My thinking was that filemap_* is the prefix for the operations in filemap.c, hence filemap_init(). I can see the confusion, though, and I think you're right that renaming would be good. I renamed them to filehash_init(), and decide_file_actions(). I think those names make the calling code in pg_rewind.c quite clear. /* * Also check that full_page_writes is enabled. We can get torn pages if * a page is modified while we read it with pg_read_binary_file(), and we * rely on full page images to fix them. */ str = run_simple_query(conn, "SHOW full_page_writes"); if (strcmp(str, "on") != 0) pg_fatal("full_page_writes must be enabled in the source server"); pg_free(str); This is a part not changed by this patch set. But If we allow to connect to a standby, this check can be tricked by setting off on the primary and "on" on the standby (FWIW, though). Some protection measure might be necessary. Good point, the value in the primary is what matters. In fact, even when connected to the primary, the value might change while pg_rewind is running. I'm not sure how we could tighten that up. + if (chunksize > rq->length) + { + pg_fatal("received more than requested for file \"%s\"", +rq->path); + /* receiving less is OK, though */ Don't we need to truncate the target file, though? If a file is truncated in the source while pg_rewind is running, there should be a WAL record about the truncation that gets replayed when you start the server after pg_rewind has finished. We could truncate the file if we wanted to, but it's not necessary. I'll add a comment. - Heikki
Re: Refactor pg_rewind code and make it work against a standby
Hey Heikki, Thanks for refactoring and making the code much easier to read! Before getting into the code review for the patch, I wanted to know why we don't use a Bitmapset for target_modified_pages? Code review: 1. We need to update the comments for process_source_file and process_target_file. We don't decide the action on the file until later. 2. Rename target_modified_pages to target_pages_to_overwrite? target_modified_pages can lead to confusion as to whether it includes pages that were modified on the target but not even present in the source and the other exclusionary cases. target_pages_to_overwrite is much clearer. 3. > /* > * If this is a relation file, copy the modified blocks. > * > * This is in addition to any other changes. > */ > iter = datapagemap_iterate(>target_modified_pages); > while (datapagemap_next(iter, )) > { > offset = blkno * BLCKSZ; > > source->queue_fetch_range(source, entry->path, offset, BLCKSZ); > } > pg_free(iter); Can we put this hunk into a static function overwrite_pages()? 4. > * block that have changed in the target system. It makes note of all the > * changed blocks in the pagemap of the file. Can we replace the above with: > * block that has changed in the target system. It decides if the given blkno in the target relfile needs to be overwritten from the source. 5. > /* > * Doesn't exist in either server. Why does it have an entry in the > * first place?? > */ > return FILE_ACTION_NONE; Can we delete the above hunk and add the following assert to the very top of decide_file_action(): Assert(entry->target_exists || entry->source_exists); 6. > pg_fatal("unexpected page modification for directory or symbolic link \"%s\"", > entry->path); Can we replace above with: pg_fatal("unexpected page modification for non-regular file \"%s\"", entry->path); This way we can address the undefined file type. 7. Please address the FIXME for the symlink case: /* FIXME: Check if it points to the same target? */ 8. * it anyway. But there's no harm in copying it now.) and * copy them here. But we don't know which scenario we're * dealing with, and there's no harm in copying the missing * blocks now, so do it now. Could you add a line or two explaining why there is "no harm" in these two hunks above? 9. Can we add pg_control, /pgsql_tmp/... and .../pgsql_tmp.* and PG_VERSION files to check_file_excluded()? 10. - * block that have changed in the target system. It makes note of all the + * block that have changed in the target system. It makes note of all the Whitespace typo 11. > * If the block is beyond the EOF in the source system, or the file doesn't > * doesn'exist Typo: Two doesnt's 12. > /* > * This represents the final decisions on what to do with each file. > * 'actions' array contains an entry for each file, sorted in the order > * that their actions should executed. > */ > typedef struct filemap_t > { > /* Summary information, filled by calculate_totals() */ > uint64 total_size; /* total size of the source cluster */ > uint64 fetch_size; /* number of bytes that needs to be copied */ > > int nactions; /* size of 'actions' array */ > file_entry_t *actions[FLEXIBLE_ARRAY_MEMBER]; > } filemap_t; Replace nactions/actions with nentries/entries..clearer in intent as it is easier to reconcile the modified pages stuff to an entry rather than an action. It could be: /* * This contains the final decisions on what to do with each file. * 'entries' array contains an entry for each file, sorted in the order * that their actions should executed. */ typedef struct filemap_t { /* Summary information, filled by calculate_totals() */ uint64 total_size; /* total size of the source cluster */ uint64 fetch_size; /* number of bytes that needs to be copied */ int nentries; /* size of 'entries' array */ file_entry_t *entries[FLEXIBLE_ARRAY_MEMBER]; } filemap_t; 13. > filehash = filehash_create(1000, NULL); Use a constant for the 1000 in (FILEMAP_INITIAL_SIZE): 14. queue_overwrite_range(), finish_overwrite() instead of queue_fetch_range(), finish_fetch()? Similarly update\ *_fetch_file_range() and *_finish_fetch() 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c 16. > conn = PQconnectdb(connstr_source); > > if (PQstatus(conn) == CONNECTION_BAD) > pg_fatal("could not connect to server: %s", > PQerrorMessage(conn)); > > if (showprogress) > pg_log_info("connected to server"); The above hunk should be part of init_libpq_source(). Consequently, init_libpq_source() should take a connection string instead of a conn. 17. > if (conn) > { > PQfinish(conn); > conn = NULL; > } The hunk above should be part of libpq_destroy() 18. > /* > * Files are fetched max CHUNK_SIZE bytes at a time, and with a > * maximum of MAX_CHUNKS_PER_QUERY chunks in a single query. > */ > #define CHUNK_SIZE (1024 * 1024) Can we rename CHUNK_SIZE to MAX_CHUNK_SIZE and update the comment? 19. > typedef struct > { > const char
Re: Refactor pg_rewind code and make it work against a standby
Hello. It needed rebasing. (Attached) At Tue, 25 Aug 2020 16:32:02 +0300, Heikki Linnakangas wrote in > On 20/08/2020 11:32, Kyotaro Horiguchi wrote: > > 0002: Rewording that old->target and new->source makes the meaning far > > Good idea! I changed the patch that way. Looks Good. > > 0003: Thomas is propsing sort template. It could be used if committed. +* If the block is beyond the EOF in the source system, or the file doesn't +* doesn'exist in the source at all, we're going to truncate/remove it away "the file doesn't doesn'exist" I don't think filemap_finalize needs to iterate over filemap twice. hash_string_pointer is a copy of that of pg_verifybackup.c. Is it worth having in hashfn.h or .c? > --- a/src/bin/pg_rewind/pg_rewind.c > +++ b/src/bin/pg_rewind/pg_rewind.c > ... > + filemap_t *filemap; > .. > + filemap_init(); > ... > + filemap = filemap_finalize(); I'm a bit confused by this, and realized that what filemap_init initializes is *not* the filemap, but the filehash. So for example, the name of the functions might should be something like this? filehash_init() filemap = filehash_finalyze()/create_filemap() > > 0004: > > The names of many of the functions gets an additional word "local" > > but I don't get the meaning clearly. but its about linguistic sense > > and I'm not fit to that.. > > -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool > > -trunc) > > +local_fetch_file_range(rewind_source *source, const char *path, > > uint64 off, > > The function actually copying the soruce range to the target file. So > > "fetch" might give somewhat different meaning, but its about > > linguistic (omitted..). > > Hmm. It is "fetching" the range from the source server, and writing it > to the target. The term makes more sense with a libpq source. Perhaps > this function should be called "local_copy_range" or something, but > it'd also be nice to have "fetch" in the name because the function > pointer it's assigned to is called "queue_fetch_range". Thanks. Yeah libpq_fetch_file makes sense. I agree to the name. The refactoring looks good to me. > > I'm going to continue reviewing this later. > > Thanks! Attached is a new set of patches. The only meaningful change > is in the 2nd patch, which I modified per your suggestion. Also, I > moved the logic to decide each file's fate into a new subroutine > called decide_file_action(). That's looks good. 0005: + /* +* We don't intend do any updates. Put the connection in read-only mode +* to keep us honest. +*/ run_simple_command(conn, "SET default_transaction_read_only = off"); The comment is wrong since the time it was added by 0004 but that's not a problem since it was to be fixed by 0005. However, we need the variable turned on in order to be really honest:p > /* > * Also check that full_page_writes is enabled. We can get torn pages if > * a page is modified while we read it with pg_read_binary_file(), and we > * rely on full page images to fix them. > */ > str = run_simple_query(conn, "SHOW full_page_writes"); > if (strcmp(str, "on") != 0) > pg_fatal("full_page_writes must be enabled in the source server"); > pg_free(str); This is a part not changed by this patch set. But If we allow to connect to a standby, this check can be tricked by setting off on the primary and "on" on the standby (FWIW, though). Some protection measure might be necessary. (Maybe standby should be restricted to have the same value with the primary.) + thislen = Min(len, CHUNK_SIZE - prev->length); + src->request_queue[src->num_requests - 1].length += thislen; prev == >request_queue[src->num_requests - 1] here. + if (chunksize > rq->length) + { + pg_fatal("received more than requested for file \"%s\"", +rq->path); + /* receiving less is OK, though */ Don't we need to truncate the target file, though? +* Source is a local data directory. It should've shut down cleanly, +* and we must to the latest shutdown checkpoint. "and we must to the" => "and we must replay to the" ? regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 6cfbb5a686e5b87159aed9769b1e52859bae02b4 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 19 Aug 2020 15:34:35 +0300 Subject: [PATCH v3 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c For consistency. All the other low-level functions that operate on the target directory are in file_ops.c. --- src/bin/pg_rewind/file_ops.c | 19 +++ src/bin/pg_rewind/file_ops.h | 1 + src/bin/pg_rewind/pg_rewind.c | 22 +- src/bin/pg_rewind/pg_rewind.h | 1 + 4 files changed, 22 insertions(+), 21 deletions(-) diff --git a/src/bin/pg_rewind/file_ops.c
Re: Refactor pg_rewind code and make it work against a standby
On Tue, Aug 25, 2020 at 04:32:02PM +0300, Heikki Linnakangas wrote: > On 20/08/2020 11:32, Kyotaro Horiguchi wrote: > > 0002: Rewording that old->target and new->source makes the meaning far > > clearer. Moving decisions core code into filemap_finalize is > > reasonable. > > > > By the way, some of the rules are remaining in > > process_source/target_file. For example, pg_wal that is a symlink, > > or tmporary directories. and excluded files. The number of > > excluded files doesn't seem so large so it doesn't seem that the > > exclusion give advantage so much. They seem to me movable to > > filemap_finalize, and we can get rid of the callbacks by doing > > so. Is there any reason that the remaining rules should be in the > > callbacks? > > Good idea! I changed the patch that way. > > > 0003: Thomas is propsing sort template. It could be used if committed. > > > > 0004: > > > > The names of many of the functions gets an additional word "local" > > but I don't get the meaning clearly. but its about linguistic sense > > and I'm not fit to that.. > > -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool > > trunc) > > +local_fetch_file_range(rewind_source *source, const char *path, uint64 off, > > > > The function actually copying the soruce range to the target file. So > > "fetch" might give somewhat different meaning, but its about > > linguistic (omitted..). > > Hmm. It is "fetching" the range from the source server, and writing it to > the target. The term makes more sense with a libpq source. Perhaps this > function should be called "local_copy_range" or something, but it'd also be > nice to have "fetch" in the name because the function pointer it's assigned > to is called "queue_fetch_range". > > > I'm going to continue reviewing this later. > > Thanks! Attached is a new set of patches. The only meaningful change is in > the 2nd patch, which I modified per your suggestion. Also, I moved the logic > to decide each file's fate into a new subroutine called > decide_file_action(). The patch set fails to apply from 0002~, so this needs a rebase. I have not looked at all that in details, but no objections to apply 0001 from me. It makes sense to move the sync subroutine for the target folder to file_ops.c. -- Michael signature.asc Description: PGP signature
Re: Refactor pg_rewind code and make it work against a standby
On 20/08/2020 11:32, Kyotaro Horiguchi wrote: 0002: Rewording that old->target and new->source makes the meaning far clearer. Moving decisions core code into filemap_finalize is reasonable. By the way, some of the rules are remaining in process_source/target_file. For example, pg_wal that is a symlink, or tmporary directories. and excluded files. The number of excluded files doesn't seem so large so it doesn't seem that the exclusion give advantage so much. They seem to me movable to filemap_finalize, and we can get rid of the callbacks by doing so. Is there any reason that the remaining rules should be in the callbacks? Good idea! I changed the patch that way. 0003: Thomas is propsing sort template. It could be used if committed. 0004: The names of many of the functions gets an additional word "local" but I don't get the meaning clearly. but its about linguistic sense and I'm not fit to that.. -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc) +local_fetch_file_range(rewind_source *source, const char *path, uint64 off, The function actually copying the soruce range to the target file. So "fetch" might give somewhat different meaning, but its about linguistic (omitted..). Hmm. It is "fetching" the range from the source server, and writing it to the target. The term makes more sense with a libpq source. Perhaps this function should be called "local_copy_range" or something, but it'd also be nice to have "fetch" in the name because the function pointer it's assigned to is called "queue_fetch_range". I'm going to continue reviewing this later. Thanks! Attached is a new set of patches. The only meaningful change is in the 2nd patch, which I modified per your suggestion. Also, I moved the logic to decide each file's fate into a new subroutine called decide_file_action(). - Heikki >From 1effdaeabb843b40575ae898ec9de57ffc37f301 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 19 Aug 2020 15:34:35 +0300 Subject: [PATCH v2 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c For consistency. All the other low-level functions that operate on the target directory are in file_ops.c. --- src/bin/pg_rewind/file_ops.c | 19 +++ src/bin/pg_rewind/file_ops.h | 1 + src/bin/pg_rewind/pg_rewind.c | 22 +- src/bin/pg_rewind/pg_rewind.h | 1 + 4 files changed, 22 insertions(+), 21 deletions(-) diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c index b3bf091c546..55439db20ba 100644 --- a/src/bin/pg_rewind/file_ops.c +++ b/src/bin/pg_rewind/file_ops.c @@ -19,6 +19,7 @@ #include #include "common/file_perm.h" +#include "common/file_utils.h" #include "file_ops.h" #include "filemap.h" #include "pg_rewind.h" @@ -266,6 +267,24 @@ remove_target_symlink(const char *path) dstpath); } +/* + * Sync target data directory to ensure that modifications are safely on disk. + * + * We do this once, for the whole data directory, for performance reasons. At + * the end of pg_rewind's run, the kernel is likely to already have flushed + * most dirty buffers to disk. Additionally fsync_pgdata uses a two-pass + * approach (only initiating writeback in the first pass), which often reduces + * the overall amount of IO noticeably. + */ +void +sync_target_dir(void) +{ + if (!do_sync || dry_run) + return; + + fsync_pgdata(datadir_target, PG_VERSION_NUM); +} + /* * Read a file into memory. The file to be read is /. diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h index 025f24141c9..d8466385cf5 100644 --- a/src/bin/pg_rewind/file_ops.h +++ b/src/bin/pg_rewind/file_ops.h @@ -19,6 +19,7 @@ extern void remove_target_file(const char *path, bool missing_ok); extern void truncate_target_file(const char *path, off_t newsize); extern void create_target(file_entry_t *t); extern void remove_target(file_entry_t *t); +extern void sync_target_dir(void); extern char *slurpFile(const char *datadir, const char *path, size_t *filesize); diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 23fc749e445..c9b9e480c0f 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -20,7 +20,6 @@ #include "catalog/pg_control.h" #include "common/controldata_utils.h" #include "common/file_perm.h" -#include "common/file_utils.h" #include "common/restricted_token.h" #include "common/string.h" #include "fe_utils/recovery_gen.h" @@ -38,7 +37,6 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, static void digestControlFile(ControlFileData *ControlFile, char *source, size_t size); -static void syncTargetDirectory(void); static void getRestoreCommand(const char *argv0); static void sanityChecks(void); static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex); @@ -455,7 +453,7 @@ main(int argc, char **argv) if
Re: Refactor pg_rewind code and make it work against a standby
Hello. At Wed, 19 Aug 2020 15:50:16 +0300, Heikki Linnakangas wrote in > Hi, > > I started to hack on making pg_rewind crash-safe (see [1]), but I > quickly got side-tracked into refactoring and tidying up up the code > in general. I ended up with this series of patches: ^^; > The first four patches are just refactoring that make the code and the > logic more readable. Tom Lane commented about the messy comments > earlier (see [2]), and I hope these patches will alleviate that > confusion. See commit messages for details. 0001: It looks fine. The new location is reasonable but adding one extern is a bit annoying. But I don't object it. 0002: Rewording that old->target and new->source makes the meaning far clearer. Moving decisions core code into filemap_finalize is reasonable. By the way, some of the rules are remaining in process_source/target_file. For example, pg_wal that is a symlink, or tmporary directories. and excluded files. The number of excluded files doesn't seem so large so it doesn't seem that the exclusion give advantage so much. They seem to me movable to filemap_finalize, and we can get rid of the callbacks by doing so. Is there any reason that the remaining rules should be in the callbacks? 0003: Thomas is propsing sort template. It could be used if committed. 0004: The names of many of the functions gets an additional word "local" but I don't get the meaning clearly. but its about linguistic sense and I'm not fit to that.. -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc) +local_fetch_file_range(rewind_source *source, const char *path, uint64 off, The function actually copying the soruce range to the target file. So "fetch" might give somewhat different meaning, but its about linguistic (omitted..). > The last patch refactors the logic in libpq_fetch.c, so that it no > longer uses a temporary table in the source system. That allows using > a hot standby server as the pg_rewind source. That sounds nice. > This doesn't do anything about pg_rewind's crash-safety yet, but I'll > try work on that after these patches. > > [1] > https://www.postgresql.org/message-id/d8dcc760-8780-5084-f066-6d663801d2e2%40iki.fi > > [2] > https://www.postgresql.org/message-id/30255.1522711675%40sss.pgh.pa.us > > - Heikki I'm going to continue reviewing this later. reagards. -- Kyotaro Horiguchi NTT Open Source Software Center
Refactor pg_rewind code and make it work against a standby
Hi, I started to hack on making pg_rewind crash-safe (see [1]), but I quickly got side-tracked into refactoring and tidying up up the code in general. I ended up with this series of patches: The first four patches are just refactoring that make the code and the logic more readable. Tom Lane commented about the messy comments earlier (see [2]), and I hope these patches will alleviate that confusion. See commit messages for details. The last patch refactors the logic in libpq_fetch.c, so that it no longer uses a temporary table in the source system. That allows using a hot standby server as the pg_rewind source. This doesn't do anything about pg_rewind's crash-safety yet, but I'll try work on that after these patches. [1] https://www.postgresql.org/message-id/d8dcc760-8780-5084-f066-6d663801d2e2%40iki.fi [2] https://www.postgresql.org/message-id/30255.1522711675%40sss.pgh.pa.us - Heikki >From f2057f662e9c52c7392491831338a720a21a8ca3 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 19 Aug 2020 15:34:35 +0300 Subject: [PATCH 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c For consistency. All the other low-level functions that operate on the target directory are in file_ops.c. --- src/bin/pg_rewind/file_ops.c | 19 +++ src/bin/pg_rewind/file_ops.h | 1 + src/bin/pg_rewind/pg_rewind.c | 22 +- src/bin/pg_rewind/pg_rewind.h | 1 + 4 files changed, 22 insertions(+), 21 deletions(-) diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c index b3bf091c546..55439db20ba 100644 --- a/src/bin/pg_rewind/file_ops.c +++ b/src/bin/pg_rewind/file_ops.c @@ -19,6 +19,7 @@ #include #include "common/file_perm.h" +#include "common/file_utils.h" #include "file_ops.h" #include "filemap.h" #include "pg_rewind.h" @@ -266,6 +267,24 @@ remove_target_symlink(const char *path) dstpath); } +/* + * Sync target data directory to ensure that modifications are safely on disk. + * + * We do this once, for the whole data directory, for performance reasons. At + * the end of pg_rewind's run, the kernel is likely to already have flushed + * most dirty buffers to disk. Additionally fsync_pgdata uses a two-pass + * approach (only initiating writeback in the first pass), which often reduces + * the overall amount of IO noticeably. + */ +void +sync_target_dir(void) +{ + if (!do_sync || dry_run) + return; + + fsync_pgdata(datadir_target, PG_VERSION_NUM); +} + /* * Read a file into memory. The file to be read is /. diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h index 025f24141c9..d8466385cf5 100644 --- a/src/bin/pg_rewind/file_ops.h +++ b/src/bin/pg_rewind/file_ops.h @@ -19,6 +19,7 @@ extern void remove_target_file(const char *path, bool missing_ok); extern void truncate_target_file(const char *path, off_t newsize); extern void create_target(file_entry_t *t); extern void remove_target(file_entry_t *t); +extern void sync_target_dir(void); extern char *slurpFile(const char *datadir, const char *path, size_t *filesize); diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index 23fc749e445..c9b9e480c0f 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -20,7 +20,6 @@ #include "catalog/pg_control.h" #include "common/controldata_utils.h" #include "common/file_perm.h" -#include "common/file_utils.h" #include "common/restricted_token.h" #include "common/string.h" #include "fe_utils/recovery_gen.h" @@ -38,7 +37,6 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, static void digestControlFile(ControlFileData *ControlFile, char *source, size_t size); -static void syncTargetDirectory(void); static void getRestoreCommand(const char *argv0); static void sanityChecks(void); static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex); @@ -455,7 +453,7 @@ main(int argc, char **argv) if (showprogress) pg_log_info("syncing target data directory"); - syncTargetDirectory(); + sync_target_dir(); if (writerecoveryconf && !dry_run) WriteRecoveryConfig(conn, datadir_target, @@ -803,24 +801,6 @@ digestControlFile(ControlFileData *ControlFile, char *src, size_t size) checkControlFile(ControlFile); } -/* - * Sync target data directory to ensure that modifications are safely on disk. - * - * We do this once, for the whole data directory, for performance reasons. At - * the end of pg_rewind's run, the kernel is likely to already have flushed - * most dirty buffers to disk. Additionally fsync_pgdata uses a two-pass - * approach (only initiating writeback in the first pass), which often reduces - * the overall amount of IO noticeably. - */ -static void -syncTargetDirectory(void) -{ - if (!do_sync || dry_run) - return; - - fsync_pgdata(datadir_target, PG_VERSION_NUM); -} - /* * Get value of GUC parameter restore_command from the target cluster. * diff --git