Re: Refactor pg_rewind code and make it work against a standby

2020-11-20 Thread Heikki Linnakangas

On 20/11/2020 19:14, Andres Freund wrote:

Hi,

On 2020-11-20 16:19:03 +0200, Heikki Linnakangas wrote:

Pushed a fix similar to your patch, but I put the wait_for_catchup() before
running pg_rewind. The point of inserting the 'in A, after C was promoted'
row is that it's present in B when pg_rewind runs.


Hm - don't we possibly need *both*? Since post pg_rewind recovery starts
at the previous checkpoint, it's quite possible for C to get ready to
answer queries before that record has been replayed?


No, C will not reach consistent state until all the WAL in the source 
system has been replayed. pg_rewind will set minRecoveryPoint to the 
minRecoveryPoint of the source system, after copying all the files. (Or 
its insert point, if it's not a standby server, but in this case it is). 
Same as when taking an online backup.


- Heikki




Re: Refactor pg_rewind code and make it work against a standby

2020-11-20 Thread Andres Freund
Hi,

On 2020-11-20 16:19:03 +0200, Heikki Linnakangas wrote:
> Pushed a fix similar to your patch, but I put the wait_for_catchup() before
> running pg_rewind. The point of inserting the 'in A, after C was promoted'
> row is that it's present in B when pg_rewind runs.

Hm - don't we possibly need *both*? Since post pg_rewind recovery starts
at the previous checkpoint, it's quite possible for C to get ready to
answer queries before that record has been replayed?

Thanks,

Andres Freund




Re: Refactor pg_rewind code and make it work against a standby

2020-11-20 Thread Heikki Linnakangas

On 20/11/2020 02:38, Andres Freund wrote:

I locally, on a heavily modified branch (AIO support), started to get
consistent failures in this test. I *suspect*, but am not sure, that
it's the test's fault, not the fault of modifications.

As far as I can tell, after the pg_rewind call, there's no guarantee
that node_c has fully caught up to the 'in A, after C was promoted'
insertion on node a. Thus at the check_query() I sometimes get just 'in
A, before promotion' back.

After adding a wait that problem seems to be fixed. Here's what I did

diff --git i/src/bin/pg_rewind/t/007_standby_source.pl 
w/src/bin/pg_rewind/t/007_standby_source.pl
index f6abcc2d987..48898bef2f5 100644
--- i/src/bin/pg_rewind/t/007_standby_source.pl
+++ w/src/bin/pg_rewind/t/007_standby_source.pl
@@ -88,6 +88,7 @@ $node_c->safe_psql('postgres', "checkpoint");
  # - you need to rewind.
  $node_a->safe_psql('postgres',
  "INSERT INTO tbl1 VALUES ('in A, after C was promoted')");
+$lsn = $node_a->lsn('insert');
  
  # Also insert a new row in the standby, which won't be present in the

  # old primary.
@@ -142,6 +143,8 @@ $node_primary = $node_c;
  # Run some checks to verify that C has been successfully rewound,
  # and connected back to follow B.
  
+$node_b->wait_for_catchup('node_c', 'replay', $lsn);

+
  check_query(
  'SELECT * FROM tbl1',
  qq(in A


Yes, I was able to reproduced that by inserting a strategic sleep in the 
test and pausing replication by attaching gdb to the walsender process.


Pushed a fix similar to your patch, but I put the wait_for_catchup() 
before running pg_rewind. The point of inserting the 'in A, after C was 
promoted' row is that it's present in B when pg_rewind runs.


Thanks!

- Heikki




Re: Refactor pg_rewind code and make it work against a standby

2020-11-19 Thread Andres Freund
Hi,

On 2020-11-15 17:10:53 +0200, Heikki Linnakangas wrote:
> Yep, quite right. Fixed that way, thanks for the debugging!

I locally, on a heavily modified branch (AIO support), started to get
consistent failures in this test. I *suspect*, but am not sure, that
it's the test's fault, not the fault of modifications.

As far as I can tell, after the pg_rewind call, there's no guarantee
that node_c has fully caught up to the 'in A, after C was promoted'
insertion on node a. Thus at the check_query() I sometimes get just 'in
A, before promotion' back.

After adding a wait that problem seems to be fixed. Here's what I did

diff --git i/src/bin/pg_rewind/t/007_standby_source.pl 
w/src/bin/pg_rewind/t/007_standby_source.pl
index f6abcc2d987..48898bef2f5 100644
--- i/src/bin/pg_rewind/t/007_standby_source.pl
+++ w/src/bin/pg_rewind/t/007_standby_source.pl
@@ -88,6 +88,7 @@ $node_c->safe_psql('postgres', "checkpoint");
 # - you need to rewind.
 $node_a->safe_psql('postgres',
 "INSERT INTO tbl1 VALUES ('in A, after C was promoted')");
+$lsn = $node_a->lsn('insert');
 
 # Also insert a new row in the standby, which won't be present in the
 # old primary.
@@ -142,6 +143,8 @@ $node_primary = $node_c;
 # Run some checks to verify that C has been successfully rewound,
 # and connected back to follow B.
 
+$node_b->wait_for_catchup('node_c', 'replay', $lsn);
+
 check_query(
 'SELECT * FROM tbl1',
 qq(in A


- Andres




Re: Refactor pg_rewind code and make it work against a standby

2020-11-15 Thread Heikki Linnakangas

On 15/11/2020 09:07, Tom Lane wrote:

I wrote:

Not sure if you noticed, but piculet has twice failed the
007_standby_source.pl test that was added by 9c4f5192f:
...
Now, I'm not sure what to make of that, but I can't help noticing that
piculet uses --disable-atomics while francolin uses --disable-spinlocks.
That leads the mind towards some kind of low-level synchronization
bug ...


Or, maybe it's less mysterious than that.  The failure looks like we
have not waited long enough for the just-inserted row to get replicated
to node C.  That wait is implemented as

$lsn = $node_a->lsn('insert');
$node_b->wait_for_catchup('node_c', 'write', $lsn);

which looks fishy ... shouldn't wait_for_catchup be told to
wait for replay of that LSN, not just write-the-WAL?


Yep, quite right. Fixed that way, thanks for the debugging!

- Heikki




Re: Refactor pg_rewind code and make it work against a standby

2020-11-14 Thread Tom Lane
I wrote:
> Not sure if you noticed, but piculet has twice failed the
> 007_standby_source.pl test that was added by 9c4f5192f:
> ...
> Now, I'm not sure what to make of that, but I can't help noticing that
> piculet uses --disable-atomics while francolin uses --disable-spinlocks.
> That leads the mind towards some kind of low-level synchronization
> bug ...

Or, maybe it's less mysterious than that.  The failure looks like we
have not waited long enough for the just-inserted row to get replicated
to node C.  That wait is implemented as

$lsn = $node_a->lsn('insert');
$node_b->wait_for_catchup('node_c', 'write', $lsn);

which looks fishy ... shouldn't wait_for_catchup be told to
wait for replay of that LSN, not just write-the-WAL?

regards, tom lane




Re: Refactor pg_rewind code and make it work against a standby

2020-11-14 Thread Tom Lane
Not sure if you noticed, but piculet has twice failed the
007_standby_source.pl test that was added by 9c4f5192f:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2020-11-15%2006%3A00%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2020-11-13%2011%3A20%3A10

and francolin failed once:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin=2020-11-12%2018%3A57%3A33

These failures look the same:

#   Failed test 'table content after rewind and insert: query result matches'
#   at t/007_standby_source.pl line 160.
#  got: 'in A
# in A, before promotion
# in A, after C was promoted
# '
# expected: 'in A
# in A, before promotion
# in A, after C was promoted
# in A, after rewind
# '
# Looks like you failed 1 test of 3.
[11:27:01] t/007_standby_source.pl ... 
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests 

Now, I'm not sure what to make of that, but I can't help noticing that
piculet uses --disable-atomics while francolin uses --disable-spinlocks.
That leads the mind towards some kind of low-level synchronization
bug ...

regards, tom lane




Re: Refactor pg_rewind code and make it work against a standby

2020-11-12 Thread Heikki Linnakangas

On 04/11/2020 11:23, Heikki Linnakangas wrote:

I read through the patches one more time, fixed a bunch of typos and
such, and pushed patches 1-4. I'm going to spend some more time on
testing the last patch. It allows using a standby server as the source,
and we don't have any tests for that yet. Thanks for the review!


Did some more testing, fixed one bug, and pushed.

To test this, I set up a cluster with one primary, a standby, and a 
cascaded standby. I launched a test workload against the primary that 
creates tables, inserts rows, and drops tables continuously. In another 
shell, I promoted the cascaded standby, run some updates on the promoted 
server, and finally, run pg_rewind pointed at the standby, and start it 
again as a cascaded standby. Repeat.


Attached are the scripts I used. I edited them between test runs to test 
slightly different scenarios. I don't expect them to be very useful to 
anyone else, but the Internet is my backup.


I did find one bug in the patch with that, so the time was well spent: 
the code in process_queued_fetch_requests() got confused and errored 
out, if a file was removed in the source system while pg_rewind was 
running. There was code to deal with that, but it was broken. Fixed that.


- Heikki


rewind-cascading-test.tar.gz
Description: application/gzip


Re: Refactor pg_rewind code and make it work against a standby

2020-11-04 Thread Heikki Linnakangas

On 25/09/2020 02:56, Soumyadeep Chakraborty wrote:

On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas  wrote:

7. Please address the FIXME for the symlink case:
/* FIXME: Check if it points to the same target? */


It's not a new issue. Would be nice to fix, of course. I'm not sure what
the right thing to do would be. If you have e.g. replaced
postgresql.conf with a symlink that points outside the data directory,
would it be appropriate to overwrite it? Or perhaps we should throw an
error? We also throw an error if a file is a symlink in the source but a
regular file in the target, or vice versa.


Hmm, I can imagine a use case for 2 different symlink targets on the
source and target clusters. For example the primary's pg_wal directory
can have a different symlink target as compared to a standby's
(different mount points on the same network maybe?). An end user might
not desire pg_rewind meddling with that setup or may desire pg_rewind to
treat the source as a source-of-truth with respect to this as well and
would want pg_rewind to overwrite the target's symlink. Maybe doing a
check and emitting a warning with hint/detail is prudent here while
taking no action.


We have special handling for 'pg_wal' to pretend that it's a regular 
directory (see process_source_file()), so that's taken care of. But if 
you did a something similar with some other subdirectory, that would be 
a problem.



14. queue_overwrite_range(), finish_overwrite() instead of
queue_fetch_range(), finish_fetch()? Similarly update\
*_fetch_file_range() and *_finish_fetch()


15. Let's have local_source.c and libpq_source.c instead of *_fetch.c


Good idea! And fetch.h -> rewind_source.h.


+1. You might have missed the changes to rename "fetch" -> "overwrite"
as was mentioned in 14.


I preferred the "fetch" nomenclature in those function names. They fetch 
and overwrite the file ranges, so 'fetch' still seems appropriate. 
"fetch" -> "overwrite" would make sense if you wanted to emphasize the 
"overwrite" part more. Or we could rename it to "fetch_and_overwrite". 
But overall I think "fetch" is fine.



16.


conn = PQconnectdb(connstr_source);

if (PQstatus(conn) == CONNECTION_BAD)
pg_fatal("could not connect to server: %s",
PQerrorMessage(conn));

if (showprogress)
pg_log_info("connected to server");



The above hunk should be part of init_libpq_source(). Consequently,
init_libpq_source() should take a connection string instead of a conn.


The libpq connection is also needed by WriteRecoveryConfig(), that's why
it's not fully encapsulated in libpq_source.


Ah. I find it pretty weird why we need to specify --source-server to
have write-recovery-conf work. From the code, we only need the conn
for calling PQserverVersion(), something we can easily get by slurping
pg_controldata on the source side? Maybe we can remove this limitation?


Yeah, perhaps. In another patch :-).

I read through the patches one more time, fixed a bunch of typos and 
such, and pushed patches 1-4. I'm going to spend some more time on 
testing the last patch. It allows using a standby server as the source, 
and we don't have any tests for that yet. Thanks for the review!


- Heikki




Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Soumyadeep Chakraborty
On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas  wrote:
> >> /*
> >>   * If this is a relation file, copy the modified blocks.
> >>   *
> >>   * This is in addition to any other changes.
> >>   */
> >> iter = datapagemap_iterate(>target_modified_pages);
> >> while (datapagemap_next(iter, ))
> >> {
> >> offset = blkno * BLCKSZ;
> >>
> >> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
> >> }
> >> pg_free(iter);
> >
> > Can we put this hunk into a static function overwrite_pages()?
>
> Meh, it's only about 10 lines of code, and one caller.

Fair.

>
> > 7. Please address the FIXME for the symlink case:
> > /* FIXME: Check if it points to the same target? */
>
> It's not a new issue. Would be nice to fix, of course. I'm not sure what
> the right thing to do would be. If you have e.g. replaced
> postgresql.conf with a symlink that points outside the data directory,
> would it be appropriate to overwrite it? Or perhaps we should throw an
> error? We also throw an error if a file is a symlink in the source but a
> regular file in the target, or vice versa.
>

Hmm, I can imagine a use case for 2 different symlink targets on the
source and target clusters. For example the primary's pg_wal directory
can have a different symlink target as compared to a standby's
(different mount points on the same network maybe?). An end user might
not desire pg_rewind meddling with that setup or may desire pg_rewind to
treat the source as a source-of-truth with respect to this as well and
would want pg_rewind to overwrite the target's symlink. Maybe doing a
check and emitting a warning with hint/detail is prudent here while
taking no action.


> > 8.
> >
> > * it anyway. But there's no harm in copying it now.)
> >
> > and
> >
> > * copy them here. But we don't know which scenario we're
> > * dealing with, and there's no harm in copying the missing
> > * blocks now, so do it now.
> >
> > Could you add a line or two explaining why there is "no harm" in these
> > two hunks above?
>
> The previous sentences explain that there's a WAL record covering them.
> So they will be overwritten by WAL replay anyway. Does it need more
> explanation?

Yeah you are right, that is reason enough.

> > 14. queue_overwrite_range(), finish_overwrite() instead of
> > queue_fetch_range(), finish_fetch()? Similarly update\
> > *_fetch_file_range() and *_finish_fetch()
> >
> >
> > 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c
>
> Good idea! And fetch.h -> rewind_source.h.

+1. You might have missed the changes to rename "fetch" -> "overwrite"
as was mentioned in 14.

>
> I also moved the code in execute_file_actions() function to pg_rewind.c,
> into a new function: perform_rewind(). It felt a bit silly to have just
> execute_file_actions() in a file of its own. perform_rewind() now does
> all the modifications to the data directory, writing the backup file.
> Except for writing the recovery config: that also needs to be when
> there's no rewind to do, so it makes sense to keep it separate. What do
> you think?

I don't mind inlining that functionality into perform_rewind(). +1.
Heads up: The function declaration for execute_file_actions() is still
there in rewind_source.h.

> > 16.
> >
> >> conn = PQconnectdb(connstr_source);
> >>
> >> if (PQstatus(conn) == CONNECTION_BAD)
> >> pg_fatal("could not connect to server: %s",
> >> PQerrorMessage(conn));
> >>
> >> if (showprogress)
> >> pg_log_info("connected to server");
> >
> >
> > The above hunk should be part of init_libpq_source(). Consequently,
> > init_libpq_source() should take a connection string instead of a conn.
>
> The libpq connection is also needed by WriteRecoveryConfig(), that's why
> it's not fully encapsulated in libpq_source.

Ah. I find it pretty weird why we need to specify --source-server to
have write-recovery-conf work. From the code, we only need the conn
for calling PQserverVersion(), something we can easily get by slurping
pg_controldata on the source side? Maybe we can remove this limitation?

Regards,
Soumyadeep (VMware)




Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Heikki Linnakangas

On 20/09/2020 23:44, Soumyadeep Chakraborty wrote:

Before getting into the code review for the patch, I wanted to know why
we don't use a Bitmapset for target_modified_pages?


Bitmapset is not available in client code. Perhaps it could be moved to 
src/common with some changes, but doesn't seem worth it until there's 
more client code that would need it.


I'm not sure that a bitmap is the best data structure for tracking the 
changed blocks in the first place. A hash table might be better if there 
are only a few changed blocks, or something like 
src/backend/lib/integerset.c if there are many. But as long as the 
simple bitmap gets the job done, let's keep it simple.



2. Rename target_modified_pages to target_pages_to_overwrite?
target_modified_pages can lead to confusion as to whether it includes pages
that were modified on the target but not even present in the source and
the other exclusionary cases. target_pages_to_overwrite is much clearer.


Agreed, I'll rename it.

Conceptually, while we're scanning source WAL, we're just making note of 
the modified blocks. The decision on what to do with them happens only 
later, in decide_file_action(). The difference is purely theoretical, 
though. There is no real decision to be made, all the modified blocks 
will be overwritten. So on the whole, I agree 'target_page_to_overwrite' 
is better.



/*
  * If this is a relation file, copy the modified blocks.
  *
  * This is in addition to any other changes.
  */
iter = datapagemap_iterate(>target_modified_pages);
while (datapagemap_next(iter, ))
{
offset = blkno * BLCKSZ;

source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
}
pg_free(iter);


Can we put this hunk into a static function overwrite_pages()?


Meh, it's only about 10 lines of code, and one caller.


4.


* block that have changed in the target system.  It makes note of all the
* changed blocks in the pagemap of the file.


Can we replace the above with:


* block that has changed in the target system.  It decides if the given

blkno in the target relfile needs to be overwritten from the source.


Ok. Again conceptually though, process_target_wal_block_change() just 
collects information, and the decisions are made later. But you're right 
that we do leave out truncated-away blocks here, so we are doing more 
than just noting all the changed blocks.



/*
  * Doesn't exist in either server. Why does it have an entry in the
  * first place??
  */
return FILE_ACTION_NONE;


Can we delete the above hunk and add the following assert to the very
top of decide_file_action():

Assert(entry->target_exists || entry->source_exists);


I would like to keep the check even when assertions are not enabled. 
I'll add an Assert(false) there.



7. Please address the FIXME for the symlink case:
/* FIXME: Check if it points to the same target? */


It's not a new issue. Would be nice to fix, of course. I'm not sure what 
the right thing to do would be. If you have e.g. replaced 
postgresql.conf with a symlink that points outside the data directory, 
would it be appropriate to overwrite it? Or perhaps we should throw an 
error? We also throw an error if a file is a symlink in the source but a 
regular file in the target, or vice versa.



8.

* it anyway. But there's no harm in copying it now.)

and

* copy them here. But we don't know which scenario we're
* dealing with, and there's no harm in copying the missing
* blocks now, so do it now.

Could you add a line or two explaining why there is "no harm" in these
two hunks above?


The previous sentences explain that there's a WAL record covering them. 
So they will be overwritten by WAL replay anyway. Does it need more 
explanation?



14. queue_overwrite_range(), finish_overwrite() instead of
queue_fetch_range(), finish_fetch()? Similarly update\
*_fetch_file_range() and *_finish_fetch()


15. Let's have local_source.c and libpq_source.c instead of *_fetch.c


Good idea! And fetch.h -> rewind_source.h.

I also moved the code in execute_file_actions() function to pg_rewind.c, 
into a new function: perform_rewind(). It felt a bit silly to have just 
execute_file_actions() in a file of its own. perform_rewind() now does 
all the modifications to the data directory, writing the backup file. 
Except for writing the recovery config: that also needs to be when 
there's no rewind to do, so it makes sense to keep it separate. What do 
you think?



16.


conn = PQconnectdb(connstr_source);

if (PQstatus(conn) == CONNECTION_BAD)
pg_fatal("could not connect to server: %s",
PQerrorMessage(conn));

if (showprogress)
pg_log_info("connected to server");



The above hunk should be part of init_libpq_source(). Consequently,
init_libpq_source() should take a connection string instead of a conn.


The libpq connection is also needed by WriteRecoveryConfig(), that's why 
it's not fully encapsulated in libpq_source.



19.


typedef struct
{
const char *path; /* path relative to data directory root */
uint64 offset;

Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Heikki Linnakangas
Thanks for the review! I'll post a new version shortly, with your 
comments incorporated. More detailed response to a few of them below:


On 18/09/2020 10:41, Kyotaro Horiguchi wrote:

I don't think filemap_finalize needs to iterate over filemap twice.


True, but I thought it's more clear that way, doing one thing at a time.


hash_string_pointer is a copy of that of pg_verifybackup.c. Is it
worth having in hashfn.h or .c?


I think it's fine for now. Maybe in the future if more copies crop up.


--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
...
+   filemap_t  *filemap;
..
+   filemap_init();
...
+   filemap = filemap_finalize();


I'm a bit confused by this, and realized that what filemap_init
initializes is *not* the filemap, but the filehash. So for example,
the name of the functions might should be something like this?

filehash_init()
filemap = filehash_finalyze()/create_filemap()


My thinking was that filemap_* is the prefix for the operations in 
filemap.c, hence filemap_init(). I can see the confusion, though, and I 
think you're right that renaming would be good. I renamed them to 
filehash_init(), and decide_file_actions(). I think those names make the 
calling code in pg_rewind.c quite clear.



/*
  * Also check that full_page_writes is enabled.  We can get torn pages if
  * a page is modified while we read it with pg_read_binary_file(), and we
  * rely on full page images to fix them.
  */
str = run_simple_query(conn, "SHOW full_page_writes");
if (strcmp(str, "on") != 0)
pg_fatal("full_page_writes must be enabled in the source server");
pg_free(str);


This is a part not changed by this patch set. But If we allow to
connect to a standby, this check can be tricked by setting off on the
primary and "on" on the standby (FWIW, though). Some protection
measure might be necessary.


Good point, the value in the primary is what matters. In fact, even when 
connected to the primary, the value might change while pg_rewind is 
running. I'm not sure how we could tighten that up.



+   if (chunksize > rq->length)
+   {
+   pg_fatal("received more than requested for file \"%s\"",
+rq->path);
+   /* receiving less is OK, though */

Don't we need to truncate the target file, though?


If a file is truncated in the source while pg_rewind is running, there 
should be a WAL record about the truncation that gets replayed when you 
start the server after pg_rewind has finished. We could truncate the 
file if we wanted to, but it's not necessary. I'll add a comment.


- Heikki




Re: Refactor pg_rewind code and make it work against a standby

2020-09-20 Thread Soumyadeep Chakraborty
Hey Heikki,

Thanks for refactoring and making the code much easier to read!

Before getting into the code review for the patch, I wanted to know why
we don't use a Bitmapset for target_modified_pages?

Code review:


1. We need to update the comments for process_source_file and
process_target_file. We don't decide the action on the file until later.


2. Rename target_modified_pages to target_pages_to_overwrite?
target_modified_pages can lead to confusion as to whether it includes pages
that were modified on the target but not even present in the source and
the other exclusionary cases. target_pages_to_overwrite is much clearer.


3.

> /*
>  * If this is a relation file, copy the modified blocks.
>  *
>  * This is in addition to any other changes.
>  */
> iter = datapagemap_iterate(>target_modified_pages);
> while (datapagemap_next(iter, ))
> {
> offset = blkno * BLCKSZ;
>
> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
> }
> pg_free(iter);

Can we put this hunk into a static function overwrite_pages()?


4.

> * block that have changed in the target system.  It makes note of all the
> * changed blocks in the pagemap of the file.

Can we replace the above with:

> * block that has changed in the target system.  It decides if the given
blkno in the target relfile needs to be overwritten from the source.


5.

> /*
>  * Doesn't exist in either server. Why does it have an entry in the
>  * first place??
>  */
> return FILE_ACTION_NONE;

Can we delete the above hunk and add the following assert to the very
top of decide_file_action():

Assert(entry->target_exists || entry->source_exists);


6.

> pg_fatal("unexpected page modification for directory or symbolic link \"%s\"",
> entry->path);

Can we replace above with:

pg_fatal("unexpected page modification for non-regular file \"%s\"",
entry->path);

This way we can address the undefined file type.


7. Please address the FIXME for the symlink case:
/* FIXME: Check if it points to the same target? */


8.

* it anyway. But there's no harm in copying it now.)

and

* copy them here. But we don't know which scenario we're
* dealing with, and there's no harm in copying the missing
* blocks now, so do it now.

Could you add a line or two explaining why there is "no harm" in these
two hunks above?


9. Can we add pg_control, /pgsql_tmp/... and .../pgsql_tmp.* and PG_VERSION
files to check_file_excluded()?


10.

- * block that have changed in the target system. It makes note of all the
+ * block that have changed in the target system.  It makes note of all the

Whitespace typo


11.

> * If the block is beyond the EOF in the source system, or the file doesn't
> * doesn'exist

Typo: Two doesnt's


12.

> /*
>  * This represents the final decisions on what to do with each file.
>  * 'actions' array contains an entry for each file, sorted in the order
>  * that their actions should executed.
>  */
> typedef struct filemap_t
> {
> /* Summary information, filled by calculate_totals() */
> uint64 total_size; /* total size of the source cluster */
> uint64 fetch_size; /* number of bytes that needs to be copied */
>
> int nactions; /* size of 'actions' array */
> file_entry_t *actions[FLEXIBLE_ARRAY_MEMBER];
> } filemap_t;

Replace nactions/actions with nentries/entries..clearer in intent as
it is easier to reconcile the modified pages stuff to an entry rather
than an action. It could be:

/*
 * This contains the final decisions on what to do with each file.
 * 'entries' array contains an entry for each file, sorted in the order
 * that their actions should executed.
 */
typedef struct filemap_t
{
/* Summary information, filled by calculate_totals() */
uint64 total_size; /* total size of the source cluster */
uint64 fetch_size; /* number of bytes that needs to be copied */
int nentries; /* size of 'entries' array */
file_entry_t *entries[FLEXIBLE_ARRAY_MEMBER];
} filemap_t;


13.

> filehash = filehash_create(1000, NULL);

Use a constant for the 1000 in (FILEMAP_INITIAL_SIZE):


14. queue_overwrite_range(), finish_overwrite() instead of
queue_fetch_range(), finish_fetch()? Similarly update\
*_fetch_file_range() and *_finish_fetch()


15. Let's have local_source.c and libpq_source.c instead of *_fetch.c


16.

> conn = PQconnectdb(connstr_source);
>
> if (PQstatus(conn) == CONNECTION_BAD)
> pg_fatal("could not connect to server: %s",
> PQerrorMessage(conn));
>
> if (showprogress)
> pg_log_info("connected to server");


The above hunk should be part of init_libpq_source(). Consequently,
init_libpq_source() should take a connection string instead of a conn.


17.

> if (conn)
> {
> PQfinish(conn);
> conn = NULL;
> }

The hunk above should be part of libpq_destroy()


18.

> /*
>  * Files are fetched max CHUNK_SIZE bytes at a time, and with a
>  * maximum of MAX_CHUNKS_PER_QUERY chunks in a single query.
>  */
> #define CHUNK_SIZE (1024 * 1024)

Can we rename CHUNK_SIZE to MAX_CHUNK_SIZE and update the comment?


19.

> typedef struct
> {
> const char 

Re: Refactor pg_rewind code and make it work against a standby

2020-09-18 Thread Kyotaro Horiguchi
Hello.

It needed rebasing. (Attached)

At Tue, 25 Aug 2020 16:32:02 +0300, Heikki Linnakangas  wrote 
in 
> On 20/08/2020 11:32, Kyotaro Horiguchi wrote:
> > 0002: Rewording that old->target and new->source makes the meaning far
> 
> Good idea! I changed the patch that way.

Looks Good.

> > 0003: Thomas is propsing sort template. It could be used if committed.


+* If the block is beyond the EOF in the source system, or the file 
doesn't
+* doesn'exist in the source at all, we're going to truncate/remove it 
away

"the file doesn't doesn'exist"

I don't think filemap_finalize needs to iterate over filemap twice.

hash_string_pointer is a copy of that of pg_verifybackup.c. Is it
worth having in hashfn.h or .c?

> --- a/src/bin/pg_rewind/pg_rewind.c
> +++ b/src/bin/pg_rewind/pg_rewind.c
> ...
> + filemap_t  *filemap;
> ..
> + filemap_init();
> ...
> + filemap = filemap_finalize();

I'm a bit confused by this, and realized that what filemap_init
initializes is *not* the filemap, but the filehash. So for example,
the name of the functions might should be something like this?

filehash_init()
filemap = filehash_finalyze()/create_filemap()


> > 0004:
> >   The names of many of the functions gets an additional word "local"
> >   but I don't get the meaning clearly. but its about linguistic sense
> >   and I'm not fit to that..
> >   -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool
> >   -trunc)
> > +local_fetch_file_range(rewind_source *source, const char *path,
> > uint64 off,
> >   The function actually copying the soruce range to the target file. So
> >   "fetch" might give somewhat different meaning, but its about
> >   linguistic (omitted..).
> 
> Hmm. It is "fetching" the range from the source server, and writing it
> to the target. The term makes more sense with a libpq source. Perhaps
> this function should be called "local_copy_range" or something, but
> it'd also be nice to have "fetch" in the name because the function
> pointer it's assigned to is called "queue_fetch_range".

Thanks. Yeah libpq_fetch_file makes sense. I agree to the name.
The refactoring looks good to me.

> > I'm going to continue reviewing this later.
> 
> Thanks! Attached is a new set of patches. The only meaningful change
> is in the 2nd patch, which I modified per your suggestion. Also, I
> moved the logic to decide each file's fate into a new subroutine
> called decide_file_action().

That's looks good.

0005:

+   /*
+* We don't intend do any updates.  Put the connection in read-only mode
+* to keep us honest.
+*/
run_simple_command(conn, "SET default_transaction_read_only = off");

The comment is wrong since the time it was added by 0004 but that's
not a problem since it was to be fixed by 0005. However, we need the
variable turned on in order to be really honest:p

> /*
>  * Also check that full_page_writes is enabled.  We can get torn pages if
>  * a page is modified while we read it with pg_read_binary_file(), and we
>  * rely on full page images to fix them.
>  */
> str = run_simple_query(conn, "SHOW full_page_writes");
> if (strcmp(str, "on") != 0)
>   pg_fatal("full_page_writes must be enabled in the source server");
> pg_free(str);

This is a part not changed by this patch set. But If we allow to
connect to a standby, this check can be tricked by setting off on the
primary and "on" on the standby (FWIW, though). Some protection
measure might be necessary. (Maybe standby should be restricted to
have the same value with the primary.)


+   thislen = Min(len, CHUNK_SIZE - prev->length);
+   src->request_queue[src->num_requests - 1].length += 
thislen;

prev == >request_queue[src->num_requests - 1] here.


+   if (chunksize > rq->length)
+   {
+   pg_fatal("received more than requested for file \"%s\"",
+rq->path);
+   /* receiving less is OK, though */

Don't we need to truncate the target file, though?


+* Source is a local data directory. It should've shut down 
cleanly,
+* and we must to the latest shutdown checkpoint.

"and we must to the" => "and we must replay to the" ?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 6cfbb5a686e5b87159aed9769b1e52859bae02b4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 19 Aug 2020 15:34:35 +0300
Subject: [PATCH v3 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c

For consistency. All the other low-level functions that operate on the
target directory are in file_ops.c.
---
 src/bin/pg_rewind/file_ops.c  | 19 +++
 src/bin/pg_rewind/file_ops.h  |  1 +
 src/bin/pg_rewind/pg_rewind.c | 22 +-
 src/bin/pg_rewind/pg_rewind.h |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c 

Re: Refactor pg_rewind code and make it work against a standby

2020-09-16 Thread Michael Paquier
On Tue, Aug 25, 2020 at 04:32:02PM +0300, Heikki Linnakangas wrote:
> On 20/08/2020 11:32, Kyotaro Horiguchi wrote:
> > 0002: Rewording that old->target and new->source makes the meaning far
> > clearer. Moving decisions core code into filemap_finalize is
> > reasonable.
> > 
> >  By the way, some of the rules are remaining in
> >  process_source/target_file. For example, pg_wal that is a symlink,
> >  or tmporary directories. and excluded files.  The number of
> >  excluded files doesn't seem so large so it doesn't seem that the
> >  exclusion give advantage so much.  They seem to me movable to
> >  filemap_finalize, and we can get rid of the callbacks by doing
> >  so. Is there any reason that the remaining rules should be in the
> >  callbacks?
> 
> Good idea! I changed the patch that way.
> 
> > 0003: Thomas is propsing sort template. It could be used if committed.
> > 
> > 0004:
> > 
> >   The names of many of the functions gets an additional word "local"
> >   but I don't get the meaning clearly. but its about linguistic sense
> >   and I'm not fit to that..
> > -rewind_copy_file_range(const char *path, off_t begin, off_t end, bool 
> > trunc)
> > +local_fetch_file_range(rewind_source *source, const char *path, uint64 off,
> > 
> >   The function actually copying the soruce range to the target file. So
> >   "fetch" might give somewhat different meaning, but its about
> >   linguistic (omitted..).
> 
> Hmm. It is "fetching" the range from the source server, and writing it to
> the target. The term makes more sense with a libpq source. Perhaps this
> function should be called "local_copy_range" or something, but it'd also be
> nice to have "fetch" in the name because the function pointer it's assigned
> to is called "queue_fetch_range".
> 
> > I'm going to continue reviewing this later.
> 
> Thanks! Attached is a new set of patches. The only meaningful change is in
> the 2nd patch, which I modified per your suggestion. Also, I moved the logic
> to decide each file's fate into a new subroutine called
> decide_file_action().

The patch set fails to apply from 0002~, so this needs a rebase.  I
have not looked at all that in details, but no objections to apply
0001 from me.  It makes sense to move the sync subroutine for the
target folder to file_ops.c.
--
Michael


signature.asc
Description: PGP signature


Re: Refactor pg_rewind code and make it work against a standby

2020-08-25 Thread Heikki Linnakangas

On 20/08/2020 11:32, Kyotaro Horiguchi wrote:

0002: Rewording that old->target and new->source makes the meaning far
clearer. Moving decisions core code into filemap_finalize is
reasonable.

 By the way, some of the rules are remaining in
 process_source/target_file. For example, pg_wal that is a symlink,
 or tmporary directories. and excluded files.  The number of
 excluded files doesn't seem so large so it doesn't seem that the
 exclusion give advantage so much.  They seem to me movable to
 filemap_finalize, and we can get rid of the callbacks by doing
 so. Is there any reason that the remaining rules should be in the
 callbacks?


Good idea! I changed the patch that way.


0003: Thomas is propsing sort template. It could be used if committed.

0004:

  The names of many of the functions gets an additional word "local"
  but I don't get the meaning clearly. but its about linguistic sense
  and I'm not fit to that..
  
-rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc)

+local_fetch_file_range(rewind_source *source, const char *path, uint64 off,

  The function actually copying the soruce range to the target file. So
  "fetch" might give somewhat different meaning, but its about
  linguistic (omitted..).


Hmm. It is "fetching" the range from the source server, and writing it 
to the target. The term makes more sense with a libpq source. Perhaps 
this function should be called "local_copy_range" or something, but it'd 
also be nice to have "fetch" in the name because the function pointer 
it's assigned to is called "queue_fetch_range".



I'm going to continue reviewing this later.


Thanks! Attached is a new set of patches. The only meaningful change is 
in the 2nd patch, which I modified per your suggestion. Also, I moved 
the logic to decide each file's fate into a new subroutine called 
decide_file_action().


- Heikki
>From 1effdaeabb843b40575ae898ec9de57ffc37f301 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 19 Aug 2020 15:34:35 +0300
Subject: [PATCH v2 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c

For consistency. All the other low-level functions that operate on the
target directory are in file_ops.c.
---
 src/bin/pg_rewind/file_ops.c  | 19 +++
 src/bin/pg_rewind/file_ops.h  |  1 +
 src/bin/pg_rewind/pg_rewind.c | 22 +-
 src/bin/pg_rewind/pg_rewind.h |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index b3bf091c546..55439db20ba 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -19,6 +19,7 @@
 #include 
 
 #include "common/file_perm.h"
+#include "common/file_utils.h"
 #include "file_ops.h"
 #include "filemap.h"
 #include "pg_rewind.h"
@@ -266,6 +267,24 @@ remove_target_symlink(const char *path)
  dstpath);
 }
 
+/*
+ * Sync target data directory to ensure that modifications are safely on disk.
+ *
+ * We do this once, for the whole data directory, for performance reasons.  At
+ * the end of pg_rewind's run, the kernel is likely to already have flushed
+ * most dirty buffers to disk.  Additionally fsync_pgdata uses a two-pass
+ * approach (only initiating writeback in the first pass), which often reduces
+ * the overall amount of IO noticeably.
+ */
+void
+sync_target_dir(void)
+{
+	if (!do_sync || dry_run)
+		return;
+
+	fsync_pgdata(datadir_target, PG_VERSION_NUM);
+}
+
 
 /*
  * Read a file into memory. The file to be read is /.
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 025f24141c9..d8466385cf5 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -19,6 +19,7 @@ extern void remove_target_file(const char *path, bool missing_ok);
 extern void truncate_target_file(const char *path, off_t newsize);
 extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
+extern void sync_target_dir(void);
 
 extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e445..c9b9e480c0f 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -20,7 +20,6 @@
 #include "catalog/pg_control.h"
 #include "common/controldata_utils.h"
 #include "common/file_perm.h"
-#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/string.h"
 #include "fe_utils/recovery_gen.h"
@@ -38,7 +37,6 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile, char *source,
 			  size_t size);
-static void syncTargetDirectory(void);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -455,7 +453,7 @@ main(int argc, char **argv)
 
 	if 

Re: Refactor pg_rewind code and make it work against a standby

2020-08-20 Thread Kyotaro Horiguchi
Hello.

At Wed, 19 Aug 2020 15:50:16 +0300, Heikki Linnakangas  wrote 
in 
> Hi,
> 
> I started to hack on making pg_rewind crash-safe (see [1]), but I
> quickly got side-tracked into refactoring and tidying up up the code
> in general. I ended up with this series of patches:

^^;

> The first four patches are just refactoring that make the code and the
> logic more readable. Tom Lane commented about the messy comments
> earlier (see [2]), and I hope these patches will alleviate that
> confusion. See commit messages for details.

0001: It looks fine. The new location is reasonable but adding one
extern is a bit annoying. But I don't object it.

0002: Rewording that old->target and new->source makes the meaning far
   clearer. Moving decisions core code into filemap_finalize is
   reasonable.

By the way, some of the rules are remaining in
process_source/target_file. For example, pg_wal that is a symlink,
or tmporary directories. and excluded files.  The number of
excluded files doesn't seem so large so it doesn't seem that the
exclusion give advantage so much.  They seem to me movable to
filemap_finalize, and we can get rid of the callbacks by doing
so. Is there any reason that the remaining rules should be in the
callbacks?

0003: Thomas is propsing sort template. It could be used if committed.

0004:

 The names of many of the functions gets an additional word "local"
 but I don't get the meaning clearly. but its about linguistic sense
 and I'm not fit to that..
 
-rewind_copy_file_range(const char *path, off_t begin, off_t end, bool trunc)
+local_fetch_file_range(rewind_source *source, const char *path, uint64 off,

 The function actually copying the soruce range to the target file. So
 "fetch" might give somewhat different meaning, but its about
 linguistic (omitted..).


> The last patch refactors the logic in libpq_fetch.c, so that it no
> longer uses a temporary table in the source system. That allows using
> a hot standby server as the pg_rewind source.

That sounds nice.

> This doesn't do anything about pg_rewind's crash-safety yet, but I'll
> try work on that after these patches.
> 
> [1]
> https://www.postgresql.org/message-id/d8dcc760-8780-5084-f066-6d663801d2e2%40iki.fi
> 
> [2]
> https://www.postgresql.org/message-id/30255.1522711675%40sss.pgh.pa.us
> 
> - Heikki

I'm going to continue reviewing this later.

reagards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Refactor pg_rewind code and make it work against a standby

2020-08-19 Thread Heikki Linnakangas

Hi,

I started to hack on making pg_rewind crash-safe (see [1]), but I 
quickly got side-tracked into refactoring and tidying up up the code in 
general. I ended up with this series of patches:


The first four patches are just refactoring that make the code and the 
logic more readable. Tom Lane commented about the messy comments earlier 
(see [2]), and I hope these patches will alleviate that confusion. See 
commit messages for details.


The last patch refactors the logic in libpq_fetch.c, so that it no 
longer uses a temporary table in the source system. That allows using a 
hot standby server as the pg_rewind source.


This doesn't do anything about pg_rewind's crash-safety yet, but I'll 
try work on that after these patches.


[1] 
https://www.postgresql.org/message-id/d8dcc760-8780-5084-f066-6d663801d2e2%40iki.fi


[2] https://www.postgresql.org/message-id/30255.1522711675%40sss.pgh.pa.us

- Heikki
>From f2057f662e9c52c7392491831338a720a21a8ca3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 19 Aug 2020 15:34:35 +0300
Subject: [PATCH 1/5] pg_rewind: Move syncTargetDirectory() to file_ops.c

For consistency. All the other low-level functions that operate on the
target directory are in file_ops.c.
---
 src/bin/pg_rewind/file_ops.c  | 19 +++
 src/bin/pg_rewind/file_ops.h  |  1 +
 src/bin/pg_rewind/pg_rewind.c | 22 +-
 src/bin/pg_rewind/pg_rewind.h |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index b3bf091c546..55439db20ba 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -19,6 +19,7 @@
 #include 
 
 #include "common/file_perm.h"
+#include "common/file_utils.h"
 #include "file_ops.h"
 #include "filemap.h"
 #include "pg_rewind.h"
@@ -266,6 +267,24 @@ remove_target_symlink(const char *path)
  dstpath);
 }
 
+/*
+ * Sync target data directory to ensure that modifications are safely on disk.
+ *
+ * We do this once, for the whole data directory, for performance reasons.  At
+ * the end of pg_rewind's run, the kernel is likely to already have flushed
+ * most dirty buffers to disk.  Additionally fsync_pgdata uses a two-pass
+ * approach (only initiating writeback in the first pass), which often reduces
+ * the overall amount of IO noticeably.
+ */
+void
+sync_target_dir(void)
+{
+	if (!do_sync || dry_run)
+		return;
+
+	fsync_pgdata(datadir_target, PG_VERSION_NUM);
+}
+
 
 /*
  * Read a file into memory. The file to be read is /.
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 025f24141c9..d8466385cf5 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -19,6 +19,7 @@ extern void remove_target_file(const char *path, bool missing_ok);
 extern void truncate_target_file(const char *path, off_t newsize);
 extern void create_target(file_entry_t *t);
 extern void remove_target(file_entry_t *t);
+extern void sync_target_dir(void);
 
 extern char *slurpFile(const char *datadir, const char *path, size_t *filesize);
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e445..c9b9e480c0f 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -20,7 +20,6 @@
 #include "catalog/pg_control.h"
 #include "common/controldata_utils.h"
 #include "common/file_perm.h"
-#include "common/file_utils.h"
 #include "common/restricted_token.h"
 #include "common/string.h"
 #include "fe_utils/recovery_gen.h"
@@ -38,7 +37,6 @@ static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
 
 static void digestControlFile(ControlFileData *ControlFile, char *source,
 			  size_t size);
-static void syncTargetDirectory(void);
 static void getRestoreCommand(const char *argv0);
 static void sanityChecks(void);
 static void findCommonAncestorTimeline(XLogRecPtr *recptr, int *tliIndex);
@@ -455,7 +453,7 @@ main(int argc, char **argv)
 
 	if (showprogress)
 		pg_log_info("syncing target data directory");
-	syncTargetDirectory();
+	sync_target_dir();
 
 	if (writerecoveryconf && !dry_run)
 		WriteRecoveryConfig(conn, datadir_target,
@@ -803,24 +801,6 @@ digestControlFile(ControlFileData *ControlFile, char *src, size_t size)
 	checkControlFile(ControlFile);
 }
 
-/*
- * Sync target data directory to ensure that modifications are safely on disk.
- *
- * We do this once, for the whole data directory, for performance reasons.  At
- * the end of pg_rewind's run, the kernel is likely to already have flushed
- * most dirty buffers to disk.  Additionally fsync_pgdata uses a two-pass
- * approach (only initiating writeback in the first pass), which often reduces
- * the overall amount of IO noticeably.
- */
-static void
-syncTargetDirectory(void)
-{
-	if (!do_sync || dry_run)
-		return;
-
-	fsync_pgdata(datadir_target, PG_VERSION_NUM);
-}
-
 /*
  * Get value of GUC parameter restore_command from the target cluster.
  *
diff --git