Re: new option to allow pg_rewind to run without full_page_writes

2022-11-07 Thread Andres Freund
Hi,

On 2022-11-08 00:07:09 +0100, Jérémie Grauer wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> > > Currently pg_rewind refuses to run if full_page_writes is off. This is to
> > > prevent it to run into a torn page during operation.
> > >
> > > This is usually a good call, but some file systems like ZFS are naturally
> > > immune to torn page (maybe btrfs too, but I don't know for sure for this
> > > one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Unfortunately not relevant for pg_rewind due to the issues mentioned
subsequently.


> > Right now, that definitely allows for torn reads, because of the way
> > pg_read_binary_file() is implemented.  We only ensure a 4k read size from 
> > the
> > view of our code, which obviously can lead to torn 8k page reads, no matter
> > what the filesystem guarantees.
> >
> > Also, for reasons I don't understand we use C streaming IO or
> > pg_read_binary_file(), so you'd also need to ensure that the buffer size 
> > used
> > by the stream implementation can't cause the reads to happen in smaller
> > chunks.  Afaict we really shouldn't use file streams here, then we'd at 
> > least
> > have control over that aspect.
> >
> >
> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something
> > here: how does full_page_writes prevents

> short _reads_ ?

Yes.


> Presumably, if we do something like read the first 4K of a file, then change
> the file, then read the next 4K, the second 4K may be a torn read.

Correct.


> But I fail to see how full_page_writes prevents this since it only act on 
> writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.


I suspect to avoid the need for this we'd need to atomically read all the
pages involved in a WAL record (presumably by locking the pages against
IO). That'd then safely allow skipping replay of WAL records based on the LSN.a

A slightly easier thing would be to force-enable full page writes just for the
duration of a rewind, similar to what we do during base backups. But that'd
still require a bunch more work than done here.


> > This isn't an fundamental issue - we could have a version of
> > pg_read_binary_file() for relation data that prevents the page being written
> > out concurrently by locking the buffer page. In addition it could often 
> > avoid
> > needing to read the page from the OS / disk, if present in shared buffers
> > (perhaps minus cases where we haven't flushed the WAL yet, but we could also
> > flush the WAL in those).
> > I agree, but this would need a differen patch, which may be beyond my
> skills.
> > Greetings,
> >
> > Andres Freund
> Anyway, ZFS will act like full_page_writes is always active, so isn't the
> proposed modification to pg_rewind valid?

No. This really isn't about the crash safety aspects of full page writes, so
the fact that ZFS is used is just not really relevant.

Regards,

Andres




Re: new option to allow pg_rewind to run without full_page_writes

2022-11-07 Thread Thomas Munro
On Tue, Nov 8, 2022 at 12:07 PM Jérémie Grauer
 wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> >> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> >> prevent it to run into a torn page during operation.
> >>
> >> This is usually a good call, but some file systems like ZFS are naturally
> >> immune to torn page (maybe btrfs too, but I don't know for sure for this
> >> one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.

> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Just to spell out the distinction Andres was making, and maybe try to
answer a couple of questions if I can, there are two completely
different phenomena here:

1.  Generally full_page_writes is for handling a lack of atomic writes
on power loss, but ZFS already does that itself by virtue of its COW
design and data-logging in certain cases.

2.  Here we are using full_page_writes to handle lack of atomicity
when there are concurrent reads and writes to the same file from
different threads.  Basically, by turning on full_page_writes we say
that we don't trust any block that might have been written to during
the copying.  Again, ZFS already handles that for itself: it uses
range locking in the read and write paths (see zfs_rangelock_enter()
in zfs_write() etc), BUT that's only going to work if the actual
pread()/pwrite() system calls that reach ZFS are aligned with
PostgreSQL's pages.

Every now and then a discussion breaks out about WTF POSIX actually
requires WRT concurrent read/write, but it's trivial to show  that the
most popular Linux filesystem exposes randomly mashed-up data from old
and new versions of even small writes if you read while a write is
concurrently in progress[1], while many others don't.  That's what the
2nd thing is protecting against.  I think it must be possible to show
that breaking on ZFS too, *if* the file regions arriving into system
calls are NOT correctly aligned.  As Andres points out, 
buffered IO streams create a risk there: we have no idea what system
calls are reaching ZFS, so it doesn't seem safe to turn off full page
writes unless you also fix that.

> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something 
> > here: how does full_page_writes prevents
> short _reads_ ?

I don't know, but I think the paranoid approach would be that if you
get a short read, you go back and pread() at least that whole page, so
all your system calls are fully aligned.  Then I think you'd be safe?
Because zfs_read() does:

/*
 * Lock the range against changes.
 */
zfs_locked_range_t *lr = zfs_rangelock_enter(>z_rangelock,
zfs_uio_offset(uio), zfs_uio_resid(uio), RL_READER);

So it should be possible to make a safe version of this patch, by
teaching the file-reading code to require BLCKSZ integrity for all
reads.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2B19bZKidSiWmMsDmgUVe%3D_rr0m57LfR%2BnAbWprVDd_cw%40mail.gmail.com




Re: new option to allow pg_rewind to run without full_page_writes

2022-11-07 Thread Jérémie Grauer

Hello,

First, thank you for reviewing.

ZFS writes files in increment of its configured recordsize for the 
current filesystem dataset.


So with a recordsize configured to be a multiple of 8K, you can't get 
torn pages on writes, that's why full_page_writes can be safely 
deactivated on ZFS (the usual advice is to configure ZFS with a 
recordsize of 8K for postgres, but on some workloads, it can actually be 
beneficial to go to a higher multiple of 8K).


On 06/11/2022 03:38, Andres Freund wrote:

Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:

Currently pg_rewind refuses to run if full_page_writes is off. This is to
prevent it to run into a torn page during operation.

This is usually a good call, but some file systems like ZFS are naturally
immune to torn page (maybe btrfs too, but I don't know for sure for this
one).


Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.
Like I wrote above, ZFS will prevent torn pages on writes, like 
full_page_writes does.


Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented.  We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks.  Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.


Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents 

short _reads_ ?

Presumably, if we do something like read the first 4K of a file, then 
change the file, then read the next 4K, the second 4K may be a torn 
read. But I fail to see how full_page_writes prevents this since it only 
act on writes>

This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).
I agree, but this would need a differen patch, which may be beyond my 

skills.

Greetings,

Andres Freund
Anyway, ZFS will act like full_page_writes is always active, so isn't 
the proposed modification to pg_rewind valid?


You'll find attached a second version of the patch, which is cleaner 
(removed double negation).


Regards,
Jérémie GrauerFrom 309649dfa2eea9d696e43271057da7e643d771b4 Mon Sep 17 00:00:00 2001
From: Jeremie Grauer 
Date: Mon, 7 Nov 2022 23:50:43 +0100
Subject: [PATCH] adds the option --no-ensure-full-page-writes to pg_rewind

---
 doc/src/sgml/ref/pg_rewind.sgml  | 27 +++-
 src/bin/pg_rewind/libpq_source.c | 22 ++--
 src/bin/pg_rewind/pg_rewind.c| 43 +++-
 src/bin/pg_rewind/pg_rewind.h|  1 +
 4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..64ee600c2b 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -99,7 +99,13 @@ PostgreSQL documentation
in postgresql.conf or data checksums enabled when
the cluster was initialized with initdb.  Neither of these
are currently on by default.  
-   must also be set to on, but is enabled by default.
+   should also be set to on (which is the default) when the
+   underlying file system can write data in an increment of less than the page size.
+   Some file systems (like ZFS when configured with a recordsize as a multiple of
+   the page size) don't need this parameter since they will never write less than a
+   complete page to disk. If you are running such a system, you can add the option
+   --no-enforce-full-page-writes so that pg_rewind will run
+   even when  is set to off.
   
 
   
@@ -283,6 +289,25 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-ensure-full-page-writes
+  
+   
+pg_rewind by default requires that 
+is set to on. This ensures that pg_rewind do
+not run into torn pages while running. This option is necessary most of
+the time since very few filesystems enforce the flush of full page to disk.
+One such system is ZFS where the page is always flushed to disk in its
+entirety. Other storage system may also have the same warranty. This option
+is here to allow pg_rewind to run on such systems without enforcing that
+ is on. It's important to know what
+you are doing before setting this setting because in case your 

Re: new option to allow pg_rewind to run without full_page_writes

2022-11-05 Thread Andres Freund
Hi,

On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> Currently pg_rewind refuses to run if full_page_writes is off. This is to
> prevent it to run into a torn page during operation.
>
> This is usually a good call, but some file systems like ZFS are naturally
> immune to torn page (maybe btrfs too, but I don't know for sure for this
> one).

Note that this isn't about torn pages in case of crashes, but about reading
pages while they're being written to.

Right now, that definitely allows for torn reads, because of the way
pg_read_binary_file() is implemented.  We only ensure a 4k read size from the
view of our code, which obviously can lead to torn 8k page reads, no matter
what the filesystem guarantees.

Also, for reasons I don't understand we use C streaming IO or
pg_read_binary_file(), so you'd also need to ensure that the buffer size used
by the stream implementation can't cause the reads to happen in smaller
chunks.  Afaict we really shouldn't use file streams here, then we'd at least
have control over that aspect.


Does ZFS actually guarantee that there never can be short reads? As soon as
they are possible, full page writes are needed.



This isn't an fundamental issue - we could have a version of
pg_read_binary_file() for relation data that prevents the page being written
out concurrently by locking the buffer page. In addition it could often avoid
needing to read the page from the OS / disk, if present in shared buffers
(perhaps minus cases where we haven't flushed the WAL yet, but we could also
flush the WAL in those).

Greetings,

Andres Freund




new option to allow pg_rewind to run without full_page_writes

2022-11-03 Thread Jérémie Grauer

Hello,

Currently pg_rewind refuses to run if full_page_writes is off. This is 
to prevent it to run into a torn page during operation.


This is usually a good call, but some file systems like ZFS are 
naturally immune to torn page (maybe btrfs too, but I don't know for 
sure for this one).


Having the option to use pg_rewind without the cost associated with 
full_page_writes when using a system immune to torn page is beneficial: 
increased performance and more compact WAL.


This patch adds a new option "--no-ensure-full-page-writes" to pg_rewind 
for this situation, as well as patched documentation.


Regards,
Jeremie Grauer
From 4f973b7e2d4bd9c04b6b6ce2c825dfdab4dbad11 Mon Sep 17 00:00:00 2001
From: Jeremie Grauer 
Date: Thu, 3 Nov 2022 16:23:49 +0100
Subject: [PATCH] adds the option --no-ensure-full-page-writes to pg_rewind

---
 doc/src/sgml/ref/pg_rewind.sgml  | 25 +-
 src/bin/pg_rewind/libpq_source.c | 44 ++--
 src/bin/pg_rewind/pg_rewind.c| 43 ++-
 src/bin/pg_rewind/pg_rewind.h|  1 +
 4 files changed, 75 insertions(+), 38 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 69d6924b3a..3dbdb503cd 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -99,7 +99,11 @@ PostgreSQL documentation
in postgresql.conf or data checksums enabled when
the cluster was initialized with initdb.  Neither of these
are currently on by default.  
-   must also be set to on, but is enabled by default.
+   should also be set to on (which is the default) except when the
+   underlying system is not succeptible to torn pages. ZFS filesystem for instance doesn't
+   need this parameter. If you are running such a system, you need to add the option
+   --no-enforce-full-page-writes else pg_rewind will refuse
+   to run.
   
 
   
@@ -283,6 +287,25 @@ PostgreSQL documentation
   
  
 
+ 
+  --no-ensure-full-page-writes
+  
+   
+pg_rewind by default requires that 
+is set to on. This ensures that pg_rewind do
+not run into torn pages while running. This option is necessary most of
+the time since very few filesystems enforce the flush of full page to disk.
+One such system is ZFS where the page is always flushed to disk in its
+entirety. Other storage system may also have the same warranty. This option
+is here to allow pg_rewind to run on such systems without enforcing that
+ is on. It's important to know what
+you are doing before setting this setting because in case your system does
+not really warrants that the page is persisted to disk in its entirety, you
+may run into data corruption.
+   
+  
+ 
+
  
   -V
   --version
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 011c9cce6e..7b0c1d54f2 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -132,26 +132,32 @@ init_libpq_conn(PGconn *conn)
 	PQclear(res);
 
 	/*
-	 * Also check that full_page_writes is enabled.  We can get torn pages if
-	 * a page is modified while we read it with pg_read_binary_file(), and we
-	 * rely on full page images to fix them.
+	 * If --no-ensure-full-page-write is false (the default), check that
+	 * full_page_writes is enabled. The goal here is to not run into a torn
+	 * pages. Torn page can happend if a page is modified while we read it
+	 * with pg_read_binary_file(). Some file systems are immune to torn page,
+	 * this is why there is an option to disable this check. Outside of those
+	 * specific systems, we rely on full_page_writes to avoid torn page.
 	 */
-	str = run_simple_query(conn, "SHOW full_page_writes");
-	if (strcmp(str, "on") != 0)
-		pg_fatal("full_page_writes must be enabled in the source server");
-	pg_free(str);
-
-	/* Prepare a statement we'll use to fetch files */
-	res = PQprepare(conn, "fetch_chunks_stmt",
-	"SELECT path, begin,\n"
-	"  pg_read_binary_file(path, begin, len, true) AS chunk\n"
-	"FROM unnest ($1::text[], $2::int8[], $3::int4[]) as x(path, begin, len)",
-	3, NULL);
-
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pg_fatal("could not prepare statement to fetch file contents: %s",
- PQresultErrorMessage(res));
-	PQclear(res);
+	if (!no_ensure_full_page_writes)
+	{
+		str = run_simple_query(conn, "SHOW full_page_writes");
+		if (strcmp(str, "on") != 0)
+			pg_fatal("full_page_writes must be enabled in the source server");
+		pg_free(str);
+
+		/* Prepare a statement we'll use to fetch files */
+		res = PQprepare(conn, "fetch_chunks_stmt",
+		"SELECT path, begin,\n"
+		"  pg_read_binary_file(path, begin, len, true) AS chunk\n"
+		"FROM unnest ($1::text[], $2::int8[], $3::int4[]) as x(path, begin, len)",
+		3, NULL);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+			pg_fatal("could not