subject:"Re\: patch to allow disable of WAL recycling"

Re: patch to allow disable of WAL recycling

2019-04-02 Thread Jerry Jelinek

On Mon, Apr 1, 2019 at 7:48 PM Thomas Munro  wrote:

> On Sat, Mar 30, 2019 at 11:18 AM Jerry Jelinek 
> wrote:
> > I went through your new version of the patch and it all looks great to
> me.
>
> I moved the error handling logic around a bit so we'd capture errno
> immediately after the syscalls.  I also made a couple of further
> tweaks to comments and removed some unnecessary casts.
>
> I suspect we took so long on this because of lack of ZFS knowledge and
> uncertainty about the precise reason for the current coding in terms
> of crash safety in general.  After learning more, I now suspect the
> claim about fsyncdata(2) and indirect blocks in the comments may be
> incorrect (it may stem from buggy behaviour on older Linux kernels),
> but I'm not sure and it's not this patch's job to change that.
>
> Pushed.  Thanks for the patch!


 Thanks a lot for getting this integrated, and thanks to everyone else who
gave me so much feedback and assistance during this process!
Jerry

Re: patch to allow disable of WAL recycling

2019-04-01 Thread Thomas Munro

On Sat, Mar 30, 2019 at 11:18 AM Jerry Jelinek  wrote:
> I went through your new version of the patch and it all looks great to me.

I moved the error handling logic around a bit so we'd capture errno
immediately after the syscalls.  I also made a couple of further
tweaks to comments and removed some unnecessary casts.

I suspect we took so long on this because of lack of ZFS knowledge and
uncertainty about the precise reason for the current coding in terms
of crash safety in general.  After learning more, I now suspect the
claim about fsyncdata(2) and indirect blocks in the comments may be
incorrect (it may stem from buggy behaviour on older Linux kernels),
but I'm not sure and it's not this patch's job to change that.

Pushed.  Thanks for the patch!

-- 
Thomas Munro
https://enterprisedb.com

Re: patch to allow disable of WAL recycling

2019-03-30 Thread Tomas Vondra


On Fri, Mar 29, 2019 at 01:09:46PM +1300, Thomas Munro wrote:


...

I still don't know why exactly this happens, but it's clearly a real
phenomenon.  As for why Tomas Vondra couldn't see it, I'm guessing
that stacks more RAM and ~500k IOPS help a lot (essentially the
opposite end of the memory, CPU, IO spectrum from this little
machine), and Joyent's systems may be somewhere in between?



If needed, I guess I can rerun the tests with data on the SATA RAID, but
this time limit the amount of RAM to something much lower.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-29 Thread Jerry Jelinek

On Thu, Mar 28, 2019 at 6:10 PM Thomas Munro  wrote:

> On Fri, Mar 29, 2019 at 10:47 AM Thomas Munro 
> wrote:
> > On Fri, Mar 29, 2019 at 8:59 AM Robert Haas 
> wrote:
> > > On Tue, Mar 26, 2019 at 3:24 PM Jerry Jelinek <
> jerry.jeli...@joyent.com> wrote:
> > > > The latest patch is rebased, builds clean, and passes some basic
> testing. Please let me know if there is anything else I could do on this.
> > >
> > > I agree with Thomas Munro's earlier critique of the documentation.
> > > The documentation of the new parameters makes an assumption,
> > > completely unsupported in my view, about when those parameters should
> > > be set, yet at the same time gives almost no information about what
> > > they actually do.  I don't like that.
> > >
> > > The patch needs a visit from pgindent, too.
> >
> > I would like to fix these problems and commit the patch.  First, I'm
> > going to go and do some project-style tidying, write some proposed doc
> > tweaks, and retest these switches on the machine where I saw
> > beneficial effects from the patch before.  I'll post a new version
> > shortly to see if anyone has objections.
>
> Here's a new version of the patch.
>
> Last time I ran the test, I was using FreeBSD 11.2, but now I'm on
> FreeBSD 12.0, and I suspect something changed about how it respects
> the arc size sysctls causing it to behave very badly, so this time I
> didn't change them from their defaults.  Also the disks have changed
> from 7200RPM drives to 5400RPM drives since last time.  The machine
> has 2 underpowered cores and 6GB of RAM.  What can I say, it's a super
> low end storage/backup box.  What's interesting is that it does show
> the reported problem.  Actually I often test stuff relating to OS
> caching on this box precisely because the IO sticks out so much.
>
> Some OS set-up steps run as root:
>
> zfs create zroot/tmp/test
> zfs set mountpoint=/tmp/test zroot/tmp/test
> zfs set compression=off zroot/tmp/test
> zfs set recordsize=8192 zroot/tmp/test
> chown tmunro:tmunro /tmp/test
>
> Now as my regular user:
>
> initdb -D /tmp/test
> cat <> /tmp/test/postgresql.conf
> fsync=off
> max_wal_size = 600MB
> min_wal_size = 600MB
> EOF
>
> I started postgres -D /tmp/test and I set up pgbench:
>
> pgbench -i -s 100 postgres
>
> Then I ran each test as follows:
>
> tar cvf /dev/null /tmp/test # make sure all data files are pre-warmed into
> arc
> for i in 1 2 3 ; do
>   pgbench -M prepared -c 4 -j 4 -T 120 postgres
> done
>
> I did that with all 4 GUC permutations and got the following TPS numbers:
>
> wal_recycle=off, wal_init_zero=off: 2668, 1873, 2166
> wal_recycle=on,  wal_init_zero=off: 1936, 1350, 1552
> wal_recycle=off, wal_init_zero=on : 2213, 1360, 1539
> wal_recycle=on,  wal_init_zero=on : 1539, 1007, 1252
>
> Finally, concious that 2 minutes isn't really enough, I did a 10
> minute run with both settings on and both off, again with the tar
> command first to try to give them the same initial conditions (really
> someone should write a "drop-caches-now" patch for FreeBSD that
> affects the page cage and the ZFS ARC, but I digress) and got:
>
> wal_recycle=on,  wal_init_zero=on : 1468
> wal_recycle=off, wal_init_zero=off: 2046
>
> I still don't know why exactly this happens, but it's clearly a real
> phenomenon.  As for why Tomas Vondra couldn't see it, I'm guessing
> that stacks more RAM and ~500k IOPS help a lot (essentially the
> opposite end of the memory, CPU, IO spectrum from this little
> machine), and Joyent's systems may be somewhere in between?
>
> --
> Thomas Munro
> https://enterprisedb.com


Thomas,

Thanks for doing all of this work on the patch. I like your new
documentation a lot better than what I wrote. Also, sorry for messing up
the style here. I was trying to be consistent with the existing style, but
I clearly failed. Thanks for cleaning all of that up. I went through your
new version of the patch and it all looks great to me.

Thanks again,
Jerry

Re: patch to allow disable of WAL recycling

2019-03-29 Thread Peter Eisentraut

On 2019-03-29 01:09, Thomas Munro wrote:
>> I would like to fix these problems and commit the patch.  First, I'm
>> going to go and do some project-style tidying, write some proposed doc
>> tweaks, and retest these switches on the machine where I saw
>> beneficial effects from the patch before.  I'll post a new version
>> shortly to see if anyone has objections.
> Here's a new version of the patch.

I like the way the documentation is written in this patch version.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-28 Thread Thomas Munro

On Fri, Mar 29, 2019 at 10:47 AM Thomas Munro  wrote:
> On Fri, Mar 29, 2019 at 8:59 AM Robert Haas  wrote:
> > On Tue, Mar 26, 2019 at 3:24 PM Jerry Jelinek  
> > wrote:
> > > The latest patch is rebased, builds clean, and passes some basic testing. 
> > > Please let me know if there is anything else I could do on this.
> >
> > I agree with Thomas Munro's earlier critique of the documentation.
> > The documentation of the new parameters makes an assumption,
> > completely unsupported in my view, about when those parameters should
> > be set, yet at the same time gives almost no information about what
> > they actually do.  I don't like that.
> >
> > The patch needs a visit from pgindent, too.
>
> I would like to fix these problems and commit the patch.  First, I'm
> going to go and do some project-style tidying, write some proposed doc
> tweaks, and retest these switches on the machine where I saw
> beneficial effects from the patch before.  I'll post a new version
> shortly to see if anyone has objections.

Here's a new version of the patch.

Last time I ran the test, I was using FreeBSD 11.2, but now I'm on
FreeBSD 12.0, and I suspect something changed about how it respects
the arc size sysctls causing it to behave very badly, so this time I
didn't change them from their defaults.  Also the disks have changed
from 7200RPM drives to 5400RPM drives since last time.  The machine
has 2 underpowered cores and 6GB of RAM.  What can I say, it's a super
low end storage/backup box.  What's interesting is that it does show
the reported problem.  Actually I often test stuff relating to OS
caching on this box precisely because the IO sticks out so much.

Some OS set-up steps run as root:

zfs create zroot/tmp/test
zfs set mountpoint=/tmp/test zroot/tmp/test
zfs set compression=off zroot/tmp/test
zfs set recordsize=8192 zroot/tmp/test
chown tmunro:tmunro /tmp/test

Now as my regular user:

initdb -D /tmp/test
cat <> /tmp/test/postgresql.conf
fsync=off
max_wal_size = 600MB
min_wal_size = 600MB
EOF

I started postgres -D /tmp/test and I set up pgbench:

pgbench -i -s 100 postgres

Then I ran each test as follows:

tar cvf /dev/null /tmp/test # make sure all data files are pre-warmed into arc
for i in 1 2 3 ; do
  pgbench -M prepared -c 4 -j 4 -T 120 postgres
done

I did that with all 4 GUC permutations and got the following TPS numbers:

wal_recycle=off, wal_init_zero=off: 2668, 1873, 2166
wal_recycle=on,  wal_init_zero=off: 1936, 1350, 1552
wal_recycle=off, wal_init_zero=on : 2213, 1360, 1539
wal_recycle=on,  wal_init_zero=on : 1539, 1007, 1252

Finally, concious that 2 minutes isn't really enough, I did a 10
minute run with both settings on and both off, again with the tar
command first to try to give them the same initial conditions (really
someone should write a "drop-caches-now" patch for FreeBSD that
affects the page cage and the ZFS ARC, but I digress) and got:

wal_recycle=on,  wal_init_zero=on : 1468
wal_recycle=off, wal_init_zero=off: 2046

I still don't know why exactly this happens, but it's clearly a real
phenomenon.  As for why Tomas Vondra couldn't see it, I'm guessing
that stacks more RAM and ~500k IOPS help a lot (essentially the
opposite end of the memory, CPU, IO spectrum from this little
machine), and Joyent's systems may be somewhere in between?

-- 
Thomas Munro
https://enterprisedb.com

0001-Add-wal_recycle-and-wal_init_zero-GUCs.patch
Description: Binary data

Re: patch to allow disable of WAL recycling

2019-03-28 Thread Thomas Munro

On Fri, Mar 29, 2019 at 8:59 AM Robert Haas  wrote:
> On Tue, Mar 26, 2019 at 3:24 PM Jerry Jelinek  
> wrote:
> > The latest patch is rebased, builds clean, and passes some basic testing. 
> > Please let me know if there is anything else I could do on this.
>
> I agree with Thomas Munro's earlier critique of the documentation.
> The documentation of the new parameters makes an assumption,
> completely unsupported in my view, about when those parameters should
> be set, yet at the same time gives almost no information about what
> they actually do.  I don't like that.
>
> The patch needs a visit from pgindent, too.

I would like to fix these problems and commit the patch.  First, I'm
going to go and do some project-style tidying, write some proposed doc
tweaks, and retest these switches on the machine where I saw
beneficial effects from the patch before.  I'll post a new version
shortly to see if anyone has objections.

-- 
Thomas Munro
https://enterprisedb.com

Re: patch to allow disable of WAL recycling

2019-03-28 Thread Robert Haas

On Tue, Mar 26, 2019 at 3:24 PM Jerry Jelinek  wrote:
> The latest patch is rebased, builds clean, and passes some basic testing. 
> Please let me know if there is anything else I could do on this.

I agree with Thomas Munro's earlier critique of the documentation.
The documentation of the new parameters makes an assumption,
completely unsupported in my view, about when those parameters should
be set, yet at the same time gives almost no information about what
they actually do.  I don't like that.

The patch needs a visit from pgindent, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-26 Thread Jerry Jelinek

On Thu, Mar 7, 2019 at 6:26 PM Thomas Munro  wrote:

> On Fri, Mar 8, 2019 at 12:35 PM Jerry Jelinek 
> wrote:
> > On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro 
> wrote:
> >> My understanding is that it's not really the COW-ness that makes it
> >> not necessary, it's the fact that fdatasync() doesn't do anything
> >> different from fsync() on ZFS and there is no situation where
> >> fdatasync() succeeds, you lose power, you come back up and find that
> >> the file size is wrong or a hole in the middle of the file has come
> >> back from the dead, and you lost the data.  The whole concept of "data
> >> sync" implies that file meta-data and file contents are cached and
> >> synchronized separately and you can deliberately ask for weaker
> >> coherency to cut down on IOs; *that's* the thing that ZFS doesn't
> >> have, and couldn't benefit from because it's just going to write stuff
> >> in its tidy sequential log in the form of all-or-nothing transactions
> >> anyway.  I don't know if that's also true for eg BTRFS or any other
> >> COW filesystem that might be out there, but I don't know why you'd
> >> want to mention COW instead of wal_sync_mode as the motivation when
> >> the source code comment know better.
> >
> >
> > Hopefully I am not misinterpreting your comment here, but I'm not sure I
> fully agree with that assessment. I can't speak for other filesystems, but
> for ZFS, none of the zero-filled blocks will actually be written to disk,
> but that determination happens fairly late in the process, after the
> thousands of write system calls have been processed. So on ZFS, these
> writes are basically useless, except as a way to increase the size of the
> file. No disk space is actually allocated. However, on any COW filesystem,
> any write to any of these zero-filled blocks will have to allocate a new
> block, so nothing about "preallocating space" has been accomplished by all
> of these system calls. At least, preallocating space is my understanding of
> why the zero-fill is currently being performed.
>
> It seems like you're focusing on the performance and I'm focusing on
> the safety.  Obviously it's a complete waste of time to try to
> "preallocate" space on COW filesystems since they will not reuse that
> space anyway by definition.  My point was that it may be unsafe to
> turn if off when configured to use fdatasync() for later writes to the
> file on filesystems that make fewer durability guarantees with
> fdatasync() than with a full fsync(), and that seems like another
> newsworthy angle on this for end users to know about.  I dunno, maybe
> those things are so closely linked that it's OK to write just "only
> turn it off on COW filesystems", but I'm wondering why we don't
> mention the actual reason for the feature when we make that claim in
> the comments.
>
> Hmm... digging a bit further.  So, those comments in xlog.c date from
> 2013/2014 when this stuff was going down:
>
> https://lkml.org/lkml/2012/9/3/83
>
> https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
> http://www.openldap.org/lists/openldap-devel/201411/msg2.html
>
> So that was not actually the intended behaviour of fdatasync(), but
> rather a bug in ext3/4 that's been fixed now.  POSIX says "File
> attributes that are not necessary for data retrieval (access time,
> modification time, status change time) need not be successfully
> transferred prior to returning to the calling process.", and the Linux
> man page says that it "does not flush modified metadata unless that
> metadata is needed in order to allow a subsequent data retrieval to be
> correctly handled", so... if there are no OS bugs, the comments in
> xlog.c are overly pessimistic and the only effect of using fdatasync()
> instead of fsync() to to avoid extra IOs for mtime etc.
>
> I still like the pessimism in the code.  But OK, I withdraw my
> complaint about that sentence in the documentation for now!  :-)


 I haven't heard any additional feedback in the last couple of weeks, so I
wanted to see if there is anything else needed for this patch? I did update
the code to use pg_pwrite. A new version of the patch is attached. The only
difference from the previous version is this diff:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3283,8 +3283,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent,
bool use_lock)
 */
errno = 0;
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-   fail = lseek(fd, wal_segment_size - 1, SEEK_SET) < (off_t)
0 ||
-   (int) write(fd, zbuffer.data, 1) != (int) 1;
+   fail = pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1)
!=
+   (ssize_t) 1;
pgstat_report_wait_end();
}

The latest patch is rebased, builds clean, and passes some basic testing.
Please let me know if there is anything else I could do on this.

Thanks,
Jerry


0001

Re: patch to allow disable of WAL recycling

2019-03-07 Thread Thomas Munro

On Fri, Mar 8, 2019 at 12:35 PM Jerry Jelinek  wrote:
> On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro  wrote:
>> My understanding is that it's not really the COW-ness that makes it
>> not necessary, it's the fact that fdatasync() doesn't do anything
>> different from fsync() on ZFS and there is no situation where
>> fdatasync() succeeds, you lose power, you come back up and find that
>> the file size is wrong or a hole in the middle of the file has come
>> back from the dead, and you lost the data.  The whole concept of "data
>> sync" implies that file meta-data and file contents are cached and
>> synchronized separately and you can deliberately ask for weaker
>> coherency to cut down on IOs; *that's* the thing that ZFS doesn't
>> have, and couldn't benefit from because it's just going to write stuff
>> in its tidy sequential log in the form of all-or-nothing transactions
>> anyway.  I don't know if that's also true for eg BTRFS or any other
>> COW filesystem that might be out there, but I don't know why you'd
>> want to mention COW instead of wal_sync_mode as the motivation when
>> the source code comment know better.
>
>
> Hopefully I am not misinterpreting your comment here, but I'm not sure I 
> fully agree with that assessment. I can't speak for other filesystems, but 
> for ZFS, none of the zero-filled blocks will actually be written to disk, but 
> that determination happens fairly late in the process, after the thousands of 
> write system calls have been processed. So on ZFS, these writes are basically 
> useless, except as a way to increase the size of the file. No disk space is 
> actually allocated. However, on any COW filesystem, any write to any of these 
> zero-filled blocks will have to allocate a new block, so nothing about 
> "preallocating space" has been accomplished by all of these system calls. At 
> least, preallocating space is my understanding of why the zero-fill is 
> currently being performed.

It seems like you're focusing on the performance and I'm focusing on
the safety.  Obviously it's a complete waste of time to try to
"preallocate" space on COW filesystems since they will not reuse that
space anyway by definition.  My point was that it may be unsafe to
turn if off when configured to use fdatasync() for later writes to the
file on filesystems that make fewer durability guarantees with
fdatasync() than with a full fsync(), and that seems like another
newsworthy angle on this for end users to know about.  I dunno, maybe
those things are so closely linked that it's OK to write just "only
turn it off on COW filesystems", but I'm wondering why we don't
mention the actual reason for the feature when we make that claim in
the comments.

Hmm... digging a bit further.  So, those comments in xlog.c date from
2013/2014 when this stuff was going down:

https://lkml.org/lkml/2012/9/3/83
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
http://www.openldap.org/lists/openldap-devel/201411/msg2.html

So that was not actually the intended behaviour of fdatasync(), but
rather a bug in ext3/4 that's been fixed now.  POSIX says "File
attributes that are not necessary for data retrieval (access time,
modification time, status change time) need not be successfully
transferred prior to returning to the calling process.", and the Linux
man page says that it "does not flush modified metadata unless that
metadata is needed in order to allow a subsequent data retrieval to be
correctly handled", so... if there are no OS bugs, the comments in
xlog.c are overly pessimistic and the only effect of using fdatasync()
instead of fsync() to to avoid extra IOs for mtime etc.

I still like the pessimism in the code.  But OK, I withdraw my
complaint about that sentence in the documentation for now!  :-)

-- 
Thomas Munro
https://enterprisedb.com

Re: patch to allow disable of WAL recycling

2019-03-07 Thread Jerry Jelinek

Thomas,

Responses in-line.

On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro  wrote:

> On Fri, Mar 8, 2019 at 10:13 AM Jerry Jelinek 
> wrote:
> >  I have attached a new version of the patch that implements the changes
> we've discussed over the past couple of days. Let me know if there are any
> comments or suggestions.
>
> +fail = lseek(fd, wal_segment_size - 1, SEEK_SET) < (off_t) 0 ||
> +(int) write(fd, zbuffer.data, 1) != (int) 1;
>
> BTW we now have pg_pwrite() to do this in one syscall.
>

Thanks for the pointer, I'll take a look at that.


>
> +Disabling this option prevents zero-filling new WAL files.
> +This parameter should only be set to off
> when the WAL
> +resides on a Copy-On-Write
> (COW)
> +filesystem.
>
> Hmm.  The comments in the source give the actual motivation for this
> preallocation logic... I wonder why we don't come out and say the same
> thing in the documentation, instead of this vague language about COW
> filesystems.
>
> Here's a suggestion: "Zero-filling new segment files ensures that it
> is safe to use wal_sync_method = fdatasync or wal_sync_method =
> open_datasync on filesystems that synchronize file meta-data and data
> separately.  It is not necessary on some filesystems such as ZFS."
>
> My understanding is that it's not really the COW-ness that makes it
> not necessary, it's the fact that fdatasync() doesn't do anything
> different from fsync() on ZFS and there is no situation where
> fdatasync() succeeds, you lose power, you come back up and find that
> the file size is wrong or a hole in the middle of the file has come
> back from the dead, and you lost the data.  The whole concept of "data
> sync" implies that file meta-data and file contents are cached and
> synchronized separately and you can deliberately ask for weaker
> coherency to cut down on IOs; *that's* the thing that ZFS doesn't
> have, and couldn't benefit from because it's just going to write stuff
> in its tidy sequential log in the form of all-or-nothing transactions
> anyway.  I don't know if that's also true for eg BTRFS or any other
> COW filesystem that might be out there, but I don't know why you'd
> want to mention COW instead of wal_sync_mode as the motivation when
> the source code comment know better.
>

Hopefully I am not misinterpreting your comment here, but I'm not sure I
fully agree with that assessment. I can't speak for other filesystems, but
for ZFS, none of the zero-filled blocks will actually be written to disk,
but that determination happens fairly late in the process, after the
thousands of write system calls have been processed. So on ZFS, these
writes are basically useless, except as a way to increase the size of the
file. No disk space is actually allocated. However, on any COW filesystem,
any write to any of these zero-filled blocks will have to allocate a new
block, so nothing about "preallocating space" has been accomplished by all
of these system calls. At least, preallocating space is my understanding of
why the zero-fill is currently being performed.


> +Disabling this option prevents WAL file recycling.
> +This parameter should only be set to off
> when the WAL
> +resides on a COW filesystem.
>
> Would it be better to say what it's for, rather than when to set it?
> To make clear that it's a performance setting, not a safety one.
> "Setting this option to off may increase
> performance on copy-on-write filesystems."
>

That sounds good to me, I'll change the wording and post a new patch after
I wait a little while to see if there is any other feedback.

Thanks for taking a look,
Jerry

Re: patch to allow disable of WAL recycling

2019-03-07 Thread Thomas Munro

On Fri, Mar 8, 2019 at 10:13 AM Jerry Jelinek  wrote:
>  I have attached a new version of the patch that implements the changes we've 
> discussed over the past couple of days. Let me know if there are any comments 
> or suggestions.

+fail = lseek(fd, wal_segment_size - 1, SEEK_SET) < (off_t) 0 ||
+(int) write(fd, zbuffer.data, 1) != (int) 1;

BTW we now have pg_pwrite() to do this in one syscall.

+Disabling this option prevents zero-filling new WAL files.
+This parameter should only be set to off
when the WAL
+resides on a Copy-On-Write
(COW)
+filesystem.

Hmm.  The comments in the source give the actual motivation for this
preallocation logic... I wonder why we don't come out and say the same
thing in the documentation, instead of this vague language about COW
filesystems.

Here's a suggestion: "Zero-filling new segment files ensures that it
is safe to use wal_sync_method = fdatasync or wal_sync_method =
open_datasync on filesystems that synchronize file meta-data and data
separately.  It is not necessary on some filesystems such as ZFS."

My understanding is that it's not really the COW-ness that makes it
not necessary, it's the fact that fdatasync() doesn't do anything
different from fsync() on ZFS and there is no situation where
fdatasync() succeeds, you lose power, you come back up and find that
the file size is wrong or a hole in the middle of the file has come
back from the dead, and you lost the data.  The whole concept of "data
sync" implies that file meta-data and file contents are cached and
synchronized separately and you can deliberately ask for weaker
coherency to cut down on IOs; *that's* the thing that ZFS doesn't
have, and couldn't benefit from because it's just going to write stuff
in its tidy sequential log in the form of all-or-nothing transactions
anyway.  I don't know if that's also true for eg BTRFS or any other
COW filesystem that might be out there, but I don't know why you'd
want to mention COW instead of wal_sync_mode as the motivation when
the source code comment know better.

+Disabling this option prevents WAL file recycling.
+This parameter should only be set to off
when the WAL
+resides on a COW filesystem.

Would it be better to say what it's for, rather than when to set it?
To make clear that it's a performance setting, not a safety one.
"Setting this option to off may increase
performance on copy-on-write filesystems."

--
Thomas Munro
https://enterprisedb.com

Re: patch to allow disable of WAL recycling

2019-03-07 Thread Jerry Jelinek

On Wed, Mar 6, 2019 at 4:14 PM Jerry Jelinek 
wrote:

>
> It sounds like everyone is in agreement that I should get rid  of the
> single COW GUC tunable and provide two different tunables instead. I will
> update the patch to go back to the original name (wal_recycle) for the
> original WAL recycling behavior. The default value of that will be true to
> provide the existing behavior. This matches my original proposal from last
> year. I will add a new tunable (wal_init_zero) which will control the
> zero-fill behavior for the WAL file. Again, the default value will be true
> and provide the existing behavior. Both of these could (should) be set to
> false for a COW filesystem like ZFS.
>
> If anyone objects to this new approach, let me know, otherwise I'll start
> preparing an updated patch.
>

 I have attached a new version of the patch that implements the changes
we've discussed over the past couple of days. Let me know if there are any
comments or suggestions.

Thanks,
Jerry


0001-wal_recycle-and-wal_init_zero.patch
Description: Binary data

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Jerry Jelinek

On Wed, Mar 6, 2019 at 11:02 AM Alvaro Herrera 
wrote:

> On 2019-Mar-06, Robert Haas wrote:
>
> > On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera 
> wrote:
> > > I want your dictating software.
> >
> > I'm afraid this is just me and a keyboard, but sadly for me you're not
> > the first person to accuse me of producing giant walls of text.
>
> Well, I don't have a problem reading long texts; my problem is that I'm
> unable to argue as quickly.
>
> I do buy your argument, though (if reluctantly); in particular I was
> worried to offer a parameter (to turn off zero-filling of segments) that
> would enable dangerous behavior, but then I realized we also have
> fsync=off of which the same thing can be said.  So I agree we should
> have two GUCs, properly explained, with a warning where appropriate.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

It sounds like everyone is in agreement that I should get rid  of the
single COW GUC tunable and provide two different tunables instead. I will
update the patch to go back to the original name (wal_recycle) for the
original WAL recycling behavior. The default value of that will be true to
provide the existing behavior. This matches my original proposal from last
year. I will add a new tunable (wal_init_zero) which will control the
zero-fill behavior for the WAL file. Again, the default value will be true
and provide the existing behavior. Both of these could (should) be set to
false for a COW filesystem like ZFS.

If anyone objects to this new approach, let me know, otherwise I'll start
preparing an updated patch.

Thanks for all of the feedback,
Jerry

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas

On Wed, Mar 6, 2019 at 1:02 PM Alvaro Herrera  wrote:
> Well, I don't have a problem reading long texts; my problem is that I'm
> unable to argue as quickly.

That's my secret weapon... except that it's not much of a secret.

> I do buy your argument, though (if reluctantly); in particular I was
> worried to offer a parameter (to turn off zero-filling of segments) that
> would enable dangerous behavior, but then I realized we also have
> fsync=off of which the same thing can be said.  So I agree we should
> have two GUCs, properly explained, with a warning where appropriate.

OK, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera

On 2019-Mar-06, Robert Haas wrote:

> On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera  
> wrote:
> > I want your dictating software.
> 
> I'm afraid this is just me and a keyboard, but sadly for me you're not
> the first person to accuse me of producing giant walls of text.

Well, I don't have a problem reading long texts; my problem is that I'm
unable to argue as quickly.

I do buy your argument, though (if reluctantly); in particular I was
worried to offer a parameter (to turn off zero-filling of segments) that
would enable dangerous behavior, but then I realized we also have
fsync=off of which the same thing can be said.  So I agree we should
have two GUCs, properly explained, with a warning where appropriate.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas

On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera  wrote:
> I want your dictating software.

I'm afraid this is just me and a keyboard, but sadly for me you're not
the first person to accuse me of producing giant walls of text.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera

I want your dictating software.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas

On Wed, Mar 6, 2019 at 11:41 AM Alvaro Herrera  wrote:
> I can understand this argument.  Is there really a reason to change
> those two behaviors separately?

See my previous rely to Andrew, but also, I think you're putting the
burden of proof in the wrong place.  You could equally well ask "Is
there really a reason for work_mem to be different for sorts and index
builds?  For sorts and hashes?  For foreground and background
vacuums?"  Well, now we've got work_mem, maintenance_work_mem,
autovacuum_work_mem, and at least in my experience, that's not
necessarily fine-grained enough -- people can't predict whether their
maintenance_work_mem setting is OK because they don't know if
somebody's going to be running a foreground VACUUM or a CREATE INDEX
while it's in flight.  See also the bit about hash_mem in
https://www.postgresql.org/message-id/CAH2-WzmNwV=LfDRXPsmCqgmm91mp=2b4FvXNF=ccvmrb8yf...@mail.gmail.com
-- see also commit a1b395b6a26ae80cde17fdfd2def8d351872f399's
introduction of pending_list_cleanup_size, yet another place where we
started to decouple something that was inadvisably tied to work_mem.

There have been other cases, too, where we've bound unrelated things
together into a single parameter, and my feeling is that most of those
have turned out a mess.  Separate behaviors ought to be controlled by
separate settings, even though it means we'll end up with more
settings.  Two settings each of which does one clear and well-defined
thing can even be easier to understand than one setting that does
multiple loosely-related things.

> The reason I wrote the documentation
> weasely is that it seems pointless to have to update it whenever we
> implement more things controlled by the same GUC option (which we might,
> if we learn new things about how to use COW filesystems later on).
> AFAIR Jerry's wording was more precise about what the parameter did.  If
> the only reason to change those behaviors is to make WAL work better on
> COW filesystems, then I don't see the point in splitting the GUC in two,
> or documenting in minute detail what it does.

Really?  What about somebody who has a different experience from
Jerry?  They turn the parameter on in release N and it's good and then
the behavior changes in release N+1 and now it sucks and they read the
documentation and it tells them nothing about what has actually
changed.  They can neither get the behavior back that they liked nor
can they understand what behavior they're actually getting that is
causing a problem, because it's not documented.

I do not think our track record is very good when it comes to deciding
which things users "need to know about."  Users need to know what's
really happening.  The idea that we're just going to have a magic flag
here that is going to change all of the things that you want changed
when you're running on a copy-on-write filesystem and it's all going
to work great so that nobody cares about the details does not sound
very likely to be correct.  We don't even know that the same
combination of behavior is performant or safe on every filesystem out
there, let alone that future things that come along are going to have
similar properties.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas

On Wed, Mar 6, 2019 at 11:37 AM Andrew Dunstan
 wrote:
> Well, let's put the question another way. Is there any reason to allow
> skipping zero filling if we are recycling? That seems possibly
> dangerous. I can imagine turning off recycling but leaving on
> zero-filling, although I don't have a concrete use case for it ATM.

I think the short answer is that we don't know.  Any filesystem where
just writing the last byte of the file is good enough to guarantee
that all the intervening space is allocated can skip zero-filling.
Any system where creating new WAL files is faster than recycling old
ones can choose to do it that way.  I don't know how you can make a
categorical argument that there can't be a system where one of those
things -- either one -- is true and the other is false.  At least to
me, they seem like basically unrelated issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera

On 2019-Mar-06, Robert Haas wrote:

> On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
> > I think the idea of it being a generic tunable for assorted behavior
> > changes, rather than specific to WAL recycling, is a good one.  I'm
> > unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
> 
> I *really* dislike this.  For one thing, it means that users don't
> have control over the behaviors individually.  For another, the
> documentation is now quite imprecise about what the option actually
> does, while expecting users to figure out whether those behaviors are
> acceptable or preferable in their environment.  It lists recycling of
> WAL files and zero-filling of those files as examples of behavior
> changes, but it does not say that those are the only changes, or even
> that they are made in all cases.

I can understand this argument.  Is there really a reason to change
those two behaviors separately?  The reason I wrote the documentation
weasely is that it seems pointless to have to update it whenever we
implement more things controlled by the same GUC option (which we might,
if we learn new things about how to use COW filesystems later on).
AFAIR Jerry's wording was more precise about what the parameter did.  If
the only reason to change those behaviors is to make WAL work better on
COW filesystems, then I don't see the point in splitting the GUC in two,
or documenting in minute detail what it does.

That said -- if there *is* such a reason, we can certainly split them up
and indicate to COW-filesystem users to change them both together.  I
don't think it's a big deal, but OTOH I see no reason to complicate
matters needlessly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Andrew Dunstan



On 3/6/19 11:30 AM, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 10:55 AM Andrew Dunstan
>  wrote:
>>> I *really* dislike this.  For one thing, it means that users don't
>>> have control over the behaviors individually.  For another, the
>>> documentation is now quite imprecise about what the option actually
>>> does, while expecting users to figure out whether those behaviors are
>>> acceptable or preferable in their environment.  It lists recycling of
>>> WAL files and zero-filling of those files as examples of behavior
>>> changes, but it does not say that those are the only changes, or even
>>> that they are made in all cases.
>> So you want two options, like wal_recycle_files and wal_zero_fill, both
>> defaulting to true? Is there a reasonably use case for turning one off
>> without the other?
> I don't know whether there's a use case for that, and that's one of
> the things that worries me.  I know, though, that if we have two
> parameters, then if there is a use case for it, people will be able to
> meet that use case without submitting a patch.  On the other hand, if
> we had convincing evidence that those two things should always go
> together, that would be fine, too.  But I don't see that anyone has
> made an argument that such a thing is necessarily true outside of ZFS.
>
> I actually wouldn't find it very surprising if disabling WAL recycling
> is sometimes beneficial even on ext4.  The fact that we haven't found
> such cases on this thread doesn't mean they don't exist.  On the other
> hand I think the wal_zero_fill behavior is not about performance but
> about reliability, so you can't afford to turn that on just because
> non-recycling happens to be faster on your machine.
>
>


Well, let's put the question another way. Is there any reason to allow
skipping zero filling if we are recycling? That seems possibly
dangerous. I can imagine turning off recycling but leaving on
zero-filling, although I don't have a concrete use case for it ATM.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas

On Wed, Mar 6, 2019 at 10:55 AM Andrew Dunstan
 wrote:
> > I *really* dislike this.  For one thing, it means that users don't
> > have control over the behaviors individually.  For another, the
> > documentation is now quite imprecise about what the option actually
> > does, while expecting users to figure out whether those behaviors are
> > acceptable or preferable in their environment.  It lists recycling of
> > WAL files and zero-filling of those files as examples of behavior
> > changes, but it does not say that those are the only changes, or even
> > that they are made in all cases.
>
> So you want two options, like wal_recycle_files and wal_zero_fill, both
> defaulting to true? Is there a reasonably use case for turning one off
> without the other?

I don't know whether there's a use case for that, and that's one of
the things that worries me.  I know, though, that if we have two
parameters, then if there is a use case for it, people will be able to
meet that use case without submitting a patch.  On the other hand, if
we had convincing evidence that those two things should always go
together, that would be fine, too.  But I don't see that anyone has
made an argument that such a thing is necessarily true outside of ZFS.

I actually wouldn't find it very surprising if disabling WAL recycling
is sometimes beneficial even on ext4.  The fact that we haven't found
such cases on this thread doesn't mean they don't exist.  On the other
hand I think the wal_zero_fill behavior is not about performance but
about reliability, so you can't afford to turn that on just because
non-recycling happens to be faster on your machine.

> Alternatively, we could remove the 'for example" wording, which I agree
> is unfortunate.

Yeah.  We seem to sometimes like to avoid documenting specifics for
fear that, should they change, we'd have to update the documentation.
But I think that just makes the documentation less useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Andrew Dunstan



On 3/6/19 10:38 AM, Robert Haas wrote:
> On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
>> I think the idea of it being a generic tunable for assorted behavior
>> changes, rather than specific to WAL recycling, is a good one.  I'm
>> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
> I *really* dislike this.  For one thing, it means that users don't
> have control over the behaviors individually.  For another, the
> documentation is now quite imprecise about what the option actually
> does, while expecting users to figure out whether those behaviors are
> acceptable or preferable in their environment.  It lists recycling of
> WAL files and zero-filling of those files as examples of behavior
> changes, but it does not say that those are the only changes, or even
> that they are made in all cases.
>

So you want two options, like wal_recycle_files and wal_zero_fill, both
defaulting to true? Is there a reasonably use case for turning one off
without the other?


Alternatively, we could remove the 'for example" wording, which I agree
is unfortunate.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas

On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  wrote:
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?

I *really* dislike this.  For one thing, it means that users don't
have control over the behaviors individually.  For another, the
documentation is now quite imprecise about what the option actually
does, while expecting users to figure out whether those behaviors are
acceptable or preferable in their environment.  It lists recycling of
WAL files and zero-filling of those files as examples of behavior
changes, but it does not say that those are the only changes, or even
that they are made in all cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2019-03-05 Thread Alvaro Herrera

Jerry,

On 2019-Mar-05, Jerry Jelinek wrote:

> Thanks again for your review. I went through your proposed patch diffs and
> applied most of them to my original changes. I did a few things slightly
> differently since I wanted to keep to to 80 columns for the source code,
> but I can revisit that if it is not an issue.

Yeah, in the places where I exceeded the limit, it is largely considered
not an issue.  Brace curling *is* an issue, though :-)  I would prefer
to go with my version (which is largely just stylistic changes over
yours), applying your subsequent changes on top of that.

I can't remember now if I pgindented it or just changed manually ... I
should do that.

> also cleaned up the confusing wording around "allocating blocks". I
> ran a clean build and make check passes. The new patch is attached.

Cool.  Can you confirm that it still fixes the performance issue for
you?  (It should, since functionally it should be the same thing as
yours.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2019-03-05 Thread Jerry Jelinek

Alvaro,

Thanks again for your review. I went through your proposed patch diffs and
applied most of them to my original changes. I did a few things slightly
differently since I wanted to keep to to 80 columns for the source code,
but I can revisit that if it is not an issue. I also cleaned up the
confusing wording around "allocating blocks". I ran a clean build and make
check passes. The new patch is attached.

Thanks,
Jerry


On Wed, Feb 27, 2019 at 4:12 PM Alvaro Herrera 
wrote:

> On 2019-Feb-05, Jerry Jelinek wrote:
>
> > First, since last fall, we have found another performance problem related
> > to initializing WAL files. I've described this issue in more detail
> below,
> > but in order to handle this new problem, I decided to generalize the
> patch
> > so the tunable refers to running on a Copy-On-Write filesystem instead of
> > just being specific to WAL recycling. Specifically, I renamed the GUC
> > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> > more obvious what is being tuned and will also be more flexible if there
> > are other problems in the future which are related to running on a COW
> > filesystem. I'm happy to choose a different name for the tunable if
> people
> > don't like 'wal_cow_fs'.
>
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
>
> I'm rewording your doc addition a little bit.  Here's my proposal:
>
>
> This parameter should only be set to on when
> the WAL
> resides on a Copy-On-Write
> (COW)
> filesystem.
> Enabling this option adjusts behavior to take advantage of the
> filesystem characteristics (for example, recycling WAL files and
> zero-filling new WAL files are disabled).
>
> This part sounds good enough to me -- further suggestions welcome.
>
> I'm less sure about this phrase:
>
> This setting is only appropriate for filesystems which
> allocate new disk blocks on every write.
>
> Is "... which allocate new disk blocks on every write" a technique
> distinct from CoW itself?  I'm confused as to what it means, or how can
> the user tell whether they are on such a filesystem.
>
> Obviously you're thinking that ZFS is such a filesystem and everybody
> who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
> -- should they turn this option on?  Browsing the wikipedia, I find that
> Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
> I don't think either Btrfs or ReFS are realistic options to put pg_wal
> on, so let's just list the common filesystems for which users are
> supposed to enable this option ... which I think nowadays is just ZFS.
> All in all, I would replace this phrase with something like: "This
> setting should be enabled when pg_wal resides on a ZFS filesystem or
> similar." That should be weasely enough that it's clear that we expect
> users to do the homework when on unusual systems, while actively pointing
> out the most common use case.
>
> > Finally, the patch now includes bypassing the zero-fill for new WAL files
> > when wal_cow_fs is true.
>
> That makes sense.  I think all these benchmarks Tomas Vondra run are not
> valid anymore ...
>
> The attached v2 has assorted cosmetic cleanups.  If you can validate it,
> I would appreciate it.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


0001-cow-filesystem.patch
Description: Binary data

Re: patch to allow disable of WAL recycling

2019-03-04 Thread Jerry Jelinek

Alvaro,

Thanks for taking a look at the new patch. I'll update the patch to change
the name of the tunable to match your suggestion and I'll also go through
the cleanup you suggested. Finally, I'll try to rewrite the doc to
eliminate the confusion around the wording about allocating new blocks on
every write. I'll send out a new draft of the patch once all of these
changes are done.

Thanks again,
Jerry


On Wed, Feb 27, 2019 at 4:12 PM Alvaro Herrera 
wrote:

> On 2019-Feb-05, Jerry Jelinek wrote:
>
> > First, since last fall, we have found another performance problem related
> > to initializing WAL files. I've described this issue in more detail
> below,
> > but in order to handle this new problem, I decided to generalize the
> patch
> > so the tunable refers to running on a Copy-On-Write filesystem instead of
> > just being specific to WAL recycling. Specifically, I renamed the GUC
> > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> > more obvious what is being tuned and will also be more flexible if there
> > are other problems in the future which are related to running on a COW
> > filesystem. I'm happy to choose a different name for the tunable if
> people
> > don't like 'wal_cow_fs'.
>
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
>
> I'm rewording your doc addition a little bit.  Here's my proposal:
>
>
> This parameter should only be set to on when
> the WAL
> resides on a Copy-On-Write
> (COW)
> filesystem.
> Enabling this option adjusts behavior to take advantage of the
> filesystem characteristics (for example, recycling WAL files and
> zero-filling new WAL files are disabled).
>
> This part sounds good enough to me -- further suggestions welcome.
>
> I'm less sure about this phrase:
>
> This setting is only appropriate for filesystems which
> allocate new disk blocks on every write.
>
> Is "... which allocate new disk blocks on every write" a technique
> distinct from CoW itself?  I'm confused as to what it means, or how can
> the user tell whether they are on such a filesystem.
>
> Obviously you're thinking that ZFS is such a filesystem and everybody
> who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
> -- should they turn this option on?  Browsing the wikipedia, I find that
> Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
> I don't think either Btrfs or ReFS are realistic options to put pg_wal
> on, so let's just list the common filesystems for which users are
> supposed to enable this option ... which I think nowadays is just ZFS.
> All in all, I would replace this phrase with something like: "This
> setting should be enabled when pg_wal resides on a ZFS filesystem or
> similar." That should be weasely enough that it's clear that we expect
> users to do the homework when on unusual systems, while actively pointing
> out the most common use case.
>
> > Finally, the patch now includes bypassing the zero-fill for new WAL files
> > when wal_cow_fs is true.
>
> That makes sense.  I think all these benchmarks Tomas Vondra run are not
> valid anymore ...
>
> The attached v2 has assorted cosmetic cleanups.  If you can validate it,
> I would appreciate it.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2019-02-27 Thread Alvaro Herrera

On 2019-Feb-05, Jerry Jelinek wrote:

> First, since last fall, we have found another performance problem related
> to initializing WAL files. I've described this issue in more detail below,
> but in order to handle this new problem, I decided to generalize the patch
> so the tunable refers to running on a Copy-On-Write filesystem instead of
> just being specific to WAL recycling. Specifically, I renamed the GUC
> tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> more obvious what is being tuned and will also be more flexible if there
> are other problems in the future which are related to running on a COW
> filesystem. I'm happy to choose a different name for the tunable if people
> don't like 'wal_cow_fs'.

I think the idea of it being a generic tunable for assorted behavior
changes, rather than specific to WAL recycling, is a good one.  I'm
unsure about your proposed name -- maybe "wal_cow_filesystem" is better?

I'm rewording your doc addition a little bit.  Here's my proposal:

   
This parameter should only be set to on when the WAL
resides on a Copy-On-Write 
(COW)
filesystem.
Enabling this option adjusts behavior to take advantage of the
filesystem characteristics (for example, recycling WAL files and
zero-filling new WAL files are disabled).

This part sounds good enough to me -- further suggestions welcome.

I'm less sure about this phrase:

This setting is only appropriate for filesystems which
allocate new disk blocks on every write.

Is "... which allocate new disk blocks on every write" a technique
distinct from CoW itself?  I'm confused as to what it means, or how can
the user tell whether they are on such a filesystem.

Obviously you're thinking that ZFS is such a filesystem and everybody
who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
-- should they turn this option on?  Browsing the wikipedia, I find that
Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
I don't think either Btrfs or ReFS are realistic options to put pg_wal
on, so let's just list the common filesystems for which users are
supposed to enable this option ... which I think nowadays is just ZFS.
All in all, I would replace this phrase with something like: "This
setting should be enabled when pg_wal resides on a ZFS filesystem or
similar." That should be weasely enough that it's clear that we expect
users to do the homework when on unusual systems, while actively pointing
out the most common use case.

> Finally, the patch now includes bypassing the zero-fill for new WAL files
> when wal_cow_fs is true.

That makes sense.  I think all these benchmarks Tomas Vondra run are not
valid anymore ...

The attached v2 has assorted cosmetic cleanups.  If you can validate it,
I would appreciate it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a4ce02f5cc8ad983c34712083f9cba7fda6d5b38 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 27 Feb 2019 19:41:05 -0300
Subject: [PATCH v2] pg_wal on COW fs

---
 doc/src/sgml/config.sgml  |  20 
 src/backend/access/transam/xlog.c | 101 --
 src/backend/utils/misc/guc.c  |  13 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h |   1 +
 5 files changed, 102 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b2..60a873273aa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2959,6 +2959,26 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_cow_filesystem (boolean)
+  
+   wal_cow_filesystem configuration parameter
+  
+  
+  
+   
+This parameter should only be set to on when the WAL
+resides on a Copy-On-Write (COW)
+filesystem.
+Enabling this option adjusts some behavior to take advantage of the
+filesystem characteristics (for example, recycling WAL files and
+zero-filling new WAL files are disabled).
+This setting is only appropriate for filesystems which
+allocate new disk blocks on every write.
+   
+  
+ 
+
  
  
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53ae..1acce1c70d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -94,6 +94,7 @@ bool		wal_log_hints = false;
 bool		wal_compression = false;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
+bool		wal_cow_filesystem = false;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -3216,6 +3217,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;

Re: patch to allow disable of WAL recycling

2019-02-05 Thread Jerry Jelinek

On Mon, Oct 1, 2018 at 7:16 PM Michael Paquier  wrote:

> On Thu, Sep 13, 2018 at 02:56:42PM -0600, Jerry Jelinek wrote:
> > I'll take a look at that. I had been trying to keep the patch as minimal
> as
> > possible, but I'm happy to work through this.
>
> (Please be careful with top-posting)
>
> Jerry, the last status was from three weeks ago with the patch waiting
> on the author, so I am marking it as returned with feedback.
> --
> Michael
>

I'd like to see if I can get this patch moving forward again. I have made
several changes to the patch since the last time this was discussed.

First, since last fall, we have found another performance problem related
to initializing WAL files. I've described this issue in more detail below,
but in order to handle this new problem, I decided to generalize the patch
so the tunable refers to running on a Copy-On-Write filesystem instead of
just being specific to WAL recycling. Specifically, I renamed the GUC
tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
more obvious what is being tuned and will also be more flexible if there
are other problems in the future which are related to running on a COW
filesystem. I'm happy to choose a different name for the tunable if people
don't like 'wal_cow_fs'.

Second, I've modified the WAL recycling code change as requested earlier.

Third, this new patch is rebased onto the current code base.

Finally, the patch now includes bypassing the zero-fill for new WAL files
when wal_cow_fs is true. Hopefully it should be obvious why this is
unnecessary for a COW filesystem, but here is some more information about
how this can cause a performance problem, at least on ZFS. As background,
internally ZFS will skip allocating zero-filled blocks, but that is handled
fairly late in the filesystem code flow. First, all of the thousands of
initialization 8k write system calls are being made by Postgres. ZFS will
throttle writes under some circumstances. We found that all of the writes
from XLogFileInit, while Postgres is also doing an autovacuum, will trigger
write throttling due to the large amount of write traffic induced by the
autovacuum. This problem occurs even when WAL files are being
recycled. That seems to be related to the fact that the checkpointer –
which is responsible for WAL file recycling – is taking so long that it is
falling behind its own estimate for WAL activity.

The revised patch is attached. Please let me know if there are any comments.

Thanks,
Jerry

0001-cow-filesystem.patch
Description: Binary data

Re: patch to allow disable of WAL recycling

2018-10-01 Thread Michael Paquier

On Thu, Sep 13, 2018 at 02:56:42PM -0600, Jerry Jelinek wrote:
> I'll take a look at that. I had been trying to keep the patch as minimal as
> possible, but I'm happy to work through this.

(Please be careful with top-posting)

Jerry, the last status was from three weeks ago with the patch waiting
on the author, so I am marking it as returned with feedback.
--
Michael


signature.asc
Description: PGP signature

Re: patch to allow disable of WAL recycling

2018-09-13 Thread Jerry Jelinek

Hi Peter,

I'll take a look at that. I had been trying to keep the patch as minimal as
possible, but I'm happy to work through this.

Thanks,
Jerry


On Tue, Sep 11, 2018 at 9:39 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 10/09/2018 16:10, Jerry Jelinek wrote:
> > Thank you again for running all of these tests on your various hardware
> > configurations. I was not aware of the convention that the commented
> > example in the config file is expected to match the default value, so I
> > was actually trying to show what to use if you didn't want the default,
> > but I am happy to update the patch so the comment matches the default.
> > Beyond that, I am unsure what the next steps are for this proposal.
>
> Could you organize the code so that the block below
>
> /*
>  * Initialize info about where to try to recycle to.
>  */
>
> isn't executed if recycling is off, since we don't need it.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-09-11 Thread Peter Eisentraut

On 10/09/2018 16:10, Jerry Jelinek wrote:
> Thank you again for running all of these tests on your various hardware
> configurations. I was not aware of the convention that the commented
> example in the config file is expected to match the default value, so I
> was actually trying to show what to use if you didn't want the default,
> but I am happy to update the patch so the comment matches the default.
> Beyond that, I am unsure what the next steps are for this proposal.

Could you organize the code so that the block below

/*
 * Initialize info about where to try to recycle to.
 */

isn't executed if recycling is off, since we don't need it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-09-10 Thread Jerry Jelinek

Tomas,

Thank you again for running all of these tests on your various hardware
configurations. I was not aware of the convention that the commented
example in the config file is expected to match the default value, so I was
actually trying to show what to use if you didn't want the default, but I
am happy to update the patch so the comment matches the default. Beyond
that, I am unsure what the next steps are for this proposal.

Thanks again,
Jerry


On Tue, Sep 4, 2018 at 12:41 PM, Tomas Vondra 
wrote:

> Hi,
>
> So here is the last set of benchmark results, this time from ext4 on a
> small SATA-based RAID (3 x 7.2k). As before, I'm only attaching PDFs
> with the simple charts, full results are available in the git repository
> [1]. Overall the numbers are rather boring, with almost no difference
> between the two setups.
>
> That being said, I'm not opposed to introducing the GUC. I'm not going
> to pretend my tests represents all possible HW configs and workloads,
> and I have no trouble believing that it may be quite beneficial in some
> cases.
>
> The one comment about the code is that we usually use the actual default
> value in the config sample. But the patch does this:
>
> +#wal_recycle = off # do not recycle WAL files
>
> while the GUC is defined like this:
>
> {
> {"wal_recycle", PGC_SUSET, WAL_SETTINGS,
> gettext_noop("WAL recycling enabled."),
> NULL
> },
> &wal_recycle,
> true,
> NULL, NULL, NULL
> },
>
> So the default is actually "on" which makes the commented-out config
> sample rather confusing.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-09-04 Thread Tomas Vondra

Hi,

So here is the last set of benchmark results, this time from ext4 on a
small SATA-based RAID (3 x 7.2k). As before, I'm only attaching PDFs
with the simple charts, full results are available in the git repository
[1]. Overall the numbers are rather boring, with almost no difference
between the two setups.

That being said, I'm not opposed to introducing the GUC. I'm not going
to pretend my tests represents all possible HW configs and workloads,
and I have no trouble believing that it may be quite beneficial in some
cases.

The one comment about the code is that we usually use the actual default
value in the config sample. But the patch does this:

+#wal_recycle = off # do not recycle WAL files

while the GUC is defined like this:

{
{"wal_recycle", PGC_SUSET, WAL_SETTINGS,
gettext_noop("WAL recycling enabled."),
NULL
},
&wal_recycle,
true,
NULL, NULL, NULL
},

So the default is actually "on" which makes the commented-out config
sample rather confusing.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


ext4-sata.pdf
Description: Adobe PDF document

Re: patch to allow disable of WAL recycling

2018-08-31 Thread Tomas Vondra

On 08/27/2018 03:59 AM, Thomas Munro wrote:
> On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
>> zfs (Linux)
>> ---
>> On scale 200, there's pretty much no difference.
> 
> Speculation: It could be that the dnode and/or indirect blocks that
> point to data blocks are falling out of memory in my test setup[1] but
> not in yours.  I don't know, but I guess those blocks compete with
> regular data blocks in the ARC?  If so it might come down to ARC size
> and the amount of other data churning through it.
> 

Not sure, but I'd expect this to matter on the largest scale. The
machine has 64GB of RAM, and scale 8000 is ~120GB with mostly random
access. I've repeated the tests with scale 6000 to give ZFS a bit more
free space and prevent the issues when there's less than 20% of free
space (results later), but I still don't see any massive improvement.

> Further speculation:  Other filesystems have equivalent data structures,
> but for example XFS jams that data into the inode itself in a compact
> "extent list" format[2] if it can, to avoid the need for an external
> btree.  Hmm, I wonder if that format tends to be used for our segment
> files.  Since cached inodes are reclaimed in a different way than cached
> data pages, I wonder if that makes them more sticky in the face of high
> data churn rates (or I guess less, depending on your Linux
> vfs_cache_pressure setting and number of active files).  I suppose the
> combination of those two things, sticky inodes with internalised extent
> lists, might make it more likely that we can overwrite an old file
> without having to fault anything in.
> 

That's possible. The question is how it affects in which cases it's
worth disabling the WAL reuse, and why you observe better performance
and I don't.

> One big difference between your test rig and mine is that your Optane
> 900P claims to do about half a million random IOPS.  That is about half
> a million more IOPS than my spinning disks.  (Actually I used my 5400RPM
> steam powered machine deliberately for that test: I disabled fsync so
> that commit rate wouldn't be slowed down but cache misses would be
> obvious.  I guess Joyent's storage is somewhere between these two
> extremes...)
> 

Yeah. It seems very much like a CPU vs. I/O trade-off, where disabling
the WAL reuse saves a bit of I/O but increases the CPU cost. On the SSD
the reduced number of I/O requests are not noticeable, but the extra CPU
costs does matter (thanks to the high tps values). On slower devices the
I/O savings will matter more, probably.

>> On scale 2000, the
>> throughput actually decreased a bit, by about 5% - from the chart it
>> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
>> for some reason.
> 
> Huh.
> 

Not sure what's causing this. On SATA results it's not visible, though.

>> I have no idea what happened at the largest scale (8000) - on master
>> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
>> minutes (but not fully). Without WAL reuse there's no such drop,
>> although there seems to be some degradation after ~220 minutes (i.e. at
>> about the same time the master partially recovers. I'm not sure what to
>> think about this, I wonder if it might be caused by almost filling the
>> disk space, or something like that. I'm rerunning this with scale 600.
> 
> There are lots of reports of ZFS performance degrading when free space
> gets below something like 20%.
> 

I've repeated the benchmarks on the Optane SSD with the largest scale
reduced to 6000, to see if it prevents the performance drop with less
than 20% of free space. It apparently does (see zfs2.pdf), although it
does not change the behavior - with WAL reuse disabled it's still a bit
slower.

I've also done the tests with SATA devices (3x 7.2k drives), to see if
it changes the behavior due to I/O vs. CPU trade-off. And it seems to be
the case (see zfs-sata.pdf), to some extent. For the smallest scale
(200) there's not much difference. For medium (2000) there seems to be a
clear improvement, although the behavior is not particularly smooth. On
the largest scale (8000) there seems to be a slight improvement, or at
least it's not slower like before.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

zfs-sata.pdf
Description: Adobe PDF document

zfs2.pdf
Description: Adobe PDF document

Re: patch to allow disable of WAL recycling

2018-08-29 Thread Jerry Jelinek

Tomas,

This is really interesting data, thanks a lot for collecting all of it and
formatting the helpful graphs.

Jerry


On Sun, Aug 26, 2018 at 4:14 PM, Tomas Vondra 
wrote:

>
>
> On 08/25/2018 12:11 AM, Jerry Jelinek wrote:
> > Alvaro,
> >
> > I have previously posted ZFS numbers for SmartOS and FreeBSD to this
> > thread, although not with the exact same benchmark runs that Tomas did.
> >
> > I think the main purpose of running the benchmarks is to demonstrate
> > that there is no significant performance regression with wal recycling
> > disabled on a COW filesystem such as ZFS (which might just be intuitive
> > for a COW filesystem). I've tried to be sure it is clear in the doc
> > change with this patch that this tunable is only applicable to COW
> > filesystems. I do not think the benchmarks will be able to recreate the
> > problematic performance state that was originally described in Dave's
> > email thread here:
> >
> > https://www.postgresql.org/message-id/flat/
> CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#
> cacukrjo7djvub8e2aijoayj8bfkk3xxbtwu3kkaritr67m3...@mail.gmail.com
> >
>
> I agree - the benchmarks are valuable both to show improvement and lack
> of regression. I do have some numbers from LVM/ext4 (with snapshot
> recreated every minute, to trigger COW-like behavior, and without the
> snapshots), and from ZFS (on Linux, using zfsonlinux 0.7.9 on kernel
> 4.17.17).
>
> Attached are PDFs with summary charts, more detailed results are
> available at
>
>   https://bitbucket.org/tvondra/wal-recycle-test-xeon/src/master/
>
>
>
> lvm/ext4 (no snapshots)
> ---
> This pretty much behaves like plain ex4, at least for scales 200 and
> 2000. I don't have results for scale 8000, because the test ran out of
> disk space (I've used part of the device for snapshots, and it was
> enough to trigger the disk space issue).
>
>
> lvm/ext4 (snapshots)
> -
> On the smallest scale (200), there's no visible difference. On scale
> 2000 disabling WAL reuse gives about 10% improvement (21468 vs. 23517
> tps), although it's not obvious from the chart. On the largest scale
> (6000, to prevent the disk space issues) the improvement is about 10%
> again, but it's much clearer.
>
>
> zfs (Linux)
> ---
> On scale 200, there's pretty much no difference. On scale 2000, the
> throughput actually decreased a bit, by about 5% - from the chart it
> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
> for some reason.
>
> I have no idea what happened at the largest scale (8000) - on master
> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
> minutes (but not fully). Without WAL reuse there's no such drop,
> although there seems to be some degradation after ~220 minutes (i.e. at
> about the same time the master partially recovers. I'm not sure what to
> think about this, I wonder if it might be caused by almost filling the
> disk space, or something like that. I'm rerunning this with scale 600.
>
> I'm also not sure how much can we extrapolate this to other ZFS configs
> (I mean, this is a ZFS on a single SSD device, while I'd generally
> expect ZFS on multiple devices, etc.).
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-08-26 Thread Thomas Munro

On Mon, Aug 27, 2018 at 10:14 AM Tomas Vondra 
wrote:
> zfs (Linux)
> ---
> On scale 200, there's pretty much no difference.

Speculation: It could be that the dnode and/or indirect blocks that point
to data blocks are falling out of memory in my test setup[1] but not in
yours.  I don't know, but I guess those blocks compete with regular data
blocks in the ARC?  If so it might come down to ARC size and the amount of
other data churning through it.

Further speculation:  Other filesystems have equivalent data structures,
but for example XFS jams that data into the inode itself in a compact
"extent list" format[2] if it can, to avoid the need for an external
btree.  Hmm, I wonder if that format tends to be used for our segment
files.  Since cached inodes are reclaimed in a different way than cached
data pages, I wonder if that makes them more sticky in the face of high
data churn rates (or I guess less, depending on your Linux
vfs_cache_pressure setting and number of active files).  I suppose the
combination of those two things, sticky inodes with internalised extent
lists, might make it more likely that we can overwrite an old file without
having to fault anything in.

One big difference between your test rig and mine is that your Optane 900P
claims to do about half a million random IOPS.  That is about half a
million more IOPS than my spinning disks.  (Actually I used my 5400RPM
steam powered machine deliberately for that test: I disabled fsync so that
commit rate wouldn't be slowed down but cache misses would be obvious.  I
guess Joyent's storage is somewhere between these two extremes...)

> On scale 2000, the
> throughput actually decreased a bit, by about 5% - from the chart it
> seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
> for some reason.

Huh.

> I have no idea what happened at the largest scale (8000) - on master
> there's a huge drop after ~120 minutes, which somewhat recovers at ~220
> minutes (but not fully). Without WAL reuse there's no such drop,
> although there seems to be some degradation after ~220 minutes (i.e. at
> about the same time the master partially recovers. I'm not sure what to
> think about this, I wonder if it might be caused by almost filling the
> disk space, or something like that. I'm rerunning this with scale 600.

There are lots of reports of ZFS performance degrading when free space gets
below something like 20%.

[1]
https://www.postgresql.org/message-id/CAEepm%3D2pypg3nGgBDYyG0wuCH%2BxTWsAJddvJUGBNsDiyMhcXaQ%40mail.gmail.com
[2]
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure/tmp/en-US/html/Data_Extents.html

-- 
Thomas Munro
http://www.enterprisedb.com

Re: patch to allow disable of WAL recycling

2018-08-26 Thread Tomas Vondra

On 08/25/2018 12:11 AM, Jerry Jelinek wrote:
> Alvaro,
> 
> I have previously posted ZFS numbers for SmartOS and FreeBSD to this
> thread, although not with the exact same benchmark runs that Tomas did.
> 
> I think the main purpose of running the benchmarks is to demonstrate
> that there is no significant performance regression with wal recycling
> disabled on a COW filesystem such as ZFS (which might just be intuitive
> for a COW filesystem). I've tried to be sure it is clear in the doc
> change with this patch that this tunable is only applicable to COW
> filesystems. I do not think the benchmarks will be able to recreate the
> problematic performance state that was originally described in Dave's
> email thread here:
> 
> https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#cacukrjo7djvub8e2aijoayj8bfkk3xxbtwu3kkaritr67m3...@mail.gmail.com
> 

I agree - the benchmarks are valuable both to show improvement and lack
of regression. I do have some numbers from LVM/ext4 (with snapshot
recreated every minute, to trigger COW-like behavior, and without the
snapshots), and from ZFS (on Linux, using zfsonlinux 0.7.9 on kernel
4.17.17).

Attached are PDFs with summary charts, more detailed results are
available at

  https://bitbucket.org/tvondra/wal-recycle-test-xeon/src/master/

lvm/ext4 (no snapshots)
---
This pretty much behaves like plain ex4, at least for scales 200 and
2000. I don't have results for scale 8000, because the test ran out of
disk space (I've used part of the device for snapshots, and it was
enough to trigger the disk space issue).

lvm/ext4 (snapshots)
-
On the smallest scale (200), there's no visible difference. On scale
2000 disabling WAL reuse gives about 10% improvement (21468 vs. 23517
tps), although it's not obvious from the chart. On the largest scale
(6000, to prevent the disk space issues) the improvement is about 10%
again, but it's much clearer.

zfs (Linux)
---
On scale 200, there's pretty much no difference. On scale 2000, the
throughput actually decreased a bit, by about 5% - from the chart it
seems disabling the WAL reuse somewhat amplifies impact of checkpoints,
for some reason.

I have no idea what happened at the largest scale (8000) - on master
there's a huge drop after ~120 minutes, which somewhat recovers at ~220
minutes (but not fully). Without WAL reuse there's no such drop,
although there seems to be some degradation after ~220 minutes (i.e. at
about the same time the master partially recovers. I'm not sure what to
think about this, I wonder if it might be caused by almost filling the
disk space, or something like that. I'm rerunning this with scale 600.

I'm also not sure how much can we extrapolate this to other ZFS configs
(I mean, this is a ZFS on a single SSD device, while I'd generally
expect ZFS on multiple devices, etc.).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

lvm-ext4-snapshots.pdf
Description: Adobe PDF document

zfs.pdf
Description: Adobe PDF document

lvm-ext4.pdf
Description: Adobe PDF document

Re: patch to allow disable of WAL recycling

2018-08-24 Thread Jerry Jelinek

Alvaro,

I have previously posted ZFS numbers for SmartOS and FreeBSD to this
thread, although not with the exact same benchmark runs that Tomas did.

I think the main purpose of running the benchmarks is to demonstrate that
there is no significant performance regression with wal recycling disabled
on a COW filesystem such as ZFS (which might just be intuitive for a COW
filesystem). I've tried to be sure it is clear in the doc change with this
patch that this tunable is only applicable to COW filesystems. I do not
think the benchmarks will be able to recreate the problematic performance
state that was originally described in Dave's email thread here:

https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#cacukrjo7djvub8e2aijoayj8bfkk3xxbtwu3kkaritr67m3...@mail.gmail.com

Thanks,
Jerry

On Wed, Aug 22, 2018 at 8:41 AM, Alvaro Herrera 
wrote:

> On 2018-Aug-22, Andres Freund wrote:
>
> > On 2018-08-22 11:06:17 -0300, Alvaro Herrera wrote:
>
> > > I suppose that the use case that was initially proposed (ZFS) has not
> > > yet been tested so we shouldn't reject this patch immediately, but
> > > perhaps what Joyent people should be doing now is running Tomas' test
> > > script on ZFS and see what the results look like.
> >
> > IDK, I would see it less negatively. Yes, we should put a BIG FAT
> > warning to never use this on non COW filesystems. And IMO ZFS (and also
> > btrfs) sucks badly here, even though they really shouldn't. But given
> > the positive impact for zfs & btrfs, and the low code complexity, I
> > think it's not insane to provide this tunable.
>
> Yeah, but let's see some ZFS numbers first :-)
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-08-22 Thread Alvaro Herrera

On 2018-Aug-22, Andres Freund wrote:

> On 2018-08-22 11:06:17 -0300, Alvaro Herrera wrote:

> > I suppose that the use case that was initially proposed (ZFS) has not
> > yet been tested so we shouldn't reject this patch immediately, but
> > perhaps what Joyent people should be doing now is running Tomas' test
> > script on ZFS and see what the results look like.
> 
> IDK, I would see it less negatively. Yes, we should put a BIG FAT
> warning to never use this on non COW filesystems. And IMO ZFS (and also
> btrfs) sucks badly here, even though they really shouldn't. But given
> the positive impact for zfs & btrfs, and the low code complexity, I
> think it's not insane to provide this tunable.

Yeah, but let's see some ZFS numbers first :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-08-22 Thread Andres Freund

On 2018-08-22 11:06:17 -0300, Alvaro Herrera wrote:
> On 2018-Aug-21, Jerry Jelinek wrote:
> 
> > Tomas,
> > 
> > Thanks for doing all of this testing. Your testing and results are much
> > more detailed than anything I did. Please let me know if there is any
> > follow-up that I should attempt.
> 
> Either I completely misread these charts, or there is practically no
> point to disabling WAL recycling (except on btrfs, but then nobody in
> their right minds would use it for Postgres given these numbers anyway).
> I suppose that the use case that was initially proposed (ZFS) has not
> yet been tested so we shouldn't reject this patch immediately, but
> perhaps what Joyent people should be doing now is running Tomas' test
> script on ZFS and see what the results look like.

IDK, I would see it less negatively. Yes, we should put a BIG FAT
warning to never use this on non COW filesystems. And IMO ZFS (and also
btrfs) sucks badly here, even though they really shouldn't. But given
the positive impact for zfs & btrfs, and the low code complexity, I
think it's not insane to provide this tunable.

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2018-08-22 Thread Alvaro Herrera

On 2018-Aug-21, Jerry Jelinek wrote:

> Tomas,
> 
> Thanks for doing all of this testing. Your testing and results are much
> more detailed than anything I did. Please let me know if there is any
> follow-up that I should attempt.

Either I completely misread these charts, or there is practically no
point to disabling WAL recycling (except on btrfs, but then nobody in
their right minds would use it for Postgres given these numbers anyway).
I suppose that the use case that was initially proposed (ZFS) has not
yet been tested so we shouldn't reject this patch immediately, but
perhaps what Joyent people should be doing now is running Tomas' test
script on ZFS and see what the results look like.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-08-21 Thread Jerry Jelinek

Tomas,

Thanks for doing all of this testing. Your testing and results are much
more detailed than anything I did. Please let me know if there is any
follow-up that I should attempt.

Thanks again,
Jerry


On Thu, Aug 16, 2018 at 3:43 PM, Tomas Vondra 
wrote:

> On 07/22/2018 10:50 PM, Tomas Vondra wrote:
> > On 07/21/2018 12:04 AM, Jerry Jelinek wrote:
> >> Thomas,
> >>
> >> Thanks for your offer to run some tests on different OSes and
> >> filesystems that you have. Anything you can provide here would be much
> >> appreciated. I don't have anything other than our native SmartOS/ZFS
> >> based configurations, but I might be able to setup some VMs and get
> >> results that way. I should be able to setup a VM running FreeBSD. If you
> >> have a chance to collect some data, just let me know the exact
> >> benchmarks you ran and I'll run the same things on the FreeBSD VM.
> >> Obviously you're under no obligation to do any of this, so if you don't
> >> have time, just let me know and I'll see what I can do on my own.
> >>
> >
> > Sounds good. I plan to start with the testing in a couple of days - the
> > boxes are currently running some other tests at the moment. Once I have
> > some numbers I'll share them here, along with the test scripts etc.
> >
>
> I do have initial results from one of the boxes. It's not complete, and
> further tests are still running, but I suppose it's worth sharing what I
> have at this point.
>
> As usual, the full data and ugly scripts are available in a git repo:
>
>https://bitbucket.org/tvondra/wal-recycle-test-xeon/src/master/
>
> Considering the WAL recycling only kicks in after a while, I've decided
> to do a single long (6-hour) pgbench run for each scale, instead of the
> usual "multiple short runs" approach.
>
> So far I've tried on these filesystems:
>
> * btrfs
> * ext4 / delayed allocation enabled (default)
> * ext4 / delayed allocation disabled
> * xfs
>
> The machine has 64GB of RAM, so I've chosen scales 200 (fits into
> shared_buffers), 2000 (in RAM) and 8000 (exceeds RAM), to trigger
> different I/O patterns. I've used the per-second aggregated logging,
> with the raw data available in the git repo. The charts attached to this
> message are per-minute tps averages, to demonstrate the overall impact
> on throughtput which would otherwise be hidden in jitter.
>
> All these tests are done on Optane 900P 280GB SSD, which is pretty nice
> storage but the limited size is somewhat tight for the scale 8000 test.
>
> For the traditional filesystems (ext4, xfs) the WAL recycling seems to
> be clearly beneficial - for the in-memory datasets the difference seems
> to be negligible, but for the largest scale it gives maybe +20% benefit.
> The delalloc/nodellalloc on ext4 makes pretty much no difference here,
> and both xfs and ext4 peform almost exactly the same here - the main
> difference seems to be that on ext4 the largest scale ran out of disk
> space while xfs managed to keep running. Clearly there's a difference in
> free space management, but that's unrelated to this patch.
>
> On BTRFS, the results on the two smaller scales show about the same
> behavior (minimal difference between WAL recycling and not recycling),
> except that the throughput is perhaps 25-50% of ext4/xfs. Fair enough, a
> different type of filesystem, and LVM snapshots would likely have the
> same impact. But no clear win with recycling disabled. On the largest
> scale, the device ran out of space after 10-20 minutes, which makes it
> impossible to make any reasonable conclusions :-(
>
>
> I plan to do some more tests with zfsonlinux, and LVM with snapshots. I
> wonder if those will show some benefit of disabling the WAL recycling.
> And then, if time permits, I'll redo some of those tests with a small
> SATA-based RAID array (aka spinning rust). Mostly out of curiosity.
>
> FWIW I've planned to do these tests on another machine, but I've ran
> into some strange data corruption issues on it, and I've spent quite a
> bit of time investigating that and trying to reproduce it, which delayed
> these tests a bit. And of course, once I added elog(PANIC) to the right
> place it stopped happening :-/
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-08-03 Thread Jerry Jelinek

After I posted my previous FreeBSD results, I had a private request to run
the test for a longer period and on a larger VM.

I setup a new 8 CPU, 16 GB VM. This is the largest I can create and is on a
different machine from the previous VM, so the results cannot be directly
compared. I reran the same pgbench run but for an hour. Here are the
aggregated results

recycling on
avg tps: 470.3
avg lat: 8.5

recycling off
avg tps: 472.4
avg lat: 8.5

I think this still shows that there is no regression on FreeBSD/ZFS with
WAL recycling off.

Thanks,
Jerry


On Fri, Jul 27, 2018 at 1:32 PM, Jerry Jelinek 
wrote:

> I've setup FreeBSD 11.1 in a VM and I setup a ZFS filesystem to use for
> the Postgres DB. I ran the following simple benchmark.
>
> pgbench -M prepared -c 4 -j 4 -T 60 postgres
>
> Since it is in a VM and I can't control what else might be happening on
> the box, I ran this several times at different times of the day and
> averaged the results. Here is the average TPS and latency with WAL
> recycling on (the default) and off.
>
> recycling on
> avg tps: 407.4
> avg lat: 9.8
>
> recycling off
> avg tps: 425.7
> avg lat: 9.4 ms
>
> Given my uncertainty about what else is running on the box, I think it is
> reasonable to say these are essentially equal, but I can collect more data
> across more different times if necessary. I'm also happy to collect more
> data if people have suggestions for different parameters on the pgbench run.
>
> Thanks,
> Jerry
>
>
> On Fri, Jul 20, 2018 at 4:04 PM, Jerry Jelinek 
> wrote:
>
>> Thomas,
>>
>> Thanks for your offer to run some tests on different OSes and filesystems
>> that you have. Anything you can provide here would be much appreciated. I
>> don't have anything other than our native SmartOS/ZFS based configurations,
>> but I might be able to setup some VMs and get results that way. I should be
>> able to setup a VM running FreeBSD. If you have a chance to collect some
>> data, just let me know the exact benchmarks you ran and I'll run the same
>> things on the FreeBSD VM. Obviously you're under no obligation to do any of
>> this, so if you don't have time, just let me know and I'll see what I can
>> do on my own.
>>
>> Thanks again,
>> Jerry
>>
>>
>> On Tue, Jul 17, 2018 at 2:47 PM, Tomas Vondra <
>> tomas.von...@2ndquadrant.com> wrote:
>>
>>> On 07/17/2018 09:12 PM, Peter Eisentraut wrote:
>>> > On 17.07.18 00:04, Jerry Jelinek wrote:
>>> >> There have been quite a few comments since last week, so at this
>>> point I
>>> >> am uncertain how to proceed with this change. I don't think I saw
>>> >> anything concrete in the recent emails that I can act upon.
>>> >
>>> > The outcome of this could be multiple orthogonal patches that affect
>>> the
>>> > WAL file allocation behavior somehow.  I think your original idea of
>>> > skipping recycling on a COW file system is sound.  But I would rather
>>> > frame the option as "preallocating files is obviously useless on a COW
>>> > file system" rather than "this will make things mysteriously faster
>>> with
>>> > uncertain trade-offs".
>>> >
>>>
>>> Makes sense, I guess. But I think many claims made in this thread are
>>> mostly just assumptions at this point, based on our beliefs how CoW or
>>> non-CoW filesystems work. The results from ZFS (showing positive impact)
>>> are an exception, but that's about it. I'm sure those claims are based
>>> on real-world experience and are likely true, but it'd be good to have
>>> data from a wider range of filesystems / configurations etc. so that we
>>> can give better recommendations to users, for example.
>>>
>>> That's something I can help with, assuming we agree on what tests we
>>> want to do. I'd say the usual batter of write-only pgbench tests with
>>> different scales (fits into s_b, fits into RAM, larger then RAM) on
>>> common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
>>> different types of storage would be enough. I don't have any freebsd box
>>> available, unfortunately.
>>>
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>>
>>
>>
>

Re: patch to allow disable of WAL recycling

2018-07-31 Thread Robert Haas

On Mon, Jul 30, 2018 at 4:43 AM, Peter Eisentraut
 wrote:
> On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote:
>> My result is that we cannot disable recycling perfectly just by
>> setting min/max_wal_size.
>
> Maybe the behavior of min_wal_size should be rethought?  Elsewhere in
> this thread, there was also a complaint that max_wal_size isn't actually
> a max.  It seems like there might be some interest in making these
> settings more accurate.
>
> I mean, what is the point of the min_wal_size setting if not controlling
> this very thing?

See the logic in XLOGfileslop().  The number of segments that the
server recycles (by renaming) after a checkpoint is bounded to not
less than min_wal_size and not more than max_wal_size, but the actual
value fluctuates between those two extremes based on the number of
segments the server believes will be required before the next
checkpoint completes.  Logically, min_wal_size = 0 would mean that the
number of recycled segments could be as small as zero.  However, what
is being requested here is to force the number of recycled segments to
never be larger than zero, which is different.

As far as the log in XLOGfileslop() is concerned, that would
correspond to max_wal_size = 0, not min_wal_size = 0.  However, that's
an impractical setting because max_wal_size is also used in other
places, like CalculateCheckpointSegments().

In other words, min_wal_size = 0 logically means that we MIGHT NOT
recycle any WAL segments, but the desired behavior here is that we DO
NOT recycle any WAL segments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-30 Thread Kyotaro HORIGUCHI

At Mon, 30 Jul 2018 10:43:20 +0200, Peter Eisentraut 
 wrote in 

> On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote:
> > My result is that we cannot disable recycling perfectly just by
> > setting min/max_wal_size.
> 
> Maybe the behavior of min_wal_size should be rethought?  Elsewhere in
> this thread, there was also a complaint that max_wal_size isn't actually
> a max.  It seems like there might be some interest in making these
> settings more accurate.
> 
> I mean, what is the point of the min_wal_size setting if not controlling
> this very thing?

Sorry, I have forgotten to mention it.

The definition of the variable is "We won't reduce segments to no
less than this segments (but in MB) even if we don't need such
many segments until the next checkpoint". I couldn't guess a
proper value for it to indicate the behaior that "I don't want to
keep (recycle) preallocated segments even for expected checkpint
interval.".  In short, since I thought that it's not intuitive at
that time.

Reconsidering the candidate values:

0 seems to keep segments for the next checkpoit interval.

-1 seems that it just disables segment reduction (this is the
same with setting the same value with max_wal_size?)

Maybe we could -1 for this purpose.

guc.c
| {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
| gettext_noop("Sets the minimum size to shrink the WAL to."),
+ gettext_noop("-1 turns off WAL recycling."),

# This seems somewhat.. out-of-the-blue?

wal-configuraiton.html

| The number of WAL segment files in pg_wal directory depends on
| min_wal_size, max_wal_size and the amount of WAL generated in
| previous checkpoint cycles. When old log segment files are no
| longer needed, they are removed or recycled (that is, renamed
| to become future segments in the numbered sequence). If, due to
...
| extent. min_wal_size puts a minimum on the amount of WAL files
| recycled for future usage; that much WAL is always recycled for
| future use, even if the system is idle and the WAL usage
| estimate suggests that little WAL is needed.
+ If you don't need the recycling feature, setting -1 to
+ min_wal_size disables the feature and WAL files are created on
+ demand.

# I'm not sure this makes sense for readers.

Besides the above, I supppose that this also turns off
preallcoation of a whole segment at the first use, which could
cause problems here and there...

If we allowed a string value like 'no-prealloc' for min_wal_size,
it might be comprehensive?

# Sorry for the scattered thoughts

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: patch to allow disable of WAL recycling

2018-07-30 Thread Peter Eisentraut

On 19/07/2018 05:59, Kyotaro HORIGUCHI wrote:
> My result is that we cannot disable recycling perfectly just by
> setting min/max_wal_size.

Maybe the behavior of min_wal_size should be rethought?  Elsewhere in
this thread, there was also a complaint that max_wal_size isn't actually
a max.  It seems like there might be some interest in making these
settings more accurate.

I mean, what is the point of the min_wal_size setting if not controlling
this very thing?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-27 Thread Jerry Jelinek

I've setup FreeBSD 11.1 in a VM and I setup a ZFS filesystem to use for the
Postgres DB. I ran the following simple benchmark.

pgbench -M prepared -c 4 -j 4 -T 60 postgres

Since it is in a VM and I can't control what else might be happening on the
box, I ran this several times at different times of the day and averaged
the results. Here is the average TPS and latency with WAL recycling on (the
default) and off.

recycling on
avg tps: 407.4
avg lat: 9.8

recycling off
avg tps: 425.7
avg lat: 9.4 ms

Given my uncertainty about what else is running on the box, I think it is
reasonable to say these are essentially equal, but I can collect more data
across more different times if necessary. I'm also happy to collect more
data if people have suggestions for different parameters on the pgbench run.

Thanks,
Jerry


On Fri, Jul 20, 2018 at 4:04 PM, Jerry Jelinek 
wrote:

> Thomas,
>
> Thanks for your offer to run some tests on different OSes and filesystems
> that you have. Anything you can provide here would be much appreciated. I
> don't have anything other than our native SmartOS/ZFS based configurations,
> but I might be able to setup some VMs and get results that way. I should be
> able to setup a VM running FreeBSD. If you have a chance to collect some
> data, just let me know the exact benchmarks you ran and I'll run the same
> things on the FreeBSD VM. Obviously you're under no obligation to do any of
> this, so if you don't have time, just let me know and I'll see what I can
> do on my own.
>
> Thanks again,
> Jerry
>
>
> On Tue, Jul 17, 2018 at 2:47 PM, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> On 07/17/2018 09:12 PM, Peter Eisentraut wrote:
>> > On 17.07.18 00:04, Jerry Jelinek wrote:
>> >> There have been quite a few comments since last week, so at this point
>> I
>> >> am uncertain how to proceed with this change. I don't think I saw
>> >> anything concrete in the recent emails that I can act upon.
>> >
>> > The outcome of this could be multiple orthogonal patches that affect the
>> > WAL file allocation behavior somehow.  I think your original idea of
>> > skipping recycling on a COW file system is sound.  But I would rather
>> > frame the option as "preallocating files is obviously useless on a COW
>> > file system" rather than "this will make things mysteriously faster with
>> > uncertain trade-offs".
>> >
>>
>> Makes sense, I guess. But I think many claims made in this thread are
>> mostly just assumptions at this point, based on our beliefs how CoW or
>> non-CoW filesystems work. The results from ZFS (showing positive impact)
>> are an exception, but that's about it. I'm sure those claims are based
>> on real-world experience and are likely true, but it'd be good to have
>> data from a wider range of filesystems / configurations etc. so that we
>> can give better recommendations to users, for example.
>>
>> That's something I can help with, assuming we agree on what tests we
>> want to do. I'd say the usual batter of write-only pgbench tests with
>> different scales (fits into s_b, fits into RAM, larger then RAM) on
>> common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
>> different types of storage would be enough. I don't have any freebsd box
>> available, unfortunately.
>>
>>
>> regards
>>
>> --
>> Tomas Vondra  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>
>

Re: patch to allow disable of WAL recycling

2018-07-22 Thread Tomas Vondra

On 07/21/2018 12:04 AM, Jerry Jelinek wrote:
> Thomas,
> 
> Thanks for your offer to run some tests on different OSes and
> filesystems that you have. Anything you can provide here would be much
> appreciated. I don't have anything other than our native SmartOS/ZFS
> based configurations, but I might be able to setup some VMs and get
> results that way. I should be able to setup a VM running FreeBSD. If you
> have a chance to collect some data, just let me know the exact
> benchmarks you ran and I'll run the same things on the FreeBSD VM.
> Obviously you're under no obligation to do any of this, so if you don't
> have time, just let me know and I'll see what I can do on my own.
> 

Sounds good. I plan to start with the testing in a couple of days - the
boxes are currently running some other tests at the moment. Once I have
some numbers I'll share them here, along with the test scripts etc.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-20 Thread Jerry Jelinek

 Thomas,

Thanks for your offer to run some tests on different OSes and filesystems
that you have. Anything you can provide here would be much appreciated. I
don't have anything other than our native SmartOS/ZFS based configurations,
but I might be able to setup some VMs and get results that way. I should be
able to setup a VM running FreeBSD. If you have a chance to collect some
data, just let me know the exact benchmarks you ran and I'll run the same
things on the FreeBSD VM. Obviously you're under no obligation to do any of
this, so if you don't have time, just let me know and I'll see what I can
do on my own.

Thanks again,
Jerry


On Tue, Jul 17, 2018 at 2:47 PM, Tomas Vondra 
wrote:

> On 07/17/2018 09:12 PM, Peter Eisentraut wrote:
> > On 17.07.18 00:04, Jerry Jelinek wrote:
> >> There have been quite a few comments since last week, so at this point I
> >> am uncertain how to proceed with this change. I don't think I saw
> >> anything concrete in the recent emails that I can act upon.
> >
> > The outcome of this could be multiple orthogonal patches that affect the
> > WAL file allocation behavior somehow.  I think your original idea of
> > skipping recycling on a COW file system is sound.  But I would rather
> > frame the option as "preallocating files is obviously useless on a COW
> > file system" rather than "this will make things mysteriously faster with
> > uncertain trade-offs".
> >
>
> Makes sense, I guess. But I think many claims made in this thread are
> mostly just assumptions at this point, based on our beliefs how CoW or
> non-CoW filesystems work. The results from ZFS (showing positive impact)
> are an exception, but that's about it. I'm sure those claims are based
> on real-world experience and are likely true, but it'd be good to have
> data from a wider range of filesystems / configurations etc. so that we
> can give better recommendations to users, for example.
>
> That's something I can help with, assuming we agree on what tests we
> want to do. I'd say the usual batter of write-only pgbench tests with
> different scales (fits into s_b, fits into RAM, larger then RAM) on
> common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
> different types of storage would be enough. I don't have any freebsd box
> available, unfortunately.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-07-20 Thread Jerry Jelinek

Peter,

Thanks for your feedback. I'm happy to change the name of the tunable or to
update the man page in any way.  I have already posted an updated patch
with changes to the man page which I think may address your concerns there,
but please let me know if that still needs more work. It looks like Kyotaro
already did some exploration, and tuning the min/max for the WAL size won't
solve this problem.  Just let me know if there is anything else here which
you think I should look into.

Thanks again,
Jerry


On Tue, Jul 17, 2018 at 1:12 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 17.07.18 00:04, Jerry Jelinek wrote:
> > There have been quite a few comments since last week, so at this point I
> > am uncertain how to proceed with this change. I don't think I saw
> > anything concrete in the recent emails that I can act upon.
>
> The outcome of this could be multiple orthogonal patches that affect the
> WAL file allocation behavior somehow.  I think your original idea of
> skipping recycling on a COW file system is sound.  But I would rather
> frame the option as "preallocating files is obviously useless on a COW
> file system" rather than "this will make things mysteriously faster with
> uncertain trade-offs".
>
> The actual implementation could use another round of consideration.  I
> wonder how this should interact with min_wal_size.  Wouldn't
> min_wal_size = 0 already do what we need (if you could set it to 0,
> which is currently not possible)?  Should the new setting be something
> like min_wal_size = -1?  Or even if it's a new setting, it might be
> better to act on it in XLOGfileslop(), so these things are kept closer
> together.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-07-20 Thread Jerry Jelinek

Hi Robert,

I'm new to the Postgresql community, so I'm not familiar with how patches
are accepted here. Thanks for your detailed explanation. I do want to keep
pushing on this. I'll respond separately to Peter and to Tomas regarding
their emails.

Thanks again,
Jerry


On Wed, Jul 18, 2018 at 1:43 PM, Robert Haas  wrote:

> On Wed, Jul 18, 2018 at 3:22 PM, Jerry Jelinek 
> wrote:
> > I've gotten a wide variety of feedback on the proposed patch. The
> comments
> > range from rough approval through various discussion about alternative
> > solutions. At this point I am unsure if this patch is rejected or if it
> > would be accepted once I had the updated man page changes that were
> > discussed last week.
> >
> > I have attached an updated patch which does incorporate man page
> changes, in
> > case that is the blocker. However, if this patch is simply rejected, I'd
> > appreciate it if I could get a definitive statement to that effect.
>
> 1. There's no such thing as a definitive statement of the community's
> opinion, generally speaking, because as a rule the community consists
> of many different people who rarely all agree on anything but the most
> uncontroversial of topics.  We could probably all agree that the sun
> rises in the East, or at least has historically done so, and that,
> say, typos are bad.
>
> 2. You can't really expect somebody else to do the work of forging
> consensus on your behalf.  Sure, that may happen, if somebody else
> takes an interest in the problem.  But, really, since you started the
> thread, most likely you're the one most interested.  If you're not
> willing to take the time to discuss the issues with the individual
> people who have responded, promote your own views, investigate
> proposed alternatives, etc., it's unlikely anybody else is going to do
> it.
>
> 3. It's not unusual for a patch of this complexity to take months to
> get committed; it's only been a few weeks.  If it's important to you,
> don't give up now.
>
> It seems to me that there are several people in favor of this patch,
> some others with questions and concerns, and pretty much nobody
> adamantly opposed.  So I would guess that this has pretty good odds in
> the long run.  But you're not going to get anywhere by pushing for a
> commit-or-reject-right-now.  It's been less than 24 hours since Tomas
> proposed to do further benchmarking if we could agree on what to test
> (you haven't made any suggestions in response) and it's also been less
> than 24 hours since Peter and I both sent emails about whether it
> should be controlled by its own GUC or in some other way.  The
> discussion is very much actively continuing.  It's too soon to decide
> on the conclusion, but it would be a good idea for you to keep
> participating.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Kyotaro HORIGUCHI

At Thu, 19 Jul 2018 12:59:26 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180719.125926.257896670.horiguchi.kyot...@lab.ntt.co.jp>
> At Thu, 19 Jul 2018 12:37:26 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20180719.123726.00899102.horiguchi.kyot...@lab.ntt.co.jp>
> > At Tue, 17 Jul 2018 21:01:03 -0400, Robert Haas  
> > wrote in 
> > 
> > > On Tue, Jul 17, 2018 at 3:12 PM, Peter Eisentraut
> > >  wrote:
> > > > The actual implementation could use another round of consideration.  I
> > > > wonder how this should interact with min_wal_size.  Wouldn't
> > > > min_wal_size = 0 already do what we need (if you could set it to 0,
> > > > which is currently not possible)?
> > > 
> > > Hmm, would that actually disable recycling, or just make it happen only 
> > > rarely?
> > 
> > It doens't. Instead setting max_wal_size smaller than checkpoint
> > interval should do that.
> 
> And that's wrong. It makes checkpoint unreasonably frequent.
> 
> My result is that we cannot disable recycling perfectly just by
> setting min/max_wal_size.

s/result/conclusion/;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Kyotaro HORIGUCHI

At Thu, 19 Jul 2018 12:37:26 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180719.123726.00899102.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 17 Jul 2018 21:01:03 -0400, Robert Haas  wrote 
> in 
> > On Tue, Jul 17, 2018 at 3:12 PM, Peter Eisentraut
> >  wrote:
> > > The actual implementation could use another round of consideration.  I
> > > wonder how this should interact with min_wal_size.  Wouldn't
> > > min_wal_size = 0 already do what we need (if you could set it to 0,
> > > which is currently not possible)?
> > 
> > Hmm, would that actually disable recycling, or just make it happen only 
> > rarely?
> 
> It doens't. Instead setting max_wal_size smaller than checkpoint
> interval should do that.

And that's wrong. It makes checkpoint unreasonably frequent.

My result is that we cannot disable recycling perfectly just by
setting min/max_wal_size.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Kyotaro HORIGUCHI

At Thu, 19 Jul 2018 12:37:26 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180719.123726.00899102.horiguchi.kyot...@lab.ntt.co.jp>
> While considering this, I found a bug in 4b0d28de06, which
> removed prior checkpoint from control file. It actually trims the
> segments before the last checkpoint's redo segment but recycling
> is still considered based on the *prevous* checkpoint. As the
> result min_wal_size doesn't work as told.  Specifically, setting
> min/max_wal_size to 48MB and advance four or more segments then
> two checkpoints leaves just one segment, which is less than
> min_wal_size.
> 
> The attached patch fixes that. One arguable point on this would
> be the removal of the behavior when RemoveXLogFile(name,
> InvalidXLogRecPtr, ..).
> 
> The only place calling the function with the parameter is
> timeline switching. Previously unconditionally 10 segments are
> recycled after switchpoint but the reason for the behavior is we
> didn't have the information on previous checkpoint at hand at the
> time. But now we can use the timeline switch point as the
> approximate of the last checkpoint's redo point and this allows
> us to use min/max_wal_size properly at the time.

Fixed a comment in the patch, which was unreadable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f2b1a0b6360263d4ddf725075daf4b56800e3e18 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 19 Jul 2018 12:13:56 +0900
Subject: [PATCH] Fix calculation base of WAL recycling

The commit 4b0d28de06 removed the prior checkpoint and related things
but that leaves WAL recycling based on the prior checkpoint. This
makes max_wal_size and min_wal_size work incorrectly. This patch makes
WAL recycling be based on the last checkpoint.
---
 src/backend/access/transam/xlog.c | 37 +
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4049deb968..d7a61af8f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2287,7 +2287,7 @@ assign_checkpoint_completion_target(double newval, void *extra)
  * XLOG segments? Returns the highest segment that should be preallocated.
  */
 static XLogSegNo
-XLOGfileslop(XLogRecPtr PriorRedoPtr)
+XLOGfileslop(XLogRecPtr RedoRecPtr)
 {
 	XLogSegNo	minSegNo;
 	XLogSegNo	maxSegNo;
@@ -2299,9 +2299,9 @@ XLOGfileslop(XLogRecPtr PriorRedoPtr)
 	 * correspond to. Always recycle enough segments to meet the minimum, and
 	 * remove enough segments to stay below the maximum.
 	 */
-	minSegNo = PriorRedoPtr / wal_segment_size +
+	minSegNo = RedoRecPtr / wal_segment_size +
 		ConvertToXSegs(min_wal_size_mb, wal_segment_size) - 1;
-	maxSegNo = PriorRedoPtr / wal_segment_size +
+	maxSegNo = RedoRecPtr / wal_segment_size +
 		ConvertToXSegs(max_wal_size_mb, wal_segment_size) - 1;
 
 	/*
@@ -2316,7 +2316,7 @@ XLOGfileslop(XLogRecPtr PriorRedoPtr)
 	/* add 10% for good measure. */
 	distance *= 1.10;
 
-	recycleSegNo = (XLogSegNo) ceil(((double) PriorRedoPtr + distance) /
+	recycleSegNo = (XLogSegNo) ceil(((double) RedoRecPtr + distance) /
 	wal_segment_size);
 
 	if (recycleSegNo < minSegNo)
@@ -3896,12 +3896,12 @@ RemoveTempXlogFiles(void)
 /*
  * Recycle or remove all log files older or equal to passed segno.
  *
- * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the
- * redo pointer of the previous checkpoint. These are used to determine
+ * endptr is current (or recent) end of xlog, and RedoRecPtr is the
+ * redo pointer of the last checkpoint. These are used to determine
  * whether we want to recycle rather than delete no-longer-wanted log files.
  */
 static void
-RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
+RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -3944,7 +3944,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 /* Update the last removed location in shared memory first */
 UpdateLastRemovedPtr(xlde->d_name);
 
-RemoveXlogFile(xlde->d_name, PriorRedoPtr, endptr);
+RemoveXlogFile(xlde->d_name, RedoRecPtr, endptr);
 			}
 		}
 	}
@@ -4006,9 +4006,11 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 			 * remove it yet. It should be OK to remove it - files that are
 			 * not part of our timeline history are not required for recovery
 			 * - but seems safer to let them be archived and removed later.
+			 * Here, switchpoint is a good approximate of RedoRecPtr for
+			 * RemoveXlogFile since we have just done timeline switching.
 			 */
 			if (!XLogArchiveIsReady(xlde->d_name))
-RemoveXlogFile(xlde->d_name, InvalidXLogRecPtr, switchpoint);
+RemoveXlogFile(xlde->d_name, switchpoint, switchpoint);
 		}
 	}
 
@@ -4018,14 +4020,12 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Kyotaro HORIGUCHI

At Tue, 17 Jul 2018 21:01:03 -0400, Robert Haas  wrote 
in 
> On Tue, Jul 17, 2018 at 3:12 PM, Peter Eisentraut
>  wrote:
> > The actual implementation could use another round of consideration.  I
> > wonder how this should interact with min_wal_size.  Wouldn't
> > min_wal_size = 0 already do what we need (if you could set it to 0,
> > which is currently not possible)?
> 
> Hmm, would that actually disable recycling, or just make it happen only 
> rarely?

It doens't. Instead setting max_wal_size smaller than checkpoint
interval should do that.

While considering this, I found a bug in 4b0d28de06, which
removed prior checkpoint from control file. It actually trims the
segments before the last checkpoint's redo segment but recycling
is still considered based on the *prevous* checkpoint. As the
result min_wal_size doesn't work as told.  Specifically, setting
min/max_wal_size to 48MB and advance four or more segments then
two checkpoints leaves just one segment, which is less than
min_wal_size.

The attached patch fixes that. One arguable point on this would
be the removal of the behavior when RemoveXLogFile(name,
InvalidXLogRecPtr, ..).

The only place calling the function with the parameter is
timeline switching. Previously unconditionally 10 segments are
recycled after switchpoint but the reason for the behavior is we
didn't have the information on previous checkpoint at hand at the
time. But now we can use the timeline switch point as the
approximate of the last checkpoint's redo point and this allows
us to use min/max_wal_size properly at the time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2a59a0fb21c0272a445fe7f05fb68ea1aafb3e21 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 19 Jul 2018 12:13:56 +0900
Subject: [PATCH] Fix calculation base of WAL recycling

The commit 4b0d28de06 removed the prior checkpoint and related things
but that leaves WAL recycling based on the prior checkpoint. This
makes max_wal_size and min_wal_size work incorrectly. This patch makes
WAL recycling be based on the last checkpoint.
---
 src/backend/access/transam/xlog.c | 37 +
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4049deb968..fdc21df122 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2287,7 +2287,7 @@ assign_checkpoint_completion_target(double newval, void *extra)
  * XLOG segments? Returns the highest segment that should be preallocated.
  */
 static XLogSegNo
-XLOGfileslop(XLogRecPtr PriorRedoPtr)
+XLOGfileslop(XLogRecPtr RedoRecPtr)
 {
 	XLogSegNo	minSegNo;
 	XLogSegNo	maxSegNo;
@@ -2299,9 +2299,9 @@ XLOGfileslop(XLogRecPtr PriorRedoPtr)
 	 * correspond to. Always recycle enough segments to meet the minimum, and
 	 * remove enough segments to stay below the maximum.
 	 */
-	minSegNo = PriorRedoPtr / wal_segment_size +
+	minSegNo = RedoRecPtr / wal_segment_size +
 		ConvertToXSegs(min_wal_size_mb, wal_segment_size) - 1;
-	maxSegNo = PriorRedoPtr / wal_segment_size +
+	maxSegNo = RedoRecPtr / wal_segment_size +
 		ConvertToXSegs(max_wal_size_mb, wal_segment_size) - 1;
 
 	/*
@@ -2316,7 +2316,7 @@ XLOGfileslop(XLogRecPtr PriorRedoPtr)
 	/* add 10% for good measure. */
 	distance *= 1.10;
 
-	recycleSegNo = (XLogSegNo) ceil(((double) PriorRedoPtr + distance) /
+	recycleSegNo = (XLogSegNo) ceil(((double) RedoRecPtr + distance) /
 	wal_segment_size);
 
 	if (recycleSegNo < minSegNo)
@@ -3896,12 +3896,12 @@ RemoveTempXlogFiles(void)
 /*
  * Recycle or remove all log files older or equal to passed segno.
  *
- * endptr is current (or recent) end of xlog, and PriorRedoRecPtr is the
- * redo pointer of the previous checkpoint. These are used to determine
+ * endptr is current (or recent) end of xlog, and RedoRecPtr is the
+ * redo pointer of the last checkpoint. These are used to determine
  * whether we want to recycle rather than delete no-longer-wanted log files.
  */
 static void
-RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
+RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -3944,7 +3944,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 /* Update the last removed location in shared memory first */
 UpdateLastRemovedPtr(xlde->d_name);
 
-RemoveXlogFile(xlde->d_name, PriorRedoPtr, endptr);
+RemoveXlogFile(xlde->d_name, RedoRecPtr, endptr);
 			}
 		}
 	}
@@ -4006,9 +4006,11 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 			 * remove it yet. It should be OK to remove it - files that are
 			 * not part of our timeline history are not required for recovery
 			 * - but seems safer to let them be archived and removed later.
+			 * Recycling based on the point gives good approximate since we
+			 * have just done timeline swi

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Robert Haas

On Tue, Jul 17, 2018 at 4:47 PM, Tomas Vondra
 wrote:
> Makes sense, I guess. But I think many claims made in this thread are
> mostly just assumptions at this point, based on our beliefs how CoW or
> non-CoW filesystems work. The results from ZFS (showing positive impact)
> are an exception, but that's about it. I'm sure those claims are based
> on real-world experience and are likely true, but it'd be good to have
> data from a wider range of filesystems / configurations etc. so that we
> can give better recommendations to users, for example.

I agree that there's a lot of assuming going on.

> That's something I can help with, assuming we agree on what tests we
> want to do. I'd say the usual batter of write-only pgbench tests with
> different scales (fits into s_b, fits into RAM, larger then RAM) on
> common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
> different types of storage would be enough. I don't have any freebsd box
> available, unfortunately.

Those sound like reasonable tests.  I also don't think we need to have
perfect recommendations.  Some general guidance is good enough for a
start and we can refine it as we know more.  IMHO, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Robert Haas

On Wed, Jul 18, 2018 at 3:22 PM, Jerry Jelinek  wrote:
> I've gotten a wide variety of feedback on the proposed patch. The comments
> range from rough approval through various discussion about alternative
> solutions. At this point I am unsure if this patch is rejected or if it
> would be accepted once I had the updated man page changes that were
> discussed last week.
>
> I have attached an updated patch which does incorporate man page changes, in
> case that is the blocker. However, if this patch is simply rejected, I'd
> appreciate it if I could get a definitive statement to that effect.

1. There's no such thing as a definitive statement of the community's
opinion, generally speaking, because as a rule the community consists
of many different people who rarely all agree on anything but the most
uncontroversial of topics.  We could probably all agree that the sun
rises in the East, or at least has historically done so, and that,
say, typos are bad.

2. You can't really expect somebody else to do the work of forging
consensus on your behalf.  Sure, that may happen, if somebody else
takes an interest in the problem.  But, really, since you started the
thread, most likely you're the one most interested.  If you're not
willing to take the time to discuss the issues with the individual
people who have responded, promote your own views, investigate
proposed alternatives, etc., it's unlikely anybody else is going to do
it.

3. It's not unusual for a patch of this complexity to take months to
get committed; it's only been a few weeks.  If it's important to you,
don't give up now.

It seems to me that there are several people in favor of this patch,
some others with questions and concerns, and pretty much nobody
adamantly opposed.  So I would guess that this has pretty good odds in
the long run.  But you're not going to get anywhere by pushing for a
commit-or-reject-right-now.  It's been less than 24 hours since Tomas
proposed to do further benchmarking if we could agree on what to test
(you haven't made any suggestions in response) and it's also been less
than 24 hours since Peter and I both sent emails about whether it
should be controlled by its own GUC or in some other way.  The
discussion is very much actively continuing.  It's too soon to decide
on the conclusion, but it would be a good idea for you to keep
participating.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-18 Thread Jerry Jelinek

I've gotten a wide variety of feedback on the proposed patch. The comments
range from rough approval through various discussion about alternative
solutions. At this point I am unsure if this patch is rejected or if it
would be accepted once I had the updated man page changes that were
discussed last week.

I have attached an updated patch which does incorporate man page changes,
in case that is the blocker. However, if this patch is simply rejected, I'd
appreciate it if I could get a definitive statement to that effect.

Thanks,
Jerry

On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek 
wrote:

> Hello All,
>
> Attached is a patch to provide an option to disable WAL recycling. We have
> found that this can help performance by eliminating read-modify-write
> behavior on old WAL files that are no longer resident in the filesystem
> cache. The is a lot more detail on the background of the motivation for
> this in the following thread.
>
> https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3
> XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3
> xxbtwu3kkaritr67m3...@mail.gmail.com
>
> A similar change has been tested against our 9.6 branch that we're
> currently running, but the attached patch is against master.
>
> Thanks,
> Jerry
>
>

0001-option-to-disable-WAL-recycling.patch
Description: Binary data

Re: patch to allow disable of WAL recycling

2018-07-17 Thread Robert Haas

On Tue, Jul 17, 2018 at 3:12 PM, Peter Eisentraut
 wrote:
> The actual implementation could use another round of consideration.  I
> wonder how this should interact with min_wal_size.  Wouldn't
> min_wal_size = 0 already do what we need (if you could set it to 0,
> which is currently not possible)?

Hmm, would that actually disable recycling, or just make it happen only rarely?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-17 Thread Tomas Vondra

On 07/17/2018 09:12 PM, Peter Eisentraut wrote:
> On 17.07.18 00:04, Jerry Jelinek wrote:
>> There have been quite a few comments since last week, so at this point I
>> am uncertain how to proceed with this change. I don't think I saw
>> anything concrete in the recent emails that I can act upon.
> 
> The outcome of this could be multiple orthogonal patches that affect the
> WAL file allocation behavior somehow.  I think your original idea of
> skipping recycling on a COW file system is sound.  But I would rather
> frame the option as "preallocating files is obviously useless on a COW
> file system" rather than "this will make things mysteriously faster with
> uncertain trade-offs".
> 

Makes sense, I guess. But I think many claims made in this thread are
mostly just assumptions at this point, based on our beliefs how CoW or
non-CoW filesystems work. The results from ZFS (showing positive impact)
are an exception, but that's about it. I'm sure those claims are based
on real-world experience and are likely true, but it'd be good to have
data from a wider range of filesystems / configurations etc. so that we
can give better recommendations to users, for example.

That's something I can help with, assuming we agree on what tests we
want to do. I'd say the usual batter of write-only pgbench tests with
different scales (fits into s_b, fits into RAM, larger then RAM) on
common Linux filesystems (ext4, xfs, btrfs) and zfsonlinux, and
different types of storage would be enough. I don't have any freebsd box
available, unfortunately.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-17 Thread Peter Eisentraut

On 17.07.18 00:04, Jerry Jelinek wrote:
> There have been quite a few comments since last week, so at this point I
> am uncertain how to proceed with this change. I don't think I saw
> anything concrete in the recent emails that I can act upon.

The outcome of this could be multiple orthogonal patches that affect the
WAL file allocation behavior somehow.  I think your original idea of
skipping recycling on a COW file system is sound.  But I would rather
frame the option as "preallocating files is obviously useless on a COW
file system" rather than "this will make things mysteriously faster with
uncertain trade-offs".

The actual implementation could use another round of consideration.  I
wonder how this should interact with min_wal_size.  Wouldn't
min_wal_size = 0 already do what we need (if you could set it to 0,
which is currently not possible)?  Should the new setting be something
like min_wal_size = -1?  Or even if it's a new setting, it might be
better to act on it in XLOGfileslop(), so these things are kept closer
together.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Michael Paquier

On Mon, Jul 16, 2018 at 10:38:14AM -0400, Robert Haas wrote:
> It's been a few years since I tested this, but my recollection is that
> if you fill up pg_xlog, the system will PANIC and die on a vanilla
> Linux install.  Sure, you can set max_wal_size, but that's a soft
> limit, not a hard limit, and if you generate WAL faster than the
> system can checkpoint, you can overrun that value and force allocation
> of additional WAL files.  So I'm not sure we have any working
> ENOSPC-panic protection today.  Given that, I'm doubtful that we
> should prioritize maintaining whatever partially-working protection we
> may have today over raw performance.  If we want to fix ENOSPC on
> pg_wal = PANIC, and I think that would be a good thing to fix, then we
> should do it either by finding a way to make the WAL insertion ERROR
> out instead of panicking, or throttle WAL generation as we get close
> to disk space exhaustion so that the checkpoint has time to complete,
> as previously proposed by Heroku.

I would personally prefer seeing max_wal_size being switched to a hard
limit, and make that tunable.  I am wondering if that's the case for
other people on this list, but I see from time to time, every couple of
weeks, people complaining that Postgres is not able to maintain a hard
guarantee behind the value of max_wal_size.  In some upgrade scenarios,
I had to tell such folks to throttle their insert load and also manually
issue checkpoints to allow the system to stay up and continue with the 
upgrade process.  So there are definitely cases where throttling is
useful, and if the hard limit is reached for some cases I would rather
see WAL generation from other backends simply stopped instead of risking
the system to go down so as the system can finish its checkpoint.  And
sometimes this happens also with a SQL dump, where throttling the load
at the application level means more complex dump strategy so as things
are split between multiple files for example.
--
Michael


signature.asc
Description: PGP signature

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Jerry Jelinek

There have been quite a few comments since last week, so at this point I am
uncertain how to proceed with this change. I don't think I saw anything
concrete in the recent emails that I can act upon.

I would like to respond to the comment about trying to "self-tune" the
behavior based on inferences made about caching during setup. I can't speak
for many other filesystems, but for ZFS, the ARC size is not fixed and will
vary based on the memory demands against the machine. Also, what files are
cached will vary based upon the workloads running on the machine. Thus, I
do not think there is a valid way to make inferences about future caching
behavior based upon a point-in-time observation.

I am still happy to update the man pages to explain the new tunable better
if that is acceptable.

Thanks,
Jerry

On Sun, Jul 15, 2018 at 6:32 PM, Robert Haas  wrote:

> On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund  wrote:
> > This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
> > performance impact of non COW filesystems, and very likely even negative
> > impacts in a number of COWed scenarios (when there's enough memory to
> > keep all WAL files in memory).
> >
> > I still think that fixing this another way would be preferrable. This'll
> > be too much of a magic knob that depends on the fs, hardware and
> > workload.
>
> I tend to agree with you, but unless we have a pretty good idea what
> that other way would be, I think we should probably accept the patch.
>
> Could we somehow make this self-tuning?  On any given
> filesystem/hardware/workload, either creating a new 16MB file is
> faster, or recycling an old file is faster.  If the old file is still
> cached, recycling it figures to win on almost any hardware.  If not,
> it seems like something of a toss-up.  I suppose we could try to keep
> a running average of how long it is taking us to recycle WAL files and
> how long it is taking us to create new ones; if we do each one of
> those things at least sometimes, then we'll eventually get an idea of
> which one is quicker.  But it's not clear to me that such data would
> be very reliable unless we tried to make sure that we tried both
> things fairly regularly under circumstances where we could have chosen
> to do the other one.
>
> I think part of the problem here is that whether a WAL segment is
> likely to be cached depends on a host of factors which we don't track
> very carefully, like whether it's been streamed or decoded recently.
> If we knew when that a particular WAL segment hadn't been accessed for
> any purpose in 10+ minutes, it would probably be fairly safe to guess
> that it's no longer in cache; if we knew that it had been accessed <15
> seconds ago, that it is probably still in cache.  But we have no idea.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 10:12 AM, Tom Lane  wrote:
> But anyway, this means we have two nearly independent issues to
> investigate: whether recycling/renaming old files is cheaper than
> constantly creating and deleting them, and whether to use physical
> file zeroing versus some "just set the EOF please" filesystem call
> when first creating a file.  The former does seem like it's purely
> a performance question, but the latter involves a tradeoff of
> performance against an ENOSPC-panic protection feature that in
> reality only works on some filesystems.

It's been a few years since I tested this, but my recollection is that
if you fill up pg_xlog, the system will PANIC and die on a vanilla
Linux install.  Sure, you can set max_wal_size, but that's a soft
limit, not a hard limit, and if you generate WAL faster than the
system can checkpoint, you can overrun that value and force allocation
of additional WAL files.  So I'm not sure we have any working
ENOSPC-panic protection today.  Given that, I'm doubtful that we
should prioritize maintaining whatever partially-working protection we
may have today over raw performance.  If we want to fix ENOSPC on
pg_wal = PANIC, and I think that would be a good thing to fix, then we
should do it either by finding a way to make the WAL insertion ERROR
out instead of panicking, or throttle WAL generation as we get close
to disk space exhaustion so that the checkpoint has time to complete,
as previously proposed by Heroku.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Tom Lane

Andres Freund  writes:
> On 2018-07-15 20:55:38 -0400, Tom Lane wrote:
>> That's not the way to think about it.  On a COW file system, we don't
>> want to "create 16MB files" at all --- we should just fill WAL files
>> on-the-fly, because the pre-fill activity isn't actually serving the
>> intended purpose of reserving disk space.  It's just completely useless
>> overhead :-(.  So we can't really make a direct comparison between the
>> two approaches; there's no good way to net out the cost of constructing
>> the WAL data we need to write.

> We probably should still allocate them in 16MB segments. We rely on the
> size being fixed in a number of places.

Reasonable point.  I was supposing that it'd be okay if a partially
written segment were shorter than 16MB, but you're right that that
would require vetting a lot of code to be sure about it.

> But it's probably worthwhile to
> just do a posix_fadvise or such. Also, if we continually increase the
> size with each write we end up doing a lot more metadata transactions,
> which'll essentially serve to increase journalling overhead further.

Hm.  What you're claiming is that on these FSen, extending a file involves
more/different metadata activity than allocating new space for a COW
overwrite of an existing area within the file.  Is that really true?
The former case would be far more common in typical usage, and somehow
I doubt the FS authors would have been too stupid to optimize things so
that the same journal entry can record both the space allocation and the
logical-EOF change.

But anyway, this means we have two nearly independent issues to
investigate: whether recycling/renaming old files is cheaper than
constantly creating and deleting them, and whether to use physical
file zeroing versus some "just set the EOF please" filesystem call
when first creating a file.  The former does seem like it's purely
a performance question, but the latter involves a tradeoff of
performance against an ENOSPC-panic protection feature that in
reality only works on some filesystems.

regards, tom lane

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Andres Freund

Hi,

On 2018-07-15 20:55:38 -0400, Tom Lane wrote:
> That's not the way to think about it.  On a COW file system, we don't
> want to "create 16MB files" at all --- we should just fill WAL files
> on-the-fly, because the pre-fill activity isn't actually serving the
> intended purpose of reserving disk space.  It's just completely useless
> overhead :-(.  So we can't really make a direct comparison between the
> two approaches; there's no good way to net out the cost of constructing
> the WAL data we need to write.

We probably should still allocate them in 16MB segments. We rely on the
size being fixed in a number of places. But it's probably worthwhile to
just do a posix_fadvise or such. Also, if we continually increase the
size with each write we end up doing a lot more metadata transactions,
which'll essentially serve to increase journalling overhead further.

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Andres Freund

On 2018-07-15 20:32:39 -0400, Robert Haas wrote:
> On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund  wrote:
> > This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
> > performance impact of non COW filesystems, and very likely even negative
> > impacts in a number of COWed scenarios (when there's enough memory to
> > keep all WAL files in memory).
> >
> > I still think that fixing this another way would be preferrable. This'll
> > be too much of a magic knob that depends on the fs, hardware and
> > workload.
> 
> I tend to agree with you, but unless we have a pretty good idea what
> that other way would be, I think we should probably accept the patch.

I don't think I've argued against that - I just want there to be
sufficient caveats to make clear it's going to hurt on very common OS &
FS combinations.


> I think part of the problem here is that whether a WAL segment is
> likely to be cached depends on a host of factors which we don't track
> very carefully, like whether it's been streamed or decoded recently.
> If we knew when that a particular WAL segment hadn't been accessed for
> any purpose in 10+ minutes, it would probably be fairly safe to guess
> that it's no longer in cache; if we knew that it had been accessed <15
> seconds ago, that it is probably still in cache.  But we have no idea.

True. Additionally we don't know whether, even if cold cache,
re-initializing files isn't worse performance-wise than recycling files.

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Tomas Vondra


On 07/16/2018 04:54 AM, Stephen Frost wrote:

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:

I think that the right basic idea is to have a GUC that chooses between
the two implementations, but whether it can be set automatically is not
clear to me.  Can initdb perhaps investigate what kind of filesystem the
WAL directory is sitting on, and set the default value from hard-wired
knowledge about that?


Maybe..  but I think we'd still need a way to change it because people
often start with their database system minimally configured (including
having WAL in the default location of the data directory) and only later
realize that was a bad idea and change it later.  I wouldn't be at all
surprised if that "change it later" meant moving it to a different
filesystem, and having to re-initdb to take advantage of that would be
particularly unfriendly.



I'm not sure the detection can be made sufficiently reliable for initdb. 
For example, it's not that uncommon to do initdb and then move the WAL 
to a different filesystem using symlink. Also, I wonder how placing the 
filesystem on LVM with snapshotting (which kinda makes it CoW) affects 
the system behavior.


But maybe those are not issues, as long as the result is predictable.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-15 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I think that the right basic idea is to have a GUC that chooses between
> the two implementations, but whether it can be set automatically is not
> clear to me.  Can initdb perhaps investigate what kind of filesystem the
> WAL directory is sitting on, and set the default value from hard-wired
> knowledge about that?

Maybe..  but I think we'd still need a way to change it because people
often start with their database system minimally configured (including
having WAL in the default location of the data directory) and only later
realize that was a bad idea and change it later.  I wouldn't be at all
surprised if that "change it later" meant moving it to a different
filesystem, and having to re-initdb to take advantage of that would be
particularly unfriendly.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: patch to allow disable of WAL recycling

2018-07-15 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund  wrote:
>> This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
>> performance impact of non COW filesystems, and very likely even negative
>> impacts in a number of COWed scenarios (when there's enough memory to
>> keep all WAL files in memory).
>> 
>> I still think that fixing this another way would be preferrable. This'll
>> be too much of a magic knob that depends on the fs, hardware and
>> workload.

> I tend to agree with you, but unless we have a pretty good idea what
> that other way would be, I think we should probably accept the patch.

> Could we somehow make this self-tuning?  On any given
> filesystem/hardware/workload, either creating a new 16MB file is
> faster, or recycling an old file is faster.

That's not the way to think about it.  On a COW file system, we don't
want to "create 16MB files" at all --- we should just fill WAL files
on-the-fly, because the pre-fill activity isn't actually serving the
intended purpose of reserving disk space.  It's just completely useless
overhead :-(.  So we can't really make a direct comparison between the
two approaches; there's no good way to net out the cost of constructing
the WAL data we need to write.

Moreover, a raw speed comparison isn't the whole story; a DBA might
choose write-without-prefill because it's faster for him, even though
he's taking a bigger chance of trouble on out-of-disk-space.

I think that the right basic idea is to have a GUC that chooses between
the two implementations, but whether it can be set automatically is not
clear to me.  Can initdb perhaps investigate what kind of filesystem the
WAL directory is sitting on, and set the default value from hard-wired
knowledge about that?

regards, tom lane

Re: patch to allow disable of WAL recycling

2018-07-15 Thread Robert Haas

On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund  wrote:
> This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
> performance impact of non COW filesystems, and very likely even negative
> impacts in a number of COWed scenarios (when there's enough memory to
> keep all WAL files in memory).
>
> I still think that fixing this another way would be preferrable. This'll
> be too much of a magic knob that depends on the fs, hardware and
> workload.

I tend to agree with you, but unless we have a pretty good idea what
that other way would be, I think we should probably accept the patch.

Could we somehow make this self-tuning?  On any given
filesystem/hardware/workload, either creating a new 16MB file is
faster, or recycling an old file is faster.  If the old file is still
cached, recycling it figures to win on almost any hardware.  If not,
it seems like something of a toss-up.  I suppose we could try to keep
a running average of how long it is taking us to recycle WAL files and
how long it is taking us to create new ones; if we do each one of
those things at least sometimes, then we'll eventually get an idea of
which one is quicker.  But it's not clear to me that such data would
be very reliable unless we tried to make sure that we tried both
things fairly regularly under circumstances where we could have chosen
to do the other one.

I think part of the problem here is that whether a WAL segment is
likely to be cached depends on a host of factors which we don't track
very carefully, like whether it's been streamed or decoded recently.
If we knew when that a particular WAL segment hadn't been accessed for
any purpose in 10+ minutes, it would probably be fairly safe to guess
that it's no longer in cache; if we knew that it had been accessed <15
seconds ago, that it is probably still in cache.  But we have no idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-13 Thread Jerry Jelinek

Thanks to everyone who has taken the time to look at this patch and provide
all of the feedback.

I'm going to wait another day to see if there are any more comments. If
not, then first thing next week, I will send out a revised patch with
improvements to the man page change as requested. If anyone has specific
things they want to be sure are covered, please just let me know.

Thanks again,
Jerry


On Thu, Jul 12, 2018 at 7:06 PM, Thomas Munro  wrote:

> On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
>  wrote:
> > I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> > confuse me ...
> >
> > I don't quite see how reusing WAL segments actually protects against full
> > filesystem? On "traditional" filesystems I would not expect any
> difference
> > between "unlink+create" and reusing an existing file. On CoW filesystems
> > (like ZFS or btrfs) the space management works very differently and
> reusing
> > an existing file is unlikely to save anything.
>
> Yeah, I had the same thoughts.
>
> > But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> > entirely. max_wal_size is not a hard limit, and the disk may be filled by
> > something else (when WAL is not on a separate device, when there is think
> > provisioning, etc.). So it's not a protection against data corruption we
> > could rely on. (And as was discussed in the recent fsync thread, ENOSPC
> is a
> > likely source of past data corruption issues on NFS and possibly other
> > filesystems.)
>
> Right.  That ENOSPC discussion was about checkpointing though, not
> WAL.  IIUC the hypothesis was that there may be stacks (possibly
> involving NFS or thin provisioning, or perhaps historical versions of
> certain local filesystems that had reservation accounting bugs, on a
> certain kernel) that could let you write() a buffer, and then later
> when the checkpointer calls fsync() the filesystem says ENOSPC, the
> kernel reports that and throws away the dirty page, and then at next
> checkpoint fsync() succeeds but the checkpoint is a lie and the data
> is smoke.
>
> We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
> in Jerry's nearby stack trace, so that failure mode seems to be
> covered already for WAL, no?
>
> > AFAICS the original reason for reusing WAL segments was the belief that
> > overwriting an existing file is faster than writing a new file. That
> might
> > have been true in the past, but the question is if it's still true on
> > current filesystems. The results posted here suggest it's not true on
> ZFS,
> > at least.
>
> Yeah.
>
> The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
> comments about the documentation; we should make sure that the 'off'
> setting isn't accidentally recommended to the wrong audience) and I
> vote we take it.
>
> Just by the way, if I'm not mistaken ZFS does avoid faulting when
> overwriting whole blocks, just like other filesystems:
>
> https://github.com/freebsd/freebsd/blob/master/sys/cddl/
> contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034
>
> So then where are those faults coming from?  Perhaps the tree page
> that holds the block pointer, of which there must be many when the
> recordsize is small?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

Re: patch to allow disable of WAL recycling

2018-07-12 Thread Thomas Munro

On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
 wrote:
> I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> confuse me ...
>
> I don't quite see how reusing WAL segments actually protects against full
> filesystem? On "traditional" filesystems I would not expect any difference
> between "unlink+create" and reusing an existing file. On CoW filesystems
> (like ZFS or btrfs) the space management works very differently and reusing
> an existing file is unlikely to save anything.

Yeah, I had the same thoughts.

> But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> entirely. max_wal_size is not a hard limit, and the disk may be filled by
> something else (when WAL is not on a separate device, when there is think
> provisioning, etc.). So it's not a protection against data corruption we
> could rely on. (And as was discussed in the recent fsync thread, ENOSPC is a
> likely source of past data corruption issues on NFS and possibly other
> filesystems.)

Right.  That ENOSPC discussion was about checkpointing though, not
WAL.  IIUC the hypothesis was that there may be stacks (possibly
involving NFS or thin provisioning, or perhaps historical versions of
certain local filesystems that had reservation accounting bugs, on a
certain kernel) that could let you write() a buffer, and then later
when the checkpointer calls fsync() the filesystem says ENOSPC, the
kernel reports that and throws away the dirty page, and then at next
checkpoint fsync() succeeds but the checkpoint is a lie and the data
is smoke.

We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
in Jerry's nearby stack trace, so that failure mode seems to be
covered already for WAL, no?

> AFAICS the original reason for reusing WAL segments was the belief that
> overwriting an existing file is faster than writing a new file. That might
> have been true in the past, but the question is if it's still true on
> current filesystems. The results posted here suggest it's not true on ZFS,
> at least.

Yeah.

The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
comments about the documentation; we should make sure that the 'off'
setting isn't accidentally recommended to the wrong audience) and I
vote we take it.

Just by the way, if I'm not mistaken ZFS does avoid faulting when
overwriting whole blocks, just like other filesystems:

https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034

So then where are those faults coming from?  Perhaps the tree page
that holds the block pointer, of which there must be many when the
recordsize is small?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: patch to allow disable of WAL recycling

2018-07-12 Thread Jerry Jelinek

I was asked to perform two different tests:
1) A benchmarksql run with WAL recycling on and then off, for comparison
2) A test when the filesystem fills up

For #1, I did two 15 minute benchmarksql runs and here are the results.
wal_recycle=on
--
Term-00, Running Average tpmTOTAL: 299.84Current tpmTOTAL: 29412
 Memory U14:49:02,470 [Thread-1] INFO   jTPCC : Term-00,
14:49:02,470 [Thread-1] INFO   jTPCC : Term-00,
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) =
136.49
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Measured tpmTOTAL = 299.78
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Session Start =
2018-07-12 14:34:02
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Session End   =
2018-07-12 14:49:02
14:49:02,471 [Thread-1] INFO   jTPCC : Term-00, Transaction Count = 4497

wal_recycle=off
---
Term-00, Running Average tpmTOTAL: 299.85Current tpmTOTAL: 29520
 Memory U15:10:15,712 [Thread-1] INFO   jTPCC : Term-00,
15:10:15,712 [Thread-1] INFO   jTPCC : Term-00,
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) =
135.89
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Measured tpmTOTAL = 299.79
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Session Start =
2018-07-12 14:55:15
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Session End   =
2018-07-12 15:10:15
15:10:15,713 [Thread-1] INFO   jTPCC : Term-00, Transaction Count = 4497

As can be seen, disabling WAL recycling does not cause any performance
regression.

For #2, I ran the test with WAL recycling on (the current behavior as well
as the default with this patch) since the behavior of postgres is
orthogonal to WAL recycling when the filesystem fills up.

I capped the filesystem with 32MB of free space. I setup a configuration
with wal_keep_segments=50 and started a long benchmarksql run. I had 4 WAL
files already in existence when the run started.

As the filesystem fills up, the performance of postgres gets slower and
slower, as would be expected. This is due to the COW nature of the
filesystem and the fact that all writes need to find space.

When a new WAL file is created, this essentially consumes no space since it
is a zero-filled file, so no filesystem space is consumed, except for a
little metadata for the file. However, as writes occur to the WAL
file, space is being consumed. Eventually all space in the filesystem is
consumed. I could not tell if this occurred during a write to an existing
WAL file or a write to the database itself. As other people have observed,
WAL file creation in a COW filesystem is not the problematic operation when
the filesystem fills up. It is the writes to existing files that will fail.
When postgres core dumped there were 6 WAL files in the pg_wal directory
(well short of the 50 configured).

When the filesystem filled up, postgres core dumped and benchmarksql
emitted a bunch of java debug information which I could provide if anyone
is interested.

Here is some information for the postgres core dump. It looks like postgres
aborted itself, but since the filesystem is full, there is nothing in the
log file.
> ::status
debugging core file of postgres (64-bit) from
f6c22f98-38aa-eb51-80d2-811ed25bed6b
file: /zones/f6c22f98-38aa-eb51-80d2-811ed25bed6b/local/pgsql/bin/postgres
initial argv: /usr/local/pgsql/bin/postgres -D /home/postgres/data
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=76019 uid=1001 code=-1
> $C
f9dfa4b0 libc.so.1`_lwp_kill+0xa()
f9dfa4e0 libc.so.1`raise+0x20(6)
f9dfa530 libc.so.1`abort+0x98()
f9dfa560 errfinish+0x230()
f9dfa5e0 XLogWrite+0x294()
f9dfa610 XLogBackgroundFlush+0x18d()
f9dfaa50 WalWriterMain+0x1a8()
f9dfaab0 AuxiliaryProcessMain+0x3ff()
f9dfab40 0x7b5566()
f9dfab90 reaper+0x60a()
f9dfaba0 libc.so.1`__sighndlr+6()
f9dfac30 libc.so.1`call_user_handler+0x1db(12, 0, f9dfaca0)
f9dfac80 libc.so.1`sigacthandler+0x116(12, 0, f9dfaca0)
f9dfb0f0 libc.so.1`__pollsys+0xa()
f9dfb220 libc.so.1`pselect+0x26b(7, f9dfdad0, 0, 0,
f9dfb230, 0)
f9dfb270 libc.so.1`select+0x5a(7, f9dfdad0, 0, 0,
f9dfb6c0)
f9dffb00 ServerLoop+0x289()
f9dffb70 PostmasterMain+0xcfa()
f9dffba0 main+0x3cd()
f9dffbd0 _start_crt+0x83()
f9dffbe0 _start+0x18()

Let me know if there is any other information I could provide.

Thanks,
Jerry

On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek 
wrote:

> Hello All,
>
> Attached is a patch to provide an option to disable WAL recycling. We have
> found that this can help performance by eliminating read-modify-write
> behavior on old WAL files that are no longer resident in the filesystem
> cache. The is a lot more detail on the background of the motivation for
> this in the following thread.
>
> https://www.postgresql.org/message-id/flat/CACukRjO7

Re: patch to allow disable of WAL recycling

2018-07-12 Thread Tomas Vondra





On 07/12/2018 02:25 AM, David Pacheco wrote:
On Tue, Jul 10, 2018 at 1:34 PM, Alvaro Herrera 
mailto:alvhe...@2ndquadrant.com>> wrote:


On 2018-Jul-10, Jerry Jelinek wrote:

> 2) Disabling WAL recycling reduces reliability, even on COW filesystems.

I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full.  If you remove it,
that protection goes out the window.  You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice.  I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards?  Is there any corrupt data?  What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something?  As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.



If the result of hitting ENOSPC when creating or writing to a WAL file 
was that the database could become corrupted, then wouldn't that risk 
already be present (a) on any system, for the whole period from database 
init until the maximum number of WAL files was created, and (b) all the 
time on any copy-on-write filesystem?




I don't follow Alvaro's reasoning, TBH. There's a couple of things that 
confuse me ...


I don't quite see how reusing WAL segments actually protects against 
full filesystem? On "traditional" filesystems I would not expect any 
difference between "unlink+create" and reusing an existing file. On CoW 
filesystems (like ZFS or btrfs) the space management works very 
differently and reusing an existing file is unlikely to save anything.


But even if it reduces the likelihood of ENOSPC, it does not eliminate 
it entirely. max_wal_size is not a hard limit, and the disk may be 
filled by something else (when WAL is not on a separate device, when 
there is think provisioning, etc.). So it's not a protection against 
data corruption we could rely on. (And as was discussed in the recent 
fsync thread, ENOSPC is a likely source of past data corruption issues 
on NFS and possibly other filesystems.)


I might be missing something, of course.

AFAICS the original reason for reusing WAL segments was the belief that 
overwriting an existing file is faster than writing a new file. That 
might have been true in the past, but the question is if it's still true 
on current filesystems. The results posted here suggest it's not true on 
ZFS, at least.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-11 Thread David Pacheco

On Tue, Jul 10, 2018 at 10:32 PM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Wed, Jul 11, 2018 at 8:25 AM, Joshua D. Drake 
> wrote:
> > On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
> >>
> >> Thanks to everyone who took the time to look at the patch and send me
> >> feedback.  I'm happy to work on improving the documentation of this new
> >> tunable to clarify when it should be used and the implications. I'm
> trying
> >> to understand more specifically what else needs to be done next. To
> >> summarize, I think the following general concerns were brought up.
> >>
> >> For #6, there is no feasible way for us to recreate our workload on
> other
> >> operating systems or filesystems. Can anyone expand on what performance
> data
> >> is needed?
> >
> > I think a simple way to prove this would be to run BenchmarkSQL against
> > PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem
> > that is COW (zfs) and then run another test where pg_xlog/pg_wal is
> patched
> > with your patch and new behavior and then run the test again.
> BenchmarkSQL
> > is a more thorough benchmarking tool that something like pg_bench and is
> > very easy to setup.
>
> I have a lowly but trusty HP Microserver running FreeBSD 11.2 with ZFS
> on spinning rust.  It occurred to me that such an anaemic machine
> might show this effect easily because its cold reads are as slow as a
> Lada full of elephants going uphill.  Let's see...
>
> # os setup
> sysctl vfs.zfs.arc_min=134217728
> sysctl vfs.zfs.arc_max=134217728
> zfs create zoot/data/test
> zfs set mountpoint=/data/test zroot/data/test
> zfs set compression=off zroot/data/test
> zfs set recordsize=8192 zroot/data/test
>
> # initdb into /data/test/pgdata, then set postgresql.conf up like this:
> fsync=off
> max_wal_size = 600MB
> min_wal_size = 600MB
>
> # small scale test, we're only interested in producing WAL, not db size
> pgbench -i -s 100 postgres
>
> # do this a few times first, to make sure we have lots of WAL segments
> pgbench -M prepared -c 4 -j 4 -T 60 postgres
>
> # now test...
>
> With wal_recycle=on I reliably get around 1100TPS and vmstat -w 10
> shows numbers like this:
>
> procs  memory   pagedisks faults cpu
> r b w  avm   fre   flt  re  pi  pofr   sr ad0 ad1   insycs us
> sy id
> 3 0 3 1.2G  3.1G  4496   0   0   052   76 144 138  607 84107 29713 55
> 17 28
> 4 0 3 1.2G  3.1G  2955   0   0   084   77 134 130  609 82942 34324 61
> 17 22
> 4 0 3 1.2G  3.1G  2327   0   0   0 0   77 114 125  454 83157 29638 68
> 15 18
> 5 0 3 1.2G  3.1G  1966   0   0   082   77  86  81  335 84480 25077 74
> 13 12
> 3 0 3 1.2G  3.1G  1793   0   0   0   533   74  72  68  310 127890 31370 77
> 16  7
> 4 0 3 1.2G  3.1G  1113   0   0   0   151   73  95  94  363 128302 29827 74
> 18  8
>
> With wal_recycle=off I reliably get around 1600TPS and vmstat -w 10
> shows numbers like this:
>
> procs  memory   pagedisks faults cpu
> r b w  avm   fre   flt  re  pi  pofr   sr ad0 ad1   insycs us
> sy id
> 0 0 3 1.2G  3.1G   148   0   0   0   402   71  38  38  153 16668  5656 10
> 3 87
> 5 0 3 1.2G  3.1G  4527   0   0   050   73  28  27  123 123986 23373 68
> 15 17
> 5 0 3 1.2G  3.1G  3036   0   0   0   151   73  47  49  181 148014 29412 83
> 16  0
> 4 0 3 1.2G  3.1G  2063   0   0   0   233   73  56  54  200 143018 28699 81
> 17  2
> 4 0 3 1.2G  3.1G  1202   0   0   095   73  48  49  189 147276 29196 81
> 18  1
> 4 0 3 1.2G  3.1G   732   0   0   0 0   73  56  55  207 146805 29265 82
> 17  1
>
> I don't have time to investigate further for now and my knowledge of
> ZFS is superficial, but the patch seems to have a clear beneficial
> effect, reducing disk IOs and page faults on my little storage box.
> Obviously this isn't representative of a proper server environment, or
> some other OS, but it's a clue.  That surprised me... I was quietly
> hoping it was hoping it was going to be 'oh, if you turn off
> compression and use 8kb it doesn't happen because the pages line up'.
> But nope.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
Hi Thomas,

Thanks for testing!  It's validating that you saw the same results.

-- Dave

Re: patch to allow disable of WAL recycling

2018-07-11 Thread Andres Freund

Hi,

On 2018-07-10 14:15:30 -0600, Jerry Jelinek wrote:
>  Thanks to everyone who took the time to look at the patch and send me
> feedback.  I'm happy to work on improving the documentation of this new
> tunable to clarify when it should be used and the implications. I'm trying
> to understand more specifically what else needs to be done next. To
> summarize, I think the following general concerns were brought up.
> 
> 1) Disabling WAL recycling could have a negative performance impact on a
> COW filesystem if all WAL files could be kept in the filesystem cache.

> For #1, #2 and #3, I don't understand these concerns. It would be helpful
> if these could be more specific

We perform more writes (new files are zeroed, which needs to be
fsynced), and increase metadata traffic (creation of files), when not
recycling.

Regards,

Andres

Re: patch to allow disable of WAL recycling

2018-07-11 Thread David Pacheco

On Tue, Jul 10, 2018 at 1:34 PM, Alvaro Herrera 
wrote:

> On 2018-Jul-10, Jerry Jelinek wrote:
>
> > 2) Disabling WAL recycling reduces reliability, even on COW filesystems.
>
> I think the problem here is that WAL recycling in normal filesystems
> helps protect the case where filesystem gets full.  If you remove it,
> that protection goes out the window.  You can claim that people needs to
> make sure to have available disk space, but this does become a problem
> in practice.  I think the thing to do is verify what happens with
> recycling off when the disk gets full; is it possible to recover
> afterwards?  Is there any corrupt data?  What happens if the disk gets
> full just as the new WAL file is being created -- is there a Postgres
> PANIC or something?  As I understand, with recycling on it is easy (?)
> to recover, there is no PANIC crash, and no data corruption results.
>

If the result of hitting ENOSPC when creating or writing to a WAL file was
that the database could become corrupted, then wouldn't that risk already
be present (a) on any system, for the whole period from database init until
the maximum number of WAL files was created, and (b) all the time on any
copy-on-write filesystem?

Thanks,
Dave

Re: patch to allow disable of WAL recycling

2018-07-11 Thread Jerry Jelinek

Alvaro,

I'll perform several test runs with various combinations and post the
results.

Thanks,
Jerry


On Tue, Jul 10, 2018 at 2:34 PM, Alvaro Herrera 
wrote:

> On 2018-Jul-10, Jerry Jelinek wrote:
>
> > 2) Disabling WAL recycling reduces reliability, even on COW filesystems.
>
> I think the problem here is that WAL recycling in normal filesystems
> helps protect the case where filesystem gets full.  If you remove it,
> that protection goes out the window.  You can claim that people needs to
> make sure to have available disk space, but this does become a problem
> in practice.  I think the thing to do is verify what happens with
> recycling off when the disk gets full; is it possible to recover
> afterwards?  Is there any corrupt data?  What happens if the disk gets
> full just as the new WAL file is being created -- is there a Postgres
> PANIC or something?  As I understand, with recycling on it is easy (?)
> to recover, there is no PANIC crash, and no data corruption results.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Thomas Munro

On Wed, Jul 11, 2018 at 8:25 AM, Joshua D. Drake  wrote:
> On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
>>
>> Thanks to everyone who took the time to look at the patch and send me
>> feedback.  I'm happy to work on improving the documentation of this new
>> tunable to clarify when it should be used and the implications. I'm trying
>> to understand more specifically what else needs to be done next. To
>> summarize, I think the following general concerns were brought up.
>>
>> For #6, there is no feasible way for us to recreate our workload on other
>> operating systems or filesystems. Can anyone expand on what performance data
>> is needed?
>
> I think a simple way to prove this would be to run BenchmarkSQL against
> PostgreSQL in a default configuration with pg_xlog/pg_wal on a filesystem
> that is COW (zfs) and then run another test where pg_xlog/pg_wal is patched
> with your patch and new behavior and then run the test again. BenchmarkSQL
> is a more thorough benchmarking tool that something like pg_bench and is
> very easy to setup.

I have a lowly but trusty HP Microserver running FreeBSD 11.2 with ZFS
on spinning rust.  It occurred to me that such an anaemic machine
might show this effect easily because its cold reads are as slow as a
Lada full of elephants going uphill.  Let's see...

# os setup
sysctl vfs.zfs.arc_min=134217728
sysctl vfs.zfs.arc_max=134217728
zfs create zoot/data/test
zfs set mountpoint=/data/test zroot/data/test
zfs set compression=off zroot/data/test
zfs set recordsize=8192 zroot/data/test

# initdb into /data/test/pgdata, then set postgresql.conf up like this:
fsync=off
max_wal_size = 600MB
min_wal_size = 600MB

# small scale test, we're only interested in producing WAL, not db size
pgbench -i -s 100 postgres

# do this a few times first, to make sure we have lots of WAL segments
pgbench -M prepared -c 4 -j 4 -T 60 postgres

# now test...

With wal_recycle=on I reliably get around 1100TPS and vmstat -w 10
shows numbers like this:

procs  memory   pagedisks faults cpu
r b w  avm   fre   flt  re  pi  pofr   sr ad0 ad1   insycs us sy id
3 0 3 1.2G  3.1G  4496   0   0   052   76 144 138  607 84107 29713 55 17 28
4 0 3 1.2G  3.1G  2955   0   0   084   77 134 130  609 82942 34324 61 17 22
4 0 3 1.2G  3.1G  2327   0   0   0 0   77 114 125  454 83157 29638 68 15 18
5 0 3 1.2G  3.1G  1966   0   0   082   77  86  81  335 84480 25077 74 13 12
3 0 3 1.2G  3.1G  1793   0   0   0   533   74  72  68  310 127890 31370 77 16  7
4 0 3 1.2G  3.1G  1113   0   0   0   151   73  95  94  363 128302 29827 74 18  8

With wal_recycle=off I reliably get around 1600TPS and vmstat -w 10
shows numbers like this:

procs  memory   pagedisks faults cpu
r b w  avm   fre   flt  re  pi  pofr   sr ad0 ad1   insycs us sy id
0 0 3 1.2G  3.1G   148   0   0   0   402   71  38  38  153 16668  5656 10  3 87
5 0 3 1.2G  3.1G  4527   0   0   050   73  28  27  123 123986 23373 68 15 17
5 0 3 1.2G  3.1G  3036   0   0   0   151   73  47  49  181 148014 29412 83 16  0
4 0 3 1.2G  3.1G  2063   0   0   0   233   73  56  54  200 143018 28699 81 17  2
4 0 3 1.2G  3.1G  1202   0   0   095   73  48  49  189 147276 29196 81 18  1
4 0 3 1.2G  3.1G   732   0   0   0 0   73  56  55  207 146805 29265 82 17  1

I don't have time to investigate further for now and my knowledge of
ZFS is superficial, but the patch seems to have a clear beneficial
effect, reducing disk IOs and page faults on my little storage box.
Obviously this isn't representative of a proper server environment, or
some other OS, but it's a clue.  That surprised me... I was quietly
hoping it was hoping it was going to be 'oh, if you turn off
compression and use 8kb it doesn't happen because the pages line up'.
But nope.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Jerry Jelinek wrote:

> 2) Disabling WAL recycling reduces reliability, even on COW filesystems.

I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full.  If you remove it,
that protection goes out the window.  You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice.  I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards?  Is there any corrupt data?  What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something?  As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Joshua D. Drake


On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
Thanks to everyone who took the time to look at the patch and send me 
feedback.  I'm happy to work on improving the documentation of this 
new tunable to clarify when it should be used and the implications. 
I'm trying to understand more specifically what else needs to be done 
next. To summarize, I think the following general concerns were 
brought up.


For #6, there is no feasible way for us to recreate our workload on 
other operating systems or filesystems. Can anyone expand on what 
performance data is needed?




I think a simple way to prove this would be to run BenchmarkSQL against 
PostgreSQL in a default configuration with pg_xlog/pg_wal on a 
filesystem that is COW (zfs) and then run another test where 
pg_xlog/pg_wal is patched with your patch and new behavior and then run 
the test again. BenchmarkSQL is a more thorough benchmarking tool that 
something like pg_bench and is very easy to setup.


The reason you would use a default configuration is because it will 
cause a huge amount of wal churn, although a test with a proper wal 
configuration would also be good.


Thanks,

JD



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Jerry Jelinek

 Thanks to everyone who took the time to look at the patch and send me
feedback.  I'm happy to work on improving the documentation of this new
tunable to clarify when it should be used and the implications. I'm trying
to understand more specifically what else needs to be done next. To
summarize, I think the following general concerns were brought up.

1) Disabling WAL recycling could have a negative performance impact on a
COW filesystem if all WAL files could be kept in the filesystem cache.
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
3) Using something like posix_fadvise to reload recycled WAL files into the
filesystem cache is better even for a COW filesystem.
4) There are "several" other purposes for WAL recycling which this tunable
would impact.
5) A WAL recycling tunable is too specific and a more general solution is
needed.
6) Need more performance data.

For #1, #2 and #3, I don't understand these concerns. It would be helpful
if these could be more specific

For #4, can anybody enumerate these other purposes for WAL recycling?

For #5, perhaps I am making an incorrect assumption about what the original
response was requesting, but I understand that WAL recycling is just one
aspect of WAL file creation/allocation. However, the creation of a new WAL
file is not a problem we've ever observed. In general, any modern
filesystem should do a good job of caching recently accessed files. We've
never observed a problem with the allocation of a new WAL file slightly
before it is needed. The problem we have observed is specifically around
WAL file recycling when we have to access old files that are long gone from
the filesystem cache. The semantics around recycling seem pretty crisp as
compared to some other tunable which would completely change how WAL files
are created. Given that a change like that is also much more intrusive, it
seems better to provide a tunable to disable WAL recycling vs. some other
kind of tunable for which we can't articulate any improvement except in the
recycling scenario.

For #6, there is no feasible way for us to recreate our workload on other
operating systems or filesystems. Can anyone expand on what performance
data is needed?

I'd like to restate the original problem we observed.

When PostgreSQL decides to reuse an old WAL file whose contents have been
evicted from the cache (because they haven't been used in hours), this
turns what should be a workload bottlenecked by synchronous write
performance (that can be well-optimized with an SSD log device) into a
random read workload (that's much more expensive for any system). What's
significantly worse is that we saw this on synchronous standbys. When that
happened, the WAL receiver was blocked on a random read from disk, and
since it's single-threaded, all write queries on the primary stop until the
random read finishes. This is particularly bad for us when the sync is
doing other I/O (e.g., for an autovacuum or a database backup) that causes
disk reads to take hundreds of milliseconds.

To summarize, recycling old WAL files seems like an optimization designed
for certain filesystems that allocate disk blocks up front. Given that the
existing behavior is already filesystem specific, is there specific reasons
why we can't provide a tunable to disable this behavior for filesystems
which don't behave that way?

Thanks again,
Jerry

On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek 
wrote:

> Hello All,
>
> Attached is a patch to provide an option to disable WAL recycling. We have
> found that this can help performance by eliminating read-modify-write
> behavior on old WAL files that are no longer resident in the filesystem
> cache. The is a lot more detail on the background of the motivation for
> this in the following thread.
>
> https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3
> XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3
> xxbtwu3kkaritr67m3...@mail.gmail.com
>
> A similar change has been tested against our 9.6 branch that we're
> currently running, but the attached patch is against master.
>
> Thanks,
> Jerry
>
>

Re: patch to allow disable of WAL recycling

2018-07-06 Thread Jerry Jelinek

Thomas,

We're using a zfs recordsize of 8k to match the PG blocksize of 8k, so what
you're describing is not the issue here.

Thanks,
Jerry


On Thu, Jul 5, 2018 at 3:44 PM, Thomas Munro 
wrote:

> On Fri, Jul 6, 2018 at 3:37 AM, Jerry Jelinek 
> wrote:
> >> If the problem is specifically the file system caching behavior, then we
> >> could also consider using the dreaded posix_fadvise().
> >
> > I'm not sure that solves the problem for non-cached files, which is where
> > we've observed the performance impact of recycling, where what should be
> a
> > write intensive workload turns into a read-modify-write workload because
> > we're now reading an old WAL file that is many hours, or even days, old
> and
> > has thus fallen out of the memory-cached data for the filesystem. The
> disk
> > reads still have to happen.
>
> What ZFS record size are you using?  PostgreSQL's XLOG_BLCKSZ is usually
> 8192 bytes.  When XLogWrite() calls write(some multiple of XLOG_BLCKSZ), on
> a traditional filesystem the kernel will say 'oh, that's overwriting whole
> pages exactly, so I have no need to read it from disk' (for example in
> FreeBSD ffs_vnops.c ffs_write() see the comment "We must peform a
> read-before-write if the transfer size does not cover the entire buffer").
> I assume ZFS has a similar optimisation, but it uses much larger records
> than the traditional 4096 byte pages, defaulting to 128KB.  Is that the
> reason for this?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

Re: patch to allow disable of WAL recycling

2018-07-05 Thread Thomas Munro

On Fri, Jul 6, 2018 at 3:37 AM, Jerry Jelinek 
wrote:
>> If the problem is specifically the file system caching behavior, then we
>> could also consider using the dreaded posix_fadvise().
>
> I'm not sure that solves the problem for non-cached files, which is where
> we've observed the performance impact of recycling, where what should be a
> write intensive workload turns into a read-modify-write workload because
> we're now reading an old WAL file that is many hours, or even days, old
and
> has thus fallen out of the memory-cached data for the filesystem. The disk
> reads still have to happen.

What ZFS record size are you using?  PostgreSQL's XLOG_BLCKSZ is usually
8192 bytes.  When XLogWrite() calls write(some multiple of XLOG_BLCKSZ), on
a traditional filesystem the kernel will say 'oh, that's overwriting whole
pages exactly, so I have no need to read it from disk' (for example in
FreeBSD ffs_vnops.c ffs_write() see the comment "We must peform a
read-before-write if the transfer size does not cover the entire buffer").
I assume ZFS has a similar optimisation, but it uses much larger records
than the traditional 4096 byte pages, defaulting to 128KB.  Is that the
reason for this?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: patch to allow disable of WAL recycling

2018-07-05 Thread Andres Freund

Hi,

On 2018-06-26 07:35:57 -0600, Jerry Jelinek wrote:
> + 
> +  wal_recycle (boolean)
> +  
> +   wal_recycle configuration 
> parameter
> +  
> +  
> +  
> +   
> +When this parameter is on, past log file segments
> +in the pg_wal directory are recycled for future
> +use.
> +   
> +
> +   
> +Turning this parameter off causes past log files segments to be 
> deleted
> +when no longer needed. This can improve performance by eliminating
> +read-modify-write operations on old files which are no longer in the
> +filesystem cache.
> +   
> +  
> + 

This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
performance impact of non COW filesystems, and very likely even negative
impacts in a number of COWed scenarios (when there's enough memory to
keep all WAL files in memory).

I still think that fixing this another way would be preferrable. This'll
be too much of a magic knob that depends on the fs, hardware and
workload.

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2018-07-05 Thread Peter Eisentraut

On 05.07.18 17:37, Jerry Jelinek wrote:
> Your patch describes this feature as a performance feature.  We would
> need to see more measurements about what this would do on other
> platforms and file systems than your particular one.  Also, we need to
> be careful with user options that trade off reliability for performance
> and describe them in much more detail.
> 
> I don't think this change really impacts the reliability of PG, since PG
> doesn't actually preallocate all of the WAL files. I think PG will
> allocate WAL files as it runs, up to the  wal_keep_segments limit, at
> which point it would start recycling. If the filesystem fills up before
> that limit is reached, PG would have to handle the filesystem being full
> when attempting to allocate a new WAL file (as it would with my change
> if WAL recycling is disabled). Of course once all of the WAL files have
> finally been allocated, then PG won't need additional space on a non-COW
> filesystem. I'd be happy to add more details to the man page change
> describing this new option and the implications if the underlying
> filesystem fills up.

The point is, the WAL recycling has a purpose, perhaps several.  If it
didn't have one, we wouldn't do it.  So if we add an option to turn it
off to get performance gains, we have to do some homework.

> If the problem is specifically the file system caching behavior, then we
> could also consider using the dreaded posix_fadvise().
> 
> I'm not sure that solves the problem for non-cached files, which is
> where we've observed the performance impact of recycling, where what
> should be a write intensive workload turns into a read-modify-write
> workload because we're now reading an old WAL file that is many hours,
> or even days, old and has thus fallen out of the memory-cached data for
> the filesystem. The disk reads still have to happen.

But they could happen ahead of time.

> Then again, I can understand that turning off WAL recycling is sensible
> on ZFS, since there is no point in preallocating space that will never
> be used.  But then we should also turn off all other preallocation of
> WAL files, including the creation of new (non-recycled) ones.
> 
> I don't think we'd see any benefit from that (since the newly allocated
> file is certainly cached), and the change would be much more intrusive,
> so I opted for the trivial change in the patch I proposed.

The change would be more invasive, but I think it would ultimately make
the code more clear and maintainable and the user interfaces more
understandable in the long run.  I think that would be better than a
slightly ad hoc knob that fixed one particular workload once upon a time.

But we're probably not there yet.  We should start with a more detailed
performance analysis of the originally proposed patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: patch to allow disable of WAL recycling

2018-07-05 Thread Jerry Jelinek

Peter,

Thanks for taking a look a this. I have a few responses in line. I am not a
PG expert, so if there is something here that I've misunderstood, please
let me know.

On Sun, Jul 1, 2018 at 6:54 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 26.06.18 15:35, Jerry Jelinek wrote:
> > Attached is a patch to provide an option to disable WAL recycling. We
> > have found that this can help performance by eliminating
> > read-modify-write behavior on old WAL files that are no longer resident
> > in the filesystem cache. The is a lot more detail on the background of
> > the motivation for this in the following thread.
>
> Your patch describes this feature as a performance feature.  We would
> need to see more measurements about what this would do on other
> platforms and file systems than your particular one.  Also, we need to
> be careful with user options that trade off reliability for performance
> and describe them in much more detail.
>

I don't think this change really impacts the reliability of PG, since PG
doesn't actually preallocate all of the WAL files. I think PG will allocate
WAL files as it runs, up to the  wal_keep_segments limit, at which point it
would start recycling. If the filesystem fills up before that limit is
reached, PG would have to handle the filesystem being full when attempting
to allocate a new WAL file (as it would with my change if WAL recycling is
disabled). Of course once all of the WAL files have finally been allocated,
then PG won't need additional space on a non-COW filesystem. I'd be happy
to add more details to the man page change describing this new option and
the implications if the underlying filesystem fills up.

> If the problem is specifically the file system caching behavior, then we
> could also consider using the dreaded posix_fadvise().
>

I'm not sure that solves the problem for non-cached files, which is where
we've observed the performance impact of recycling, where what should be a
write intensive workload turns into a read-modify-write workload because
we're now reading an old WAL file that is many hours, or even days, old and
has thus fallen out of the memory-cached data for the filesystem. The disk
reads still have to happen.

>
> Then again, I can understand that turning off WAL recycling is sensible
> on ZFS, since there is no point in preallocating space that will never
> be used.  But then we should also turn off all other preallocation of
> WAL files, including the creation of new (non-recycled) ones.
>

I don't think we'd see any benefit from that (since the newly allocated
file is certainly cached), and the change would be much more intrusive, so
I opted for the trivial change in the patch I proposed.

>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Thanks again,
Jerry

Re: patch to allow disable of WAL recycling

2018-07-01 Thread Peter Eisentraut

On 26.06.18 15:35, Jerry Jelinek wrote:
> Attached is a patch to provide an option to disable WAL recycling. We
> have found that this can help performance by eliminating
> read-modify-write behavior on old WAL files that are no longer resident
> in the filesystem cache. The is a lot more detail on the background of
> the motivation for this in the following thread.

Your patch describes this feature as a performance feature.  We would
need to see more measurements about what this would do on other
platforms and file systems than your particular one.  Also, we need to
be careful with user options that trade off reliability for performance
and describe them in much more detail.

If the problem is specifically the file system caching behavior, then we
could also consider using the dreaded posix_fadvise().

Then again, I can understand that turning off WAL recycling is sensible
on ZFS, since there is no point in preallocating space that will never
be used.  But then we should also turn off all other preallocation of
WAL files, including the creation of new (non-recycled) ones.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

91 matches

Mail list logo