Re: adding 'zstd' as a compression algorithm

2022-02-19 Thread Andres Freund
On 2022-02-18 09:52:52 -0500, Robert Haas wrote:
> plus a second patch to change the occurrences above to use AC_CHECK_HEADER()
> and remove all traces of the corresponding preprocessor symbol.

LGTM.  I'm not entirely sure the gssapi one is a real improvement, because we
kind of test for that branch, via the #else in HAVE_GSSAPI_H. But on balance,
I'd probably go for it.




Re: adding 'zstd' as a compression algorithm

2022-02-18 Thread Robert Haas
On Fri, Feb 18, 2022 at 9:52 AM Robert Haas  wrote:
> So here's a revised patch for zstd support that uses
> AC_CHECK_HEADER(), plus a second patch to change the occurrences above
> to use AC_CHECK_HEADER() and remove all traces of the corresponding
> preprocessor symbol.

And I've now committed the first of these.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-18 Thread Robert Haas
On Fri, Feb 18, 2022 at 9:24 AM Robert Haas  wrote:
> On Fri, Feb 18, 2022 at 9:02 AM Robert Haas  wrote:
> > Oh wait ... you want it the other way. Yeah, that seems harmless to
> > change. I wonder how many others there are that could be changed
> > similarly...
>
> I went through configure.ac looking for instances of
> AC_CHECK_HEADERS() where the corresponding symbol was not used. I
> found four:
>
> AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is
> required for LZ4])])
> AC_CHECK_HEADERS(gssapi/gssapi.h, [], [AC_MSG_ERROR([gssapi.h header
> file is required for GSSAPI])])])
> AC_CHECK_HEADERS(ldap.h, [], [AC_MSG_ERROR([header file  is
> required for LDAP])]
> AC_CHECK_HEADER(winldap.h, [], ...stuff...)
>
> I guess we could clean all of those up similarly.

So here's a revised patch for zstd support that uses
AC_CHECK_HEADER(), plus a second patch to change the occurrences above
to use AC_CHECK_HEADER() and remove all traces of the corresponding
preprocessor symbol.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v4-0001-Add-support-for-building-with-ZSTD.patch
Description: Binary data


v4-0002-Demote-AC_CHECK_HEADERS-calls-to-AC_CHECK_HEADER-.patch
Description: Binary data


Re: adding 'zstd' as a compression algorithm

2022-02-18 Thread Robert Haas
On Fri, Feb 18, 2022 at 9:02 AM Robert Haas  wrote:
> Oh wait ... you want it the other way. Yeah, that seems harmless to
> change. I wonder how many others there are that could be changed
> similarly...

I went through configure.ac looking for instances of
AC_CHECK_HEADERS() where the corresponding symbol was not used. I
found four:

AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is
required for LZ4])])
AC_CHECK_HEADERS(gssapi/gssapi.h, [], [AC_MSG_ERROR([gssapi.h header
file is required for GSSAPI])])])
AC_CHECK_HEADERS(ldap.h, [], [AC_MSG_ERROR([header file  is
required for LDAP])]
AC_CHECK_HEADER(winldap.h, [], ...stuff...)

I guess we could clean all of those up similarly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-18 Thread Robert Haas
On Fri, Feb 18, 2022 at 8:48 AM Robert Haas  wrote:
> On Thu, Feb 17, 2022 at 8:36 PM Andres Freund  wrote:
> > On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> > > %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> > > this version fails the sanity check between pg_config.h.in and the
> > > MSVC scripts checking that all flags exist.
> >
> > Do we really need all three defines? How about using AC_CHECK_HEADER() 
> > instead
> > of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we 
> > error
> > out if a header isn't found make it a bit pointless to then still define
> > HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.
>
> I have to admit to being somewhat confused by the apparent lack of
> consistency in the way we do configure checks. The ZSTD check we added
> here is just based on the LZ4 check just above it, which was the
> result of my commit of Dilip's patch to add LZ4 TOAST compression. So
> if we want to do something different we should change them both. But
> that just begs the question of why the LZ4 support looks the way it
> does, and to be honest I don't recall. The zlib and XSLT formulas just
> above are much simpler, but for some reason what we're doing here
> seems to be based on the more-complex formula we use for XML support
> instead of either of those.
>
> But having said that, the proposed patch adds no new call to
> AC_CHECK_HEADER(), and does add a new call to AC_CHECK_HEADERS(), so I
> don't understand the specifics of your complaint here.

Oh wait ... you want it the other way. Yeah, that seems harmless to
change. I wonder how many others there are that could be changed
similarly...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-18 Thread Robert Haas
On Thu, Feb 17, 2022 at 8:36 PM Andres Freund  wrote:
> On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> > %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> > this version fails the sanity check between pg_config.h.in and the
> > MSVC scripts checking that all flags exist.
>
> Do we really need all three defines? How about using AC_CHECK_HEADER() instead
> of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we error
> out if a header isn't found make it a bit pointless to then still define
> HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.

I have to admit to being somewhat confused by the apparent lack of
consistency in the way we do configure checks. The ZSTD check we added
here is just based on the LZ4 check just above it, which was the
result of my commit of Dilip's patch to add LZ4 TOAST compression. So
if we want to do something different we should change them both. But
that just begs the question of why the LZ4 support looks the way it
does, and to be honest I don't recall. The zlib and XSLT formulas just
above are much simpler, but for some reason what we're doing here
seems to be based on the more-complex formula we use for XML support
instead of either of those.

But having said that, the proposed patch adds no new call to
AC_CHECK_HEADER(), and does add a new call to AC_CHECK_HEADERS(), so I
don't understand the specifics of your complaint here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-18 Thread Robert Haas
On Thu, Feb 17, 2022 at 8:18 PM Michael Paquier  wrote:
> On Thu, Feb 17, 2022 at 09:40:13AM -0500, Robert Haas wrote:
> > Ah, OK, cool. It seems like we have consensus on this being a good
> > direction, so I plan to commit this later today unless objections or
> > additional review comments turn up.
>
> So, will there be a part of the system where we'll make use of that in
> 15?  pg_basebackup for server-side and client-side compression, at
> least, as far as I can guess?

That's my plan, yes. I think the patches only need a small amount of
work at this point, but I'd like to get this part committed first.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-17 Thread Andres Freund
Hi,

On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> this version fails the sanity check between pg_config.h.in and the
> MSVC scripts checking that all flags exist.

Do we really need all three defines? How about using AC_CHECK_HEADER() instead
of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we error
out if a header isn't found make it a bit pointless to then still define
HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.

Greetings,

Andres Freund




Re: adding 'zstd' as a compression algorithm

2022-02-17 Thread Michael Paquier
On Thu, Feb 17, 2022 at 09:40:13AM -0500, Robert Haas wrote:
> Ah, OK, cool. It seems like we have consensus on this being a good
> direction, so I plan to commit this later today unless objections or
> additional review comments turn up.

So, will there be a part of the system where we'll make use of that in
15?  pg_basebackup for server-side and client-side compression, at
least, as far as I can guess?
--
Michael


signature.asc
Description: PGP signature


Re: adding 'zstd' as a compression algorithm

2022-02-17 Thread Robert Haas
On Wed, Feb 16, 2022 at 11:34 PM Michael Paquier  wrote:
> Thanks.  This looks pretty much right, except for two things that I
> have taken the freedom to fix as of the v3 attached.

Ah, OK, cool. It seems like we have consensus on this being a good
direction, so I plan to commit this later today unless objections or
additional review comments turn up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Andrey Borodin



> 15 февр. 2022 г., в 23:20, Robert Haas  написал(а):
> 
> Anyway, those are my thoughts. What are yours?

+1 for adding Zstd everywhere. Quite similar patches are already a part of 
"libpq compression" and "zstd for pg_dump" commitfest entries, so pushing this 
will simplify those entries IMV.

Also I like the idea of defaulting FPI compression to lz4 and adding Zstd as 
wal_compression option.
I don't think Zstd is obvious choice for default FPI compression: there are HW 
setups when disks are almost as fast as RAM. In this case lz4 can improve 
performance, while Zstd might make things slower.

Thanks!

Best regards, Andrey Borodin.



Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Michael Paquier
On Wed, Feb 16, 2022 at 10:40:01AM -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 9:33 PM Michael Paquier  wrote:
>> Yes, the patch misses the fact that each ./configure switch is
>> documented.  For consistency, I think that you should also add that in
>> the MSVC scripts in the first version.  It is really straight-forward
>> to do so, and it should be just a matter of grepping for the code
>> paths of lz4, then adjust things for zstd.
> 
> Makes sense. Here's an attempt by me to do that.

Thanks.  This looks pretty much right, except for two things that I
have taken the freedom to fix as of the v3 attached.

%define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
this version fails the sanity check between pg_config.h.in and the
MSVC scripts checking that all flags exist.

@@ -351,6 +351,7 @@ XGETTEXT = @XGETTEXT@
 GZIP   = gzip
 BZIP2  = bzip2
 LZ4= @LZ4@
+ZSTD   = @ZSTD@
A similar refresh is needed in vcregress.pl.

+   $proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
The upstream code is also using this library name, so that should be
fine.
--
Michael
From 54485782e931c9536de2fdf7dbdfd0bb5c0c632d Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Wed, 16 Feb 2022 10:36:36 -0500
Subject: [PATCH v3] Add support for building with ZSTD.

This commit doesn't actually add anything that uses ZSTD; that will be
done separately. It just puts the basic infrastructure into place.

Jeevan Ladhe and Robert Haas
---
 src/include/pg_config.h.in|   9 +
 doc/src/sgml/install-windows.sgml |   9 +
 doc/src/sgml/installation.sgml|   9 +
 configure | 271 ++
 configure.ac  |  33 
 src/Makefile.global.in|   1 +
 src/tools/msvc/Solution.pm|  15 ++
 src/tools/msvc/config_default.pl  |   1 +
 src/tools/msvc/vcregress.pl   |   1 +
 9 files changed, 349 insertions(+)

diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 28a1f0e9f0..1912cf35de 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -352,6 +352,9 @@
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
+/* Define to 1 if you have the `zstd' library (-lzstd). */
+#undef HAVE_LIBZSTD
+
 /* Define to 1 if you have the `link' function. */
 #undef HAVE_LINK
 
@@ -718,6 +721,9 @@
 /* Define to 1 if the assembler supports X86_64's POPCNTQ instruction. */
 #undef HAVE_X86_64_POPCNTQ
 
+/* Define to 1 if you have the  header file. */
+#undef HAVE_ZSTD_H
+
 /* Define to 1 if the system has the type `_Bool'. */
 #undef HAVE__BOOL
 
@@ -949,6 +955,9 @@
 /* Define to select Win32-style shared memory. */
 #undef USE_WIN32_SHARED_MEMORY
 
+/* Define to 1 to build with ZSTD support. (--with-zstd) */
+#undef USE_ZSTD
+
 /* Define to 1 if `wcstombs_l' requires . */
 #undef WCSTOMBS_L_IN_XLOCALE
 
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index 30dd0c7f75..d2f63db3f2 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -307,6 +307,15 @@ $ENV{MSBFLAGS}="/m";
  
 
 
+
+ ZSTD
+ 
+  Required for supporting ZSTD compression
+  method. Binaries and source can be downloaded from
+  https://github.com/facebook/zstd/releases;>.
+ 
+
+
 
  OpenSSL
  
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 655095f3b1..c6190f6955 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -989,6 +989,15 @@ build-postgresql:

   
 
+  
+   --with-zstd
+   
+
+ Build with ZSTD compression support.
+
+   
+  
+
   
--with-ssl=LIBRARY

diff --git a/configure b/configure
index 930658..f07f689f1a 100755
--- a/configure
+++ b/configure
@@ -650,6 +650,7 @@ CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
 have_win32_dbghelp
 LIBOBJS
+ZSTD
 LZ4
 UUID_LIBS
 LDAP_LIBS_BE
@@ -700,6 +701,9 @@ with_gnu_ld
 LD
 LDFLAGS_SL
 LDFLAGS_EX
+ZSTD_LIBS
+ZSTD_CFLAGS
+with_zstd
 LZ4_LIBS
 LZ4_CFLAGS
 with_lz4
@@ -869,6 +873,7 @@ with_libxslt
 with_system_tzdata
 with_zlib
 with_lz4
+with_zstd
 with_gnu_ld
 with_ssl
 with_openssl
@@ -898,6 +903,8 @@ XML2_CFLAGS
 XML2_LIBS
 LZ4_CFLAGS
 LZ4_LIBS
+ZSTD_CFLAGS
+ZSTD_LIBS
 LDFLAGS_EX
 LDFLAGS_SL
 PERL
@@ -1577,6 +1584,7 @@ Optional Packages:
   use system time zone data in DIR
   --without-zlib  do not use Zlib
   --with-lz4  build with LZ4 support
+  --with-zstd build with ZSTD support
   --with-gnu-ld   assume the C compiler uses GNU ld [default=no]
   --with-ssl=LIB  use LIB for SSL/TLS support (openssl)
   --with-openssl  obsolete spelling of --with-ssl=openssl
@@ -1606,6 +1614,8 @@ Some influential environment variables:
   XML2_LIBS   linker flags for XML2, overriding pkg-config
   LZ4_CFLAGS  C compiler flags for LZ4, overriding 

Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Michael Paquier
On Tue, Feb 15, 2022 at 06:24:13PM -0800, Andres Freund wrote:
> For backups it's pretty obviously zstd imo. At the lower levels it achieves
> considerably higher compression ratios while still being vastly faster than
> gzip. Without even using the threaded compression support the library has.

Noted.

> For something like wal_compression it'd be a harder question.

FWIW, I have done some measurements for wal_compression with zstd, as
of:
https://www.postgresql.org/message-id/YMmlvyVyAFlxZ+/h...@paquier.xyz

The result is not surprising, a bit more CPU for zstd with more
compression compared to LZ4, both outclassing easily zlib.  I am not
sure which one would be more adapted as default as FPI patterns depend
on the workload, for one, and this is just one corner case.
--
Michael


signature.asc
Description: PGP signature


Re: adding 'zstd' as a compression algorithm (initdb/lz4)

2022-02-16 Thread Justin Pryzby
On Tue, Feb 15, 2022 at 11:54:10AM -0800, Andres Freund wrote:
> > Isn't it an incontrovertible fact that LZ4 is superior to pglz in
> > every way? LZ4 is pretty much its successor. And so it seems totally
> > fine to assume that users will always want to use the clearly better
> > option, and that that option will be generally available going
> > forward. TOAST compression is applied selectively already, based on
> > various obscure implementation details, some of which are quite
> > arbitrary.
> 
> Yea, we should really default to lz4 in initdb when available.

This patch intends to implement that.  I have no particular interest in this,
but if anyone wants, I will add it to the next CF (or the one after that).

commit 2a3c5950e625ccfaebc49bbf71b8db16dc143cd2
Author: Justin Pryzby 
Date:   Tue Feb 15 19:14:33 2022 -0600

initdb: default_toast_compression=lz4 if available

TODO: consider same for wal_compression

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc63172efde..f9cd2ef7229 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8537,7 +8537,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;
 The supported compression methods are pglz and
 (if PostgreSQL was compiled with
 --with-lz4) lz4.
-The default is pglz.
+The default is lz4 if available at the time 
+PostgreSQL was compiled, otherwise
+pglz.

   
  
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f505413a7f9..6b6f6efaba1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4726,7 +4726,11 @@ static struct config_enum ConfigureNamesEnum[] =
NULL
},
_toast_compression,
+#ifdef USE_LZ4
+   TOAST_LZ4_COMPRESSION,
+#else
TOAST_PGLZ_COMPRESSION,
+#endif
default_toast_compression_options,
NULL, NULL, NULL
},
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index d78e8e67b8d..bb7c57e00fa 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1185,6 +1185,12 @@ setup_config(void)
  
"#update_process_title = off");
 #endif
 
+#ifdef USE_LZ4
+   conflines = replace_token(conflines,
+ 
"#default_toast_compression = 'pglz'",
+ 
"#default_toast_compression = 'lz4'");
+#endif
+
/*
 * Change password_encryption setting to md5 if md5 was chosen as an
 * authentication method, unless scram-sha-256 was also chosen.




Re: adding 'zstd' as a compression algorithm

2022-02-16 Thread Robert Haas
On Tue, Feb 15, 2022 at 9:33 PM Michael Paquier  wrote:
> On Tue, Feb 15, 2022 at 02:33:32PM -0600, Justin Pryzby wrote:
> > What I mean is that any patches which *use* the zstd support should update
> > those for completeness.
> >
> >  doc/src/sgml/config.sgml  | 14 +-
> >  doc/src/sgml/install-windows.sgml |  2 +-
> >  doc/src/sgml/installation.sgml|  5 +++--
>
> Yes, the patch misses the fact that each ./configure switch is
> documented.  For consistency, I think that you should also add that in
> the MSVC scripts in the first version.  It is really straight-forward
> to do so, and it should be just a matter of grepping for the code
> paths of lz4, then adjust things for zstd.

Makes sense. Here's an attempt by me to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-Add-support-for-building-with-ZSTD.patch
Description: Binary data


Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Andres Freund
Hi,m

On 2022-02-15 20:56:15 -0500, Tom Lane wrote:
> Maybe we have a bit more flexibility for TOAST, not sure.

It'd be nice to at least add it as an option for initdb. Afaics there's no way
to change the default at that point. initdb itself is measurably
faster. Although sadly it's a bigger difference for optimized builds.

I think it matter a bit even after initdb, pg_rewrite is most of the
compressed data and some of the views are regularly used...

- Andres




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Michael Paquier
On Tue, Feb 15, 2022 at 02:33:32PM -0600, Justin Pryzby wrote:
> What I mean is that any patches which *use* the zstd support should update
> those for completeness.
> 
>  doc/src/sgml/config.sgml  | 14 +-
>  doc/src/sgml/install-windows.sgml |  2 +-
>  doc/src/sgml/installation.sgml|  5 +++--

Yes, the patch misses the fact that each ./configure switch is
documented.  For consistency, I think that you should also add that in
the MSVC scripts in the first version.  It is really straight-forward
to do so, and it should be just a matter of grepping for the code
paths of lz4, then adjust things for zstd.
--
Michael


signature.asc
Description: PGP signature


Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Andres Freund
Hi,

On 2022-02-16 09:58:44 +0900, Michael Paquier wrote:
> On Tue, Feb 15, 2022 at 01:42:55PM -0800, Andres Freund wrote:
> > I think well before removing stuff we should default to decent compression
> > algorithms. E.g. -Z/--compress should probably not continue to use gzip if
> > better things are available.
> 
> Which one of lz4 or zstd would it be though?  LZ4 is light on CPU, and
> compresses less than zstd which is more CPU intensive with more
> compression, so the choice does not look that straight-forward to me.

For backups it's pretty obviously zstd imo. At the lower levels it achieves
considerably higher compression ratios while still being vastly faster than
gzip. Without even using the threaded compression support the library has.

For something like wal_compression it'd be a harder question.

Greetings,

Andres Freund




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Tom Lane
Robert Haas  writes:
> On Tue, Feb 15, 2022 at 7:09 PM David Steele  wrote:
>> This is even more true for zstd since it is not as ubiquitous as lz4. In
>> fact, it is not even included with base RHEL8. You need to install EPEL
>> to get zstd.

> Yeah, I agree. One thing I was thinking about but didn't include in
> the previous email is that if we did choose to make something like LZ4
> the default, it would presumably only be the default on builds that
> include LZ4 support. Other builds would need to use something else,
> unless we just chose to make LZ4 a hard requirement, which would be
> bolder than we usually are. And that has the consequence that you
> mention. It's something we should consider as we think about changing
> defaults.

Yeah.  I'm +1 on adding zstd as an option, but I think we need to
move *very* slowly on changing any defaults for user-accessible
data (like backup files).  Maybe we have a bit more flexibility
for TOAST, not sure.

regards, tom lane




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Robert Haas
On Tue, Feb 15, 2022 at 7:09 PM David Steele  wrote:
> I'm not sure that this is entirely true. lz4 is available on most
> non-EOL distros but you don't need to look to far to find a distro that
> does not have it. So, choosing lz4 by default could impact binary
> portability. Of course, there are lots of other factors that impact
> binary portability, e.g. collations, but this would add one more.
>
> This is even more true for zstd since it is not as ubiquitous as lz4. In
> fact, it is not even included with base RHEL8. You need to install EPEL
> to get zstd.

Yeah, I agree. One thing I was thinking about but didn't include in
the previous email is that if we did choose to make something like LZ4
the default, it would presumably only be the default on builds that
include LZ4 support. Other builds would need to use something else,
unless we just chose to make LZ4 a hard requirement, which would be
bolder than we usually are. And that has the consequence that you
mention. It's something we should consider as we think about changing
defaults.

> Having said all that, I'm absolutely in favor of adding zstd to Postgres
> as an optional compression format.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Michael Paquier
On Tue, Feb 15, 2022 at 01:42:55PM -0800, Andres Freund wrote:
> I think well before removing stuff we should default to decent compression
> algorithms. E.g. -Z/--compress should probably not continue to use gzip if
> better things are available.

Which one of lz4 or zstd would it be though?  LZ4 is light on CPU, and
compresses less than zstd which is more CPU intensive with more
compression, so the choice does not look that straight-forward to me.
Both of them are better than zlib, still they are less available on
the platforms Postgres needs to provide support for.  It does not seem
really intuitive to me to have different defaults depending on the
build options provided, either :/

Saying that, I'd like to think that zstd would still be a better
default choice than LZ4.
--
Michael


signature.asc
Description: PGP signature


Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread David Steele

On 2/15/22 16:10, Peter Geoghegan wrote:

On Tue, Feb 15, 2022 at 12:00 PM Robert Haas  wrote:

I'm not sure I completely follow this. There are cases where we use
compression algorithms for internal purposes, and we can change the
defaults without users knowing or caring. For example, we could decide
that LZ4-compressing FPIs in WAL records is a sensible default and
just start doing it, and users need not care. 


I'm not sure that this is entirely true. lz4 is available on most 
non-EOL distros but you don't need to look to far to find a distro that 
does not have it. So, choosing lz4 by default could impact binary 
portability. Of course, there are lots of other factors that impact 
binary portability, e.g. collations, but this would add one more.


This is even more true for zstd since it is not as ubiquitous as lz4. In 
fact, it is not even included with base RHEL8. You need to install EPEL 
to get zstd.



But backups are
different, because when you pg_basebackup -Ft, you get .tar or .tar.gz
or .tar.lz4 files which we don't give you tools to extract.


What I meant is that you're buying into an ecosystem by choosing to
use a tool like pgBackrest. That might not be the right thing to do
universally, but it comes pretty close. You as a user are largely
deferring to the maintainers and their choice of defaults. Not
entirely, of course, but to a significant degree, especially in
matters like this. There aren't that many dimensions to the problem
once the question of compatibility is settled (it's settled here
because you've already bought into a tool that requires the library as
standard, say).


As much as we would like to, we have not changed the default compression 
for pgBackRest. lz4 and zstd are both optional at build time since they 
are not always available (though lz4 is getting close). So we have left 
gzip as the default, though a lot of that has to do with maintaining 
compatibility with prior versions of pgBackRest.


Having said all that, I'm absolutely in favor of adding zstd to Postgres 
as an optional compression format.


Regards,
David




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Peter Geoghegan
On Tue, Feb 15, 2022 at 12:00 PM Robert Haas  wrote:
> I'm not sure I completely follow this. There are cases where we use
> compression algorithms for internal purposes, and we can change the
> defaults without users knowing or caring. For example, we could decide
> that LZ4-compressing FPIs in WAL records is a sensible default and
> just start doing it, and users need not care. But backups are
> different, because when you pg_basebackup -Ft, you get .tar or .tar.gz
> or .tar.lz4 files which we don't give you tools to extract.

What I meant is that you're buying into an ecosystem by choosing to
use a tool like pgBackrest. That might not be the right thing to do
universally, but it comes pretty close. You as a user are largely
deferring to the maintainers and their choice of defaults. Not
entirely, of course, but to a significant degree, especially in
matters like this. There aren't that many dimensions to the problem
once the question of compatibility is settled (it's settled here
because you've already bought into a tool that requires the library as
standard, say).

A lot of things are *not* like that, but ISTM that backup compression
really is -- we have the luxury of a constrained problem space, for
once. There aren't that many metrics to consider, because it must be
lossless compression in any case, and because the requirements are
relatively homogenous. The truly important thing (once compatibility
is accounted for) is to use something basically reasonable, with no
noticeable weaknesses relative to any of the alternatives.

> I pretty much agree with all of that. Nevertheless, there's a lot of
> pglz-compressed data that's already on a disk someplace which people
> are not necessarily keen to rewrite, and that will probably remain
> true for the foreseable future. If we change the default to something
> that's not pglz, and then wait 10 years, we MIGHT be able to get rid
> of it without pushback. But I wouldn't be surprised to learn that even
> then there are a lot of people who have just pg_upgrade'd the same
> database over and over again. And honestly I think that's fine.

I think that it's fine too. It's unlikely that anybody is going to go
to any trouble to get better compression, certainly, but we should
give them every opportunity to use better alternatives. Ideally
without anybody having to even think about it.

-- 
Peter Geoghegan




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Andres Freund
Hi,

On 2022-02-15 15:09:37 -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 2:54 PM Andres Freund  wrote:
> > There's a difference between downloading a source tarball of 1.5x the size,
> > and 3x the decompression cost (made up numbers), because the cost of either 
> > is
> > not a relevant factor. I think basebackups are a different story.
> 
> To be clear, I'm not saying that people shouldn't choose to adopt the
> new stuff. I'm just saying that many of them won't, for various
> reasons, including inertia. There may come a point when we want to
> make a push to remove obsolete stuff, but I think what is far more
> important right now is making the new stuff available.

I think well before removing stuff we should default to decent compression
algorithms. E.g. -Z/--compress should probably not continue to use gzip if
better things are available.

Greetings,

Andres Freund




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Justin Pryzby
On Tue, Feb 15, 2022 at 03:23:30PM -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 1:55 PM Justin Pryzby  wrote:
> > Regarding the patch:
> >
> > I suppose it should support windows, and whatever patches use zstd should
> > update the install instructions.  See 9ca40dcd4, 02a93e7ef, 4035cd5d4.
> 
> Thanks, this is helpful. To me, it looks like we need something
> similar to 9ca40dcd4, plus something like the last hunk of 02a93e7ef.

If you mean a patch which only adds the configure options, I don't think 02a9
is applicable at all.

What I mean is that any patches which *use* the zstd support should update
those for completeness.

 doc/src/sgml/config.sgml  | 14 +-
 doc/src/sgml/install-windows.sgml |  2 +-
 doc/src/sgml/installation.sgml|  5 +++--

> The rest of 02a93e7ef and all of 4035cd5d4 seem like they address the
> application of lz4 rather than just adding support for it, so I don't
> see the need for those elements in this patch. Am I missing something?




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Robert Haas
On Tue, Feb 15, 2022 at 1:55 PM Justin Pryzby  wrote:
> Regarding the patch:
>
> I suppose it should support windows, and whatever patches use zstd should
> update the install instructions.  See 9ca40dcd4, 02a93e7ef, 4035cd5d4.

Thanks, this is helpful. To me, it looks like we need something
similar to 9ca40dcd4, plus something like the last hunk of 02a93e7ef.
The rest of 02a93e7ef and all of 4035cd5d4 seem like they address the
application of lz4 rather than just adding support for it, so I don't
see the need for those elements in this patch. Am I missing something?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Robert Haas
On Tue, Feb 15, 2022 at 2:54 PM Andres Freund  wrote:
> There's a difference between downloading a source tarball of 1.5x the size,
> and 3x the decompression cost (made up numbers), because the cost of either is
> not a relevant factor. I think basebackups are a different story.

To be clear, I'm not saying that people shouldn't choose to adopt the
new stuff. I'm just saying that many of them won't, for various
reasons, including inertia. There may come a point when we want to
make a push to remove obsolete stuff, but I think what is far more
important right now is making the new stuff available.

> Yea, we should really default to lz4 in initdb when available. Expecting users
> to know about magic incantation to get often vastly superior performance isn't
> a good approach. And it even makes initdb itself faster, because it turns out
> we compress enough that the cpu cycles for it are a measurable factor.

I don't mind if you want to propose a patch for that, but it seems
like a separate topic from this thread.

> IMO zstd has pretty much won the "gzip" type usecase of compression. Even
> prior lz4 uses have even been converted to it because it's often cheap enough.

You know more about this than I do...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Robert Haas
On Tue, Feb 15, 2022 at 2:40 PM Peter Geoghegan  wrote:
> On Tue, Feb 15, 2022 at 10:20 AM Robert Haas  wrote:
> > In general, deciding on new compression algorithms can feel a bit like
> > debating the merits of vi vs. emacs, or one political party vs.
> > another.
>
> Really? I don't get that impression myself. (That's not important, though.)

Well, that's a good thing. Maybe we're getting better at keeping
things constructive...

> > What I imagine if this patch is accepted is that we (or our users)
> > will end up using lz4 for places where compression needs to be very
> > lightweight, and zstd for places where it's acceptable or even
> > desirable to spend more CPU cycles in exchange for better compression.
>
> Sounds reasonable to me.

Cool.

> > Likewise, I still download the .tar.gz version of anything
> > that gives me that option, basically because I'm familiar with the
> > format and it's easy for me to just carry on using it -- and in a
> > similar way I expect a lot of people will be happy to continue to
> > compress backups with gzip for many years to come.
>
> Maybe, but it seems more likely that they'll usually do whatever the
> easiest reasonable-seeming thing is. Perhaps we need to distinguish
> between "container format" (e.g., a backup that is produced by a tool
> like pgBackrest, including backup manifest) and compression algorithm.

I'm not sure I completely follow this. There are cases where we use
compression algorithms for internal purposes, and we can change the
defaults without users knowing or caring. For example, we could decide
that LZ4-compressing FPIs in WAL records is a sensible default and
just start doing it, and users need not care. But backups are
different, because when you pg_basebackup -Ft, you get .tar or .tar.gz
or .tar.lz4 files which we don't give you tools to extract. You have
to be able to extract them with OS tools. In cases like that, people
are going to be more likely to pick formats with which they are
familiar even if they are not technically best, and I don't think we
really dare change the defaults. That seems OK, though.

> Isn't it an incontrovertible fact that LZ4 is superior to pglz in
> every way? LZ4 is pretty much its successor. And so it seems totally
> fine to assume that users will always want to use the clearly better
> option, and that that option will be generally available going
> forward. TOAST compression is applied selectively already, based on
> various obscure implementation details, some of which are quite
> arbitrary.

I pretty much agree with all of that. Nevertheless, there's a lot of
pglz-compressed data that's already on a disk someplace which people
are not necessarily keen to rewrite, and that will probably remain
true for the foreseable future. If we change the default to something
that's not pglz, and then wait 10 years, we MIGHT be able to get rid
of it without pushback. But I wouldn't be surprised to learn that even
then there are a lot of people who have just pg_upgrade'd the same
database over and over again. And honestly I think that's fine. Yeah,
lz4 is better. But, on some data sets, there's very little real
difference in the compression ratio you get. Until keeping pglz around
starts costing us something tangible, there's no reason to spend any
effort worrying about it.

> I know less about zstd, but I imagine that a similar dynamic exists
> there. Overall, everything that you've said makes sense to me.

Great, thanks for chiming in.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Andres Freund
Hi,

On 2022-02-15 11:40:20 -0800, Peter Geoghegan wrote:
> On Tue, Feb 15, 2022 at 10:20 AM Robert Haas  wrote:
> > Likewise, I still download the .tar.gz version of anything
> > that gives me that option, basically because I'm familiar with the
> > format and it's easy for me to just carry on using it -- and in a
> > similar way I expect a lot of people will be happy to continue to
> > compress backups with gzip for many years to come.

There's a difference between downloading a source tarball of 1.5x the size,
and 3x the decompression cost (made up numbers), because the cost of either is
not a relevant factor. I think basebackups are a different story.


> Isn't it an incontrovertible fact that LZ4 is superior to pglz in
> every way? LZ4 is pretty much its successor. And so it seems totally
> fine to assume that users will always want to use the clearly better
> option, and that that option will be generally available going
> forward. TOAST compression is applied selectively already, based on
> various obscure implementation details, some of which are quite
> arbitrary.

Yea, we should really default to lz4 in initdb when available. Expecting users
to know about magic incantation to get often vastly superior performance isn't
a good approach. And it even makes initdb itself faster, because it turns out
we compress enough that the cpu cycles for it are a measurable factor.


> I know less about zstd, but I imagine that a similar dynamic exists
> there. Overall, everything that you've said makes sense to me.

IMO zstd has pretty much won the "gzip" type usecase of compression. Even
prior lz4 uses have even been converted to it because it's often cheap enough.

Greetings,

Andres Freund




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Peter Geoghegan
On Tue, Feb 15, 2022 at 10:20 AM Robert Haas  wrote:
> In general, deciding on new compression algorithms can feel a bit like
> debating the merits of vi vs. emacs, or one political party vs.
> another.

Really? I don't get that impression myself. (That's not important, though.)

> What I imagine if this patch is accepted is that we (or our users)
> will end up using lz4 for places where compression needs to be very
> lightweight, and zstd for places where it's acceptable or even
> desirable to spend more CPU cycles in exchange for better compression.

Sounds reasonable to me.

> Likewise, I still download the .tar.gz version of anything
> that gives me that option, basically because I'm familiar with the
> format and it's easy for me to just carry on using it -- and in a
> similar way I expect a lot of people will be happy to continue to
> compress backups with gzip for many years to come.

Maybe, but it seems more likely that they'll usually do whatever the
easiest reasonable-seeming thing is. Perhaps we need to distinguish
between "container format" (e.g., a backup that is produced by a tool
like pgBackrest, including backup manifest) and compression algorithm.

Isn't it an incontrovertible fact that LZ4 is superior to pglz in
every way? LZ4 is pretty much its successor. And so it seems totally
fine to assume that users will always want to use the clearly better
option, and that that option will be generally available going
forward. TOAST compression is applied selectively already, based on
various obscure implementation details, some of which are quite
arbitrary.

I know less about zstd, but I imagine that a similar dynamic exists
there. Overall, everything that you've said makes sense to me.

-- 
Peter Geoghegan




Re: adding 'zstd' as a compression algorithm

2022-02-15 Thread Justin Pryzby
I think the WAL patch (4035cd5d4) should support zstd if library support is
added.  A supplementary patch for that already exists.
https://www.postgresql.org/message-id/ynqwd2gsmrnqw...@paquier.xyz

There's also some in-progress patches:

 - Konstantin and Daniil have a patch to add libpq compression, which ultimately
   wants to use zstd.
   https://commitfest.postgresql.org/37/3499/

 - I had a patch to add zstd to pg_dump, but I ended up storing backups to
   ZFS+zstd rather than waiting to progress the patch.
   https://commitfest.postgresql.org/32/2888/

Regarding the patch:

I suppose it should support windows, and whatever patches use zstd should
update the install instructions.  See 9ca40dcd4, 02a93e7ef, 4035cd5d4.




adding 'zstd' as a compression algorithm

2022-02-15 Thread Robert Haas
Hi,

Over on the rather-long "refactoring basebackup.c" thread, there is a
proposal, which I endorse, to add base-backup compression via zstd. To
do that, we'd need to patch configure.ac to create a new --with-zstd
flag and appropriate supporting infrastructure. My colleague Jeevan
Ladhe included that in a patch posted over there; I've extracted just
the part adding configure support for libzstd and attach it here. I
thought it would be good to have a new thread specifically devoted to
the topic of whether zstd is a thing that PostgreSQL ought to support
in general.

In general, deciding on new compression algorithms can feel a bit like
debating the merits of vi vs. emacs, or one political party vs.
another. Everyone has their own favorites, for reasons that can often
seem idiosyncratic. One advantage of zstd is that it is already being
used by other prominent open-source projects, including the Linux
kernel.[1] This means that it is unlikely to just dry up and vanish,
and it also reduces the risk of legal issues. On a technical level,
zstd offers compression ratios similar to or better than gzip, but
with much faster compression speed. Furthermore, the zstd library has
built-in multi-threaded compression which we may be able to leverage
for even better performance. In fact, parallel zstd might be able to
compress faster than lz4, which is already extremely fast.

What I imagine if this patch is accepted is that we (or our users)
will end up using lz4 for places where compression needs to be very
lightweight, and zstd for places where it's acceptable or even
desirable to spend more CPU cycles in exchange for better compression.
I think that gzip and pglz are really only of historical interest -
and I don't say that to mean that we shouldn't continue to support
them or that they won't get use. Lots of people are perfectly happy
with TOAST compression using pglz, and I'm perfectly happy if they
continue to do that forever, even though I'm glad LZ4 is now an
option. Likewise, I still download the .tar.gz version of anything
that gives me that option, basically because I'm familiar with the
format and it's easy for me to just carry on using it -- and in a
similar way I expect a lot of people will be happy to continue to
compress backups with gzip for many years to come. But I think there
is value in supporting newer and better technology, too. I realize
that we don't want to support every new and shiny thing that shows up,
but I don't think that's what I am proposing here.

Anyway, those are my thoughts. What are yours?

Thanks,

[1] https://en.wikipedia.org/wiki/Zstd#Usage

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0001-Add-support-for-building-with-ZSTD.patch
Description: Binary data