Re: basebackup/lz4 crash

2022-03-31 Thread Dipesh Pandit
Hi,

> I think your proposed change is OK, modulo some comments. But I think
> maybe we ought to delete all the stuff related to compressed_bound
> from bbstreamer_lz4_compressor_new() as well, because I don't see that
> there's any point. And then I think we should also add logic similar
> to what you've added here to bbstreamer_lz4_compressor_finalize(), so
> that we're not making the assumption that the buffer will get enlarged
> at some earlier point.
>
> Thoughts?
I agree that we should remove the compression bound stuff from
bbstreamer_lz4_compressor_new() and add a fix in
bbstreamer_lz4_compressor_content() and bbstreamer_lz4_compressor_finalize()
to enlarge the buffer if it falls short of the compress bound.

Patch attached.

Thanks,
Dipesh
From ee9b5e4739bf78b39dc34c9fcc76fc731b09b788 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Thu, 31 Mar 2022 12:19:31 +0530
Subject: [PATCH] fix crash with lz4

We cannot fix the size of output buffer at client during lz4
compression. Client receives chunks from the server and the size of
these chunks varies depending upon the data sent by the server. The
output buffer capacity at client should be adjusted based on the size
of the chunk received from the server.
---
 src/bin/pg_basebackup/bbstreamer_lz4.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/src/bin/pg_basebackup/bbstreamer_lz4.c b/src/bin/pg_basebackup/bbstreamer_lz4.c
index 67f841d..2ffe224 100644
--- a/src/bin/pg_basebackup/bbstreamer_lz4.c
+++ b/src/bin/pg_basebackup/bbstreamer_lz4.c
@@ -73,7 +73,6 @@ bbstreamer_lz4_compressor_new(bbstreamer *next, bc_specification *compress)
 	bbstreamer_lz4_frame   *streamer;
 	LZ4F_errorCode_t		ctxError;
 	LZ4F_preferences_t	   *prefs;
-	size_t	compressed_bound;
 
 	Assert(next != NULL);
 
@@ -92,17 +91,6 @@ bbstreamer_lz4_compressor_new(bbstreamer *next, bc_specification *compress)
 	if ((compress->options & BACKUP_COMPRESSION_OPTION_LEVEL) != 0)
 		prefs->compressionLevel = compress->level;
 
-	/*
-	 * Find out the compression bound, it specifies the minimum destination
-	 * capacity required in worst case for the success of compression operation
-	 * (LZ4F_compressUpdate) based on a given source size and preferences.
-	 */
-	compressed_bound = LZ4F_compressBound(streamer->base.bbs_buffer.maxlen, prefs);
-
-	/* Enlarge buffer if it falls short of compression bound. */
-	if (streamer->base.bbs_buffer.maxlen < compressed_bound)
-		enlargeStringInfo(>base.bbs_buffer, compressed_bound);
-
 	ctxError = LZ4F_createCompressionContext(>cctx, LZ4F_VERSION);
 	if (LZ4F_isError(ctxError))
 			pg_log_error("could not create lz4 compression context: %s",
@@ -170,7 +158,6 @@ bbstreamer_lz4_compressor_content(bbstreamer *streamer,
 	 * forward the content to next streamer and empty the buffer.
 	 */
 	out_bound = LZ4F_compressBound(len, >prefs);
-	Assert(mystreamer->base.bbs_buffer.maxlen >= out_bound);
 	if (avail_out < out_bound)
 	{
 			bbstreamer_content(mystreamer->base.bbs_next, member,
@@ -178,6 +165,10 @@ bbstreamer_lz4_compressor_content(bbstreamer *streamer,
 			   mystreamer->bytes_written,
 			   context);
 
+			/* Enlarge buffer if it falls short of out bound. */
+			if (mystreamer->base.bbs_buffer.maxlen < out_bound)
+enlargeStringInfo(>base.bbs_buffer, out_bound);
+
 			avail_out = mystreamer->base.bbs_buffer.maxlen;
 			mystreamer->bytes_written = 0;
 			next_out = (uint8 *) mystreamer->base.bbs_buffer.data;
@@ -218,7 +209,6 @@ bbstreamer_lz4_compressor_finalize(bbstreamer *streamer)
 
 	/* Find out the footer bound and update the output buffer. */
 	footer_bound = LZ4F_compressBound(0, >prefs);
-	Assert(mystreamer->base.bbs_buffer.maxlen >= footer_bound);
 	if ((mystreamer->base.bbs_buffer.maxlen - mystreamer->bytes_written) <
 		footer_bound)
 	{
@@ -227,6 +217,10 @@ bbstreamer_lz4_compressor_finalize(bbstreamer *streamer)
 			   mystreamer->bytes_written,
 			   BBSTREAMER_UNKNOWN);
 
+			/* Enlarge buffer if it falls short of footer bound. */
+			if (mystreamer->base.bbs_buffer.maxlen < footer_bound)
+enlargeStringInfo(>base.bbs_buffer, footer_bound);
+
 			avail_out = mystreamer->base.bbs_buffer.maxlen;
 			mystreamer->bytes_written = 0;
 			next_out = (uint8 *) mystreamer->base.bbs_buffer.data;
-- 
1.8.3.1



Re: fixing a few backup compression goofs

2022-03-25 Thread Dipesh Pandit
Hi,

The changes look good to me.

Thanks,
Dipesh


Re: refactoring basebackup.c

2022-03-14 Thread Dipesh Pandit
Hi,

I tried to implement support for parallel ZSTD compression. The
library provides an option (ZSTD_c_nbWorkers) to specify the
number of compression workers. The number of parallel
workers can be set as part of compression parameter and if this
option is specified then the library performs parallel compression
based on the specified number of workers.

User can specify the number of parallel worker as part of
--compress option by appending an integer value after at sign (@).
(-Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL][@WORKERS])

Please find the attached patch v1 with the above changes.

Note: ZSTD library version 1.5.x supports parallel compression
by default and if the library version is lower than 1.5.x then
parallel compression is enabled only the source is compiled with build
macro ZSTD_MULTITHREAD. If the linked library version doesn't
support parallel compression then setting the value of parameter
ZSTD_c_nbWorkers to a value other than 0 will be no-op and
returns an error.

Thanks,
Dipesh
From 688ad1e3f9b43bf911e8c3837497a874e4a6937f Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Mon, 14 Mar 2022 18:39:02 +0530
Subject: [PATCH] support parallel zstd compression

---
 doc/src/sgml/ref/pg_basebackup.sgml   | 11 ++-
 src/backend/replication/basebackup.c  | 14 +++-
 src/backend/replication/basebackup_zstd.c | 15 -
 src/bin/pg_basebackup/bbstreamer.h|  3 +-
 src/bin/pg_basebackup/bbstreamer_zstd.c   | 11 ++-
 src/bin/pg_basebackup/pg_basebackup.c | 97 +--
 src/bin/pg_verifybackup/t/008_untar.pl|  8 +++
 src/bin/pg_verifybackup/t/010_client_untar.pl |  8 +++
 src/include/replication/basebackup_sink.h |  2 +-
 9 files changed, 142 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 4a630b5..87feca0 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -399,9 +399,9 @@ PostgreSQL documentation
 
  
   -Z level
-  -Z [{client|server}-]method[:level]
+  -Z [{client|server}-]method[:level][@workers]
   --compress=level
-  --compress=[{client|server}-]method[:level]
+  --compress=[{client|server}-]method[:level][@workers]
   

 Requests compression of the backup. If client or
@@ -428,6 +428,13 @@ PostgreSQL documentation
 the level is 0.


+Compression workers can be specified optionally by appending the
+number of workers after an at sign (@). It 
+defines the degree of parallelism while compressing the archive.
+Currently, parallel compression is supported only for
+zstd compressed archives.
+   
+   
 When the tar format is used with gzip,
 lz4, or zstd, the suffix
 .gz, .lz4, or
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 2378ce5..8217fa9 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -82,6 +82,7 @@ typedef struct
 	backup_manifest_option manifest;
 	basebackup_compression_type	compression;
 	int			compression_level;
+	int			compression_workers;
 	pg_checksum_type manifest_checksum_type;
 } basebackup_options;
 
@@ -718,6 +719,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 	char	   *target_str = "compat";	/* placate compiler */
 	bool		o_compression = false;
 	bool		o_compression_level = false;
+	bool		o_compression_workers = false;
 
 	MemSet(opt, 0, sizeof(*opt));
 	opt->target = BACKUP_TARGET_CLIENT;
@@ -925,6 +927,15 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 			opt->compression_level = defGetInt32(defel);
 			o_compression_level = true;
 		}
+		else if (strcmp(defel->defname, "compression_workers") == 0)
+		{
+			if (o_compression_workers)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("duplicate option \"%s\"", defel->defname)));
+			opt->compression_workers = defGetInt32(defel);
+			o_compression_workers = true;
+		}
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
@@ -1030,7 +1041,8 @@ SendBaseBackup(BaseBackupCmd *cmd)
 	else if (opt.compression == BACKUP_COMPRESSION_LZ4)
 		sink = bbsink_lz4_new(sink, opt.compression_level);
 	else if (opt.compression == BACKUP_COMPRESSION_ZSTD)
-		sink = bbsink_zstd_new(sink, opt.compression_level);
+		sink = bbsink_zstd_new(sink, opt.compression_level,
+			   opt.compression_workers);
 
 	/* Set up progress reporting. */
 	sink = bbsink_progress_new(sink, opt.progress);
diff --git a/src/backend/replication/basebackup_zstd.c b/src/backend/replication/basebackup_zstd.c
index e3f9b1d..54b91eb 100644
--- a/src/backend/replication/basebackup_zstd.c
+++ b/src/backend/replication/basebackup_zstd.c
@@ -28,6 +28,9 @@ typedef struct bbsink_zstd
 	/* Compression lev

Re: refactoring basebackup.c

2022-03-04 Thread Dipesh Pandit
Hi,

> > It will be good if we can also fix
> > CreateWalTarMethod to support LZ4 and ZSTD.
> Ok we will see, either Dipesh or I will take care of it.

I took a look at the CreateWalTarMethod to support LZ4 compression
for WAL files. The current implementation involves a 3 step to backup
a WAL file to a tar archive. For each file:

   1. It first writes the header in the function tar_open_for_write,
   flushes the contents of tar to disk and stores the header offset.
   2.  Next, the contents of WAL are written to the tar archive.
   3. In the end, it recalculates the checksum in function tar_close() and
   overwrites the header at an offset stored in step #1.

The need for overwriting header in CreateWalTarMethod is mainly related to
partial WAL files where the size of the WAL file < WalSegSize. The file is
being
padded and checksum is recalculated after adding pad bytes.

If we go ahead and implement LZ4 support for CreateWalTarMethod then
we have a problem here at step #3. In order to achieve better compression
ratio, compressed LZ4 blocks are linked to each other and these blocks
are decoded sequentially. If we overwrite the header as part of step #3 then
it corrupts the link between compressed LZ4 blocks. Although LZ4 provides
an option to write the compressed block independently (using blockMode
option set to LZ4F_blockIndepedent) but it is still a problem because we
don't
know if overwriting the header after recalculating the checksum will not
overlap
the boundary of the next block.

GZIP manages to overcome this problem as it provides an option to turn
on/off
compression on the fly while writing a compressed archive with the help of
zlib
library function deflateParams(). The current gzip implementation for
CreateWalTarMethod uses this library function to turn off compression just
before
step #1 and it writes the uncompressed header of size equal to
TAR_BLOCK_SIZE.
It uses the same library function to turn on the compression for writing
the contents
of the WAL file as part of step #2. It again turns off the compression just
before step
#3 to overwrite the header. The header is overwritten at the same offset
with size
equal to TAR_BLOCK_SIZE.

Since GZIP provides this option to enable/disable compression, it is
possible to
control the size of data we are writing to a compressed archive. Even if we
overwrite
an already written block in a compressed archive there is no risk of it
overlapping
with the boundary of the next block. This mechanism is not available in LZ4
and ZSTD.

In order to support LZ4 and ZSTD compression for CreateWalTarMethod we may
need to refactor this code unless I am missing something. We need to
somehow
add the padding bytes in case of partial WAL before we send it to the
compressed
archive. This will make sure that all files which are being compressed does
not
require any padding as the size is always equal to WalSegSize. There is no
need to
recalculate the checksum and we can avoid overwriting the header as part of
step #3.

Thoughts?

Thanks,
Dipesh


Re: refactoring basebackup.c

2022-02-11 Thread Dipesh Pandit
> Sure, please find the rebased patch attached.

Thanks, I have validated v2 patch on top of rebased patch.

Thanks,
Dipesh


Re: refactoring basebackup.c

2022-02-11 Thread Dipesh Pandit
Hi,

Thanks for the feedback, I have incorporated the suggestions
and updated a new patch. PFA v2 patch.

> I think similar to bbstreamer_lz4_compressor_content() in
> bbstreamer_lz4_decompressor_content() we can change len to avail_in.

In bbstreamer_lz4_decompressor_content(), we are modifying avail_in
based on the number of bytes decompressed in each iteration. I think
we cannot replace it with "len" here.

Jeevan, Your v12 patch does not apply on HEAD, it requires a
rebase. I have applied it on commit 400fc6b6487ddf16aa82c9d76e5cfbe64d94f660
to validate my v2 patch.

Thanks,
Dipesh
From 47a0ef4348747ffa61eccd7954e00f3cf5fc7222 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Thu, 3 Feb 2022 18:31:03 +0530
Subject: [PATCH] support client side compression and decompression using LZ4

---
 src/bin/pg_basebackup/Makefile|   1 +
 src/bin/pg_basebackup/bbstreamer.h|   3 +
 src/bin/pg_basebackup/bbstreamer_lz4.c| 431 ++
 src/bin/pg_basebackup/pg_basebackup.c |  32 +-
 src/bin/pg_verifybackup/t/009_extract.pl  |   7 +-
 src/bin/pg_verifybackup/t/010_client_untar.pl | 111 +++
 src/tools/msvc/Mkvcbuild.pm   |   1 +
 7 files changed, 580 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_basebackup/bbstreamer_lz4.c
 create mode 100644 src/bin/pg_verifybackup/t/010_client_untar.pl

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index ada3a5a..1d0db4f 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -43,6 +43,7 @@ BBOBJS = \
 	bbstreamer_file.o \
 	bbstreamer_gzip.o \
 	bbstreamer_inject.o \
+	bbstreamer_lz4.o \
 	bbstreamer_tar.o
 
 all: pg_basebackup pg_receivewal pg_recvlogical
diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h
index fe49ae3..c2de77b 100644
--- a/src/bin/pg_basebackup/bbstreamer.h
+++ b/src/bin/pg_basebackup/bbstreamer.h
@@ -206,6 +206,9 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath,
 			void (*report_output_file) (const char *));
 
 extern bbstreamer *bbstreamer_gzip_decompressor_new(bbstreamer *next);
+extern bbstreamer *bbstreamer_lz4_compressor_new(bbstreamer *next,
+ int compresslevel);
+extern bbstreamer *bbstreamer_lz4_decompressor_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next);
diff --git a/src/bin/pg_basebackup/bbstreamer_lz4.c b/src/bin/pg_basebackup/bbstreamer_lz4.c
new file mode 100644
index 000..f0bc226
--- /dev/null
+++ b/src/bin/pg_basebackup/bbstreamer_lz4.c
@@ -0,0 +1,431 @@
+/*-
+ *
+ * bbstreamer_lz4.c
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/bin/pg_basebackup/bbstreamer_lz4.c
+ *-
+ */
+
+#include "postgres_fe.h"
+
+#include 
+
+#ifdef HAVE_LIBLZ4
+#include 
+#endif
+
+#include "bbstreamer.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "common/string.h"
+
+#ifdef HAVE_LIBLZ4
+typedef struct bbstreamer_lz4_frame
+{
+	bbstreamer	base;
+
+	LZ4F_compressionContext_t	cctx;
+	LZ4F_decompressionContext_t	dctx;
+	LZ4F_preferences_t			prefs;
+
+	size_t		bytes_written;
+	bool		header_written;
+} bbstreamer_lz4_frame;
+
+static void bbstreamer_lz4_compressor_content(bbstreamer *streamer,
+			  bbstreamer_member *member,
+			  const char *data, int len,
+			  bbstreamer_archive_context context);
+static void bbstreamer_lz4_compressor_finalize(bbstreamer *streamer);
+static void bbstreamer_lz4_compressor_free(bbstreamer *streamer);
+
+const bbstreamer_ops bbstreamer_lz4_compressor_ops = {
+	.content = bbstreamer_lz4_compressor_content,
+	.finalize = bbstreamer_lz4_compressor_finalize,
+	.free = bbstreamer_lz4_compressor_free
+};
+
+static void bbstreamer_lz4_decompressor_content(bbstreamer *streamer,
+bbstreamer_member *member,
+const char *data, int len,
+bbstreamer_archive_context context);
+static void bbstreamer_lz4_decompressor_finalize(bbstreamer *streamer);
+static void bbstreamer_lz4_decompressor_free(bbstreamer *streamer);
+
+const bbstreamer_ops bbstreamer_lz4_decompressor_ops = {
+	.content = bbstreamer_lz4_decompressor_content,
+	.finalize = bbstreamer_lz4_decompressor_finalize,
+	.free = bbstreamer_lz4_decompressor_free
+};
+#endif
+
+/*
+ * Create a new base backup streamer that performs lz4 compression of tar
+ * blocks.
+ */
+bbstreamer *
+bbstreamer_lz4_compressor_new(bbstreamer *next, int compresslevel)
+{
+#ifdef HAVE_LIBLZ4
+	bbstreamer_lz4_frame   *streamer;
+	LZ4F_errorCode_t		ctxEr

Re: refactoring basebackup.c

2022-02-10 Thread Dipesh Pandit
Hi,

> On Mon, Jan 31, 2022 at 4:41 PM Jeevan Ladhe <
jeevan.la...@enterprisedb.com> wrote:

> Hi Robert,
>
> I had an offline discussion with Dipesh, and he will be working on the
> lz4 client side decompression part.
>

Please find the attached patch to support client side compression
and decompression using lz4.

Added a new lz4 bbstreamer to compress the archive chunks at
client if user has specified --compress=clinet-lz4:[LEVEL] option
in pg_basebackup. The new streamer accepts archive chunks
compresses it and forwards it to plain-writer.

Similarly, If a user has specified a server compressed lz4 archive
with plain format (-F p) backup then it requires decompressing
the compressed archive chunks before forwarding it to tar extractor.
Added a new bbstreamer to decompress the compressed archive
and forward it to tar extractor.

Note: This patch can be applied on Jeevan Ladhe's v12 patch
for lz4 compression.

Thanks,
Dipesh
From 67e47579e119897c66e6f5f7a5e5e9542399072f Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Thu, 3 Feb 2022 18:31:03 +0530
Subject: [PATCH] support client side compression and decompression using LZ4

---
 src/bin/pg_basebackup/Makefile|   1 +
 src/bin/pg_basebackup/bbstreamer.h|   3 +
 src/bin/pg_basebackup/bbstreamer_lz4.c| 436 ++
 src/bin/pg_basebackup/pg_basebackup.c |  32 +-
 src/bin/pg_verifybackup/t/009_extract.pl  |   7 +-
 src/bin/pg_verifybackup/t/010_client_untar.pl | 111 +++
 src/tools/msvc/Mkvcbuild.pm   |   1 +
 7 files changed, 585 insertions(+), 6 deletions(-)
 create mode 100644 src/bin/pg_basebackup/bbstreamer_lz4.c
 create mode 100644 src/bin/pg_verifybackup/t/010_client_untar.pl

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index ada3a5a..1d0db4f 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -43,6 +43,7 @@ BBOBJS = \
 	bbstreamer_file.o \
 	bbstreamer_gzip.o \
 	bbstreamer_inject.o \
+	bbstreamer_lz4.o \
 	bbstreamer_tar.o
 
 all: pg_basebackup pg_receivewal pg_recvlogical
diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h
index fe49ae3..c2de77b 100644
--- a/src/bin/pg_basebackup/bbstreamer.h
+++ b/src/bin/pg_basebackup/bbstreamer.h
@@ -206,6 +206,9 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath,
 			void (*report_output_file) (const char *));
 
 extern bbstreamer *bbstreamer_gzip_decompressor_new(bbstreamer *next);
+extern bbstreamer *bbstreamer_lz4_compressor_new(bbstreamer *next,
+ int compresslevel);
+extern bbstreamer *bbstreamer_lz4_decompressor_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next);
diff --git a/src/bin/pg_basebackup/bbstreamer_lz4.c b/src/bin/pg_basebackup/bbstreamer_lz4.c
new file mode 100644
index 000..9055a23
--- /dev/null
+++ b/src/bin/pg_basebackup/bbstreamer_lz4.c
@@ -0,0 +1,436 @@
+/*-
+ *
+ * bbstreamer_lz4.c
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/bin/pg_basebackup/bbstreamer_lz4.c
+ *-
+ */
+
+#include "postgres_fe.h"
+
+#include 
+
+#ifdef HAVE_LIBLZ4
+#include 
+#endif
+
+#include "bbstreamer.h"
+#include "common/logging.h"
+#include "common/file_perm.h"
+#include "common/string.h"
+
+#ifdef HAVE_LIBLZ4
+typedef struct bbstreamer_lz4_frame
+{
+	bbstreamer	base;
+
+	LZ4F_compressionContext_t	cctx;
+	LZ4F_decompressionContext_t	dctx;
+	LZ4F_preferences_t			prefs;
+
+	size_t		bytes_written;
+	bool		header_written;
+} bbstreamer_lz4_frame;
+
+static void bbstreamer_lz4_compressor_content(bbstreamer *streamer,
+			  bbstreamer_member *member,
+			  const char *data, int len,
+			  bbstreamer_archive_context context);
+static void bbstreamer_lz4_compressor_finalize(bbstreamer *streamer);
+static void bbstreamer_lz4_compressor_free(bbstreamer *streamer);
+
+const bbstreamer_ops bbstreamer_lz4_compressor_ops = {
+	.content = bbstreamer_lz4_compressor_content,
+	.finalize = bbstreamer_lz4_compressor_finalize,
+	.free = bbstreamer_lz4_compressor_free
+};
+
+static void bbstreamer_lz4_decompressor_content(bbstreamer *streamer,
+bbstreamer_member *member,
+const char *data, int len,
+bbstreamer_archive_context context);
+static void bbstreamer_lz4_decompressor_finalize(bbstreamer *streamer);
+static void bbstreamer_lz4_decompressor_free(bbstreamer *streamer);
+
+const bbstreamer_ops bbstreamer_lz4_decompressor_ops = {
+	.content = bbstreamer_lz4_decompressor_content,
+	.finalize = bbstre

Re: refactoring basebackup.c

2022-01-28 Thread Dipesh Pandit
Hi,

> I made a pass over these patches today and made a bunch of minor
> corrections. New version attached. The two biggest things I changed
> are (1) s/gzip_extractor/gzip_compressor/, because I feel like you
> extract an archive like a tarfile, but that is not what is happening
> here, this is not an archive and (2) I took a few bits of out of the
> test case that didn't seem to be necessary. There wasn't any reason
> that I could see why testing for PG_VERSION needed to be skipped when
> the compression method is 'none', so my first thought was to just take
> out the 'if' statement around that, but then after more thought that
> test and the one for pg_verifybackup are certainly going to fail if
> those files are not present, so why have an extra test? It might make
> sense if we were only conditionally able to run pg_verifybackup and
> wanted to have some test coverage even when we can't, but that's not
> the case here, so I see no point.

Thanks. This makes sense.

+#ifdef HAVE_LIBZ
+   /*
+* If the user has requested a server compressed archive along with
archive
+* extraction at client then we need to decompress it.
+*/
+   if (format == 'p' && compressmethod == COMPRESSION_GZIP &&
+   compressloc == COMPRESS_LOCATION_SERVER)
+   streamer = bbstreamer_gzip_decompressor_new(streamer);
+#endif

I think it is not required to have HAVE_LIBZ check in pg_basebackup.c
while creating a new gzip writer/decompressor. This check is already
in place in bbstreamer_gzip_writer_new() and
bbstreamer_gzip_decompressor_new()
and it throws an error in case the build does not have required library
support. I have removed this check from pg_basebackup.c and updated
a delta patch. The patch can be applied on v5 patch.

Thanks,
Dipesh
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 46ab60d..1f81bbf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1199,7 +1199,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
 			compressloc != COMPRESS_LOCATION_CLIENT)
 			streamer = bbstreamer_plain_writer_new(archive_filename,
    archive_file);
-#ifdef HAVE_LIBZ
 		else if (compressmethod == COMPRESSION_GZIP)
 		{
 			strlcat(archive_filename, ".gz", sizeof(archive_filename));
@@ -1207,7 +1206,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
   archive_file,
   compresslevel);
 		}
-#endif
 		else
 		{
 			Assert(false);		/* not reachable */
@@ -1256,7 +1254,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
 	else if (expect_unterminated_tarfile)
 		streamer = bbstreamer_tar_terminator_new(streamer);
 
-#ifdef HAVE_LIBZ
 	/*
 	 * If the user has requested a server compressed archive along with archive
 	 * extraction at client then we need to decompress it.
@@ -1264,7 +1261,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
 	if (format == 'p' && compressmethod == COMPRESSION_GZIP &&
 			compressloc == COMPRESS_LOCATION_SERVER)
 		streamer = bbstreamer_gzip_decompressor_new(streamer);
-#endif
 
 	/* Return the results. */
 	*manifest_inject_streamer_p = manifest_inject_streamer;


Re: refactoring basebackup.c

2022-01-26 Thread Dipesh Pandit
Hi,

> It only needed trivial rebasing; I have committed it after doing that.

I have updated the patches to support server compression (gzip) for
plain format backup. Please find attached v4 patches.

Thanks,
Dipesh
From 4d0c84d6fac841aafb757535cc0e48334a251581 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Mon, 24 Jan 2022 15:28:48 +0530
Subject: [PATCH 1/2] Support for extracting gzip compressed archive

pg_basebackup can support server side compression using gzip. In
order to support plain format backup with option '-Fp' we need to
add support for decompressing the compressed blocks at client. This
patch addresses the extraction of gzip compressed blocks at client.
---
 doc/src/sgml/ref/pg_basebackup.sgml |   8 +-
 src/bin/pg_basebackup/Makefile  |   1 +
 src/bin/pg_basebackup/bbstreamer.h  |   1 +
 src/bin/pg_basebackup/bbstreamer_file.c | 182 
 src/bin/pg_basebackup/bbstreamer_gzip.c | 376 
 src/bin/pg_basebackup/pg_basebackup.c   |  19 +-
 6 files changed, 401 insertions(+), 186 deletions(-)
 create mode 100644 src/bin/pg_basebackup/bbstreamer_gzip.c

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 1d0df34..19849be 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -428,8 +428,12 @@ PostgreSQL documentation


 When the tar format is used, the suffix .gz will
-automatically be added to all tar filenames. Compression is not
-available in plain format.
+automatically be added to all tar filenames.
+   
+   
+Server compression can be specified with plain format backup. It
+enables compression of the archive at server and extract the
+compressed archive at client.

   
  
diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 5b18851..78d96c6 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -38,6 +38,7 @@ OBJS = \
 BBOBJS = \
 	pg_basebackup.o \
 	bbstreamer_file.o \
+	bbstreamer_gzip.o \
 	bbstreamer_inject.o \
 	bbstreamer_tar.o
 
diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h
index fc88b50..270b0df 100644
--- a/src/bin/pg_basebackup/bbstreamer.h
+++ b/src/bin/pg_basebackup/bbstreamer.h
@@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath,
 			const char *(*link_map) (const char *),
 			void (*report_output_file) (const char *));
 
+extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next);
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 77ca222..d721f87 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -11,10 +11,6 @@
 
 #include "postgres_fe.h"
 
-#ifdef HAVE_LIBZ
-#include 
-#endif
-
 #include 
 
 #include "bbstreamer.h"
@@ -30,15 +26,6 @@ typedef struct bbstreamer_plain_writer
 	bool		should_close_file;
 } bbstreamer_plain_writer;
 
-#ifdef HAVE_LIBZ
-typedef struct bbstreamer_gzip_writer
-{
-	bbstreamer	base;
-	char	   *pathname;
-	gzFile		gzfile;
-}			bbstreamer_gzip_writer;
-#endif
-
 typedef struct bbstreamer_extractor
 {
 	bbstreamer	base;
@@ -62,22 +49,6 @@ const bbstreamer_ops bbstreamer_plain_writer_ops = {
 	.free = bbstreamer_plain_writer_free
 };
 
-#ifdef HAVE_LIBZ
-static void bbstreamer_gzip_writer_content(bbstreamer *streamer,
-		   bbstreamer_member *member,
-		   const char *data, int len,
-		   bbstreamer_archive_context context);
-static void bbstreamer_gzip_writer_finalize(bbstreamer *streamer);
-static void bbstreamer_gzip_writer_free(bbstreamer *streamer);
-static const char *get_gz_error(gzFile gzf);
-
-const bbstreamer_ops bbstreamer_gzip_writer_ops = {
-	.content = bbstreamer_gzip_writer_content,
-	.finalize = bbstreamer_gzip_writer_finalize,
-	.free = bbstreamer_gzip_writer_free
-};
-#endif
-
 static void bbstreamer_extractor_content(bbstreamer *streamer,
 		 bbstreamer_member *member,
 		 const char *data, int len,
@@ -196,159 +167,6 @@ bbstreamer_plain_writer_free(bbstreamer *streamer)
 }
 
 /*
- * Create a bbstreamer that just compresses data using gzip, and then writes
- * it to a file.
- *
- * As in the case of bbstreamer_plain_writer_new, pathname is always used
- * for error reporting purposes; if file is NULL, it is also the opened and
- * closed so that the data may be written there.
- */
-bbstreamer *
-bbstreamer_gzip_writer_new(char *pathname, FILE *file, int compresslevel)
-{
-#ifdef HAVE_LIBZ
-	bbstreamer_gzip_writer *streamer;
-
-	streamer = palloc0(sizeof(bbstreamer_gzip_writer));
-	

Re: refactoring basebackup.c

2022-01-24 Thread Dipesh Pandit
Hi,

> Here is a more detailed review.

Thanks for the feedback, I have incorporated the suggestions
and updated a new version of the patch (v3-0001).

The required documentation changes are also incorporated in
updated patch (v3-0001).

> Interesting approach. This unfortunately has the effect of making that
> test case file look a bit incoherent -- the comment at the top of the
> file isn't really accurate any more, for example, and the plain_format
> flag does more than just cause us to use -Fp; it also causes us NOT to
> use --target server:X. However, that might be something we can figure
> out a way to clean up. Alternatively, we could have a new test case
> file that is structured like 002_algorithm.pl but looping over
> compression methods rather than checksum algorithms, and testing each
> one with --server-compress and -Fp. It might be easier to make that
> look nice (but I'm not 100% sure).

Added a new test case file "009_extract.pl" to test server compressed plain
format backup (v3-0002).

> I committed the base backup target patch yesterday, and today I
> updated the remaining code in light of Michael Paquier's commit
> 5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch.

v13 patch does not apply on the latest head, it requires a rebase. I have
applied
it on commit dc43fc9b3aa3e0fa9c84faddad6d301813580f88 to validate gzip
decompression patches.

Thanks,
Dipesh
From 9ec2efcc908e988409cd9ba19ea64a50012163a2 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Mon, 24 Jan 2022 15:28:48 +0530
Subject: [PATCH 1/2] Support for extracting gzip compressed archive

pg_basebackup can support server side compression using gzip. In
order to support plain format backup with option '-Fp' we need to
add support for decompressing the compressed blocks at client. This
patch addresses the extraction of gzip compressed blocks at client.
---
 doc/src/sgml/ref/pg_basebackup.sgml |   8 +-
 src/bin/pg_basebackup/Makefile  |   1 +
 src/bin/pg_basebackup/bbstreamer.h  |   1 +
 src/bin/pg_basebackup/bbstreamer_file.c | 182 
 src/bin/pg_basebackup/bbstreamer_gzip.c | 376 
 src/bin/pg_basebackup/pg_basebackup.c   |  19 +-
 6 files changed, 401 insertions(+), 186 deletions(-)
 create mode 100644 src/bin/pg_basebackup/bbstreamer_gzip.c

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 1d0df34..19849be 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -428,8 +428,12 @@ PostgreSQL documentation


 When the tar format is used, the suffix .gz will
-automatically be added to all tar filenames. Compression is not
-available in plain format.
+automatically be added to all tar filenames.
+   
+   
+Server compression can be specified with plain format backup. It
+enables compression of the archive at server and extract the
+compressed archive at client.

   
  
diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 5b18851..78d96c6 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -38,6 +38,7 @@ OBJS = \
 BBOBJS = \
 	pg_basebackup.o \
 	bbstreamer_file.o \
+	bbstreamer_gzip.o \
 	bbstreamer_inject.o \
 	bbstreamer_tar.o
 
diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h
index fc88b50..270b0df 100644
--- a/src/bin/pg_basebackup/bbstreamer.h
+++ b/src/bin/pg_basebackup/bbstreamer.h
@@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath,
 			const char *(*link_map) (const char *),
 			void (*report_output_file) (const char *));
 
+extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next);
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 77ca222..d721f87 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -11,10 +11,6 @@
 
 #include "postgres_fe.h"
 
-#ifdef HAVE_LIBZ
-#include 
-#endif
-
 #include 
 
 #include "bbstreamer.h"
@@ -30,15 +26,6 @@ typedef struct bbstreamer_plain_writer
 	bool		should_close_file;
 } bbstreamer_plain_writer;
 
-#ifdef HAVE_LIBZ
-typedef struct bbstreamer_gzip_writer
-{
-	bbstreamer	base;
-	char	   *pathname;
-	gzFile		gzfile;
-}			bbstreamer_gzip_writer;
-#endif
-
 typedef struct bbstreamer_extractor
 {
 	bbstreamer	base;
@@ -62,22 +49,6 @@ const bbstreamer_ops bbstreamer_plain_writer_ops = {
 	.free = bbstreamer_plain_writer_free
 };
 
-#ifdef HAVE_LIBZ
-static void bbstreamer_gzip_writer_content(bbstreamer *streamer,
-			

Re: refactoring basebackup.c

2022-01-20 Thread Dipesh Pandit
Hi,

Thanks for the feedback, I have incorporated the suggestions and
updated a new patch v2.

> I spent some time thinking about test coverage for the server-side
> backup code today and came up with the attached (v12-0003). It does an
> end-to-end test that exercises server-side backup and server-side
> compression and then untars the backup and validity-checks it using
> pg_verifybackup. In addition to being good test coverage for these
> patches, it also plugs a gap in the test coverage of pg_verifybackup,
> which currently has no test case that untars a tar-format backup and
> then verifies the result. I couldn't figure out a way to do that back
> at the time I was working on pg_verifybackup, because I didn't think
> we had any existing precedent for using 'tar' from a TAP test. But it
> was pointed out to me that we do, so I used that as the model for this
> test. It should be easy to generalize this test case to test lz4 and
> zstd as well, I think. But I guess we'll still need something
> different to test what your patch is doing.

I tried to add the test coverage for server side gzip compression with
plain format backup using pg_verifybackup. I have modified the test
to use a flag specific to plain format. If this flag is set then it takes a
plain format backup (with server compression enabled) and verifies
this using pg_verifybackup. I have updated (v2-0002) for the test
coverage.

> It's going to need some documentation changes, too.
yes, I am working on it.

Note: Before applying the patches, please apply Robert's v12 version
of the patches 0001, 0002 and 0003.

Thanks,
Dipesh
From 826a1cbb639afb7e10a20955d3ec64b1bab1fa80 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Thu, 20 Jan 2022 16:38:36 +0530
Subject: [PATCH 1/2] Support for extracting gzip compressed archive

pg_basebackup can support server side compression using gzip. In
order to support plain format backup with option '-Fp' we need to
add support for decompressing the compressed blocks at client. This
patch addresses the extraction of gzip compressed blocks at client.
---
 src/bin/pg_basebackup/Makefile  |   1 +
 src/bin/pg_basebackup/bbstreamer.h  |   1 +
 src/bin/pg_basebackup/bbstreamer_file.c | 182 ---
 src/bin/pg_basebackup/bbstreamer_gzip.c | 377 
 src/bin/pg_basebackup/pg_basebackup.c   |  43 +++-
 5 files changed, 419 insertions(+), 185 deletions(-)
 create mode 100644 src/bin/pg_basebackup/bbstreamer_gzip.c

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index 5b18851..78d96c6 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -38,6 +38,7 @@ OBJS = \
 BBOBJS = \
 	pg_basebackup.o \
 	bbstreamer_file.o \
+	bbstreamer_gzip.o \
 	bbstreamer_inject.o \
 	bbstreamer_tar.o
 
diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h
index fc88b50..270b0df 100644
--- a/src/bin/pg_basebackup/bbstreamer.h
+++ b/src/bin/pg_basebackup/bbstreamer.h
@@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath,
 			const char *(*link_map) (const char *),
 			void (*report_output_file) (const char *));
 
+extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next);
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 77ca222..d721f87 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -11,10 +11,6 @@
 
 #include "postgres_fe.h"
 
-#ifdef HAVE_LIBZ
-#include 
-#endif
-
 #include 
 
 #include "bbstreamer.h"
@@ -30,15 +26,6 @@ typedef struct bbstreamer_plain_writer
 	bool		should_close_file;
 } bbstreamer_plain_writer;
 
-#ifdef HAVE_LIBZ
-typedef struct bbstreamer_gzip_writer
-{
-	bbstreamer	base;
-	char	   *pathname;
-	gzFile		gzfile;
-}			bbstreamer_gzip_writer;
-#endif
-
 typedef struct bbstreamer_extractor
 {
 	bbstreamer	base;
@@ -62,22 +49,6 @@ const bbstreamer_ops bbstreamer_plain_writer_ops = {
 	.free = bbstreamer_plain_writer_free
 };
 
-#ifdef HAVE_LIBZ
-static void bbstreamer_gzip_writer_content(bbstreamer *streamer,
-		   bbstreamer_member *member,
-		   const char *data, int len,
-		   bbstreamer_archive_context context);
-static void bbstreamer_gzip_writer_finalize(bbstreamer *streamer);
-static void bbstreamer_gzip_writer_free(bbstreamer *streamer);
-static const char *get_gz_error(gzFile gzf);
-
-const bbstreamer_ops bbstreamer_gzip_writer_ops = {
-	.content = bbstreamer_gzip_writer_content,
-	.finalize = bbstreamer_gzip_writer_finalize,
-	.free = bbstreamer_gzip_writer_free
-};
-#endif
-
 static void bbstreamer_extractor_content(bbstreamer *stream

Re: refactoring basebackup.c

2022-01-19 Thread Dipesh Pandit
Hi,

I have added support for decompressing a gzip compressed tar file
at client. pg_basebackup can enable server side compression for
plain format backup with this change.

Added a gzip extractor which decompresses the compressed archive
and forwards it to the next streamer. I have done initial testing and
working on updating the test coverage.

Note: Before applying the patch, please apply Robert's v11 version
of the patches 0001 and 0002.

Thanks,
Dipesh
From 737badce26ed05b5cdb64d9ffd1735fef9acbbf8 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 19 Jan 2022 17:11:45 +0530
Subject: [PATCH] Support for extracting gzip compressed archive

pg_basebackup can support server side compression using gzip. In
order to support plain format backup with option '-Fp' we need to
add support for decompressing the compressed blocks at client. This
patch addresses the extraction of gzip compressed blocks at client.
---
 src/bin/pg_basebackup/bbstreamer.h  |   1 +
 src/bin/pg_basebackup/bbstreamer_file.c | 175 
 src/bin/pg_basebackup/pg_basebackup.c   |  58 +--
 3 files changed, 225 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h
index fc88b50..270b0df 100644
--- a/src/bin/pg_basebackup/bbstreamer.h
+++ b/src/bin/pg_basebackup/bbstreamer.h
@@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath,
 			const char *(*link_map) (const char *),
 			void (*report_output_file) (const char *));
 
+extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next);
 extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next);
diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c
index 77ca222..350af1d 100644
--- a/src/bin/pg_basebackup/bbstreamer_file.c
+++ b/src/bin/pg_basebackup/bbstreamer_file.c
@@ -37,6 +37,13 @@ typedef struct bbstreamer_gzip_writer
 	char	   *pathname;
 	gzFile		gzfile;
 }			bbstreamer_gzip_writer;
+
+typedef struct bbstreamer_gzip_extractor
+{
+	bbstreamer	base;
+	z_stream	zstream;
+	size_t		bytes_written;
+} bbstreamer_gzip_extractor;
 #endif
 
 typedef struct bbstreamer_extractor
@@ -76,6 +83,21 @@ const bbstreamer_ops bbstreamer_gzip_writer_ops = {
 	.finalize = bbstreamer_gzip_writer_finalize,
 	.free = bbstreamer_gzip_writer_free
 };
+
+static void bbstreamer_gzip_extractor_content(bbstreamer *streamer,
+			  bbstreamer_member *member,
+			  const char *data, int len,
+			  bbstreamer_archive_context context);
+static void bbstreamer_gzip_extractor_finalize(bbstreamer *streamer);
+static void bbstreamer_gzip_extractor_free(bbstreamer *streamer);
+static void *gzip_palloc(void *opaque, unsigned items, unsigned size);
+static void gzip_pfree(void *opaque, void *address);
+
+const bbstreamer_ops bbstreamer_gzip_extractor_ops = {
+	.content = bbstreamer_gzip_extractor_content,
+	.finalize = bbstreamer_gzip_extractor_finalize,
+	.free = bbstreamer_gzip_extractor_free
+};
 #endif
 
 static void bbstreamer_extractor_content(bbstreamer *streamer,
@@ -349,6 +371,159 @@ get_gz_error(gzFile gzf)
 #endif
 
 /*
+ * Create a new base backup streamer that performs decompression of gzip
+ * compressed blocks.
+ */
+bbstreamer *
+bbstreamer_gzip_extractor_new(bbstreamer *next)
+{
+#ifdef HAVE_LIBZ
+	bbstreamer_gzip_extractor	*streamer;
+	z_stream *zs;
+
+	Assert(next != NULL);
+
+	streamer = palloc0(sizeof(bbstreamer_gzip_extractor));
+	*((const bbstreamer_ops **) >base.bbs_ops) =
+		_gzip_extractor_ops;
+
+	streamer->base.bbs_next = next;
+	initStringInfo(>base.bbs_buffer);
+
+	/* Initialize internal stream state for decompression */
+	zs = >zstream;
+	zs->zalloc = gzip_palloc;
+	zs->zfree = gzip_pfree;
+	zs->next_out = (uint8 *) streamer->base.bbs_buffer.data;
+	zs->avail_out = streamer->base.bbs_buffer.maxlen;
+
+	/*
+	 * Data compression was initialized using deflateInit2 to request a gzip
+	 * header. Similarly, we are using inflateInit2 to initialize data
+	 * decompression.
+	 * "windowBits" must be greater than or equal to "windowBits" value
+	 * provided to deflateInit2 while compressing.
+	 */
+	if (inflateInit2(zs, 15 + 16) != Z_OK)
+	{
+		pg_log_error("could not initialize compression library");
+		exit(1);
+
+	}
+
+	return >base;
+#else
+	pg_log_error("this build does not support compression");
+	exit(1);
+#endif
+}
+
+#ifdef HAVE_LIBZ
+/*
+ * Decompress the input data to output buffer until we ran out of the input
+ * data. Each time the output buffer is full invoke bbstreamer_content to pass
+ * on the decompressed data to next streamer.
+ */
+static void
+bbstreamer_gzip_extractor_content(bbstreamer *streamer,
+  bbstreamer_member *member,
+ 

Re: .ready and .done files considered harmful

2021-09-20 Thread Dipesh Pandit
Hi,

> 1. I've removed several calls to PgArchForceDirScan() in favor of
> calling it at the top of pgarch_ArchiverCopyLoop().  I believe
> there is some disagreement about this change, but I don't think
> we gain enough to justify the complexity.  The main reason we
> exit pgarch_ArchiverCopyLoop() should ordinarily be that we've
> run out of files to archive, so incurring a directory scan the
> next time it is called doesn't seem like it would normally be too
> bad.  I'm sure there are exceptions (e.g., lots of .done files,
> archive failures), but the patch is still not making things any
> worse than they presently are for these cases.

Yes, I think when archiver is lagging behind then a call to force
directory scan at the top of pgarch_ArchiverCopyLoop() does not
have any impact. This may result into a directory scan in next cycle
only when the archiver is ahead or in sync but in that case also a
directory scan may not incur too much cost since the archiver is
ahead.I agree that we can remove the separate calls to force a
directory scan in failure scenarios with a single call at the top of
PgArchForceDirScan().

> 2. I removed all the logic that attempted to catch out-of-order
> .ready files.  Instead, XLogArchiveNotify() only forces a
> directory scan for files other than regular WAL files, and we
> depend on our periodic directory scans to pick up anything that's
> been left behind.
> 3. I moved the logic that forces directory scans every once in a
> while.  We were doing that in the checkpoint/restartpoint logic,
> which, upon further thought, might not be the best idea.  The
> checkpoint interval can vary widely, and IIRC we won't bother
> creating checkpoints at all if database activity stops.  Instead,
> I've added logic in pgarch_readyXlog() that forces a directory
> scan if one hasn't happened in a few minutes.
> 4. Finally, I've tried to ensure comments are clear and that the
> logic is generally easy to reason about.
>
> What do you think?

I agree, If we force a periodic directory scan then we may not
require any special logic for handling scenarios where a .ready file
is created out of order in XLogArchiveNotify(). We need to force a
directory scan only in case of a non-regular WAL file in
XLogArchiveNotify().

Overall I think the periodic directory scan simplifies the patch and
makes sure that any missing file gets archived within a few mins.

Thanks,
Dipesh


Re: .ready and .done files considered harmful

2021-09-15 Thread Dipesh Pandit
Hi,

Thanks for the feedback.

> I wonder if this can be simplified even further.  If we don't bother
> trying to catch out-of-order .ready files in XLogArchiveNotify() and
> just depend on the per-checkpoint/restartpoint directory scans, we can
> probably remove lastReadySegNo from archiver state completely.

If we agree that some extra delay in archiving these files is acceptable
then we don't require any special handling for this scenario otherwise
we may need to handle it separately.

> +   /* Initialize the current state of archiver */
> +   xlogState.lastSegNo = MaxXLogSegNo;
> +   xlogState.lastTli = MaxTimeLineID;
>
> It looks like we have two ways to force a directory scan.  We can
> either set force_dir_scan to true, or lastSegNo can be set to
> MaxXLogSegNo.  Why not just set force_dir_scan to true here so that we
> only have one way to force a directory scan?

make sense, I have updated it.

> Don't we also need to force a directory scan in the other cases we
> return early from pgarch_ArchiverCopyLoop()?  We will have already
> advanced the archiver state in pgarch_readyXlog(), so I think we'd end
> up skipping files if we didn't.  For example, if archive_command isn't
> set, we'll just return, and the next call to pgarch_readyXlog() might
> return the next file.

I agree, we should do it for all early return paths.

> nitpick: I think we should just call PgArchForceDirScan() here.

Yes, that's right.

> > > This is an interesting idea, but the "else" block here seems prone to
> > > race conditions.  I think we'd have to hold arch_lck to prevent that.
> > > But as I mentioned above, if we are okay with depending on the
> > > fallback directory scans, I think we can remove the "else" block
> > > completely.

Ohh I didn't realize the race condition here. The competing processes
can read the same value of lastReadySegNo.

> > Thinking further, we probably need to hold a lock even when we are
> > creating the .ready file to avoid race conditions.
>
> The race condition surely happens, but even if that happens, all
> competing processes except one of them detect out-of-order and will
> enforce directory scan.  But I'm not sure how it behaves under more
> complex situation so I'm not sure I like that behavior.
>
> We could just use another lock for the logic there, but instead
> couldn't we merge PgArchGetLastReadySegNo and PgArchSetLastReadySegNo
> into one atomic test-and-(check-and-)set function?  Like this.

I agree that we can merge the existing "Get" and "Set" functions into
an atomic test-and-check-and-set function to avoid a race condition.

I have incorporated these changes and updated a new patch. PFA patch.

Thanks,
Dipesh
From f8983c7b80ff0f8cbc415d4d9a3ad4992925e775 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 8 Sep 2021 21:47:16 +0530
Subject: [PATCH] keep trying the next file approach

---
 src/backend/access/transam/xlog.c|  20 +++
 src/backend/access/transam/xlogarchive.c |  16 ++
 src/backend/postmaster/pgarch.c  | 249 +++
 src/include/access/xlogdefs.h|   2 +
 src/include/postmaster/pgarch.h  |   4 +
 5 files changed, 259 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a7..7dd4b96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per checkpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
@@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per restartpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..110bee4 1006

Re: .ready and .done files considered harmful

2021-09-14 Thread Dipesh Pandit
Thanks for the feedback.

> The latest post on this thread contained a link to this one, and it
> made me want to rewind to this point in the discussion. Suppose we
> have the following alternative scenario:
>
> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
> yet.  The following directory scan ends up finding 12.ready.  Just
> before we update the PgArch state, XLogArchiveNotify() is called and
> creates 11.ready.  However, pg_readyXlog() has already decided to
> return WAL segment 12 and update the state to look for 13 next.
>
> Now, if I'm not mistaken, using <= doesn't help at all.
>
> In my opinion, the problem here is that the natural way to ask "is
> this file being archived out of order?" is to ask yourself "is the
> file that I'm marking as ready for archiving now the one that
> immediately follows the last one I marked as ready for archiving?" and
> then invert the result. That is, if I last marked 10 as ready, and now
> I'm marking 11 as ready, then it's in order, but if I'm now marking
> anything else whatsoever, then it's out of order. But that's not what
> this does. Instead of comparing what it's doing now to what it did
> last, it compares what it did now to what the archiver did last.

I agree that when we are creating a .ready file we should compare
the current .ready file with the last .ready file to check if this file is
created out of order. We can store the state of the last .ready file
in shared memory and compare it with the current .ready file. I
believe that archiver specific shared memory area can be used
to store the state of the last .ready file unless I am missing
something and this needs to be stored in a separate shared
memory area.

With this change, we have the flexibility to move the current archiver
state out of shared memory and keep it local to archiver. I have
incorporated these changes and updated a new patch.


> > And it's really not obvious that that's correct. I think that the
> > above argument actually demonstrates a flaw in the logic, but even if
> > not, or even if it's too small a flaw to be a problem in practice, it
> > seems a lot harder to reason about.
>
> I certainly agree that it's harder to reason about.  If we were to go
> the keep-trying-the-next-file route, we could probably minimize a lot
> of the handling for these rare cases by banking on the "fallback"
> directory scans.  Provided we believe these situations are extremely
> rare, some extra delay for an archive every once in a while might be
> acceptable.

+1. We are forcing a directory scan at the checkpoint and it will make sure
that any missing file gets archived within the checkpoint boundaries.

Please find the attached patch.

Thanks,
Dipesh
From f05b223b368e40594f0ed8440c0704fb7b970ee0 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 8 Sep 2021 21:47:16 +0530
Subject: [PATCH] keep trying the next file approach

---
 src/backend/access/transam/xlog.c|  20 +++
 src/backend/access/transam/xlogarchive.c |  25 
 src/backend/postmaster/pgarch.c  | 234 ++-
 src/include/access/xlogdefs.h|   2 +
 src/include/postmaster/pgarch.h  |   5 +
 5 files changed, 254 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a7..7dd4b96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per checkpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
@@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per restartpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..7756b87 100644
--- a/sr

Re: .ready and .done files considered harmful

2021-09-13 Thread Dipesh Pandit
Hi,

Thanks for the feedback.

> + * by checking the availability of next WAL file. "xlogState" specifies
the
> + * segment number and timeline ID corresponding to the next WAL file.
>
> "xlogState" probably needs to be updated here.

Yes, I updated the comment.

> As noted before [0], I think we need to force a directory scan at the
> beginning of pgarch_MainLoop() and when pgarch_ArchiverCopyLoop()
> returns before we exit the "while" loop.  Else, there's probably a
> risk that we skip archiving a file until the next directory scan.  IMO
> forcing a directory scan at the beginning of pgarch_ArchiverCopyLoop()
> is a simpler way to do roughly the same thing.  I'm skeptical that
> persisting the next-anticipated state between calls to
> pgarch_ArchiverCopyLoop() is worth the complexity.

I think if we force a directory scan in pgarch_ArchiverCopyLoop() when it
returns before we exit the "while" loop or outside the loop then it may
result in directory scan for all WAL files in one of the scenarios that I
can think of.

There could be two possible scenarios, first scenario in which the archiver
is always lagging and the second scenario in which archiver is in sync or
ahead with the rate at which WAL files are generated.

If we focus on the second scenario, then consider a case where the archiver
has
just archived file 1.ready and is about to check the availability of
2.ready but the
file 2.ready is not available in archive status directory. Archiver
performs a directory
scan as a fall-back mechanism and goes to wait state.(The current
implementation
relies on notifying the archiver by creating a .ready file on disk. It may
happen that
the file is ready file archival but due to slow notification mechanism
there is a delay
in notification and archiver goes to wait state.) When file 2.ready is
created on disk
archive is notified, it wakes up and calls pgarch_ArchiverCopyLoop(). Now
if we
unconditionally force a directory scan in pgarch_ArchiverCopyLoop() then it
may
result in directory scan for all WAL files in this scenario. In this case
we have the
next anticipated log segment number and we can prevent an additional
directory
scan. I have tested this with a small setup by creating ~2000 WAL files and
it has
resulted in directory scan for each file.

I agree that the the failure scenario discussed in [0] will require a WAL
file to
wait until the next directory scan. However, this can be avoided by forcing
a
directory scan in pgarch_ArchiverCopyLoop() only in case of failure
scenario.
This will make sure that when the archiver wakes up for the next cycle it
performs a full directory leaving out any risk of missing a file due to
archive
failure. Additionally, it will also avoid additional directory scans
mentioned in
above scenario.

I have incorporated the changes and updated a new patch. PFA patch.

Thanks,
Dipesh

[0]
https://www.postgresql.org/message-id/AC78607B-9DA6-41F4-B253-840D3DD964BF%40amazon.com
From 7bb0794f3a41dacfada4863c26189ec01e3bee63 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 8 Sep 2021 21:47:16 +0530
Subject: [PATCH] keep trying the next file approach

---
 src/backend/access/transam/xlog.c|  20 +++
 src/backend/access/transam/xlogarchive.c |  23 
 src/backend/postmaster/pgarch.c  | 203 +++
 src/include/access/xlogdefs.h|   2 +
 src/include/postmaster/pgarch.h  |   4 +
 5 files changed, 225 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a7..7dd4b96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per checkpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
@@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per restartpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of th

Re: .ready and .done files considered harmful

2021-09-08 Thread Dipesh Pandit
> > I guess we still have to pick one or the other, but I don't really
> > know how to do that, since both methods seem to be relatively fine,
> > and the scenarios where one is better than the other all feel a little
> > bit contrived. I guess if no clear consensus emerges in the next week
> > or so, I'll just pick one and commit it. Not quite sure yet how I'll
> > do the picking, but we seem to all agree that something is better than
> > nothing, so hopefully nobody will be too sad if I make an arbitrary
> > decision. And if some clear agreement emerges before then, even
> > better.
>
> I will be happy to see this fixed either way.

+1

> > I agree, but it should probably be something like DEBUG3 instead of
> > LOG.
>
> I will update it in the next patch.

Updated log level to DEBUG3 and rebased the patch. PFA patch.

Thanks,
Dipesh
From e0ce2b85f9122963d820d005ca093e6c667ca7e6 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 8 Sep 2021 21:47:16 +0530
Subject: [PATCH] keep trying the next file approach

---
 src/backend/access/transam/xlog.c|  20 
 src/backend/access/transam/xlogarchive.c |  23 
 src/backend/postmaster/pgarch.c  | 195 ++-
 src/include/access/xlogdefs.h|   2 +
 src/include/postmaster/pgarch.h  |   4 +
 5 files changed, 217 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a7..7dd4b96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per checkpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
@@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per restartpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..0969f42 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -489,6 +489,29 @@ XLogArchiveNotify(const char *xlog)
 		return;
 	}
 
+	/* Force a directory scan if we are archiving anything but a regular
+	 * WAL file or if this WAL file is being created out-of-order.
+	 */
+	if (!IsXLogFileName(xlog))
+		PgArchForceDirScan();
+	else
+	{
+		TimeLineID tli;
+		XLogSegNo this_segno;
+		XLogSegNo arch_segno;
+
+		XLogFromFileName(xlog, , _segno, wal_segment_size);
+		arch_segno = PgArchGetCurrentSegno();
+
+		/*
+		 * We must use <= because the archiver may have just completed a
+		 * directory scan and found a later segment (but hasn't updated
+		 * shared memory yet).
+		 */
+		if (this_segno <= arch_segno)
+			PgArchForceDirScan();
+	}
+
 	/* Notify archiver that it's got something to do */
 	if (IsUnderPostmaster)
 		PgArchWakeup();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..84ba1e0 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -47,6 +47,7 @@
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
+#include "storage/spin.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -76,6 +77,20 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Forces a directory scan in pgarch_readyXlog().  Protected by
+	 * arch_lck.
+	 */
+	bool		force_dir_scan;
+
+	/*
+	 * Current archiver state.  Protected by arch_lck.
+	 */
+	TimeLineID	lastTli;
+	XLogSegNo	lastSegNo;
+
+	slock_t		arch_lck;
 } PgArchData;
 
 
@@ -103,6 +118,7 @@ static bool pgarch_readyXlog(char *xlog);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
+static bool higher_arch_priority(const char *a, const char

Re: .ready and .done files considered harmful

2021-09-03 Thread Dipesh Pandit
Hi,

Thanks for the feedback.

> Which approach do you think we should use?  I think we have decent
> patches for both approaches at this point, so perhaps we should see if
> we can get some additional feedback from the community on which one we
> should pursue further.

In my opinion both the approaches have benefits over current implementation.
I think in keep-trying-the-next-file approach we have handled all rare and
specific
scenarios which requires us to force a directory scan to archive the
desired files.
In addition to this with the recent change to force a directory scan at
checkpoint
we can avoid an infinite wait for a file which is still being missed out
despite
handling the special scenarios. It is also more efficient in extreme
scenarios
as discussed in this thread. However, multiple-files-per-readdir approach
is
cleaner with resilience of current implementation.

I agree that we should decide on which approach to pursue further based on
additional feedback from the community.

> The problem I see with this is that pgarch_archiveXlog() might end up
> failing.  If it does, we won't retry archiving the file until we do a
> directory scan.  I think we could try to avoid forcing a directory
> scan outside of these failure cases and archiver startup, but I'm not
> sure it's really worth it.  When pgarch_readyXlog() returns false, it
> most likely means that there are no .ready files present, so I'm not
> sure we are gaining a whole lot by avoiding a directory scan in that
> case.  I guess it might help a bit if there are a ton of .done files,
> though.

Yes, I think it will be useful when we have a bunch of .done files and
the frequency of .ready files is such that the archiver goes to wait
state before the next WAL file is ready for archival.

> I agree, but it should probably be something like DEBUG3 instead of
> LOG.

I will update it in the next patch.

Thanks,
Dipesh


Re: .ready and .done files considered harmful

2021-09-02 Thread Dipesh Pandit
Hi,

Thanks for the feedback.

> I attached two patches that demonstrate what I'm thinking this change
> should look like.  One is my take on the keep-trying-the-next-file
> approach, and the other is a new version of the multiple-files-per-
> readdir approach (with handling for "cheating" archive commands).  I
> personally feel that the multiple-files-per-readdir approach winds up
> being a bit cleaner and more resilient than the keep-trying-the-next-
> file approach.  However, the keep-trying-the-next-file approach will
> certainly be more efficient (especially for the extreme cases
> discussed in this thread), and I don't have any concrete concerns with
> this approach that seem impossible to handle.

I agree that multiple-files-pre-readdir is cleaner and has the resilience
of the
current implementation. However, I have a few suggestion on keep-trying-the
-next-file approach patch shared in previous thread.

+   /* force directory scan the first time we call pgarch_readyXlog() */
+   PgArchForceDirScan();
+

We should not force a directory in pgarch_ArchiverCopyLoop(). This gets
called
whenever archiver wakes up from the wait state. This will result in a
situation where the archiver performs a full directory scan despite having
the
accurate information about the next anticipated log segment.
Instead we can check if lastSegNo is initialized and continue directory
scan
until it gets initialized in pgarch_readyXlog().

+   return lastSegNo;
We should return "true" here.

I am thinking if we can add a log message for files which are
archived as part of directory scan. This will be useful for diagnostic
purpose
to check if desired files gets archived as part of directory scan in
special
scenarios. I also think that we should add a few comments in
pgarch_readyXlog().

I have incorporated these changes and attached a patch
v1-0001-keep-trying-the-next-file-approach.patch.

+   /*
+* We must use <= because the archiver may have just completed a
+* directory scan and found a later segment (but hasn't updated
+* shared memory yet).
+*/
+   if (this_segno <= arch_segno)
+   PgArchForceDirScan();

I still think that we should use '<' operator here because
arch_segno represents the segment number of the most recent
.ready file found by the archiver. This gets updated in shared
memory only if archiver has successfully found a .ready file.
In a normal scenario this_segno will be greater than arch_segno
whereas in cases where a .ready file is created out of order
this_segno may be less than arch_segno. I am wondering
if there is a scenario where arch_segno is equal to this_segno
unless I am missing something.

Thanks,
Dipesh
From 4fb41105ba4ed9d02a2a551bb4f6cf693ec5c31e Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Thu, 2 Sep 2021 17:16:20 +0530
Subject: [PATCH] keep trying the next file approach

---
 src/backend/access/transam/xlog.c|  20 
 src/backend/access/transam/xlogarchive.c |  23 
 src/backend/postmaster/pgarch.c  | 195 ++-
 src/include/access/xlogdefs.h|   1 +
 src/include/postmaster/pgarch.h  |   4 +
 5 files changed, 216 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab..49caa61 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9483,6 +9483,16 @@ CreateCheckPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per checkpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
@@ -9848,6 +9858,16 @@ CreateRestartPoint(int flags)
 	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
 
 	/*
+	 * Force the archiver to perform a directory scan.
+	 *
+	 * Ordinarily, this should not be needed, but it seems like a good idea
+	 * to make sure we scan the archive_status directory every once in a
+	 * while to make sure we haven't left anything behind.  Calling it here
+	 * ensures we do a directory scan at least once per restartpoint.
+	 */
+	PgArchForceDirScan();
+
+	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index b9c19b2..44630e7 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -492,

Re: .ready and .done files considered harmful

2021-08-25 Thread Dipesh Pandit
> If a .ready file is created out of order, the directory scan logic
> will pick it up about as soon as possible based on its priority.  If
> the archiver is keeping up relatively well, there's a good chance such
> a file will have the highest archival priority and will be picked up
> the next time the archiver looks for a file to archive.  With the
> patch proposed in this thread, an out-of-order .ready file has no such
> guarantee.  As long as the archiver never has to fall back to a
> directory scan, it won't be archived.  The proposed patch handles the
> case where RemoveOldXlogFiles() creates missing .ready files by
> forcing a directory scan, but I'm not sure this is enough.  I think we
> have to check the archiver state each time we create a .ready file to
> see whether we're creating one out-of-order.

We can handle the scenario where .ready file is created out of order
in XLogArchiveNotify(). This way we can avoid making an explicit call
to enable directory scan from different code paths which may result
into creating an out of order .ready file.

Archiver can store the segment number corresponding to the last or most
recent .ready file found. When a .ready file is created in
XLogArchiveNotify(),
the log segment number of the current .ready file can be compared with the
segment number of the last .ready file found at archiver to detect if this
file is
created out of order. A directory scan can be forced if required.

I have incorporated these changes in patch v11.

> While this may be an extremely rare problem in practice, archiving
> something after the next checkpoint completes seems better than never
> archiving it at all.  IMO this isn't an area where there is much space
> to take risks.

An alternate approach could be to force a directory scan at checkpoint to
break the infinite wait for a .ready file which is being missed due to the
fact that it is created out of order. This will make sure that the file
gets archived within the checkpoint boundaries.

Thoughts?

Please find attached patch v11.

Thanks,
Dipesh
From 9392fd1b82ade933e8127845013bb2940239af68 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Tue, 24 Aug 2021 12:17:34 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL
file that needs to be archived. This directory scan can be minimised
by maintaining the log segment of current wAL file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

There are some special scenarios like timeline switch, backup or
.ready file created out of order which requires archiver to perform a
full directory scan as archiving these files takes precedence over
regular WAL files.
---
 src/backend/access/transam/xlogarchive.c |  23 
 src/backend/postmaster/pgarch.c  | 211 ---
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/postmaster/pgarch.h  |   4 +
 4 files changed, 224 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index b9c19b2..cb73895 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -465,12 +465,16 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
  *
  * Optionally, nudge the archiver process so that it'll notice the file we
  * create.
+ *
+ * Also, notifies archiver to enable directory scan to handle a few special
+ * scenarios.
  */
 void
 XLogArchiveNotify(const char *xlog, bool nudge)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	FILE	   *fd;
+	bool		fileOutOfOrder = false;
 
 	/* insert an otherwise empty file called .ready */
 	StatusFilePath(archiveStatusPath, xlog, ".ready");
@@ -492,6 +496,25 @@ XLogArchiveNotify(const char *xlog, bool nudge)
 		return;
 	}
 
+	/* Check if .ready file is created out of order */
+	if (IsXLogFileName(xlog))
+	{
+		XLogSegNo	curSegNo;
+		TimeLineID	tli;
+
+		XLogFromFileName(xlog, , , wal_segment_size);
+
+		fileOutOfOrder = PgArchIsBrokenReadyFileOrder(curSegNo);
+	}
+
+	/*
+	 * History files or a .ready file created out of order requires archiver to
+	 * perform a full directory scan.
+	 */
+	if (IsTLHistoryFileName(xlog) || IsBackupHistoryFileName(xlog) ||
+			fileOutOfOrder)
+		PgArchEnableDirScan();
+
 	/* If caller requested, let archiver know it's got work to do */
 	if (nudge)
 		PgArchWakeup();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..a33648a 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,31 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disab

Re: .ready and .done files considered harmful

2021-08-24 Thread Dipesh Pandit
Thanks for the feedback.

> > > IIUC partial WAL files are handled because the next file in the
> > > sequence with the given TimeLineID won't be there, so we will fall
> > > back to a directory scan and pick it up.  Timeline history files are
> > > handled by forcing a directory scan, which should work because they
> > > always have the highest priority.  Backup history files, however, do
> > > not seem to be handled.  I think one approach to fixing that is to
> > > also treat backup history files similarly to timeline history files.
> > > If one is created, we force a directory scan, and the directory scan
> > > logic will consider backup history files as higher priority than
> > > everything but timeline history files.
> >
> > Backup history files are (currently) just informational and they are
> > finally processed at the end of a bulk-archiving performed by the fast
> > path.  However, I feel that it is cleaner to trigger a directory scan
> > every time we add an other-than-a-regular-WAL-file, as base-backup or
> - promotion are not supposed happen so infrequently.
> + promotion are not supposed happen so frequently.

I have incorporated the changes to trigger a directory scan in case of a
backup history file. Also, updated archiver to prioritize archiving a backup
history file over regular WAL files during directory scan to make sure that
backup history file gets archived before the directory scan gets disabled
as part of archiving a regular WAL file.

> > I've been looking at the v9 patch with fresh eyes, and I still think
> > we should be able to force the directory scan as needed in
> > XLogArchiveNotify().  Unless the file to archive is a regular WAL file
> > that is > our stored location in archiver memory, we should force a
> > directory scan.  I think it needs to be > instead of >= because we
> > don't know if the archiver has just completed a directory scan and
> > found a later segment to use to update the archiver state (but hasn't
> > yet updated the state in shared memory).
>
> I'm afraid that it can be seen as a violation of modularity. I feel
> that wal-emitter side should not be aware of that datail of
> archiving. Instead, I would prefer to keep directory scan as far as it
> found an smaller segment id than the next-expected segment id ever
> archived by the fast-path (if possible).  This would be
> less-performant in the case out-of-order segments are frequent but I
> think the overall objective of the original patch will be kept.

Archiver selects the file with lowest segment number as part of directory
scan and the next segment number gets resets based on this file. It starts
a new sequence from here and check the availability of the next file. If
there are holes then it will continue to fall back to directory scan. This
will
continue until it finds the next sequence in order. I think this is already
handled unless I am missing something.

> Also, I think we need to make sure to set PgArch->dirScan back to true
> > at the end of pgarch_readyXlog() unless we've found a new regular WAL
> > file that we can use to reset the archiver's stored location.  This
> > ensures that we'll keep doing directory scans as long as there are
> > timeline/backup history files to process.
>
> Right.

Done.

Please find the attached patch v10.

Thanks,
Dipesh
From c0a4bf3937934e576db49f7a30530dcf71a068d6 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Tue, 24 Aug 2021 12:17:34 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL
file that needs to be archived. This directory scan can be minimised
by maintaining the log segment of current wAL file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

There are some special scenarios like timeline switch, backup or
.ready file created out of order which requires archiver to perform a
full directory scan as archiving these files takes precedence over
regular WAL files.
---
 src/backend/access/transam/xlogarchive.c |  23 -
 src/backend/postmaster/pgarch.c  | 170 ---
 src/include/postmaster/pgarch.h  |   1 +
 3 files changed, 178 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index b9c19b2..ececd10 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -465,6 +465,8 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
  *
  * Optionally, nudge the archiver pro

Re: .ready and .done files considered harmful

2021-08-19 Thread Dipesh Pandit
Hi,

Thanks for the feedback.

> Should we have XLogArchiveNotify(), writeTimeLineHistory(), and
> writeTimeLineHistoryFile() enable the directory scan instead?  Else,
> we have to exhaustively cover all such code paths, which may be
> difficult to maintain.  Another reason I am bringing this up is that
> my patch for adjusting .ready file creation [0] introduces more
> opportunities for .ready files to be created out-of-order.

XLogArchiveNotify() notifies Archiver when a log segment is ready for
archival by creating a .ready file. This function is being called for each
log segment and placing a call to enable directory scan here will result
in directory scan for each log segment.

We can have writeTimeLineHistory() and writeTimeLineHistoryFile() to
enable directory scan to handle the scenarios related to timeline switch.

However, in other scenarios, I think we have to explicitly call
PgArchEnableDirScan()
to enable directory scan. PgArchEnableDirScan() takes care of waking up
archiver so that the caller of this function need not have to nudge the
archiver.

> +/*
> + * This is a fall-back path, check if we are here due to the
unavailability
> + * of next anticipated log segment or the archiver is being forced to
> + * perform a full directory scan. Reset the flag in shared memory
only if
> + * it has been enabled to force a full directory scan and then
proceed with
> + * directory scan.
> + */
> +if (PgArch->dirScan)
> +PgArch->dirScan = false;

> Why do we need to check that the flag is set before we reset it?  I
> think we could just always reset it since we are about to do a
> directory scan anyway

Yes, I agree.

> > If there is a race condition here with setting the flag, then an
> > alternative design would be to use a counter - either a plain old
> > uint64 or perhaps pg_atomic_uint64 - and have the startup process
> > increment the counter when it wants to trigger a scan. In this design,
> > the archiver would never modify the counter itself, but just remember
> > the last value that it saw. If it later sees a different value it
> > knows that a full scan is required. I think this kind of system is
> > extremely robust against the general class of problems that you're
> > talking about here, but I'm not sure whether we need it, because I'm
> > not sure whether there is a race with just the bool.

> I'm not sure, either.  Perhaps it would at least be worth adding a
> pg_memory_barrier() after setting dirScan to false to avoid the
> scenario I mentioned (which may or may not be possible).  IMO this
> stuff would be much easier to reason about if we used a lock instead,
> even if the synchronization was not strictly necessary.  However, I
> don't want to hold this patch up too much on this point.

There is one possible scenario where it may run into a race condition. If
archiver has just finished archiving all .ready files and the next
anticipated
log segment is not available then in this case archiver takes the fall-back
path to scan directory. It resets the flag before it begins directory scan.
Now, if a directory scan is enabled by a timeline switch or .ready file
created
out of order in parallel to the event that the archiver resets the flag
then this
might result in a race condition. But in this case also archiver is
eventually
going to perform a directory scan and the desired file will be archived as
part
of directory scan. Apart of this I can't think of any other scenario which
may
result into a race condition unless I am missing something.

I have incorporated the suggestions and updated a new patch. PFA patch v9.

Thanks,
Dipesh
From 04312fa668a56d9860547a02260d68c339747fa8 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/timeline.c|  10 ++
 src/backend/access/transam/xlogarchive.c |  11 +++
 src/backend/postmaster/pgarch.c  | 162 ---
 src/include/postmaster/pgarch.h  |   1 +
 4 files changed, 171 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c..f70cede 100644
-

Re: .ready and .done files considered harmful

2021-08-18 Thread Dipesh Pandit
Hi,

Thanks for the feedback. I have incorporated the suggestion
to use an unsynchronized boolean flag to force directory scan.
This flag is being set if there is a timeline switch or .ready file
is created out of order. Archiver resets this flag in case if it is
being set before it begins directory scan.

PFA patch v8.

Thanks,
Dipesh
From 1248da2cd96394b632e9853d8ff1f23f123a7813 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/xlog.c|  15 +++
 src/backend/access/transam/xlogarchive.c |  12 +++
 src/backend/postmaster/pgarch.c  | 155 ---
 src/include/postmaster/pgarch.h  |   1 +
 4 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a7..c30ef16 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -7564,6 +7565,13 @@ StartupXLOG(void)
 	 */
 	if (AllowCascadeReplication())
 		WalSndWakeup();
+
+	/*
+	 * Switched to a new timeline, notify archiver to enable
+	 * directory scan.
+	 */
+	if (XLogArchivingActive())
+		PgArchEnableDirScan();
 }
 
 /* Exit loop if we reached inclusive recovery target */
@@ -7806,6 +7814,13 @@ StartupXLOG(void)
 			 EndRecPtr, reason);
 
 		/*
+		 * Switched to a new timeline, notify archiver to enable directory
+		 * scan.
+		 */
+		if (XLogArchivingActive())
+			PgArchEnableDirScan();
+
+		/*
 		 * Since there might be a partial WAL segment named RECOVERYXLOG, get
 		 * rid of it.
 		 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..94c74f8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -609,6 +609,18 @@ XLogArchiveCheckDone(const char *xlog)
 
 	/* Retry creation of the .ready file */
 	XLogArchiveNotify(xlog);
+
+	/*
+	 * This .ready file is created out of order, notify archiver to perform
+	 * a full directory scan to archive corresponding WAL file.
+	 */
+	StatusFilePath(archiveStatusPath, xlog, ".ready");
+	if (stat(archiveStatusPath, _buf) == 0)
+	{
+		PgArchEnableDirScan();
+		PgArchWakeup();
+	}
+
 	return false;
 }
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..9bfcfda 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,25 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disable directory scan. If this flag is set then it
+	 * forces archiver to perform a full directory scan to get the next log
+	 * segment. It is not required to synchronize this flag as it guarantees
+	 * directory scan for the next cycle even if it is being missed in current
+	 * cycle.
+	 */
+	bool		dirScan;
 } PgArchData;
 
+/*
+ * Segment number and timeline ID to identify the next file in a WAL sequence
+ */
+typedef struct readyXLogState
+{
+	XLogSegNo	lastSegNo;
+	TimeLineID	lastTLI;
+} readyXLogState;
 
 /* --
  * Local data
@@ -97,9 +114,9 @@ static volatile sig_atomic_t ready_to_stop = false;
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
-static void pgarch_ArchiverCopyLoop(void);
+static void pgarch_ArchiverCopyLoop(readyXLogState *xlogState);
 static bool pgarch_archiveXlog(char *xlog);
-static bool pgarch_readyXlog(char *xlog);
+static bool pgarch_readyXlog(char *xlog, readyXLogState *xlogState);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
@@ -221,6 +238,15 @@ PgArchWakeup(void)
 		SetLatch(>allProcs[arch_pgprocno].procLatch);
 }
 
+/*
+ * Set dirScan flag in shared memory. Backend notifies archiver in case if an
+ * action requires full directory scan to get the next log segment.
+ */
+void
+PgArchEnableDirS

Re: .ready and .done files considered harmful

2021-08-17 Thread Dipesh Pandit
Thanks for the feedback.

> +   StatusFilePath(archiveStatusPath, xlog, ".ready");
> +   if (stat(archiveStatusPath, _buf) == 0)
> +   PgArchEnableDirScan();

> We may want to call PgArchWakeup() after setting the flag.

Yes, added a call to wake up archiver.

> > +  * - The next anticipated log segment is not available.
> >
> > I wonder if we really need to perform a directory scan in this case.
> > Unless there are other cases where the .ready files are created out of
> > order, I think this just causes an unnecessary directory scan every
> > time the archiver catches up.

> Thinking further, I suppose this is necessary for when lastSegNo gets
> reset after processing an out-of-order .ready file.

Also, this is necessary when lastTLI gets reset after switching to a new
timeline.

> +   pg_atomic_flag dirScan;

> I personally don't think it's necessary to use an atomic here.  A
> spinlock or LWLock would probably work just fine, as contention seems
> unlikely.  If we use a lock, we also don't have to worry about memory
> barriers.

History file should be archived as soon as it gets created. The atomic flag
here will make sure that there is no reordering of read/write instructions
while
accessing the flag in shared memory. Archiver needs to read this flag at
the
beginning of each cycle. Write to atomic flag is synchronized and it
provides
a lockless read. I think an atomic flag here is an efficient choice unless
I am
missing something.

Please find the attached patch v7.

Thanks,
Dipesh
From 55c42f851176a75881a55b1c75d624248169b876 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/xlog.c|  15 +++
 src/backend/access/transam/xlogarchive.c |  12 +++
 src/backend/postmaster/pgarch.c  | 163 ---
 src/include/postmaster/pgarch.h  |   1 +
 4 files changed, 178 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb..088ab43 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -7555,6 +7556,13 @@ StartupXLOG(void)
 	 */
 	if (AllowCascadeReplication())
 		WalSndWakeup();
+
+	/*
+	 * Switched to a new timeline, notify archiver to enable
+	 * directory scan.
+	 */
+	if (XLogArchivingActive())
+		PgArchEnableDirScan();
 }
 
 /* Exit loop if we reached inclusive recovery target */
@@ -7797,6 +7805,13 @@ StartupXLOG(void)
 			 EndRecPtr, reason);
 
 		/*
+		 * Switched to a new timeline, notify archiver to enable directory
+		 * scan.
+		 */
+		if (XLogArchivingActive())
+			PgArchEnableDirScan();
+
+		/*
 		 * Since there might be a partial WAL segment named RECOVERYXLOG, get
 		 * rid of it.
 		 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..94c74f8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -609,6 +609,18 @@ XLogArchiveCheckDone(const char *xlog)
 
 	/* Retry creation of the .ready file */
 	XLogArchiveNotify(xlog);
+
+	/*
+	 * This .ready file is created out of order, notify archiver to perform
+	 * a full directory scan to archive corresponding WAL file.
+	 */
+	StatusFilePath(archiveStatusPath, xlog, ".ready");
+	if (stat(archiveStatusPath, _buf) == 0)
+	{
+		PgArchEnableDirScan();
+		PgArchWakeup();
+	}
+
 	return false;
 }
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..e5ea7a6 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,23 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disable directory scan. If this flag is set then it
+	 * forces archiver to perform a full directory sc

Re: .ready and .done files considered harmful

2021-08-12 Thread Dipesh Pandit
Hi,

Thanks for the feedback.

The possible path that archiver can take for each cycle is either a fast
path or a fall-back patch. The fast path involves checking availability of
next anticipated log segment and decide the next target for archival or
a fall-back path which involves full directory scan to get the next log
segment.
We need a mechanism that enables the archiver to select the desired path
for each cycle.

This can be achieved by maintaining a shared memory flag. If this flag is
set
then archiver should take the fall-back path otherwise it should continue
with
the fast path.

This flag can be set by backend in case if an action like timeline switch,
.ready files created out of order,...  requires archiver to perform a full
directory scan.

I have incorporated these changes and updated a new patch. PFA patch v6.

Thanks,
Dipesh
From c636dffab006c15cf3a8656982679fbe6a6c6440 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then archiver performs a full directory
scan to make sure that archiving history file takes precedence over
archiving WAL files on older timeline.
---
 src/backend/access/transam/xlog.c|  15 +++
 src/backend/access/transam/xlogarchive.c |   9 ++
 src/backend/postmaster/pgarch.c  | 163 ---
 src/include/postmaster/pgarch.h  |   1 +
 4 files changed, 175 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb..088ab43 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -7555,6 +7556,13 @@ StartupXLOG(void)
 	 */
 	if (AllowCascadeReplication())
 		WalSndWakeup();
+
+	/*
+	 * Switched to a new timeline, notify archiver to enable
+	 * directory scan.
+	 */
+	if (XLogArchivingActive())
+		PgArchEnableDirScan();
 }
 
 /* Exit loop if we reached inclusive recovery target */
@@ -7797,6 +7805,13 @@ StartupXLOG(void)
 			 EndRecPtr, reason);
 
 		/*
+		 * Switched to a new timeline, notify archiver to enable directory
+		 * scan.
+		 */
+		if (XLogArchivingActive())
+			PgArchEnableDirScan();
+
+		/*
 		 * Since there might be a partial WAL segment named RECOVERYXLOG, get
 		 * rid of it.
 		 */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..bccbbb6 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -609,6 +609,15 @@ XLogArchiveCheckDone(const char *xlog)
 
 	/* Retry creation of the .ready file */
 	XLogArchiveNotify(xlog);
+
+	/*
+	 * This .ready file is created out of order, notify archiver to perform
+	 * a full directory scan to archive corresponding WAL file.
+	 */
+	StatusFilePath(archiveStatusPath, xlog, ".ready");
+	if (stat(archiveStatusPath, _buf) == 0)
+		PgArchEnableDirScan();
+
 	return false;
 }
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..e5ea7a6 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -76,8 +76,23 @@
 typedef struct PgArchData
 {
 	int			pgprocno;		/* pgprocno of archiver process */
+
+	/*
+	 * Flag to enable/disable directory scan. If this flag is set then it
+	 * forces archiver to perform a full directory scan to get the next log
+	 * segment.
+	 */
+	pg_atomic_flag dirScan;
 } PgArchData;
 
+/*
+ * Segment number and timeline ID to identify the next file in a WAL sequence
+ */
+typedef struct readyXLogState
+{
+	XLogSegNo	lastSegNo;
+	TimeLineID	lastTLI;
+} readyXLogState;
 
 /* --
  * Local data
@@ -97,12 +112,13 @@ static volatile sig_atomic_t ready_to_stop = false;
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
-static void pgarch_ArchiverCopyLoop(void);
+static void pgarch_ArchiverCopyLoop(readyXLogState *xlogState);
 static bool pgarch_archiveXlog(char *xlog);
-static bool pgarch_readyXlog(char *xlog);
+static bool pgarch_readyXlog(char *xlog, readyXLogState *xlogState);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_d

Re: .ready and .done files considered harmful

2021-08-05 Thread Dipesh Pandit
> I'm not sure. I think we need the value to be accurate during
> recovery, so I'm not sure whether replayEndTLI would get us there.
> Another approach might be to set ThisTimeLineID on standbys also.
> Actually just taking a fast look at the code I'm not quite sure why
> that isn't happening already. Do you have any understanding of that?

During investigation I found that the current timeline ID (ThisTimeLineID)
gets updated in XLogCtl’s ThisTimeLineID once it gets finalised as part
of archive recovery.

/*
 * Write the timeline history file, and have it archived. After this
 * point (or rather, as soon as the file is archived), the timeline
 * will appear as "taken" in the WAL archive and to any standby
 * servers.  If we crash before actually switching to the new
 * timeline, standby servers will nevertheless think that we
switched
 * to the new timeline, and will try to connect to the new timeline.
 * To minimize the window for that, try to do as little as possible
 * between here and writing the end-of-recovery record.
 */

In case of Standby this happens only when it gets promoted.

If Standby is in recovery mode then replayEndTLI points to the most
recent TLI corresponding to the replayed records. Also, if replying a
record causes timeline switch then replayEndTLI gets updated with
the new timeline. As long as it is in recovery mode replayEndTLI should
point to the current timeline ID on Standby. Thoughts?

Thanks,
Dipesh


Re: .ready and .done files considered harmful

2021-08-05 Thread Dipesh Pandit
Hi,

> I don't really understand why you are storing something in shared
> memory specifically for the archiver. Can't we use XLogCtl's
> ThisTimeLineID instead of storing another copy of the information?

Yes, we can avoid storing another copy of information. We can
use XLogCtl's ThisTimeLineID on Primary. However,
XLogCtl's ThisTimeLineID is not set to the current timeline ID on
Standby server. It's value is set to '0'. Can we use XLogCtl's
replayEndTLI on the Standby server to get the current timeline ID?

Thanks,
Dipesh


Re: .ready and .done files considered harmful

2021-08-02 Thread Dipesh Pandit
Hi,

> I think what you are saying is true before v14, but not in v14 and master.
Yes, we can use archiver specific shared memory. Thanks.

> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()? I guess there could be a concern about the
> expensive of that, because we'd probably be taking a spinlock or an
> lwlock for every cycle, but I don't think it's probably that bad,
> because I doubt we can archive much more than a double-digit number of
> files per second even with a very fast archive_command, and contention
> on a lock generally requires a five digit number of acquisitions per
> second. It would be worth testing to see if we can see a problem here,
> but I'm fairly hopeful that it's not an issue. If we do feel that it's
> important to avoid repeatedly taking a lock, let's see if we can find
> a way to do it without dedicating a signal to this purpose.

We can maintain the current timeline ID in archiver specific shared memory.
If we switch to a new timeline then the backend process can update the new
timeline ID in shared memory. Archiver can keep a track of current timeline
ID
and if it finds that there is a timeline switch then it can perform a full
directory
scan to make sure that archiving history files takes precedence over WAL
files.
Access to the shared memory area can be protected by adding a
WALArchiverLock.
If we take this approach then it doesn't require to use a dedicated signal
to notify
a timeline switch.

> The problem with all this is that you can't understand either function
> in isolation. Unless you read them both together and look at all of
> the ways these three variables are manipulated, you can't really
> understand the logic. And there's really no reason why that needs to
> be true. The job of cleaning timeline_switch and setting dirScan could
> be done entirely within pgarch_readyXlog(), and so could the job of
> incrementing nextLogSegNo, because we're not going to again call
> pgarch_readyXlog() unless archiving succeeded.

> Also note that the TLI which is stored in curFileTLI corresponds to
> the segment number stored in nextLogSegNo, yet one of them has "cur"
> for "current" in the name and the other has "next". It would be easier
> to read the code if the names were chosen more consistently.

> My tentative idea as to how to clean this up is: declare a new struct
> with a name like readyXlogState and members lastTLI and lastSegNo.
> Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero
> it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave
> it alone. Then let pgarch_readyXlog() do all of the manipulation of
> the values stored therein.

Make sense, we can move the entire logic to a single function
pgarch_readyXlog()
and declare a new struct readyXLogState.

I think we cannot declare a variable of this type in
pgarch_ArchiverCopyLoop()
due to the fact that this function will be called every time the archiver
wakes up.
Initializing readyXLogState here will reset the next anticipated log
segment number
when the archiver wakes up from a wait state. We can declare and initialize
it in
pgarch_MainLoop() to avoid resetting the next anticipated log segment
number
when the archiver wakes up.

> You've moved this comment from its original location, but the trouble
> is that the comment is 100% false. In fact, the whole reason why you
> wrote this patch is *because* this comment is 100% false. In fact it
> is not difficult to create cases where each scan finds many files, and
> the purpose of the patch is precisely to optimize the code that the
> person who wrote this thought didn't need optimizing. Now it may take
> some work to figure out what we want to say here exactly, but
> preserving the comment as it's written here is certainly misleading.

Yes, I agree. We can update the comments here to list the scenarios
where we may need to perform a full directory scan.

I have incorporated these changes and updated a new patch. Please find
the attached patch v5.

Thanks,
Dipesh
From 3e0f690d1b8fd1d07fa503a807ee97cb9ed7377b Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archiv

Re: .ready and .done files considered harmful

2021-07-28 Thread Dipesh Pandit
Hi,

> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()?

As of now shared memory is not attached to the archiver. Archiver cannot
access ThisTimeLineID or a flag available in shared memory.

if (strcmp(argv[1], "--forkbackend") == 0 ||

strcmp(argv[1], "--forkavlauncher") == 0 ||

strcmp(argv[1], "--forkavworker") == 0 ||

strcmp(argv[1], "--forkboot") == 0 ||

strncmp(argv[1], "--forkbgworker=", 15) == 0)

PGSharedMemoryReAttach();

else

PGSharedMemoryNoReAttach();

This is the reason we have thought of sending a notification to the
archiver if
there is a timeline switch. Should we consider attaching shared memory to
archiver process or explore more on notification mechanism to avoid
using SIGINT?

Thanks,
Dipesh


Re: .ready and .done files considered harmful

2021-07-27 Thread Dipesh Pandit
> Some minor suggestions:
Thanks for your comments. I have incorporated the changes
and updated a new patch. Please find the attached patch v4.

Thanks,
Dipesh

On Mon, Jul 26, 2021 at 9:44 PM Bossart, Nathan  wrote:

> On 7/26/21, 6:31 AM, "Robert Haas"  wrote:
> > In terms of immediate next steps, I think we should focus on
> > eliminating the O(n^2) problem and not get sucked into a bigger
> > redesign. The patch on the table aims to do just that much and I think
> > that's a good thing.
>
> I agree.  I'll leave further discussion about a redesign for another
> thread.
>
> Nathan
>
>
From c597411b08377ea64634d5198071060ec2b9c524 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then backend sends a notification to
archiver. Archiver registers the timeline switch and performs a full
directory scan to make sure that archiving history files takes
precedence over archiving WAL files
---
 src/backend/access/transam/xlog.c |   8 +++
 src/backend/postmaster/pgarch.c   | 138 ++
 src/include/postmaster/pgarch.h   |   1 +
 3 files changed, 135 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402..2580ce8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -8130,6 +8131,13 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
+	 * If archiver is active, send notification that timeline has switched.
+	 */
+	if (XLogArchivingActive() && ArchiveRecoveryRequested &&
+		IsUnderPostmaster)
+		PgArchNotifyTLISwitch();
+
+	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
 	 * required for consistency, but the last restartpoint might be far back,
 	 * and in case of a crash, recovering from it might take a longer than is
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..161a5b6 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -90,16 +90,18 @@ static PgArchData *PgArch = NULL;
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t ready_to_stop = false;
+static volatile sig_atomic_t timeline_switch = false;
 
 /* --
  * Local function forward declarations
  * --
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
+static void pgarch_timeline_switch(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
 static bool pgarch_archiveXlog(char *xlog);
-static bool pgarch_readyXlog(char *xlog);
+static bool pgarch_readyXlog(char *xlog, bool *dirScan, XLogSegNo *nextLogSegNo);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
@@ -169,10 +171,11 @@ PgArchiverMain(void)
 {
 	/*
 	 * Ignore all signals usually bound to some action in the postmaster,
-	 * except for SIGHUP, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT.
+	 * except for SIGHUP, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT.
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	/* Archiver is notified by backend if there is a timeline switch */
+	pqsignal(SIGINT, pgarch_timeline_switch);
 	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -221,6 +224,23 @@ PgArchWakeup(void)
 		SetLatch(>allProcs[arch_pgprocno].procLatch);
 }
 
+/*
+ * Called by backend process to notify a timeline switch.
+ */
+void
+PgArchNotifyTLISwitch(void)
+{
+	int			arch_pgprocno = PgArch->pgprocno;
+
+	if (arch_pgprocno != INVALID_PGPROCNO)
+	{
+		int		archiver_pid = ProcGlobal->allProcs[arch_pgprocno].pid;
+
+		if (kill(archiver_pid, SIGINT) < 0)
+			elog(ERROR, "could not notify timeline change to archiver: %m");
+	}
+}
+
 
 /* SIGUSR2 signal handler for archiver process */
 static void
@@ -236,6 +256,19 @@ pgarch_waken_stop(SIGNAL_ARGS)
 }
 
 /*
+ * Interrupt handler for hand

Re: .ready and .done files considered harmful

2021-07-22 Thread Dipesh Pandit
Hi,

> some comments on v2.
Thanks for your comments. I have incorporated the changes
and updated a new patch. Please find the details below.

> On the timeline switch, setting a flag should be enough, I don't think
> that we need to wake up the archiver.  Because it will just waste the
> scan cycle.
Yes, I modified it.

> Why do we need multi level interfaces? I mean instead of calling first
> XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
> can't we directly call PgArchNotifyTLISwitch()?
Yes, multilevel interfaces are not required. Removed extra interface.

> +if (timeline_switch)
> +{
> +/* Perform a full directory scan in next cycle */
> +dirScan = true;
> +timeline_switch = false;
> +}

> I suggest you can add some comments atop this check.
Added comment to specify the action required in case of a
timeline switch.

> I think you should use %m in the error message so that it also prints
> the OS error code.
Done.

> Why is this a global variable?  I mean whenever you enter the function
> pgarch_ArchiverCopyLoop(), this can be set to true and after that you
> can pass this as inout parameter to pgarch_readyXlog() there in it can
> be conditionally set to false once we get some segment and whenever
> the timeline switch we can set it back to the true.
Yes, It is not necessary to have global scope for "dirScan". Changed
the scope to local for "dirScan" and "nextLogSegNo".

PFA patch v3.

Thanks,
Dipesh
From 76260a2ebf90fd063e06dac701e560a506b7a2b7 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then backend sends a notification to
archiver. Archiver registers the timeline switch and performs a full
directory scan to make sure that archiving history files takes
precedence over archiving WAL files
---
 src/backend/access/transam/xlog.c |   8 +++
 src/backend/postmaster/pgarch.c   | 131 ++
 src/include/postmaster/pgarch.h   |   1 +
 3 files changed, 128 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c7c928f..baee37b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -50,6 +50,7 @@
 #include "port/atomics.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/pgarch.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/basebackup.h"
@@ -8130,6 +8131,13 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
+	 * If archiver is active, send notification that timeline has switched.
+	 */
+	if (XLogArchivingActive() && ArchiveRecoveryRequested &&
+		IsUnderPostmaster)
+		PgArchNotifyTLISwitch();
+
+	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
 	 * required for consistency, but the last restartpoint might be far back,
 	 * and in case of a crash, recovering from it might take a longer than is
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..b604966 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -90,16 +90,18 @@ static PgArchData *PgArch = NULL;
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t ready_to_stop = false;
+static volatile sig_atomic_t timeline_switch = false;
 
 /* --
  * Local function forward declarations
  * --
  */
 static void pgarch_waken_stop(SIGNAL_ARGS);
+static void pgarch_timeline_switch(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
 static bool pgarch_archiveXlog(char *xlog);
-static bool pgarch_readyXlog(char *xlog);
+static bool pgarch_readyXlog(char *xlog, bool *dirScan, XLogSegNo *nextLogSegNo);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
@@ -169,10 +171,11 @@ PgArchiverMain(void)
 {
 	/*
 	 * Ignore all signals usually bound to some action in the postmaster,
-	 * except for SIGHUP, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT.
+	 * except for SIGHUP, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT.
 	 */
 	pqsignal(SIGHUP, SignalHandler

Re: .ready and .done files considered harmful

2021-07-19 Thread Dipesh Pandit
Hi,

> I agree, I missed this part. The .history file should be given higher
preference.
> I will take care of it in the next patch.

Archiver does not have access to shared memory and the current timeline ID
is not available at archiver. In order to keep track of timeline switch we
have
to push a notification from backend to archiver.  Backend can send a signal
to notify archiver about the timeline change. Archiver can register this
notification and perform a full directory scan to make sure that archiving
history files take precedence over archiving WAL files.

> If a history file is found we are not updating curFileTLI and
> nextLogSegNo, so it will attempt the previously found segment.  This
> is fine because it will not find that segment and it will rescan the
> directory.  But I think we can do better, instead of searching the
> same old segment in the previous timeline we can search that old
> segment in the new TL so that if the TL switch happened within the
> segment then we will find the segment and we will avoid the directory
> search.

This could have been done with the approach mentioned in patch v1 but now
considering archiving history file takes precedence over WAL files we cannot
update the "curFileTLI" whenever a history file is found.

> So everytime archiver will start with searching segno=0 in timeline=0.
> Instead of doing this can't we first scan the directory and once we
> get the first segment to archive then only we can start predicting the
> next wal segment?

Done.

> This comment is a bit confusing with the name of the variable
nextLogSegNo.
> I think the name of the variable is appropriate here, but maybe we can
reword
> the comment something like:

Done.

I have incorporated these changes and updated a new patch. PFA, patch v2.

Thanks,
Dipesh
From 22b0e8fb9c778fbfdc945a647f82f6bbd8d6ec0a Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
that needs to be archived. This directory scan can be minimised by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.

If there is a timeline switch then backend sends a notification to
archiver. Archiver registers the timeline switch and performs a full
directory scan to make sure that archiving history files takes
precedence over archiving WAL files
---
 src/backend/access/transam/xlog.c|   6 ++
 src/backend/access/transam/xlogarchive.c |  10 +++
 src/backend/postmaster/pgarch.c  | 124 ---
 src/include/access/xlogarchive.h |   1 +
 src/include/postmaster/pgarch.h  |   1 +
 5 files changed, 133 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c7c928f..40969a9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8130,6 +8130,12 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
+	 * If archiver is active, send notification that timeline has switched.
+	 */
+	if (XLogArchivingActive() && ArchiveRecoveryRequested)
+		XLogArchiveNotifyTLISwitch();
+
+	/*
 	 * If this was a promotion, request an (online) checkpoint now. This isn't
 	 * required for consistency, but the last restartpoint might be far back,
 	 * and in case of a crash, recovering from it might take a longer than is
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e..1968872 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -507,6 +507,16 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 }
 
 /*
+ * Signal archiver to notify timeline switch
+ */
+void
+XLogArchiveNotifyTLISwitch(void)
+{
+	if (IsUnderPostmaster)
+		PgArchNotifyTLISwitch();
+}
+
+/*
  * XLogArchiveForceDone
  *
  * Emit notification forcibly that an XLOG segment file has been successfully
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..e23b9bc 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -90,12 +90,22 @@ static PgArchData *PgArch = NULL;
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t ready_to_stop = false;
+static volatile sig_atomic_t timeline_switch = false;
+
+/*
+ * Log segment number to get next WAL file in a sequence.
+ */
+static XLogSegNo nextLogSegNo = 0;
+
+/* Flag to specify a full directory scan to find next log file */
+static bool dirScan = true;
 
 /* --
  * Local function forward declarations
  * --
  */
 static voi

Re: .ready and .done files considered harmful

2021-07-06 Thread Dipesh Pandit
> specifically about history files being given higher priority for
> archiving.  If we go with this change then we'd at least want to rewrite
> or remove those comments, but I don't actually agree that we should
> remove that preference to archive history files ahead of WAL, for the
> reasons brought up previously.

> As was suggested on that subthread, it seems like it should be possible
> to just track the current timeline and adjust what we're doing if the
> timeline changes, and we should even know what the .history file is at
> that point and likely don't even need to scan the directory for it, as
> it'll be the old timeline ID.

I agree, I missed this part. The .history file should be given higher
preference.
I will take care of it in the next patch.

Thanks,
Dipesh


Re: .ready and .done files considered harmful

2021-07-06 Thread Dipesh Pandit
Hi,

We have addressed the O(n^2) problem which involves directory scan for
archiving individual WAL files by maintaining a WAL counter to identify
the next WAL file in a sequence.

WAL archiver scans the status directory to identify the next WAL file
which needs to be archived. This directory scan can be minimized by
maintaining the log segment number of the current file which is being
archived
and incrementing it by '1' to get the next WAL file in a sequence. Archiver
can check the availability of the next file in status directory and in case
if the
file is not available then it should fall-back to directory scan to get the
oldest
WAL file.

Please find attached patch v1.

Thanks,
Dipesh

On Fri, May 7, 2021 at 1:31 AM Andres Freund  wrote:

> Hi,
>
> On 2021-05-06 21:23:36 +0200, Hannu Krosing wrote:
> > How are you envisioning the shared-memory signaling should work in the
> > original sample case, where the archiver had been failing for half a
> > year ?
>
> If we leave history files and gaps in the .ready sequence aside for a
> second, we really only need an LSN or segment number describing the
> current "archive position". Then we can iterate over the segments
> between the "archive position" and the flush position (which we already
> know). Even if we needed to keep statting .ready/.done files (to handle
> gaps due to archive command mucking around with .ready/done), it'd still
> be a lot cheaper than what we do today.  It probably would even still be
> cheaper if we just statted all potentially relevant timeline history
> files all the time to send them first.
>
>
> > Or should we perhaps have a system table for ready-to-archive WAL
> > files to get around limitation sof file system to return just the
> > needed files with ORDER BY ... LIMIT as we already know how to make
> > lookups in database fast ?
>
> Archiving needs to work on a standby so that doesn't seem like an
> option.
>
> Regards,
>
> Andres Freund
>
>
>
From ce5591c7ef574a1b039a5de807eb0c566a3456e1 Mon Sep 17 00:00:00 2001
From: Dipesh Pandit 
Date: Wed, 30 Jun 2021 14:05:58 +0530
Subject: [PATCH] mitigate directory scan for WAL archiver

WAL archiver scans the status directory to identify the next WAL file
which needs to be archived. This directory scan can be minimized by
maintaining the log segment number of current file which is being
archived and incrementing it by '1' to get the next WAL file.
Archiver can check the availability of next file and in case if the
file is not available then it should fall-back to directory scan to
get the oldest WAL file.
---
 src/backend/postmaster/pgarch.c | 63 -
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 74a7d7c..eb219d7 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -87,6 +87,12 @@ static time_t last_sigterm_time = 0;
 static PgArchData *PgArch = NULL;
 
 /*
+ * Log segment number and timeline ID to get next WAL file in a sequence.
+ */
+static XLogSegNo nextLogSegNo = 0;
+static TimeLineID curFileTLI = 0;
+
+/*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t ready_to_stop = false;
@@ -411,6 +417,10 @@ pgarch_ArchiverCopyLoop(void)
 /* successful */
 pgarch_archiveDone(xlog);
 
+/* Increment log segment number to point to the next WAL file */
+if (!IsTLHistoryFileName(xlog))
+	nextLogSegNo++;
+
 /*
  * Tell the collector about the WAL file that we successfully
  * archived
@@ -596,29 +606,55 @@ pgarch_archiveXlog(char *xlog)
  * larger ID; the net result being that past timelines are given higher
  * priority for archiving.  This seems okay, or at least not obviously worth
  * changing.
+ *
+ * WAL files are generated in a specific order of log segment number. The
+ * directory scan for each WAL file can be minimized by identifying the next
+ * WAL file in the sequence. This can be achieved by maintaining log segment
+ * number and timeline ID corresponding to WAL file currently being archived.
+ * The log segment number of current WAL file can be incremented by '1' upon
+ * successful archival to point to the next WAL file.
  */
 static bool
 pgarch_readyXlog(char *xlog)
 {
-	/*
-	 * open xlog status directory and read through list of xlogs that have the
-	 * .ready suffix, looking for earliest file. It is possible to optimise
-	 * this code, though only a single file is expected on the vast majority
-	 * of calls, so
-	 */
+	char		basename[MAX_XFN_CHARS + 1];
+	char		xlogready[MAXPGPATH];
 	char		XLogArchiveStatusDir[MAXPGPATH];
 	DIR		   *rldir;
 	struct dirent *rlde;
+	struct stat	st;
 	bool		found = false;
 	bool		historyFound = false;
 
+	/*
+	 * Log segment number already p

Re: refactoring basebackup.c

2020-06-29 Thread Dipesh Pandit
Hi,

I have repeated the experiment with 8K block size and found that the
results are not varying much after applying the patch.
Please find the details below.

*Backup type*: local backup using pg_basebackup
*Data size*: Around 200GB (200 tables - each table around 1.05 GB)
*TAR_SEND_SIZE value*: 8kb

*Server details:*
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit,
64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

*Results:*

Iteration WIthout refactor
patch WIth refactor
patch
1st run real 10m19.001s
user 1m37.895s
sys 8m33.008s real 9m45.291s
user 1m23.192s
sys 8m14.993s
2nd run real 9m33.970s
user 1m19.490s
sys 8m6.062s real 9m30.560s
user 1m22.124s
sys 8m0.979s
3rd run real 9m19.327s
user 1m21.772s
sys 7m50.613s real 8m59.241s
user 1m19.001s
sys 7m32.645s
4th run real 9m56.873s
user 1m22.370s
sys 8m27.054s real 9m52.290s
user 1m22.175s
sys 8m23.052s
5th run real 9m45.343s
user 1m23.113s
sys 8m15.418s real 9m49.633s
user 1m23.122s
sys 8m19.240s

Later I connected with Suraj to validate the experiment details and found
that the setup and steps followed are exactly the same in this
experiment when compared with the previous experiment.

Thanks,
Dipesh

On Thu, May 14, 2020 at 7:50 AM Suraj Kharage <
suraj.khar...@enterprisedb.com> wrote:

> Hi,
>
> On Wed, May 13, 2020 at 7:49 PM Robert Haas  wrote:
>
>>
>> So the patch came out slightly faster at 8kB and slightly slower in the
>> other tests. That's kinda strange. I wonder if it's just noise. How much do
>> the results vary run to run?
>>
> It is not varying much except for 8kB run. Please see below details for
> both runs of each scenario.
>
> 8kb 32kb (default value) 128kB 1024kB
> WIthout refactor
> patch 1st run real 10m50.924s
> user 1m29.774s
> sys 9m13.058s real 8m36.245s
> user 1m8.471s
> sys 7m21.520s real 7m8.690s
> user 0m54.840s
> sys 6m1.725s real 18m16.898s
> user 1m39.105s
> sys 9m42.803s
> 2nd run real 10m22.718s
> user 1m23.629s
> sys 8m51.410s real 8m44.455s
> user 1m7.896s
> sys 7m28.909s real 6m54.299s
> user 0m55.690s
> sys 5m46.502s real 18m3.511s
> user 1m38.197s
> sys 9m36.517s
> WIth refactor
> patch 1st run real 10m11.350s
> user 1m25.038s
> sys 8m39.226s real 8m56.226s
> user 1m9.774s
> sys 7m41.032s real 7m26.678s
> user 0m54.833s
> sys 6m20.057s real 19m5.218s
> user 1m44.122s
> sys 10m17.623s
> 2nd run real 11m30.500s
> user 1m45.221s
> sys 9m37.815s real 9m4.103s
> user 1m6.893s
> sys 7m49.393s real 7m26.713s
> user 0m54.868s
> sys 6m19.652s real 18m17.230s
> user 1m42.749s
> sys 9m53.704s
>
>
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
> EnterpriseDB Corporation,
> The Postgres Database Company.
>


Re: WIP/PoC for parallel backup

2020-04-22 Thread Dipesh Pandit
Hi Asif,

I am reviewing your recent patch and found the patch is not applicable on 
latest master. 

Could you please resolve the conflicts and update a new patch?

Thanks,
Dipesh
EnterpriseDB: http://www.enterprisedb.com