Re: basebackup/lz4 crash
Hi, > I think your proposed change is OK, modulo some comments. But I think > maybe we ought to delete all the stuff related to compressed_bound > from bbstreamer_lz4_compressor_new() as well, because I don't see that > there's any point. And then I think we should also add logic similar > to what you've added here to bbstreamer_lz4_compressor_finalize(), so > that we're not making the assumption that the buffer will get enlarged > at some earlier point. > > Thoughts? I agree that we should remove the compression bound stuff from bbstreamer_lz4_compressor_new() and add a fix in bbstreamer_lz4_compressor_content() and bbstreamer_lz4_compressor_finalize() to enlarge the buffer if it falls short of the compress bound. Patch attached. Thanks, Dipesh From ee9b5e4739bf78b39dc34c9fcc76fc731b09b788 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Thu, 31 Mar 2022 12:19:31 +0530 Subject: [PATCH] fix crash with lz4 We cannot fix the size of output buffer at client during lz4 compression. Client receives chunks from the server and the size of these chunks varies depending upon the data sent by the server. The output buffer capacity at client should be adjusted based on the size of the chunk received from the server. --- src/bin/pg_basebackup/bbstreamer_lz4.c | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/src/bin/pg_basebackup/bbstreamer_lz4.c b/src/bin/pg_basebackup/bbstreamer_lz4.c index 67f841d..2ffe224 100644 --- a/src/bin/pg_basebackup/bbstreamer_lz4.c +++ b/src/bin/pg_basebackup/bbstreamer_lz4.c @@ -73,7 +73,6 @@ bbstreamer_lz4_compressor_new(bbstreamer *next, bc_specification *compress) bbstreamer_lz4_frame *streamer; LZ4F_errorCode_t ctxError; LZ4F_preferences_t *prefs; - size_t compressed_bound; Assert(next != NULL); @@ -92,17 +91,6 @@ bbstreamer_lz4_compressor_new(bbstreamer *next, bc_specification *compress) if ((compress->options & BACKUP_COMPRESSION_OPTION_LEVEL) != 0) prefs->compressionLevel = compress->level; - /* - * Find out the compression bound, it specifies the minimum destination - * capacity required in worst case for the success of compression operation - * (LZ4F_compressUpdate) based on a given source size and preferences. - */ - compressed_bound = LZ4F_compressBound(streamer->base.bbs_buffer.maxlen, prefs); - - /* Enlarge buffer if it falls short of compression bound. */ - if (streamer->base.bbs_buffer.maxlen < compressed_bound) - enlargeStringInfo(>base.bbs_buffer, compressed_bound); - ctxError = LZ4F_createCompressionContext(>cctx, LZ4F_VERSION); if (LZ4F_isError(ctxError)) pg_log_error("could not create lz4 compression context: %s", @@ -170,7 +158,6 @@ bbstreamer_lz4_compressor_content(bbstreamer *streamer, * forward the content to next streamer and empty the buffer. */ out_bound = LZ4F_compressBound(len, >prefs); - Assert(mystreamer->base.bbs_buffer.maxlen >= out_bound); if (avail_out < out_bound) { bbstreamer_content(mystreamer->base.bbs_next, member, @@ -178,6 +165,10 @@ bbstreamer_lz4_compressor_content(bbstreamer *streamer, mystreamer->bytes_written, context); + /* Enlarge buffer if it falls short of out bound. */ + if (mystreamer->base.bbs_buffer.maxlen < out_bound) +enlargeStringInfo(>base.bbs_buffer, out_bound); + avail_out = mystreamer->base.bbs_buffer.maxlen; mystreamer->bytes_written = 0; next_out = (uint8 *) mystreamer->base.bbs_buffer.data; @@ -218,7 +209,6 @@ bbstreamer_lz4_compressor_finalize(bbstreamer *streamer) /* Find out the footer bound and update the output buffer. */ footer_bound = LZ4F_compressBound(0, >prefs); - Assert(mystreamer->base.bbs_buffer.maxlen >= footer_bound); if ((mystreamer->base.bbs_buffer.maxlen - mystreamer->bytes_written) < footer_bound) { @@ -227,6 +217,10 @@ bbstreamer_lz4_compressor_finalize(bbstreamer *streamer) mystreamer->bytes_written, BBSTREAMER_UNKNOWN); + /* Enlarge buffer if it falls short of footer bound. */ + if (mystreamer->base.bbs_buffer.maxlen < footer_bound) +enlargeStringInfo(>base.bbs_buffer, footer_bound); + avail_out = mystreamer->base.bbs_buffer.maxlen; mystreamer->bytes_written = 0; next_out = (uint8 *) mystreamer->base.bbs_buffer.data; -- 1.8.3.1
Re: fixing a few backup compression goofs
Hi, The changes look good to me. Thanks, Dipesh
Re: refactoring basebackup.c
Hi, I tried to implement support for parallel ZSTD compression. The library provides an option (ZSTD_c_nbWorkers) to specify the number of compression workers. The number of parallel workers can be set as part of compression parameter and if this option is specified then the library performs parallel compression based on the specified number of workers. User can specify the number of parallel worker as part of --compress option by appending an integer value after at sign (@). (-Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL][@WORKERS]) Please find the attached patch v1 with the above changes. Note: ZSTD library version 1.5.x supports parallel compression by default and if the library version is lower than 1.5.x then parallel compression is enabled only the source is compiled with build macro ZSTD_MULTITHREAD. If the linked library version doesn't support parallel compression then setting the value of parameter ZSTD_c_nbWorkers to a value other than 0 will be no-op and returns an error. Thanks, Dipesh From 688ad1e3f9b43bf911e8c3837497a874e4a6937f Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Mon, 14 Mar 2022 18:39:02 +0530 Subject: [PATCH] support parallel zstd compression --- doc/src/sgml/ref/pg_basebackup.sgml | 11 ++- src/backend/replication/basebackup.c | 14 +++- src/backend/replication/basebackup_zstd.c | 15 - src/bin/pg_basebackup/bbstreamer.h| 3 +- src/bin/pg_basebackup/bbstreamer_zstd.c | 11 ++- src/bin/pg_basebackup/pg_basebackup.c | 97 +-- src/bin/pg_verifybackup/t/008_untar.pl| 8 +++ src/bin/pg_verifybackup/t/010_client_untar.pl | 8 +++ src/include/replication/basebackup_sink.h | 2 +- 9 files changed, 142 insertions(+), 27 deletions(-) diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 4a630b5..87feca0 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -399,9 +399,9 @@ PostgreSQL documentation -Z level - -Z [{client|server}-]method[:level] + -Z [{client|server}-]method[:level][@workers] --compress=level - --compress=[{client|server}-]method[:level] + --compress=[{client|server}-]method[:level][@workers] Requests compression of the backup. If client or @@ -428,6 +428,13 @@ PostgreSQL documentation the level is 0. +Compression workers can be specified optionally by appending the +number of workers after an at sign (@). It +defines the degree of parallelism while compressing the archive. +Currently, parallel compression is supported only for +zstd compressed archives. + + When the tar format is used with gzip, lz4, or zstd, the suffix .gz, .lz4, or diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 2378ce5..8217fa9 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -82,6 +82,7 @@ typedef struct backup_manifest_option manifest; basebackup_compression_type compression; int compression_level; + int compression_workers; pg_checksum_type manifest_checksum_type; } basebackup_options; @@ -718,6 +719,7 @@ parse_basebackup_options(List *options, basebackup_options *opt) char *target_str = "compat"; /* placate compiler */ bool o_compression = false; bool o_compression_level = false; + bool o_compression_workers = false; MemSet(opt, 0, sizeof(*opt)); opt->target = BACKUP_TARGET_CLIENT; @@ -925,6 +927,15 @@ parse_basebackup_options(List *options, basebackup_options *opt) opt->compression_level = defGetInt32(defel); o_compression_level = true; } + else if (strcmp(defel->defname, "compression_workers") == 0) + { + if (o_compression_workers) +ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("duplicate option \"%s\"", defel->defname))); + opt->compression_workers = defGetInt32(defel); + o_compression_workers = true; + } else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), @@ -1030,7 +1041,8 @@ SendBaseBackup(BaseBackupCmd *cmd) else if (opt.compression == BACKUP_COMPRESSION_LZ4) sink = bbsink_lz4_new(sink, opt.compression_level); else if (opt.compression == BACKUP_COMPRESSION_ZSTD) - sink = bbsink_zstd_new(sink, opt.compression_level); + sink = bbsink_zstd_new(sink, opt.compression_level, + opt.compression_workers); /* Set up progress reporting. */ sink = bbsink_progress_new(sink, opt.progress); diff --git a/src/backend/replication/basebackup_zstd.c b/src/backend/replication/basebackup_zstd.c index e3f9b1d..54b91eb 100644 --- a/src/backend/replication/basebackup_zstd.c +++ b/src/backend/replication/basebackup_zstd.c @@ -28,6 +28,9 @@ typedef struct bbsink_zstd /* Compression lev
Re: refactoring basebackup.c
Hi, > > It will be good if we can also fix > > CreateWalTarMethod to support LZ4 and ZSTD. > Ok we will see, either Dipesh or I will take care of it. I took a look at the CreateWalTarMethod to support LZ4 compression for WAL files. The current implementation involves a 3 step to backup a WAL file to a tar archive. For each file: 1. It first writes the header in the function tar_open_for_write, flushes the contents of tar to disk and stores the header offset. 2. Next, the contents of WAL are written to the tar archive. 3. In the end, it recalculates the checksum in function tar_close() and overwrites the header at an offset stored in step #1. The need for overwriting header in CreateWalTarMethod is mainly related to partial WAL files where the size of the WAL file < WalSegSize. The file is being padded and checksum is recalculated after adding pad bytes. If we go ahead and implement LZ4 support for CreateWalTarMethod then we have a problem here at step #3. In order to achieve better compression ratio, compressed LZ4 blocks are linked to each other and these blocks are decoded sequentially. If we overwrite the header as part of step #3 then it corrupts the link between compressed LZ4 blocks. Although LZ4 provides an option to write the compressed block independently (using blockMode option set to LZ4F_blockIndepedent) but it is still a problem because we don't know if overwriting the header after recalculating the checksum will not overlap the boundary of the next block. GZIP manages to overcome this problem as it provides an option to turn on/off compression on the fly while writing a compressed archive with the help of zlib library function deflateParams(). The current gzip implementation for CreateWalTarMethod uses this library function to turn off compression just before step #1 and it writes the uncompressed header of size equal to TAR_BLOCK_SIZE. It uses the same library function to turn on the compression for writing the contents of the WAL file as part of step #2. It again turns off the compression just before step #3 to overwrite the header. The header is overwritten at the same offset with size equal to TAR_BLOCK_SIZE. Since GZIP provides this option to enable/disable compression, it is possible to control the size of data we are writing to a compressed archive. Even if we overwrite an already written block in a compressed archive there is no risk of it overlapping with the boundary of the next block. This mechanism is not available in LZ4 and ZSTD. In order to support LZ4 and ZSTD compression for CreateWalTarMethod we may need to refactor this code unless I am missing something. We need to somehow add the padding bytes in case of partial WAL before we send it to the compressed archive. This will make sure that all files which are being compressed does not require any padding as the size is always equal to WalSegSize. There is no need to recalculate the checksum and we can avoid overwriting the header as part of step #3. Thoughts? Thanks, Dipesh
Re: refactoring basebackup.c
> Sure, please find the rebased patch attached. Thanks, I have validated v2 patch on top of rebased patch. Thanks, Dipesh
Re: refactoring basebackup.c
Hi, Thanks for the feedback, I have incorporated the suggestions and updated a new patch. PFA v2 patch. > I think similar to bbstreamer_lz4_compressor_content() in > bbstreamer_lz4_decompressor_content() we can change len to avail_in. In bbstreamer_lz4_decompressor_content(), we are modifying avail_in based on the number of bytes decompressed in each iteration. I think we cannot replace it with "len" here. Jeevan, Your v12 patch does not apply on HEAD, it requires a rebase. I have applied it on commit 400fc6b6487ddf16aa82c9d76e5cfbe64d94f660 to validate my v2 patch. Thanks, Dipesh From 47a0ef4348747ffa61eccd7954e00f3cf5fc7222 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Thu, 3 Feb 2022 18:31:03 +0530 Subject: [PATCH] support client side compression and decompression using LZ4 --- src/bin/pg_basebackup/Makefile| 1 + src/bin/pg_basebackup/bbstreamer.h| 3 + src/bin/pg_basebackup/bbstreamer_lz4.c| 431 ++ src/bin/pg_basebackup/pg_basebackup.c | 32 +- src/bin/pg_verifybackup/t/009_extract.pl | 7 +- src/bin/pg_verifybackup/t/010_client_untar.pl | 111 +++ src/tools/msvc/Mkvcbuild.pm | 1 + 7 files changed, 580 insertions(+), 6 deletions(-) create mode 100644 src/bin/pg_basebackup/bbstreamer_lz4.c create mode 100644 src/bin/pg_verifybackup/t/010_client_untar.pl diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index ada3a5a..1d0db4f 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -43,6 +43,7 @@ BBOBJS = \ bbstreamer_file.o \ bbstreamer_gzip.o \ bbstreamer_inject.o \ + bbstreamer_lz4.o \ bbstreamer_tar.o all: pg_basebackup pg_receivewal pg_recvlogical diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h index fe49ae3..c2de77b 100644 --- a/src/bin/pg_basebackup/bbstreamer.h +++ b/src/bin/pg_basebackup/bbstreamer.h @@ -206,6 +206,9 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath, void (*report_output_file) (const char *)); extern bbstreamer *bbstreamer_gzip_decompressor_new(bbstreamer *next); +extern bbstreamer *bbstreamer_lz4_compressor_new(bbstreamer *next, + int compresslevel); +extern bbstreamer *bbstreamer_lz4_decompressor_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next); diff --git a/src/bin/pg_basebackup/bbstreamer_lz4.c b/src/bin/pg_basebackup/bbstreamer_lz4.c new file mode 100644 index 000..f0bc226 --- /dev/null +++ b/src/bin/pg_basebackup/bbstreamer_lz4.c @@ -0,0 +1,431 @@ +/*- + * + * bbstreamer_lz4.c + * + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/bin/pg_basebackup/bbstreamer_lz4.c + *- + */ + +#include "postgres_fe.h" + +#include + +#ifdef HAVE_LIBLZ4 +#include +#endif + +#include "bbstreamer.h" +#include "common/logging.h" +#include "common/file_perm.h" +#include "common/string.h" + +#ifdef HAVE_LIBLZ4 +typedef struct bbstreamer_lz4_frame +{ + bbstreamer base; + + LZ4F_compressionContext_t cctx; + LZ4F_decompressionContext_t dctx; + LZ4F_preferences_t prefs; + + size_t bytes_written; + bool header_written; +} bbstreamer_lz4_frame; + +static void bbstreamer_lz4_compressor_content(bbstreamer *streamer, + bbstreamer_member *member, + const char *data, int len, + bbstreamer_archive_context context); +static void bbstreamer_lz4_compressor_finalize(bbstreamer *streamer); +static void bbstreamer_lz4_compressor_free(bbstreamer *streamer); + +const bbstreamer_ops bbstreamer_lz4_compressor_ops = { + .content = bbstreamer_lz4_compressor_content, + .finalize = bbstreamer_lz4_compressor_finalize, + .free = bbstreamer_lz4_compressor_free +}; + +static void bbstreamer_lz4_decompressor_content(bbstreamer *streamer, +bbstreamer_member *member, +const char *data, int len, +bbstreamer_archive_context context); +static void bbstreamer_lz4_decompressor_finalize(bbstreamer *streamer); +static void bbstreamer_lz4_decompressor_free(bbstreamer *streamer); + +const bbstreamer_ops bbstreamer_lz4_decompressor_ops = { + .content = bbstreamer_lz4_decompressor_content, + .finalize = bbstreamer_lz4_decompressor_finalize, + .free = bbstreamer_lz4_decompressor_free +}; +#endif + +/* + * Create a new base backup streamer that performs lz4 compression of tar + * blocks. + */ +bbstreamer * +bbstreamer_lz4_compressor_new(bbstreamer *next, int compresslevel) +{ +#ifdef HAVE_LIBLZ4 + bbstreamer_lz4_frame *streamer; + LZ4F_errorCode_t ctxEr
Re: refactoring basebackup.c
Hi, > On Mon, Jan 31, 2022 at 4:41 PM Jeevan Ladhe < jeevan.la...@enterprisedb.com> wrote: > Hi Robert, > > I had an offline discussion with Dipesh, and he will be working on the > lz4 client side decompression part. > Please find the attached patch to support client side compression and decompression using lz4. Added a new lz4 bbstreamer to compress the archive chunks at client if user has specified --compress=clinet-lz4:[LEVEL] option in pg_basebackup. The new streamer accepts archive chunks compresses it and forwards it to plain-writer. Similarly, If a user has specified a server compressed lz4 archive with plain format (-F p) backup then it requires decompressing the compressed archive chunks before forwarding it to tar extractor. Added a new bbstreamer to decompress the compressed archive and forward it to tar extractor. Note: This patch can be applied on Jeevan Ladhe's v12 patch for lz4 compression. Thanks, Dipesh From 67e47579e119897c66e6f5f7a5e5e9542399072f Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Thu, 3 Feb 2022 18:31:03 +0530 Subject: [PATCH] support client side compression and decompression using LZ4 --- src/bin/pg_basebackup/Makefile| 1 + src/bin/pg_basebackup/bbstreamer.h| 3 + src/bin/pg_basebackup/bbstreamer_lz4.c| 436 ++ src/bin/pg_basebackup/pg_basebackup.c | 32 +- src/bin/pg_verifybackup/t/009_extract.pl | 7 +- src/bin/pg_verifybackup/t/010_client_untar.pl | 111 +++ src/tools/msvc/Mkvcbuild.pm | 1 + 7 files changed, 585 insertions(+), 6 deletions(-) create mode 100644 src/bin/pg_basebackup/bbstreamer_lz4.c create mode 100644 src/bin/pg_verifybackup/t/010_client_untar.pl diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index ada3a5a..1d0db4f 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -43,6 +43,7 @@ BBOBJS = \ bbstreamer_file.o \ bbstreamer_gzip.o \ bbstreamer_inject.o \ + bbstreamer_lz4.o \ bbstreamer_tar.o all: pg_basebackup pg_receivewal pg_recvlogical diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h index fe49ae3..c2de77b 100644 --- a/src/bin/pg_basebackup/bbstreamer.h +++ b/src/bin/pg_basebackup/bbstreamer.h @@ -206,6 +206,9 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath, void (*report_output_file) (const char *)); extern bbstreamer *bbstreamer_gzip_decompressor_new(bbstreamer *next); +extern bbstreamer *bbstreamer_lz4_compressor_new(bbstreamer *next, + int compresslevel); +extern bbstreamer *bbstreamer_lz4_decompressor_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next); diff --git a/src/bin/pg_basebackup/bbstreamer_lz4.c b/src/bin/pg_basebackup/bbstreamer_lz4.c new file mode 100644 index 000..9055a23 --- /dev/null +++ b/src/bin/pg_basebackup/bbstreamer_lz4.c @@ -0,0 +1,436 @@ +/*- + * + * bbstreamer_lz4.c + * + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/bin/pg_basebackup/bbstreamer_lz4.c + *- + */ + +#include "postgres_fe.h" + +#include + +#ifdef HAVE_LIBLZ4 +#include +#endif + +#include "bbstreamer.h" +#include "common/logging.h" +#include "common/file_perm.h" +#include "common/string.h" + +#ifdef HAVE_LIBLZ4 +typedef struct bbstreamer_lz4_frame +{ + bbstreamer base; + + LZ4F_compressionContext_t cctx; + LZ4F_decompressionContext_t dctx; + LZ4F_preferences_t prefs; + + size_t bytes_written; + bool header_written; +} bbstreamer_lz4_frame; + +static void bbstreamer_lz4_compressor_content(bbstreamer *streamer, + bbstreamer_member *member, + const char *data, int len, + bbstreamer_archive_context context); +static void bbstreamer_lz4_compressor_finalize(bbstreamer *streamer); +static void bbstreamer_lz4_compressor_free(bbstreamer *streamer); + +const bbstreamer_ops bbstreamer_lz4_compressor_ops = { + .content = bbstreamer_lz4_compressor_content, + .finalize = bbstreamer_lz4_compressor_finalize, + .free = bbstreamer_lz4_compressor_free +}; + +static void bbstreamer_lz4_decompressor_content(bbstreamer *streamer, +bbstreamer_member *member, +const char *data, int len, +bbstreamer_archive_context context); +static void bbstreamer_lz4_decompressor_finalize(bbstreamer *streamer); +static void bbstreamer_lz4_decompressor_free(bbstreamer *streamer); + +const bbstreamer_ops bbstreamer_lz4_decompressor_ops = { + .content = bbstreamer_lz4_decompressor_content, + .finalize = bbstre
Re: refactoring basebackup.c
Hi, > I made a pass over these patches today and made a bunch of minor > corrections. New version attached. The two biggest things I changed > are (1) s/gzip_extractor/gzip_compressor/, because I feel like you > extract an archive like a tarfile, but that is not what is happening > here, this is not an archive and (2) I took a few bits of out of the > test case that didn't seem to be necessary. There wasn't any reason > that I could see why testing for PG_VERSION needed to be skipped when > the compression method is 'none', so my first thought was to just take > out the 'if' statement around that, but then after more thought that > test and the one for pg_verifybackup are certainly going to fail if > those files are not present, so why have an extra test? It might make > sense if we were only conditionally able to run pg_verifybackup and > wanted to have some test coverage even when we can't, but that's not > the case here, so I see no point. Thanks. This makes sense. +#ifdef HAVE_LIBZ + /* +* If the user has requested a server compressed archive along with archive +* extraction at client then we need to decompress it. +*/ + if (format == 'p' && compressmethod == COMPRESSION_GZIP && + compressloc == COMPRESS_LOCATION_SERVER) + streamer = bbstreamer_gzip_decompressor_new(streamer); +#endif I think it is not required to have HAVE_LIBZ check in pg_basebackup.c while creating a new gzip writer/decompressor. This check is already in place in bbstreamer_gzip_writer_new() and bbstreamer_gzip_decompressor_new() and it throws an error in case the build does not have required library support. I have removed this check from pg_basebackup.c and updated a delta patch. The patch can be applied on v5 patch. Thanks, Dipesh diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 46ab60d..1f81bbf 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1199,7 +1199,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation, compressloc != COMPRESS_LOCATION_CLIENT) streamer = bbstreamer_plain_writer_new(archive_filename, archive_file); -#ifdef HAVE_LIBZ else if (compressmethod == COMPRESSION_GZIP) { strlcat(archive_filename, ".gz", sizeof(archive_filename)); @@ -1207,7 +1206,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation, archive_file, compresslevel); } -#endif else { Assert(false); /* not reachable */ @@ -1256,7 +1254,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation, else if (expect_unterminated_tarfile) streamer = bbstreamer_tar_terminator_new(streamer); -#ifdef HAVE_LIBZ /* * If the user has requested a server compressed archive along with archive * extraction at client then we need to decompress it. @@ -1264,7 +1261,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation, if (format == 'p' && compressmethod == COMPRESSION_GZIP && compressloc == COMPRESS_LOCATION_SERVER) streamer = bbstreamer_gzip_decompressor_new(streamer); -#endif /* Return the results. */ *manifest_inject_streamer_p = manifest_inject_streamer;
Re: refactoring basebackup.c
Hi, > It only needed trivial rebasing; I have committed it after doing that. I have updated the patches to support server compression (gzip) for plain format backup. Please find attached v4 patches. Thanks, Dipesh From 4d0c84d6fac841aafb757535cc0e48334a251581 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Mon, 24 Jan 2022 15:28:48 +0530 Subject: [PATCH 1/2] Support for extracting gzip compressed archive pg_basebackup can support server side compression using gzip. In order to support plain format backup with option '-Fp' we need to add support for decompressing the compressed blocks at client. This patch addresses the extraction of gzip compressed blocks at client. --- doc/src/sgml/ref/pg_basebackup.sgml | 8 +- src/bin/pg_basebackup/Makefile | 1 + src/bin/pg_basebackup/bbstreamer.h | 1 + src/bin/pg_basebackup/bbstreamer_file.c | 182 src/bin/pg_basebackup/bbstreamer_gzip.c | 376 src/bin/pg_basebackup/pg_basebackup.c | 19 +- 6 files changed, 401 insertions(+), 186 deletions(-) create mode 100644 src/bin/pg_basebackup/bbstreamer_gzip.c diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 1d0df34..19849be 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -428,8 +428,12 @@ PostgreSQL documentation When the tar format is used, the suffix .gz will -automatically be added to all tar filenames. Compression is not -available in plain format. +automatically be added to all tar filenames. + + +Server compression can be specified with plain format backup. It +enables compression of the archive at server and extract the +compressed archive at client. diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index 5b18851..78d96c6 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -38,6 +38,7 @@ OBJS = \ BBOBJS = \ pg_basebackup.o \ bbstreamer_file.o \ + bbstreamer_gzip.o \ bbstreamer_inject.o \ bbstreamer_tar.o diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h index fc88b50..270b0df 100644 --- a/src/bin/pg_basebackup/bbstreamer.h +++ b/src/bin/pg_basebackup/bbstreamer.h @@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath, const char *(*link_map) (const char *), void (*report_output_file) (const char *)); +extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next); diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c index 77ca222..d721f87 100644 --- a/src/bin/pg_basebackup/bbstreamer_file.c +++ b/src/bin/pg_basebackup/bbstreamer_file.c @@ -11,10 +11,6 @@ #include "postgres_fe.h" -#ifdef HAVE_LIBZ -#include -#endif - #include #include "bbstreamer.h" @@ -30,15 +26,6 @@ typedef struct bbstreamer_plain_writer bool should_close_file; } bbstreamer_plain_writer; -#ifdef HAVE_LIBZ -typedef struct bbstreamer_gzip_writer -{ - bbstreamer base; - char *pathname; - gzFile gzfile; -} bbstreamer_gzip_writer; -#endif - typedef struct bbstreamer_extractor { bbstreamer base; @@ -62,22 +49,6 @@ const bbstreamer_ops bbstreamer_plain_writer_ops = { .free = bbstreamer_plain_writer_free }; -#ifdef HAVE_LIBZ -static void bbstreamer_gzip_writer_content(bbstreamer *streamer, - bbstreamer_member *member, - const char *data, int len, - bbstreamer_archive_context context); -static void bbstreamer_gzip_writer_finalize(bbstreamer *streamer); -static void bbstreamer_gzip_writer_free(bbstreamer *streamer); -static const char *get_gz_error(gzFile gzf); - -const bbstreamer_ops bbstreamer_gzip_writer_ops = { - .content = bbstreamer_gzip_writer_content, - .finalize = bbstreamer_gzip_writer_finalize, - .free = bbstreamer_gzip_writer_free -}; -#endif - static void bbstreamer_extractor_content(bbstreamer *streamer, bbstreamer_member *member, const char *data, int len, @@ -196,159 +167,6 @@ bbstreamer_plain_writer_free(bbstreamer *streamer) } /* - * Create a bbstreamer that just compresses data using gzip, and then writes - * it to a file. - * - * As in the case of bbstreamer_plain_writer_new, pathname is always used - * for error reporting purposes; if file is NULL, it is also the opened and - * closed so that the data may be written there. - */ -bbstreamer * -bbstreamer_gzip_writer_new(char *pathname, FILE *file, int compresslevel) -{ -#ifdef HAVE_LIBZ - bbstreamer_gzip_writer *streamer; - - streamer = palloc0(sizeof(bbstreamer_gzip_writer)); -
Re: refactoring basebackup.c
Hi, > Here is a more detailed review. Thanks for the feedback, I have incorporated the suggestions and updated a new version of the patch (v3-0001). The required documentation changes are also incorporated in updated patch (v3-0001). > Interesting approach. This unfortunately has the effect of making that > test case file look a bit incoherent -- the comment at the top of the > file isn't really accurate any more, for example, and the plain_format > flag does more than just cause us to use -Fp; it also causes us NOT to > use --target server:X. However, that might be something we can figure > out a way to clean up. Alternatively, we could have a new test case > file that is structured like 002_algorithm.pl but looping over > compression methods rather than checksum algorithms, and testing each > one with --server-compress and -Fp. It might be easier to make that > look nice (but I'm not 100% sure). Added a new test case file "009_extract.pl" to test server compressed plain format backup (v3-0002). > I committed the base backup target patch yesterday, and today I > updated the remaining code in light of Michael Paquier's commit > 5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch. v13 patch does not apply on the latest head, it requires a rebase. I have applied it on commit dc43fc9b3aa3e0fa9c84faddad6d301813580f88 to validate gzip decompression patches. Thanks, Dipesh From 9ec2efcc908e988409cd9ba19ea64a50012163a2 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Mon, 24 Jan 2022 15:28:48 +0530 Subject: [PATCH 1/2] Support for extracting gzip compressed archive pg_basebackup can support server side compression using gzip. In order to support plain format backup with option '-Fp' we need to add support for decompressing the compressed blocks at client. This patch addresses the extraction of gzip compressed blocks at client. --- doc/src/sgml/ref/pg_basebackup.sgml | 8 +- src/bin/pg_basebackup/Makefile | 1 + src/bin/pg_basebackup/bbstreamer.h | 1 + src/bin/pg_basebackup/bbstreamer_file.c | 182 src/bin/pg_basebackup/bbstreamer_gzip.c | 376 src/bin/pg_basebackup/pg_basebackup.c | 19 +- 6 files changed, 401 insertions(+), 186 deletions(-) create mode 100644 src/bin/pg_basebackup/bbstreamer_gzip.c diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 1d0df34..19849be 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -428,8 +428,12 @@ PostgreSQL documentation When the tar format is used, the suffix .gz will -automatically be added to all tar filenames. Compression is not -available in plain format. +automatically be added to all tar filenames. + + +Server compression can be specified with plain format backup. It +enables compression of the archive at server and extract the +compressed archive at client. diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index 5b18851..78d96c6 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -38,6 +38,7 @@ OBJS = \ BBOBJS = \ pg_basebackup.o \ bbstreamer_file.o \ + bbstreamer_gzip.o \ bbstreamer_inject.o \ bbstreamer_tar.o diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h index fc88b50..270b0df 100644 --- a/src/bin/pg_basebackup/bbstreamer.h +++ b/src/bin/pg_basebackup/bbstreamer.h @@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath, const char *(*link_map) (const char *), void (*report_output_file) (const char *)); +extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next); diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c index 77ca222..d721f87 100644 --- a/src/bin/pg_basebackup/bbstreamer_file.c +++ b/src/bin/pg_basebackup/bbstreamer_file.c @@ -11,10 +11,6 @@ #include "postgres_fe.h" -#ifdef HAVE_LIBZ -#include -#endif - #include #include "bbstreamer.h" @@ -30,15 +26,6 @@ typedef struct bbstreamer_plain_writer bool should_close_file; } bbstreamer_plain_writer; -#ifdef HAVE_LIBZ -typedef struct bbstreamer_gzip_writer -{ - bbstreamer base; - char *pathname; - gzFile gzfile; -} bbstreamer_gzip_writer; -#endif - typedef struct bbstreamer_extractor { bbstreamer base; @@ -62,22 +49,6 @@ const bbstreamer_ops bbstreamer_plain_writer_ops = { .free = bbstreamer_plain_writer_free }; -#ifdef HAVE_LIBZ -static void bbstreamer_gzip_writer_content(bbstreamer *streamer, -
Re: refactoring basebackup.c
Hi, Thanks for the feedback, I have incorporated the suggestions and updated a new patch v2. > I spent some time thinking about test coverage for the server-side > backup code today and came up with the attached (v12-0003). It does an > end-to-end test that exercises server-side backup and server-side > compression and then untars the backup and validity-checks it using > pg_verifybackup. In addition to being good test coverage for these > patches, it also plugs a gap in the test coverage of pg_verifybackup, > which currently has no test case that untars a tar-format backup and > then verifies the result. I couldn't figure out a way to do that back > at the time I was working on pg_verifybackup, because I didn't think > we had any existing precedent for using 'tar' from a TAP test. But it > was pointed out to me that we do, so I used that as the model for this > test. It should be easy to generalize this test case to test lz4 and > zstd as well, I think. But I guess we'll still need something > different to test what your patch is doing. I tried to add the test coverage for server side gzip compression with plain format backup using pg_verifybackup. I have modified the test to use a flag specific to plain format. If this flag is set then it takes a plain format backup (with server compression enabled) and verifies this using pg_verifybackup. I have updated (v2-0002) for the test coverage. > It's going to need some documentation changes, too. yes, I am working on it. Note: Before applying the patches, please apply Robert's v12 version of the patches 0001, 0002 and 0003. Thanks, Dipesh From 826a1cbb639afb7e10a20955d3ec64b1bab1fa80 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Thu, 20 Jan 2022 16:38:36 +0530 Subject: [PATCH 1/2] Support for extracting gzip compressed archive pg_basebackup can support server side compression using gzip. In order to support plain format backup with option '-Fp' we need to add support for decompressing the compressed blocks at client. This patch addresses the extraction of gzip compressed blocks at client. --- src/bin/pg_basebackup/Makefile | 1 + src/bin/pg_basebackup/bbstreamer.h | 1 + src/bin/pg_basebackup/bbstreamer_file.c | 182 --- src/bin/pg_basebackup/bbstreamer_gzip.c | 377 src/bin/pg_basebackup/pg_basebackup.c | 43 +++- 5 files changed, 419 insertions(+), 185 deletions(-) create mode 100644 src/bin/pg_basebackup/bbstreamer_gzip.c diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile index 5b18851..78d96c6 100644 --- a/src/bin/pg_basebackup/Makefile +++ b/src/bin/pg_basebackup/Makefile @@ -38,6 +38,7 @@ OBJS = \ BBOBJS = \ pg_basebackup.o \ bbstreamer_file.o \ + bbstreamer_gzip.o \ bbstreamer_inject.o \ bbstreamer_tar.o diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h index fc88b50..270b0df 100644 --- a/src/bin/pg_basebackup/bbstreamer.h +++ b/src/bin/pg_basebackup/bbstreamer.h @@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath, const char *(*link_map) (const char *), void (*report_output_file) (const char *)); +extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next); diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c index 77ca222..d721f87 100644 --- a/src/bin/pg_basebackup/bbstreamer_file.c +++ b/src/bin/pg_basebackup/bbstreamer_file.c @@ -11,10 +11,6 @@ #include "postgres_fe.h" -#ifdef HAVE_LIBZ -#include -#endif - #include #include "bbstreamer.h" @@ -30,15 +26,6 @@ typedef struct bbstreamer_plain_writer bool should_close_file; } bbstreamer_plain_writer; -#ifdef HAVE_LIBZ -typedef struct bbstreamer_gzip_writer -{ - bbstreamer base; - char *pathname; - gzFile gzfile; -} bbstreamer_gzip_writer; -#endif - typedef struct bbstreamer_extractor { bbstreamer base; @@ -62,22 +49,6 @@ const bbstreamer_ops bbstreamer_plain_writer_ops = { .free = bbstreamer_plain_writer_free }; -#ifdef HAVE_LIBZ -static void bbstreamer_gzip_writer_content(bbstreamer *streamer, - bbstreamer_member *member, - const char *data, int len, - bbstreamer_archive_context context); -static void bbstreamer_gzip_writer_finalize(bbstreamer *streamer); -static void bbstreamer_gzip_writer_free(bbstreamer *streamer); -static const char *get_gz_error(gzFile gzf); - -const bbstreamer_ops bbstreamer_gzip_writer_ops = { - .content = bbstreamer_gzip_writer_content, - .finalize = bbstreamer_gzip_writer_finalize, - .free = bbstreamer_gzip_writer_free -}; -#endif - static void bbstreamer_extractor_content(bbstreamer *stream
Re: refactoring basebackup.c
Hi, I have added support for decompressing a gzip compressed tar file at client. pg_basebackup can enable server side compression for plain format backup with this change. Added a gzip extractor which decompresses the compressed archive and forwards it to the next streamer. I have done initial testing and working on updating the test coverage. Note: Before applying the patch, please apply Robert's v11 version of the patches 0001 and 0002. Thanks, Dipesh From 737badce26ed05b5cdb64d9ffd1735fef9acbbf8 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 19 Jan 2022 17:11:45 +0530 Subject: [PATCH] Support for extracting gzip compressed archive pg_basebackup can support server side compression using gzip. In order to support plain format backup with option '-Fp' we need to add support for decompressing the compressed blocks at client. This patch addresses the extraction of gzip compressed blocks at client. --- src/bin/pg_basebackup/bbstreamer.h | 1 + src/bin/pg_basebackup/bbstreamer_file.c | 175 src/bin/pg_basebackup/pg_basebackup.c | 58 +-- 3 files changed, 225 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_basebackup/bbstreamer.h b/src/bin/pg_basebackup/bbstreamer.h index fc88b50..270b0df 100644 --- a/src/bin/pg_basebackup/bbstreamer.h +++ b/src/bin/pg_basebackup/bbstreamer.h @@ -205,6 +205,7 @@ extern bbstreamer *bbstreamer_extractor_new(const char *basepath, const char *(*link_map) (const char *), void (*report_output_file) (const char *)); +extern bbstreamer *bbstreamer_gzip_extractor_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_parser_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_terminator_new(bbstreamer *next); extern bbstreamer *bbstreamer_tar_archiver_new(bbstreamer *next); diff --git a/src/bin/pg_basebackup/bbstreamer_file.c b/src/bin/pg_basebackup/bbstreamer_file.c index 77ca222..350af1d 100644 --- a/src/bin/pg_basebackup/bbstreamer_file.c +++ b/src/bin/pg_basebackup/bbstreamer_file.c @@ -37,6 +37,13 @@ typedef struct bbstreamer_gzip_writer char *pathname; gzFile gzfile; } bbstreamer_gzip_writer; + +typedef struct bbstreamer_gzip_extractor +{ + bbstreamer base; + z_stream zstream; + size_t bytes_written; +} bbstreamer_gzip_extractor; #endif typedef struct bbstreamer_extractor @@ -76,6 +83,21 @@ const bbstreamer_ops bbstreamer_gzip_writer_ops = { .finalize = bbstreamer_gzip_writer_finalize, .free = bbstreamer_gzip_writer_free }; + +static void bbstreamer_gzip_extractor_content(bbstreamer *streamer, + bbstreamer_member *member, + const char *data, int len, + bbstreamer_archive_context context); +static void bbstreamer_gzip_extractor_finalize(bbstreamer *streamer); +static void bbstreamer_gzip_extractor_free(bbstreamer *streamer); +static void *gzip_palloc(void *opaque, unsigned items, unsigned size); +static void gzip_pfree(void *opaque, void *address); + +const bbstreamer_ops bbstreamer_gzip_extractor_ops = { + .content = bbstreamer_gzip_extractor_content, + .finalize = bbstreamer_gzip_extractor_finalize, + .free = bbstreamer_gzip_extractor_free +}; #endif static void bbstreamer_extractor_content(bbstreamer *streamer, @@ -349,6 +371,159 @@ get_gz_error(gzFile gzf) #endif /* + * Create a new base backup streamer that performs decompression of gzip + * compressed blocks. + */ +bbstreamer * +bbstreamer_gzip_extractor_new(bbstreamer *next) +{ +#ifdef HAVE_LIBZ + bbstreamer_gzip_extractor *streamer; + z_stream *zs; + + Assert(next != NULL); + + streamer = palloc0(sizeof(bbstreamer_gzip_extractor)); + *((const bbstreamer_ops **) >base.bbs_ops) = + _gzip_extractor_ops; + + streamer->base.bbs_next = next; + initStringInfo(>base.bbs_buffer); + + /* Initialize internal stream state for decompression */ + zs = >zstream; + zs->zalloc = gzip_palloc; + zs->zfree = gzip_pfree; + zs->next_out = (uint8 *) streamer->base.bbs_buffer.data; + zs->avail_out = streamer->base.bbs_buffer.maxlen; + + /* + * Data compression was initialized using deflateInit2 to request a gzip + * header. Similarly, we are using inflateInit2 to initialize data + * decompression. + * "windowBits" must be greater than or equal to "windowBits" value + * provided to deflateInit2 while compressing. + */ + if (inflateInit2(zs, 15 + 16) != Z_OK) + { + pg_log_error("could not initialize compression library"); + exit(1); + + } + + return >base; +#else + pg_log_error("this build does not support compression"); + exit(1); +#endif +} + +#ifdef HAVE_LIBZ +/* + * Decompress the input data to output buffer until we ran out of the input + * data. Each time the output buffer is full invoke bbstreamer_content to pass + * on the decompressed data to next streamer. + */ +static void +bbstreamer_gzip_extractor_content(bbstreamer *streamer, + bbstreamer_member *member, +
Re: .ready and .done files considered harmful
Hi, > 1. I've removed several calls to PgArchForceDirScan() in favor of > calling it at the top of pgarch_ArchiverCopyLoop(). I believe > there is some disagreement about this change, but I don't think > we gain enough to justify the complexity. The main reason we > exit pgarch_ArchiverCopyLoop() should ordinarily be that we've > run out of files to archive, so incurring a directory scan the > next time it is called doesn't seem like it would normally be too > bad. I'm sure there are exceptions (e.g., lots of .done files, > archive failures), but the patch is still not making things any > worse than they presently are for these cases. Yes, I think when archiver is lagging behind then a call to force directory scan at the top of pgarch_ArchiverCopyLoop() does not have any impact. This may result into a directory scan in next cycle only when the archiver is ahead or in sync but in that case also a directory scan may not incur too much cost since the archiver is ahead.I agree that we can remove the separate calls to force a directory scan in failure scenarios with a single call at the top of PgArchForceDirScan(). > 2. I removed all the logic that attempted to catch out-of-order > .ready files. Instead, XLogArchiveNotify() only forces a > directory scan for files other than regular WAL files, and we > depend on our periodic directory scans to pick up anything that's > been left behind. > 3. I moved the logic that forces directory scans every once in a > while. We were doing that in the checkpoint/restartpoint logic, > which, upon further thought, might not be the best idea. The > checkpoint interval can vary widely, and IIRC we won't bother > creating checkpoints at all if database activity stops. Instead, > I've added logic in pgarch_readyXlog() that forces a directory > scan if one hasn't happened in a few minutes. > 4. Finally, I've tried to ensure comments are clear and that the > logic is generally easy to reason about. > > What do you think? I agree, If we force a periodic directory scan then we may not require any special logic for handling scenarios where a .ready file is created out of order in XLogArchiveNotify(). We need to force a directory scan only in case of a non-regular WAL file in XLogArchiveNotify(). Overall I think the periodic directory scan simplifies the patch and makes sure that any missing file gets archived within a few mins. Thanks, Dipesh
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. > I wonder if this can be simplified even further. If we don't bother > trying to catch out-of-order .ready files in XLogArchiveNotify() and > just depend on the per-checkpoint/restartpoint directory scans, we can > probably remove lastReadySegNo from archiver state completely. If we agree that some extra delay in archiving these files is acceptable then we don't require any special handling for this scenario otherwise we may need to handle it separately. > + /* Initialize the current state of archiver */ > + xlogState.lastSegNo = MaxXLogSegNo; > + xlogState.lastTli = MaxTimeLineID; > > It looks like we have two ways to force a directory scan. We can > either set force_dir_scan to true, or lastSegNo can be set to > MaxXLogSegNo. Why not just set force_dir_scan to true here so that we > only have one way to force a directory scan? make sense, I have updated it. > Don't we also need to force a directory scan in the other cases we > return early from pgarch_ArchiverCopyLoop()? We will have already > advanced the archiver state in pgarch_readyXlog(), so I think we'd end > up skipping files if we didn't. For example, if archive_command isn't > set, we'll just return, and the next call to pgarch_readyXlog() might > return the next file. I agree, we should do it for all early return paths. > nitpick: I think we should just call PgArchForceDirScan() here. Yes, that's right. > > > This is an interesting idea, but the "else" block here seems prone to > > > race conditions. I think we'd have to hold arch_lck to prevent that. > > > But as I mentioned above, if we are okay with depending on the > > > fallback directory scans, I think we can remove the "else" block > > > completely. Ohh I didn't realize the race condition here. The competing processes can read the same value of lastReadySegNo. > > Thinking further, we probably need to hold a lock even when we are > > creating the .ready file to avoid race conditions. > > The race condition surely happens, but even if that happens, all > competing processes except one of them detect out-of-order and will > enforce directory scan. But I'm not sure how it behaves under more > complex situation so I'm not sure I like that behavior. > > We could just use another lock for the logic there, but instead > couldn't we merge PgArchGetLastReadySegNo and PgArchSetLastReadySegNo > into one atomic test-and-(check-and-)set function? Like this. I agree that we can merge the existing "Get" and "Set" functions into an atomic test-and-check-and-set function to avoid a race condition. I have incorporated these changes and updated a new patch. PFA patch. Thanks, Dipesh From f8983c7b80ff0f8cbc415d4d9a3ad4992925e775 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 8 Sep 2021 21:47:16 +0530 Subject: [PATCH] keep trying the next file approach --- src/backend/access/transam/xlog.c| 20 +++ src/backend/access/transam/xlogarchive.c | 16 ++ src/backend/postmaster/pgarch.c | 249 +++ src/include/access/xlogdefs.h| 2 + src/include/postmaster/pgarch.h | 4 + 5 files changed, 259 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a7..7dd4b96 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per checkpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ @@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per restartpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..110bee4 1006
Re: .ready and .done files considered harmful
Thanks for the feedback. > The latest post on this thread contained a link to this one, and it > made me want to rewind to this point in the discussion. Suppose we > have the following alternative scenario: > > Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist > yet. The following directory scan ends up finding 12.ready. Just > before we update the PgArch state, XLogArchiveNotify() is called and > creates 11.ready. However, pg_readyXlog() has already decided to > return WAL segment 12 and update the state to look for 13 next. > > Now, if I'm not mistaken, using <= doesn't help at all. > > In my opinion, the problem here is that the natural way to ask "is > this file being archived out of order?" is to ask yourself "is the > file that I'm marking as ready for archiving now the one that > immediately follows the last one I marked as ready for archiving?" and > then invert the result. That is, if I last marked 10 as ready, and now > I'm marking 11 as ready, then it's in order, but if I'm now marking > anything else whatsoever, then it's out of order. But that's not what > this does. Instead of comparing what it's doing now to what it did > last, it compares what it did now to what the archiver did last. I agree that when we are creating a .ready file we should compare the current .ready file with the last .ready file to check if this file is created out of order. We can store the state of the last .ready file in shared memory and compare it with the current .ready file. I believe that archiver specific shared memory area can be used to store the state of the last .ready file unless I am missing something and this needs to be stored in a separate shared memory area. With this change, we have the flexibility to move the current archiver state out of shared memory and keep it local to archiver. I have incorporated these changes and updated a new patch. > > And it's really not obvious that that's correct. I think that the > > above argument actually demonstrates a flaw in the logic, but even if > > not, or even if it's too small a flaw to be a problem in practice, it > > seems a lot harder to reason about. > > I certainly agree that it's harder to reason about. If we were to go > the keep-trying-the-next-file route, we could probably minimize a lot > of the handling for these rare cases by banking on the "fallback" > directory scans. Provided we believe these situations are extremely > rare, some extra delay for an archive every once in a while might be > acceptable. +1. We are forcing a directory scan at the checkpoint and it will make sure that any missing file gets archived within the checkpoint boundaries. Please find the attached patch. Thanks, Dipesh From f05b223b368e40594f0ed8440c0704fb7b970ee0 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 8 Sep 2021 21:47:16 +0530 Subject: [PATCH] keep trying the next file approach --- src/backend/access/transam/xlog.c| 20 +++ src/backend/access/transam/xlogarchive.c | 25 src/backend/postmaster/pgarch.c | 234 ++- src/include/access/xlogdefs.h| 2 + src/include/postmaster/pgarch.h | 5 + 5 files changed, 254 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a7..7dd4b96 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per checkpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ @@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per restartpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..7756b87 100644 --- a/sr
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. > + * by checking the availability of next WAL file. "xlogState" specifies the > + * segment number and timeline ID corresponding to the next WAL file. > > "xlogState" probably needs to be updated here. Yes, I updated the comment. > As noted before [0], I think we need to force a directory scan at the > beginning of pgarch_MainLoop() and when pgarch_ArchiverCopyLoop() > returns before we exit the "while" loop. Else, there's probably a > risk that we skip archiving a file until the next directory scan. IMO > forcing a directory scan at the beginning of pgarch_ArchiverCopyLoop() > is a simpler way to do roughly the same thing. I'm skeptical that > persisting the next-anticipated state between calls to > pgarch_ArchiverCopyLoop() is worth the complexity. I think if we force a directory scan in pgarch_ArchiverCopyLoop() when it returns before we exit the "while" loop or outside the loop then it may result in directory scan for all WAL files in one of the scenarios that I can think of. There could be two possible scenarios, first scenario in which the archiver is always lagging and the second scenario in which archiver is in sync or ahead with the rate at which WAL files are generated. If we focus on the second scenario, then consider a case where the archiver has just archived file 1.ready and is about to check the availability of 2.ready but the file 2.ready is not available in archive status directory. Archiver performs a directory scan as a fall-back mechanism and goes to wait state.(The current implementation relies on notifying the archiver by creating a .ready file on disk. It may happen that the file is ready file archival but due to slow notification mechanism there is a delay in notification and archiver goes to wait state.) When file 2.ready is created on disk archive is notified, it wakes up and calls pgarch_ArchiverCopyLoop(). Now if we unconditionally force a directory scan in pgarch_ArchiverCopyLoop() then it may result in directory scan for all WAL files in this scenario. In this case we have the next anticipated log segment number and we can prevent an additional directory scan. I have tested this with a small setup by creating ~2000 WAL files and it has resulted in directory scan for each file. I agree that the the failure scenario discussed in [0] will require a WAL file to wait until the next directory scan. However, this can be avoided by forcing a directory scan in pgarch_ArchiverCopyLoop() only in case of failure scenario. This will make sure that when the archiver wakes up for the next cycle it performs a full directory leaving out any risk of missing a file due to archive failure. Additionally, it will also avoid additional directory scans mentioned in above scenario. I have incorporated the changes and updated a new patch. PFA patch. Thanks, Dipesh [0] https://www.postgresql.org/message-id/AC78607B-9DA6-41F4-B253-840D3DD964BF%40amazon.com From 7bb0794f3a41dacfada4863c26189ec01e3bee63 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 8 Sep 2021 21:47:16 +0530 Subject: [PATCH] keep trying the next file approach --- src/backend/access/transam/xlog.c| 20 +++ src/backend/access/transam/xlogarchive.c | 23 src/backend/postmaster/pgarch.c | 203 +++ src/include/access/xlogdefs.h| 2 + src/include/postmaster/pgarch.h | 4 + 5 files changed, 225 insertions(+), 27 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a7..7dd4b96 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per checkpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ @@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per restartpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of th
Re: .ready and .done files considered harmful
> > I guess we still have to pick one or the other, but I don't really > > know how to do that, since both methods seem to be relatively fine, > > and the scenarios where one is better than the other all feel a little > > bit contrived. I guess if no clear consensus emerges in the next week > > or so, I'll just pick one and commit it. Not quite sure yet how I'll > > do the picking, but we seem to all agree that something is better than > > nothing, so hopefully nobody will be too sad if I make an arbitrary > > decision. And if some clear agreement emerges before then, even > > better. > > I will be happy to see this fixed either way. +1 > > I agree, but it should probably be something like DEBUG3 instead of > > LOG. > > I will update it in the next patch. Updated log level to DEBUG3 and rebased the patch. PFA patch. Thanks, Dipesh From e0ce2b85f9122963d820d005ca093e6c667ca7e6 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 8 Sep 2021 21:47:16 +0530 Subject: [PATCH] keep trying the next file approach --- src/backend/access/transam/xlog.c| 20 src/backend/access/transam/xlogarchive.c | 23 src/backend/postmaster/pgarch.c | 195 ++- src/include/access/xlogdefs.h| 2 + src/include/postmaster/pgarch.h | 4 + 5 files changed, 217 insertions(+), 27 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a7..7dd4b96 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9285,6 +9285,16 @@ CreateCheckPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per checkpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ @@ -9650,6 +9660,16 @@ CreateRestartPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per restartpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..0969f42 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -489,6 +489,29 @@ XLogArchiveNotify(const char *xlog) return; } + /* Force a directory scan if we are archiving anything but a regular + * WAL file or if this WAL file is being created out-of-order. + */ + if (!IsXLogFileName(xlog)) + PgArchForceDirScan(); + else + { + TimeLineID tli; + XLogSegNo this_segno; + XLogSegNo arch_segno; + + XLogFromFileName(xlog, , _segno, wal_segment_size); + arch_segno = PgArchGetCurrentSegno(); + + /* + * We must use <= because the archiver may have just completed a + * directory scan and found a later segment (but hasn't updated + * shared memory yet). + */ + if (this_segno <= arch_segno) + PgArchForceDirScan(); + } + /* Notify archiver that it's got something to do */ if (IsUnderPostmaster) PgArchWakeup(); diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..84ba1e0 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -47,6 +47,7 @@ #include "storage/proc.h" #include "storage/procsignal.h" #include "storage/shmem.h" +#include "storage/spin.h" #include "utils/guc.h" #include "utils/ps_status.h" @@ -76,6 +77,20 @@ typedef struct PgArchData { int pgprocno; /* pgprocno of archiver process */ + + /* + * Forces a directory scan in pgarch_readyXlog(). Protected by + * arch_lck. + */ + bool force_dir_scan; + + /* + * Current archiver state. Protected by arch_lck. + */ + TimeLineID lastTli; + XLogSegNo lastSegNo; + + slock_t arch_lck; } PgArchData; @@ -103,6 +118,7 @@ static bool pgarch_readyXlog(char *xlog); static void pgarch_archiveDone(char *xlog); static void pgarch_die(int code, Datum arg); static void HandlePgArchInterrupts(void); +static bool higher_arch_priority(const char *a, const char
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. > Which approach do you think we should use? I think we have decent > patches for both approaches at this point, so perhaps we should see if > we can get some additional feedback from the community on which one we > should pursue further. In my opinion both the approaches have benefits over current implementation. I think in keep-trying-the-next-file approach we have handled all rare and specific scenarios which requires us to force a directory scan to archive the desired files. In addition to this with the recent change to force a directory scan at checkpoint we can avoid an infinite wait for a file which is still being missed out despite handling the special scenarios. It is also more efficient in extreme scenarios as discussed in this thread. However, multiple-files-per-readdir approach is cleaner with resilience of current implementation. I agree that we should decide on which approach to pursue further based on additional feedback from the community. > The problem I see with this is that pgarch_archiveXlog() might end up > failing. If it does, we won't retry archiving the file until we do a > directory scan. I think we could try to avoid forcing a directory > scan outside of these failure cases and archiver startup, but I'm not > sure it's really worth it. When pgarch_readyXlog() returns false, it > most likely means that there are no .ready files present, so I'm not > sure we are gaining a whole lot by avoiding a directory scan in that > case. I guess it might help a bit if there are a ton of .done files, > though. Yes, I think it will be useful when we have a bunch of .done files and the frequency of .ready files is such that the archiver goes to wait state before the next WAL file is ready for archival. > I agree, but it should probably be something like DEBUG3 instead of > LOG. I will update it in the next patch. Thanks, Dipesh
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. > I attached two patches that demonstrate what I'm thinking this change > should look like. One is my take on the keep-trying-the-next-file > approach, and the other is a new version of the multiple-files-per- > readdir approach (with handling for "cheating" archive commands). I > personally feel that the multiple-files-per-readdir approach winds up > being a bit cleaner and more resilient than the keep-trying-the-next- > file approach. However, the keep-trying-the-next-file approach will > certainly be more efficient (especially for the extreme cases > discussed in this thread), and I don't have any concrete concerns with > this approach that seem impossible to handle. I agree that multiple-files-pre-readdir is cleaner and has the resilience of the current implementation. However, I have a few suggestion on keep-trying-the -next-file approach patch shared in previous thread. + /* force directory scan the first time we call pgarch_readyXlog() */ + PgArchForceDirScan(); + We should not force a directory in pgarch_ArchiverCopyLoop(). This gets called whenever archiver wakes up from the wait state. This will result in a situation where the archiver performs a full directory scan despite having the accurate information about the next anticipated log segment. Instead we can check if lastSegNo is initialized and continue directory scan until it gets initialized in pgarch_readyXlog(). + return lastSegNo; We should return "true" here. I am thinking if we can add a log message for files which are archived as part of directory scan. This will be useful for diagnostic purpose to check if desired files gets archived as part of directory scan in special scenarios. I also think that we should add a few comments in pgarch_readyXlog(). I have incorporated these changes and attached a patch v1-0001-keep-trying-the-next-file-approach.patch. + /* +* We must use <= because the archiver may have just completed a +* directory scan and found a later segment (but hasn't updated +* shared memory yet). +*/ + if (this_segno <= arch_segno) + PgArchForceDirScan(); I still think that we should use '<' operator here because arch_segno represents the segment number of the most recent .ready file found by the archiver. This gets updated in shared memory only if archiver has successfully found a .ready file. In a normal scenario this_segno will be greater than arch_segno whereas in cases where a .ready file is created out of order this_segno may be less than arch_segno. I am wondering if there is a scenario where arch_segno is equal to this_segno unless I am missing something. Thanks, Dipesh From 4fb41105ba4ed9d02a2a551bb4f6cf693ec5c31e Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Thu, 2 Sep 2021 17:16:20 +0530 Subject: [PATCH] keep trying the next file approach --- src/backend/access/transam/xlog.c| 20 src/backend/access/transam/xlogarchive.c | 23 src/backend/postmaster/pgarch.c | 195 ++- src/include/access/xlogdefs.h| 1 + src/include/postmaster/pgarch.h | 4 + 5 files changed, 216 insertions(+), 27 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 24165ab..49caa61 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -9483,6 +9483,16 @@ CreateCheckPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per checkpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ @@ -9848,6 +9858,16 @@ CreateRestartPoint(int flags) RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr); /* + * Force the archiver to perform a directory scan. + * + * Ordinarily, this should not be needed, but it seems like a good idea + * to make sure we scan the archive_status directory every once in a + * while to make sure we haven't left anything behind. Calling it here + * ensures we do a directory scan at least once per restartpoint. + */ + PgArchForceDirScan(); + + /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the needed files.) */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index b9c19b2..44630e7 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -492,
Re: .ready and .done files considered harmful
> If a .ready file is created out of order, the directory scan logic > will pick it up about as soon as possible based on its priority. If > the archiver is keeping up relatively well, there's a good chance such > a file will have the highest archival priority and will be picked up > the next time the archiver looks for a file to archive. With the > patch proposed in this thread, an out-of-order .ready file has no such > guarantee. As long as the archiver never has to fall back to a > directory scan, it won't be archived. The proposed patch handles the > case where RemoveOldXlogFiles() creates missing .ready files by > forcing a directory scan, but I'm not sure this is enough. I think we > have to check the archiver state each time we create a .ready file to > see whether we're creating one out-of-order. We can handle the scenario where .ready file is created out of order in XLogArchiveNotify(). This way we can avoid making an explicit call to enable directory scan from different code paths which may result into creating an out of order .ready file. Archiver can store the segment number corresponding to the last or most recent .ready file found. When a .ready file is created in XLogArchiveNotify(), the log segment number of the current .ready file can be compared with the segment number of the last .ready file found at archiver to detect if this file is created out of order. A directory scan can be forced if required. I have incorporated these changes in patch v11. > While this may be an extremely rare problem in practice, archiving > something after the next checkpoint completes seems better than never > archiving it at all. IMO this isn't an area where there is much space > to take risks. An alternate approach could be to force a directory scan at checkpoint to break the infinite wait for a .ready file which is being missed due to the fact that it is created out of order. This will make sure that the file gets archived within the checkpoint boundaries. Thoughts? Please find attached patch v11. Thanks, Dipesh From 9392fd1b82ade933e8127845013bb2940239af68 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Tue, 24 Aug 2021 12:17:34 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment of current wAL file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. There are some special scenarios like timeline switch, backup or .ready file created out of order which requires archiver to perform a full directory scan as archiving these files takes precedence over regular WAL files. --- src/backend/access/transam/xlogarchive.c | 23 src/backend/postmaster/pgarch.c | 211 --- src/backend/storage/lmgr/lwlocknames.txt | 1 + src/include/postmaster/pgarch.h | 4 + 4 files changed, 224 insertions(+), 15 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index b9c19b2..cb73895 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -465,12 +465,16 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) * * Optionally, nudge the archiver process so that it'll notice the file we * create. + * + * Also, notifies archiver to enable directory scan to handle a few special + * scenarios. */ void XLogArchiveNotify(const char *xlog, bool nudge) { char archiveStatusPath[MAXPGPATH]; FILE *fd; + bool fileOutOfOrder = false; /* insert an otherwise empty file called .ready */ StatusFilePath(archiveStatusPath, xlog, ".ready"); @@ -492,6 +496,25 @@ XLogArchiveNotify(const char *xlog, bool nudge) return; } + /* Check if .ready file is created out of order */ + if (IsXLogFileName(xlog)) + { + XLogSegNo curSegNo; + TimeLineID tli; + + XLogFromFileName(xlog, , , wal_segment_size); + + fileOutOfOrder = PgArchIsBrokenReadyFileOrder(curSegNo); + } + + /* + * History files or a .ready file created out of order requires archiver to + * perform a full directory scan. + */ + if (IsTLHistoryFileName(xlog) || IsBackupHistoryFileName(xlog) || + fileOutOfOrder) + PgArchEnableDirScan(); + /* If caller requested, let archiver know it's got work to do */ if (nudge) PgArchWakeup(); diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..a33648a 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -76,8 +76,31 @@ typedef struct PgArchData { int pgprocno; /* pgprocno of archiver process */ + + /* + * Flag to enable/disab
Re: .ready and .done files considered harmful
Thanks for the feedback. > > > IIUC partial WAL files are handled because the next file in the > > > sequence with the given TimeLineID won't be there, so we will fall > > > back to a directory scan and pick it up. Timeline history files are > > > handled by forcing a directory scan, which should work because they > > > always have the highest priority. Backup history files, however, do > > > not seem to be handled. I think one approach to fixing that is to > > > also treat backup history files similarly to timeline history files. > > > If one is created, we force a directory scan, and the directory scan > > > logic will consider backup history files as higher priority than > > > everything but timeline history files. > > > > Backup history files are (currently) just informational and they are > > finally processed at the end of a bulk-archiving performed by the fast > > path. However, I feel that it is cleaner to trigger a directory scan > > every time we add an other-than-a-regular-WAL-file, as base-backup or > - promotion are not supposed happen so infrequently. > + promotion are not supposed happen so frequently. I have incorporated the changes to trigger a directory scan in case of a backup history file. Also, updated archiver to prioritize archiving a backup history file over regular WAL files during directory scan to make sure that backup history file gets archived before the directory scan gets disabled as part of archiving a regular WAL file. > > I've been looking at the v9 patch with fresh eyes, and I still think > > we should be able to force the directory scan as needed in > > XLogArchiveNotify(). Unless the file to archive is a regular WAL file > > that is > our stored location in archiver memory, we should force a > > directory scan. I think it needs to be > instead of >= because we > > don't know if the archiver has just completed a directory scan and > > found a later segment to use to update the archiver state (but hasn't > > yet updated the state in shared memory). > > I'm afraid that it can be seen as a violation of modularity. I feel > that wal-emitter side should not be aware of that datail of > archiving. Instead, I would prefer to keep directory scan as far as it > found an smaller segment id than the next-expected segment id ever > archived by the fast-path (if possible). This would be > less-performant in the case out-of-order segments are frequent but I > think the overall objective of the original patch will be kept. Archiver selects the file with lowest segment number as part of directory scan and the next segment number gets resets based on this file. It starts a new sequence from here and check the availability of the next file. If there are holes then it will continue to fall back to directory scan. This will continue until it finds the next sequence in order. I think this is already handled unless I am missing something. > Also, I think we need to make sure to set PgArch->dirScan back to true > > at the end of pgarch_readyXlog() unless we've found a new regular WAL > > file that we can use to reset the archiver's stored location. This > > ensures that we'll keep doing directory scans as long as there are > > timeline/backup history files to process. > > Right. Done. Please find the attached patch v10. Thanks, Dipesh From c0a4bf3937934e576db49f7a30530dcf71a068d6 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Tue, 24 Aug 2021 12:17:34 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment of current wAL file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. There are some special scenarios like timeline switch, backup or .ready file created out of order which requires archiver to perform a full directory scan as archiving these files takes precedence over regular WAL files. --- src/backend/access/transam/xlogarchive.c | 23 - src/backend/postmaster/pgarch.c | 170 --- src/include/postmaster/pgarch.h | 1 + 3 files changed, 178 insertions(+), 16 deletions(-) diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index b9c19b2..ececd10 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -465,6 +465,8 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname) * * Optionally, nudge the archiver pro
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. > Should we have XLogArchiveNotify(), writeTimeLineHistory(), and > writeTimeLineHistoryFile() enable the directory scan instead? Else, > we have to exhaustively cover all such code paths, which may be > difficult to maintain. Another reason I am bringing this up is that > my patch for adjusting .ready file creation [0] introduces more > opportunities for .ready files to be created out-of-order. XLogArchiveNotify() notifies Archiver when a log segment is ready for archival by creating a .ready file. This function is being called for each log segment and placing a call to enable directory scan here will result in directory scan for each log segment. We can have writeTimeLineHistory() and writeTimeLineHistoryFile() to enable directory scan to handle the scenarios related to timeline switch. However, in other scenarios, I think we have to explicitly call PgArchEnableDirScan() to enable directory scan. PgArchEnableDirScan() takes care of waking up archiver so that the caller of this function need not have to nudge the archiver. > +/* > + * This is a fall-back path, check if we are here due to the unavailability > + * of next anticipated log segment or the archiver is being forced to > + * perform a full directory scan. Reset the flag in shared memory only if > + * it has been enabled to force a full directory scan and then proceed with > + * directory scan. > + */ > +if (PgArch->dirScan) > +PgArch->dirScan = false; > Why do we need to check that the flag is set before we reset it? I > think we could just always reset it since we are about to do a > directory scan anyway Yes, I agree. > > If there is a race condition here with setting the flag, then an > > alternative design would be to use a counter - either a plain old > > uint64 or perhaps pg_atomic_uint64 - and have the startup process > > increment the counter when it wants to trigger a scan. In this design, > > the archiver would never modify the counter itself, but just remember > > the last value that it saw. If it later sees a different value it > > knows that a full scan is required. I think this kind of system is > > extremely robust against the general class of problems that you're > > talking about here, but I'm not sure whether we need it, because I'm > > not sure whether there is a race with just the bool. > I'm not sure, either. Perhaps it would at least be worth adding a > pg_memory_barrier() after setting dirScan to false to avoid the > scenario I mentioned (which may or may not be possible). IMO this > stuff would be much easier to reason about if we used a lock instead, > even if the synchronization was not strictly necessary. However, I > don't want to hold this patch up too much on this point. There is one possible scenario where it may run into a race condition. If archiver has just finished archiving all .ready files and the next anticipated log segment is not available then in this case archiver takes the fall-back path to scan directory. It resets the flag before it begins directory scan. Now, if a directory scan is enabled by a timeline switch or .ready file created out of order in parallel to the event that the archiver resets the flag then this might result in a race condition. But in this case also archiver is eventually going to perform a directory scan and the desired file will be archived as part of directory scan. Apart of this I can't think of any other scenario which may result into a race condition unless I am missing something. I have incorporated the suggestions and updated a new patch. PFA patch v9. Thanks, Dipesh From 04312fa668a56d9860547a02260d68c339747fa8 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then archiver performs a full directory scan to make sure that archiving history file takes precedence over archiving WAL files on older timeline. --- src/backend/access/transam/timeline.c| 10 ++ src/backend/access/transam/xlogarchive.c | 11 +++ src/backend/postmaster/pgarch.c | 162 --- src/include/postmaster/pgarch.h | 1 + 4 files changed, 171 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c..f70cede 100644 -
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. I have incorporated the suggestion to use an unsynchronized boolean flag to force directory scan. This flag is being set if there is a timeline switch or .ready file is created out of order. Archiver resets this flag in case if it is being set before it begins directory scan. PFA patch v8. Thanks, Dipesh From 1248da2cd96394b632e9853d8ff1f23f123a7813 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then archiver performs a full directory scan to make sure that archiving history file takes precedence over archiving WAL files on older timeline. --- src/backend/access/transam/xlog.c| 15 +++ src/backend/access/transam/xlogarchive.c | 12 +++ src/backend/postmaster/pgarch.c | 155 --- src/include/postmaster/pgarch.h | 1 + 4 files changed, 170 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e51a7a7..c30ef16 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -50,6 +50,7 @@ #include "port/atomics.h" #include "port/pg_iovec.h" #include "postmaster/bgwriter.h" +#include "postmaster/pgarch.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/basebackup.h" @@ -7564,6 +7565,13 @@ StartupXLOG(void) */ if (AllowCascadeReplication()) WalSndWakeup(); + + /* + * Switched to a new timeline, notify archiver to enable + * directory scan. + */ + if (XLogArchivingActive()) + PgArchEnableDirScan(); } /* Exit loop if we reached inclusive recovery target */ @@ -7806,6 +7814,13 @@ StartupXLOG(void) EndRecPtr, reason); /* + * Switched to a new timeline, notify archiver to enable directory + * scan. + */ + if (XLogArchivingActive()) + PgArchEnableDirScan(); + + /* * Since there might be a partial WAL segment named RECOVERYXLOG, get * rid of it. */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..94c74f8 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -609,6 +609,18 @@ XLogArchiveCheckDone(const char *xlog) /* Retry creation of the .ready file */ XLogArchiveNotify(xlog); + + /* + * This .ready file is created out of order, notify archiver to perform + * a full directory scan to archive corresponding WAL file. + */ + StatusFilePath(archiveStatusPath, xlog, ".ready"); + if (stat(archiveStatusPath, _buf) == 0) + { + PgArchEnableDirScan(); + PgArchWakeup(); + } + return false; } diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..9bfcfda 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -76,8 +76,25 @@ typedef struct PgArchData { int pgprocno; /* pgprocno of archiver process */ + + /* + * Flag to enable/disable directory scan. If this flag is set then it + * forces archiver to perform a full directory scan to get the next log + * segment. It is not required to synchronize this flag as it guarantees + * directory scan for the next cycle even if it is being missed in current + * cycle. + */ + bool dirScan; } PgArchData; +/* + * Segment number and timeline ID to identify the next file in a WAL sequence + */ +typedef struct readyXLogState +{ + XLogSegNo lastSegNo; + TimeLineID lastTLI; +} readyXLogState; /* -- * Local data @@ -97,9 +114,9 @@ static volatile sig_atomic_t ready_to_stop = false; */ static void pgarch_waken_stop(SIGNAL_ARGS); static void pgarch_MainLoop(void); -static void pgarch_ArchiverCopyLoop(void); +static void pgarch_ArchiverCopyLoop(readyXLogState *xlogState); static bool pgarch_archiveXlog(char *xlog); -static bool pgarch_readyXlog(char *xlog); +static bool pgarch_readyXlog(char *xlog, readyXLogState *xlogState); static void pgarch_archiveDone(char *xlog); static void pgarch_die(int code, Datum arg); static void HandlePgArchInterrupts(void); @@ -221,6 +238,15 @@ PgArchWakeup(void) SetLatch(>allProcs[arch_pgprocno].procLatch); } +/* + * Set dirScan flag in shared memory. Backend notifies archiver in case if an + * action requires full directory scan to get the next log segment. + */ +void +PgArchEnableDirS
Re: .ready and .done files considered harmful
Thanks for the feedback. > + StatusFilePath(archiveStatusPath, xlog, ".ready"); > + if (stat(archiveStatusPath, _buf) == 0) > + PgArchEnableDirScan(); > We may want to call PgArchWakeup() after setting the flag. Yes, added a call to wake up archiver. > > + * - The next anticipated log segment is not available. > > > > I wonder if we really need to perform a directory scan in this case. > > Unless there are other cases where the .ready files are created out of > > order, I think this just causes an unnecessary directory scan every > > time the archiver catches up. > Thinking further, I suppose this is necessary for when lastSegNo gets > reset after processing an out-of-order .ready file. Also, this is necessary when lastTLI gets reset after switching to a new timeline. > + pg_atomic_flag dirScan; > I personally don't think it's necessary to use an atomic here. A > spinlock or LWLock would probably work just fine, as contention seems > unlikely. If we use a lock, we also don't have to worry about memory > barriers. History file should be archived as soon as it gets created. The atomic flag here will make sure that there is no reordering of read/write instructions while accessing the flag in shared memory. Archiver needs to read this flag at the beginning of each cycle. Write to atomic flag is synchronized and it provides a lockless read. I think an atomic flag here is an efficient choice unless I am missing something. Please find the attached patch v7. Thanks, Dipesh From 55c42f851176a75881a55b1c75d624248169b876 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then archiver performs a full directory scan to make sure that archiving history file takes precedence over archiving WAL files on older timeline. --- src/backend/access/transam/xlog.c| 15 +++ src/backend/access/transam/xlogarchive.c | 12 +++ src/backend/postmaster/pgarch.c | 163 --- src/include/postmaster/pgarch.h | 1 + 4 files changed, 178 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f84c0bb..088ab43 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -50,6 +50,7 @@ #include "port/atomics.h" #include "port/pg_iovec.h" #include "postmaster/bgwriter.h" +#include "postmaster/pgarch.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/basebackup.h" @@ -7555,6 +7556,13 @@ StartupXLOG(void) */ if (AllowCascadeReplication()) WalSndWakeup(); + + /* + * Switched to a new timeline, notify archiver to enable + * directory scan. + */ + if (XLogArchivingActive()) + PgArchEnableDirScan(); } /* Exit loop if we reached inclusive recovery target */ @@ -7797,6 +7805,13 @@ StartupXLOG(void) EndRecPtr, reason); /* + * Switched to a new timeline, notify archiver to enable directory + * scan. + */ + if (XLogArchivingActive()) + PgArchEnableDirScan(); + + /* * Since there might be a partial WAL segment named RECOVERYXLOG, get * rid of it. */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..94c74f8 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -609,6 +609,18 @@ XLogArchiveCheckDone(const char *xlog) /* Retry creation of the .ready file */ XLogArchiveNotify(xlog); + + /* + * This .ready file is created out of order, notify archiver to perform + * a full directory scan to archive corresponding WAL file. + */ + StatusFilePath(archiveStatusPath, xlog, ".ready"); + if (stat(archiveStatusPath, _buf) == 0) + { + PgArchEnableDirScan(); + PgArchWakeup(); + } + return false; } diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..e5ea7a6 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -76,8 +76,23 @@ typedef struct PgArchData { int pgprocno; /* pgprocno of archiver process */ + + /* + * Flag to enable/disable directory scan. If this flag is set then it + * forces archiver to perform a full directory sc
Re: .ready and .done files considered harmful
Hi, Thanks for the feedback. The possible path that archiver can take for each cycle is either a fast path or a fall-back patch. The fast path involves checking availability of next anticipated log segment and decide the next target for archival or a fall-back path which involves full directory scan to get the next log segment. We need a mechanism that enables the archiver to select the desired path for each cycle. This can be achieved by maintaining a shared memory flag. If this flag is set then archiver should take the fall-back path otherwise it should continue with the fast path. This flag can be set by backend in case if an action like timeline switch, .ready files created out of order,... requires archiver to perform a full directory scan. I have incorporated these changes and updated a new patch. PFA patch v6. Thanks, Dipesh From c636dffab006c15cf3a8656982679fbe6a6c6440 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then archiver performs a full directory scan to make sure that archiving history file takes precedence over archiving WAL files on older timeline. --- src/backend/access/transam/xlog.c| 15 +++ src/backend/access/transam/xlogarchive.c | 9 ++ src/backend/postmaster/pgarch.c | 163 --- src/include/postmaster/pgarch.h | 1 + 4 files changed, 175 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f84c0bb..088ab43 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -50,6 +50,7 @@ #include "port/atomics.h" #include "port/pg_iovec.h" #include "postmaster/bgwriter.h" +#include "postmaster/pgarch.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/basebackup.h" @@ -7555,6 +7556,13 @@ StartupXLOG(void) */ if (AllowCascadeReplication()) WalSndWakeup(); + + /* + * Switched to a new timeline, notify archiver to enable + * directory scan. + */ + if (XLogArchivingActive()) + PgArchEnableDirScan(); } /* Exit loop if we reached inclusive recovery target */ @@ -7797,6 +7805,13 @@ StartupXLOG(void) EndRecPtr, reason); /* + * Switched to a new timeline, notify archiver to enable directory + * scan. + */ + if (XLogArchivingActive()) + PgArchEnableDirScan(); + + /* * Since there might be a partial WAL segment named RECOVERYXLOG, get * rid of it. */ diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..bccbbb6 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -609,6 +609,15 @@ XLogArchiveCheckDone(const char *xlog) /* Retry creation of the .ready file */ XLogArchiveNotify(xlog); + + /* + * This .ready file is created out of order, notify archiver to perform + * a full directory scan to archive corresponding WAL file. + */ + StatusFilePath(archiveStatusPath, xlog, ".ready"); + if (stat(archiveStatusPath, _buf) == 0) + PgArchEnableDirScan(); + return false; } diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..e5ea7a6 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -76,8 +76,23 @@ typedef struct PgArchData { int pgprocno; /* pgprocno of archiver process */ + + /* + * Flag to enable/disable directory scan. If this flag is set then it + * forces archiver to perform a full directory scan to get the next log + * segment. + */ + pg_atomic_flag dirScan; } PgArchData; +/* + * Segment number and timeline ID to identify the next file in a WAL sequence + */ +typedef struct readyXLogState +{ + XLogSegNo lastSegNo; + TimeLineID lastTLI; +} readyXLogState; /* -- * Local data @@ -97,12 +112,13 @@ static volatile sig_atomic_t ready_to_stop = false; */ static void pgarch_waken_stop(SIGNAL_ARGS); static void pgarch_MainLoop(void); -static void pgarch_ArchiverCopyLoop(void); +static void pgarch_ArchiverCopyLoop(readyXLogState *xlogState); static bool pgarch_archiveXlog(char *xlog); -static bool pgarch_readyXlog(char *xlog); +static bool pgarch_readyXlog(char *xlog, readyXLogState *xlogState); static void pgarch_archiveDone(char *xlog); static void pgarch_d
Re: .ready and .done files considered harmful
> I'm not sure. I think we need the value to be accurate during > recovery, so I'm not sure whether replayEndTLI would get us there. > Another approach might be to set ThisTimeLineID on standbys also. > Actually just taking a fast look at the code I'm not quite sure why > that isn't happening already. Do you have any understanding of that? During investigation I found that the current timeline ID (ThisTimeLineID) gets updated in XLogCtl’s ThisTimeLineID once it gets finalised as part of archive recovery. /* * Write the timeline history file, and have it archived. After this * point (or rather, as soon as the file is archived), the timeline * will appear as "taken" in the WAL archive and to any standby * servers. If we crash before actually switching to the new * timeline, standby servers will nevertheless think that we switched * to the new timeline, and will try to connect to the new timeline. * To minimize the window for that, try to do as little as possible * between here and writing the end-of-recovery record. */ In case of Standby this happens only when it gets promoted. If Standby is in recovery mode then replayEndTLI points to the most recent TLI corresponding to the replayed records. Also, if replying a record causes timeline switch then replayEndTLI gets updated with the new timeline. As long as it is in recovery mode replayEndTLI should point to the current timeline ID on Standby. Thoughts? Thanks, Dipesh
Re: .ready and .done files considered harmful
Hi, > I don't really understand why you are storing something in shared > memory specifically for the archiver. Can't we use XLogCtl's > ThisTimeLineID instead of storing another copy of the information? Yes, we can avoid storing another copy of information. We can use XLogCtl's ThisTimeLineID on Primary. However, XLogCtl's ThisTimeLineID is not set to the current timeline ID on Standby server. It's value is set to '0'. Can we use XLogCtl's replayEndTLI on the Standby server to get the current timeline ID? Thanks, Dipesh
Re: .ready and .done files considered harmful
Hi, > I think what you are saying is true before v14, but not in v14 and master. Yes, we can use archiver specific shared memory. Thanks. > I don't think it's great that we're using up SIGINT for this purpose. > There aren't that many signals available at the O/S level that we can > use for our purposes, and we generally try to multiplex them at the > application layer, e.g. by setting a latch or a flag in shared memory, > rather than using a separate signal. Can we do something of that sort > here? Or maybe we don't even need a signal. ThisTimeLineID is already > visible in shared memory, so why not just have the archiver just check > and see whether it's changed, say via a new accessor function > GetCurrentTimeLineID()? I guess there could be a concern about the > expensive of that, because we'd probably be taking a spinlock or an > lwlock for every cycle, but I don't think it's probably that bad, > because I doubt we can archive much more than a double-digit number of > files per second even with a very fast archive_command, and contention > on a lock generally requires a five digit number of acquisitions per > second. It would be worth testing to see if we can see a problem here, > but I'm fairly hopeful that it's not an issue. If we do feel that it's > important to avoid repeatedly taking a lock, let's see if we can find > a way to do it without dedicating a signal to this purpose. We can maintain the current timeline ID in archiver specific shared memory. If we switch to a new timeline then the backend process can update the new timeline ID in shared memory. Archiver can keep a track of current timeline ID and if it finds that there is a timeline switch then it can perform a full directory scan to make sure that archiving history files takes precedence over WAL files. Access to the shared memory area can be protected by adding a WALArchiverLock. If we take this approach then it doesn't require to use a dedicated signal to notify a timeline switch. > The problem with all this is that you can't understand either function > in isolation. Unless you read them both together and look at all of > the ways these three variables are manipulated, you can't really > understand the logic. And there's really no reason why that needs to > be true. The job of cleaning timeline_switch and setting dirScan could > be done entirely within pgarch_readyXlog(), and so could the job of > incrementing nextLogSegNo, because we're not going to again call > pgarch_readyXlog() unless archiving succeeded. > Also note that the TLI which is stored in curFileTLI corresponds to > the segment number stored in nextLogSegNo, yet one of them has "cur" > for "current" in the name and the other has "next". It would be easier > to read the code if the names were chosen more consistently. > My tentative idea as to how to clean this up is: declare a new struct > with a name like readyXlogState and members lastTLI and lastSegNo. > Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero > it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave > it alone. Then let pgarch_readyXlog() do all of the manipulation of > the values stored therein. Make sense, we can move the entire logic to a single function pgarch_readyXlog() and declare a new struct readyXLogState. I think we cannot declare a variable of this type in pgarch_ArchiverCopyLoop() due to the fact that this function will be called every time the archiver wakes up. Initializing readyXLogState here will reset the next anticipated log segment number when the archiver wakes up from a wait state. We can declare and initialize it in pgarch_MainLoop() to avoid resetting the next anticipated log segment number when the archiver wakes up. > You've moved this comment from its original location, but the trouble > is that the comment is 100% false. In fact, the whole reason why you > wrote this patch is *because* this comment is 100% false. In fact it > is not difficult to create cases where each scan finds many files, and > the purpose of the patch is precisely to optimize the code that the > person who wrote this thought didn't need optimizing. Now it may take > some work to figure out what we want to say here exactly, but > preserving the comment as it's written here is certainly misleading. Yes, I agree. We can update the comments here to list the scenarios where we may need to perform a full directory scan. I have incorporated these changes and updated a new patch. Please find the attached patch v5. Thanks, Dipesh From 3e0f690d1b8fd1d07fa503a807ee97cb9ed7377b Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archiv
Re: .ready and .done files considered harmful
Hi, > I don't think it's great that we're using up SIGINT for this purpose. > There aren't that many signals available at the O/S level that we can > use for our purposes, and we generally try to multiplex them at the > application layer, e.g. by setting a latch or a flag in shared memory, > rather than using a separate signal. Can we do something of that sort > here? Or maybe we don't even need a signal. ThisTimeLineID is already > visible in shared memory, so why not just have the archiver just check > and see whether it's changed, say via a new accessor function > GetCurrentTimeLineID()? As of now shared memory is not attached to the archiver. Archiver cannot access ThisTimeLineID or a flag available in shared memory. if (strcmp(argv[1], "--forkbackend") == 0 || strcmp(argv[1], "--forkavlauncher") == 0 || strcmp(argv[1], "--forkavworker") == 0 || strcmp(argv[1], "--forkboot") == 0 || strncmp(argv[1], "--forkbgworker=", 15) == 0) PGSharedMemoryReAttach(); else PGSharedMemoryNoReAttach(); This is the reason we have thought of sending a notification to the archiver if there is a timeline switch. Should we consider attaching shared memory to archiver process or explore more on notification mechanism to avoid using SIGINT? Thanks, Dipesh
Re: .ready and .done files considered harmful
> Some minor suggestions: Thanks for your comments. I have incorporated the changes and updated a new patch. Please find the attached patch v4. Thanks, Dipesh On Mon, Jul 26, 2021 at 9:44 PM Bossart, Nathan wrote: > On 7/26/21, 6:31 AM, "Robert Haas" wrote: > > In terms of immediate next steps, I think we should focus on > > eliminating the O(n^2) problem and not get sucked into a bigger > > redesign. The patch on the table aims to do just that much and I think > > that's a good thing. > > I agree. I'll leave further discussion about a redesign for another > thread. > > Nathan > > From c597411b08377ea64634d5198071060ec2b9c524 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then backend sends a notification to archiver. Archiver registers the timeline switch and performs a full directory scan to make sure that archiving history files takes precedence over archiving WAL files --- src/backend/access/transam/xlog.c | 8 +++ src/backend/postmaster/pgarch.c | 138 ++ src/include/postmaster/pgarch.h | 1 + 3 files changed, 135 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3479402..2580ce8 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -50,6 +50,7 @@ #include "port/atomics.h" #include "port/pg_iovec.h" #include "postmaster/bgwriter.h" +#include "postmaster/pgarch.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/basebackup.h" @@ -8130,6 +8131,13 @@ StartupXLOG(void) WalSndWakeup(); /* + * If archiver is active, send notification that timeline has switched. + */ + if (XLogArchivingActive() && ArchiveRecoveryRequested && + IsUnderPostmaster) + PgArchNotifyTLISwitch(); + + /* * If this was a promotion, request an (online) checkpoint now. This isn't * required for consistency, but the last restartpoint might be far back, * and in case of a crash, recovering from it might take a longer than is diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..161a5b6 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -90,16 +90,18 @@ static PgArchData *PgArch = NULL; * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t ready_to_stop = false; +static volatile sig_atomic_t timeline_switch = false; /* -- * Local function forward declarations * -- */ static void pgarch_waken_stop(SIGNAL_ARGS); +static void pgarch_timeline_switch(SIGNAL_ARGS); static void pgarch_MainLoop(void); static void pgarch_ArchiverCopyLoop(void); static bool pgarch_archiveXlog(char *xlog); -static bool pgarch_readyXlog(char *xlog); +static bool pgarch_readyXlog(char *xlog, bool *dirScan, XLogSegNo *nextLogSegNo); static void pgarch_archiveDone(char *xlog); static void pgarch_die(int code, Datum arg); static void HandlePgArchInterrupts(void); @@ -169,10 +171,11 @@ PgArchiverMain(void) { /* * Ignore all signals usually bound to some action in the postmaster, - * except for SIGHUP, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT. + * except for SIGHUP, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT. */ pqsignal(SIGHUP, SignalHandlerForConfigReload); - pqsignal(SIGINT, SIG_IGN); + /* Archiver is notified by backend if there is a timeline switch */ + pqsignal(SIGINT, pgarch_timeline_switch); pqsignal(SIGTERM, SignalHandlerForShutdownRequest); /* SIGQUIT handler was already set up by InitPostmasterChild */ pqsignal(SIGALRM, SIG_IGN); @@ -221,6 +224,23 @@ PgArchWakeup(void) SetLatch(>allProcs[arch_pgprocno].procLatch); } +/* + * Called by backend process to notify a timeline switch. + */ +void +PgArchNotifyTLISwitch(void) +{ + int arch_pgprocno = PgArch->pgprocno; + + if (arch_pgprocno != INVALID_PGPROCNO) + { + int archiver_pid = ProcGlobal->allProcs[arch_pgprocno].pid; + + if (kill(archiver_pid, SIGINT) < 0) + elog(ERROR, "could not notify timeline change to archiver: %m"); + } +} + /* SIGUSR2 signal handler for archiver process */ static void @@ -236,6 +256,19 @@ pgarch_waken_stop(SIGNAL_ARGS) } /* + * Interrupt handler for hand
Re: .ready and .done files considered harmful
Hi, > some comments on v2. Thanks for your comments. I have incorporated the changes and updated a new patch. Please find the details below. > On the timeline switch, setting a flag should be enough, I don't think > that we need to wake up the archiver. Because it will just waste the > scan cycle. Yes, I modified it. > Why do we need multi level interfaces? I mean instead of calling first > XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch, > can't we directly call PgArchNotifyTLISwitch()? Yes, multilevel interfaces are not required. Removed extra interface. > +if (timeline_switch) > +{ > +/* Perform a full directory scan in next cycle */ > +dirScan = true; > +timeline_switch = false; > +} > I suggest you can add some comments atop this check. Added comment to specify the action required in case of a timeline switch. > I think you should use %m in the error message so that it also prints > the OS error code. Done. > Why is this a global variable? I mean whenever you enter the function > pgarch_ArchiverCopyLoop(), this can be set to true and after that you > can pass this as inout parameter to pgarch_readyXlog() there in it can > be conditionally set to false once we get some segment and whenever > the timeline switch we can set it back to the true. Yes, It is not necessary to have global scope for "dirScan". Changed the scope to local for "dirScan" and "nextLogSegNo". PFA patch v3. Thanks, Dipesh From 76260a2ebf90fd063e06dac701e560a506b7a2b7 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then backend sends a notification to archiver. Archiver registers the timeline switch and performs a full directory scan to make sure that archiving history files takes precedence over archiving WAL files --- src/backend/access/transam/xlog.c | 8 +++ src/backend/postmaster/pgarch.c | 131 ++ src/include/postmaster/pgarch.h | 1 + 3 files changed, 128 insertions(+), 12 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c7c928f..baee37b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -50,6 +50,7 @@ #include "port/atomics.h" #include "port/pg_iovec.h" #include "postmaster/bgwriter.h" +#include "postmaster/pgarch.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" #include "replication/basebackup.h" @@ -8130,6 +8131,13 @@ StartupXLOG(void) WalSndWakeup(); /* + * If archiver is active, send notification that timeline has switched. + */ + if (XLogArchivingActive() && ArchiveRecoveryRequested && + IsUnderPostmaster) + PgArchNotifyTLISwitch(); + + /* * If this was a promotion, request an (online) checkpoint now. This isn't * required for consistency, but the last restartpoint might be far back, * and in case of a crash, recovering from it might take a longer than is diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..b604966 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -90,16 +90,18 @@ static PgArchData *PgArch = NULL; * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t ready_to_stop = false; +static volatile sig_atomic_t timeline_switch = false; /* -- * Local function forward declarations * -- */ static void pgarch_waken_stop(SIGNAL_ARGS); +static void pgarch_timeline_switch(SIGNAL_ARGS); static void pgarch_MainLoop(void); static void pgarch_ArchiverCopyLoop(void); static bool pgarch_archiveXlog(char *xlog); -static bool pgarch_readyXlog(char *xlog); +static bool pgarch_readyXlog(char *xlog, bool *dirScan, XLogSegNo *nextLogSegNo); static void pgarch_archiveDone(char *xlog); static void pgarch_die(int code, Datum arg); static void HandlePgArchInterrupts(void); @@ -169,10 +171,11 @@ PgArchiverMain(void) { /* * Ignore all signals usually bound to some action in the postmaster, - * except for SIGHUP, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT. + * except for SIGHUP, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT. */ pqsignal(SIGHUP, SignalHandler
Re: .ready and .done files considered harmful
Hi, > I agree, I missed this part. The .history file should be given higher preference. > I will take care of it in the next patch. Archiver does not have access to shared memory and the current timeline ID is not available at archiver. In order to keep track of timeline switch we have to push a notification from backend to archiver. Backend can send a signal to notify archiver about the timeline change. Archiver can register this notification and perform a full directory scan to make sure that archiving history files take precedence over archiving WAL files. > If a history file is found we are not updating curFileTLI and > nextLogSegNo, so it will attempt the previously found segment. This > is fine because it will not find that segment and it will rescan the > directory. But I think we can do better, instead of searching the > same old segment in the previous timeline we can search that old > segment in the new TL so that if the TL switch happened within the > segment then we will find the segment and we will avoid the directory > search. This could have been done with the approach mentioned in patch v1 but now considering archiving history file takes precedence over WAL files we cannot update the "curFileTLI" whenever a history file is found. > So everytime archiver will start with searching segno=0 in timeline=0. > Instead of doing this can't we first scan the directory and once we > get the first segment to archive then only we can start predicting the > next wal segment? Done. > This comment is a bit confusing with the name of the variable nextLogSegNo. > I think the name of the variable is appropriate here, but maybe we can reword > the comment something like: Done. I have incorporated these changes and updated a new patch. PFA, patch v2. Thanks, Dipesh From 22b0e8fb9c778fbfdc945a647f82f6bbd8d6ec0a Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file that needs to be archived. This directory scan can be minimised by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. If there is a timeline switch then backend sends a notification to archiver. Archiver registers the timeline switch and performs a full directory scan to make sure that archiving history files takes precedence over archiving WAL files --- src/backend/access/transam/xlog.c| 6 ++ src/backend/access/transam/xlogarchive.c | 10 +++ src/backend/postmaster/pgarch.c | 124 --- src/include/access/xlogarchive.h | 1 + src/include/postmaster/pgarch.h | 1 + 5 files changed, 133 insertions(+), 9 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c7c928f..40969a9 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -8130,6 +8130,12 @@ StartupXLOG(void) WalSndWakeup(); /* + * If archiver is active, send notification that timeline has switched. + */ + if (XLogArchivingActive() && ArchiveRecoveryRequested) + XLogArchiveNotifyTLISwitch(); + + /* * If this was a promotion, request an (online) checkpoint now. This isn't * required for consistency, but the last restartpoint might be far back, * and in case of a crash, recovering from it might take a longer than is diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 26b023e..1968872 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -507,6 +507,16 @@ XLogArchiveNotifySeg(XLogSegNo segno) } /* + * Signal archiver to notify timeline switch + */ +void +XLogArchiveNotifyTLISwitch(void) +{ + if (IsUnderPostmaster) + PgArchNotifyTLISwitch(); +} + +/* * XLogArchiveForceDone * * Emit notification forcibly that an XLOG segment file has been successfully diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..e23b9bc 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -90,12 +90,22 @@ static PgArchData *PgArch = NULL; * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t ready_to_stop = false; +static volatile sig_atomic_t timeline_switch = false; + +/* + * Log segment number to get next WAL file in a sequence. + */ +static XLogSegNo nextLogSegNo = 0; + +/* Flag to specify a full directory scan to find next log file */ +static bool dirScan = true; /* -- * Local function forward declarations * -- */ static voi
Re: .ready and .done files considered harmful
> specifically about history files being given higher priority for > archiving. If we go with this change then we'd at least want to rewrite > or remove those comments, but I don't actually agree that we should > remove that preference to archive history files ahead of WAL, for the > reasons brought up previously. > As was suggested on that subthread, it seems like it should be possible > to just track the current timeline and adjust what we're doing if the > timeline changes, and we should even know what the .history file is at > that point and likely don't even need to scan the directory for it, as > it'll be the old timeline ID. I agree, I missed this part. The .history file should be given higher preference. I will take care of it in the next patch. Thanks, Dipesh
Re: .ready and .done files considered harmful
Hi, We have addressed the O(n^2) problem which involves directory scan for archiving individual WAL files by maintaining a WAL counter to identify the next WAL file in a sequence. WAL archiver scans the status directory to identify the next WAL file which needs to be archived. This directory scan can be minimized by maintaining the log segment number of the current file which is being archived and incrementing it by '1' to get the next WAL file in a sequence. Archiver can check the availability of the next file in status directory and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. Please find attached patch v1. Thanks, Dipesh On Fri, May 7, 2021 at 1:31 AM Andres Freund wrote: > Hi, > > On 2021-05-06 21:23:36 +0200, Hannu Krosing wrote: > > How are you envisioning the shared-memory signaling should work in the > > original sample case, where the archiver had been failing for half a > > year ? > > If we leave history files and gaps in the .ready sequence aside for a > second, we really only need an LSN or segment number describing the > current "archive position". Then we can iterate over the segments > between the "archive position" and the flush position (which we already > know). Even if we needed to keep statting .ready/.done files (to handle > gaps due to archive command mucking around with .ready/done), it'd still > be a lot cheaper than what we do today. It probably would even still be > cheaper if we just statted all potentially relevant timeline history > files all the time to send them first. > > > > Or should we perhaps have a system table for ready-to-archive WAL > > files to get around limitation sof file system to return just the > > needed files with ORDER BY ... LIMIT as we already know how to make > > lookups in database fast ? > > Archiving needs to work on a standby so that doesn't seem like an > option. > > Regards, > > Andres Freund > > > From ce5591c7ef574a1b039a5de807eb0c566a3456e1 Mon Sep 17 00:00:00 2001 From: Dipesh Pandit Date: Wed, 30 Jun 2021 14:05:58 +0530 Subject: [PATCH] mitigate directory scan for WAL archiver WAL archiver scans the status directory to identify the next WAL file which needs to be archived. This directory scan can be minimized by maintaining the log segment number of current file which is being archived and incrementing it by '1' to get the next WAL file. Archiver can check the availability of next file and in case if the file is not available then it should fall-back to directory scan to get the oldest WAL file. --- src/backend/postmaster/pgarch.c | 63 - 1 file changed, 56 insertions(+), 7 deletions(-) diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 74a7d7c..eb219d7 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -87,6 +87,12 @@ static time_t last_sigterm_time = 0; static PgArchData *PgArch = NULL; /* + * Log segment number and timeline ID to get next WAL file in a sequence. + */ +static XLogSegNo nextLogSegNo = 0; +static TimeLineID curFileTLI = 0; + +/* * Flags set by interrupt handlers for later service in the main loop. */ static volatile sig_atomic_t ready_to_stop = false; @@ -411,6 +417,10 @@ pgarch_ArchiverCopyLoop(void) /* successful */ pgarch_archiveDone(xlog); +/* Increment log segment number to point to the next WAL file */ +if (!IsTLHistoryFileName(xlog)) + nextLogSegNo++; + /* * Tell the collector about the WAL file that we successfully * archived @@ -596,29 +606,55 @@ pgarch_archiveXlog(char *xlog) * larger ID; the net result being that past timelines are given higher * priority for archiving. This seems okay, or at least not obviously worth * changing. + * + * WAL files are generated in a specific order of log segment number. The + * directory scan for each WAL file can be minimized by identifying the next + * WAL file in the sequence. This can be achieved by maintaining log segment + * number and timeline ID corresponding to WAL file currently being archived. + * The log segment number of current WAL file can be incremented by '1' upon + * successful archival to point to the next WAL file. */ static bool pgarch_readyXlog(char *xlog) { - /* - * open xlog status directory and read through list of xlogs that have the - * .ready suffix, looking for earliest file. It is possible to optimise - * this code, though only a single file is expected on the vast majority - * of calls, so - */ + char basename[MAX_XFN_CHARS + 1]; + char xlogready[MAXPGPATH]; char XLogArchiveStatusDir[MAXPGPATH]; DIR *rldir; struct dirent *rlde; + struct stat st; bool found = false; bool historyFound = false; + /* + * Log segment number already p
Re: refactoring basebackup.c
Hi, I have repeated the experiment with 8K block size and found that the results are not varying much after applying the patch. Please find the details below. *Backup type*: local backup using pg_basebackup *Data size*: Around 200GB (200 tables - each table around 1.05 GB) *TAR_SEND_SIZE value*: 8kb *Server details:* RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4 *Results:* Iteration WIthout refactor patch WIth refactor patch 1st run real 10m19.001s user 1m37.895s sys 8m33.008s real 9m45.291s user 1m23.192s sys 8m14.993s 2nd run real 9m33.970s user 1m19.490s sys 8m6.062s real 9m30.560s user 1m22.124s sys 8m0.979s 3rd run real 9m19.327s user 1m21.772s sys 7m50.613s real 8m59.241s user 1m19.001s sys 7m32.645s 4th run real 9m56.873s user 1m22.370s sys 8m27.054s real 9m52.290s user 1m22.175s sys 8m23.052s 5th run real 9m45.343s user 1m23.113s sys 8m15.418s real 9m49.633s user 1m23.122s sys 8m19.240s Later I connected with Suraj to validate the experiment details and found that the setup and steps followed are exactly the same in this experiment when compared with the previous experiment. Thanks, Dipesh On Thu, May 14, 2020 at 7:50 AM Suraj Kharage < suraj.khar...@enterprisedb.com> wrote: > Hi, > > On Wed, May 13, 2020 at 7:49 PM Robert Haas wrote: > >> >> So the patch came out slightly faster at 8kB and slightly slower in the >> other tests. That's kinda strange. I wonder if it's just noise. How much do >> the results vary run to run? >> > It is not varying much except for 8kB run. Please see below details for > both runs of each scenario. > > 8kb 32kb (default value) 128kB 1024kB > WIthout refactor > patch 1st run real 10m50.924s > user 1m29.774s > sys 9m13.058s real 8m36.245s > user 1m8.471s > sys 7m21.520s real 7m8.690s > user 0m54.840s > sys 6m1.725s real 18m16.898s > user 1m39.105s > sys 9m42.803s > 2nd run real 10m22.718s > user 1m23.629s > sys 8m51.410s real 8m44.455s > user 1m7.896s > sys 7m28.909s real 6m54.299s > user 0m55.690s > sys 5m46.502s real 18m3.511s > user 1m38.197s > sys 9m36.517s > WIth refactor > patch 1st run real 10m11.350s > user 1m25.038s > sys 8m39.226s real 8m56.226s > user 1m9.774s > sys 7m41.032s real 7m26.678s > user 0m54.833s > sys 6m20.057s real 19m5.218s > user 1m44.122s > sys 10m17.623s > 2nd run real 11m30.500s > user 1m45.221s > sys 9m37.815s real 9m4.103s > user 1m6.893s > sys 7m49.393s real 7m26.713s > user 0m54.868s > sys 6m19.652s real 18m17.230s > user 1m42.749s > sys 9m53.704s > > > -- > -- > > Thanks & Regards, > Suraj kharage, > EnterpriseDB Corporation, > The Postgres Database Company. >
Re: WIP/PoC for parallel backup
Hi Asif, I am reviewing your recent patch and found the patch is not applicable on latest master. Could you please resolve the conflicts and update a new patch? Thanks, Dipesh EnterpriseDB: http://www.enterprisedb.com